# Follow up: Twitter analysis

In this notebook, we will
* **Assignment 1**: Given a set of twitter user (user id) (suggested 20 twitter users), build a model to classify users based on their available features on twitter including media, text, and metrics
    * Output: the working model which takes any user id, outputs the propability of each class. Performance metrics on training and test set

* **Assignment 2**: Additional insights for the twitter users
    * Output: popular topics, analysis, suggestions, and visualizations

Now we'll begin the Assignment 1

In [1]:
import extract_features
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from gensim import corpora, models
import pyLDAvis.gensim
import numpy
import sklearn

%matplotlib inline

%load_ext autoreload
%autoreload 2

## Assignment 1: Classifying twitter users

### Dataset descriptions:

* Number of classes: 2 (influencer, and non-influencer)
* The dataset is splited into three sets: train (~60% = 48 twitter users), validation (~20% = 16 twitter users), and test (~20% = 18 twitter users)
* How to annotate data?
    * Reputation experts manually identified the opinion makers (i.e. twitter users with reputational influence) as 'Influencer'.
    * The profiles that are not considered opinion makers are assigned the 'non-influencer' label.
    * Our twitter users data is collected as a small set of REPLAB dataset (http://nlp.uned.es/replab2014/) for the task ** Author Profiling Task **

### First experiment: Classify twitter users based on users' public profile and users' posts

In this experiment, we consider following features:

#### User features

* Number of followers
* Number of lists contains the user
* Number of status counts
* Number of following
* Number of favorite count
* Username at description: boolean value if username of screen name is found in the user profile description
* Average number of tweets per month
* Followee per follower ratio: the ratio between the number of followees and the number of followers
* URL in profile: boolean value
* Description length
* Proportion of replies: the ratio between the number of replies to another user among the total number of posts/
* Number of distinct users to reply

#### User posts
* Number of hashtag: the ratio between the number of posts having a hashtag to the total number of posts
* Number of urls: the ratio between the number of posts having an url to the total number of posts
* Number of retweet: the ratio between the number of posts which is a retweet to the total number of posts
* Number of favourite: the ratio between the number of posts which is favourited to the total number of posts
* Update frequency: the average number of seconds to create a new post
* Update frequency sd: standard deviation for update frequency
* Proportion of retweets among tweets: the ratio between the number of posts which is a retweet from someone else to the total number of posts
* Number of posts retweeted: number of posts that are created by the user and are retweeted by others to the total number of posts
* Word per posts: the average number of words per posts (that are not retweets from someone else)
* Number of posts retweet: the average number of retweet count across posts
* Number of posts favourite: the average number of favourite count across posts

In [2]:
# Load json file, and calculate all above features for each set
extract_features.to_csv('./data/test', 'basic_features_test.csv')
extract_features.to_csv('./data/train', 'basic_features_train.csv')
extract_features.to_csv('./data/validation', 'basic_features_validation.csv')

X, y = extract_features.read_csv('basic_features_train.csv')
X_test, y_test = extract_features.read_csv('basic_features_test.csv')
X_val, y_val = extract_features.read_csv('basic_features_validation.csv')

#### Perform classification and report results

In [5]:
# Perform classification on test set:

basic_model = LogisticRegression(solver='lbfgs')
basic_model = basic_model.fit(X, y)

y_true, y_pred = y_test, basic_model.predict(X_test)

print('Classification report')

print(classification_report(y_true, y_pred))

Classification report
             precision    recall  f1-score   support

        0.0       0.53      0.89      0.67         9
        1.0       0.67      0.22      0.33         9

avg / total       0.60      0.56      0.50        18



### Second experiment: Classify twitter users based on users' public profile and users' posts

In this experiment, we consider following features:
* User features
* User posts
* Perform analysis on tweets:
    * Preprocess tweets by remove urls, tokenize
    * Represent tweets as tfidf features
    * Perform K-mean clustering
    * Each twitter user, represent frequency of tweet clusters

In [7]:
# Tune the value of K-mean clustering

num_clusters = range(10, 70, 10)
f1_scores = []

for k in num_clusters:
    transformer, km = extract_features.train_text_model('./data/train', k)
    extract_features.to_csv('./data/test', 'features_test.csv', transformer, km, k)
    extract_features.to_csv('./data/train', 'features_train.csv', transformer, km, k)
    extract_features.to_csv('./data/validation', 'features_validation.csv', transformer, km, k)
    
    X, y = extract_features.read_csv('features_train.csv')
    X_val, y_val = extract_features.read_csv('features_validation.csv')
    
    model = LogisticRegression(solver='lbfgs')
    model = model.fit(X, y)
    
    y_true, y_pred = y_val, model.predict(X_val)
    f1_score = sklearn.metrics.f1_score(y_true, y_pred)
    
    f1_scores.append(f1_score)
    print('F1-score: %.2f%%' %(f1_score*100.0))

learning kmeans 10 
F1-score: 66.67%
learning kmeans 20 
F1-score: 82.35%
learning kmeans 30 
F1-score: 53.33%
learning kmeans 40 
F1-score: 66.67%
learning kmeans 50 
F1-score: 66.67%
learning kmeans 60 
F1-score: 70.59%


#### Report results on test set after tuning parameters

In [19]:
optimal_k = 0
for i in range(len(num_clusters)):
    if f1_scores[i] == max(f1_scores):
        optimal_k = i
        break

# Perform evaluation on test set

transformer, km = extract_features.train_text_model('./data/train', num_clusters[optimal_k])
extract_features.to_csv('./data/test', 'features_test.csv', transformer, km, num_clusters[optimal_k])
extract_features.to_csv('./data/validation', 'features_validation.csv', transformer, km, num_clusters[optimal_k])
    
X, y = extract_features.read_csv('features_train.csv')
X_test, y_test = extract_features.read_csv('features_test.csv')
    
model = LogisticRegression(solver='lbfgs')
model = model.fit(X, y)
    
y_true, y_pred = y_test, model.predict(X_test)

print('Classification report')

print(classification_report(y_true, y_pred))

learning kmeans 20 
Classification report
             precision    recall  f1-score   support

        0.0       0.62      0.56      0.59         9
        1.0       0.60      0.67      0.63         9

avg / total       0.61      0.61      0.61        18



## Assignment 2: Gain insights about what twitter users are talking about

In [23]:
# Load the data: Use Twython to extract the twitter data and store twitter data as a collection of JSON files
tweets_data = extract_features.read_json_file('./data/train')

### Clean up and prepare the documents for LDA

* Clean urls
* Tokenize data: break the documents into words
* Remove stop words, and words composed of only 1, or 2 characters
* Remove numbers

In [25]:
docs = extract_features.preprocess_tweet(tweets_data)

### To gain insights about what twitter users are talking about, we perform Latent Dirichlet Allocation (LDA)

* LDA is a way of automatically discovering topics that these tweets contain

In [26]:
# Assign a unique integer id to each unique token while also collecting word counts and relevant statistics
dictionary = corpora.Dictionary(docs)

# Convert into bag-of-words representation
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [28]:
# First LDA model with 20 topics
lda_20 = models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary, passes=10)

In [41]:
# Select top 15 words for each topic
top_words = [[word for word,_ in lda_20.show_topic(topicno, topn=15)] for topicno in range(lda_20.num_topics)]

In [42]:
def display(topics):
    for index, words in enumerate(topics):
        print "Topic %s: %s" % (index, ", ".join(words))

In [43]:
display(top_words)

Topic 0: los, seeking, con, fiat, facebook, alfa, digital, año, audi, vía, porsche, toyota, https, ecstuning, ecs
Topic 1: don, people, want, twitter, pero, going, fund, said, many, set, one, group, days, sagittarius, family
Topic 2: via, bank, social, read, use, capital, est, target, reports, wall, esto, bankthink, https, wsj, fintech
Topic 3: money, best, business, live, brexit, looking, car, top, stop, ever, businessinsider, yet, gold, street, socialmedia
Topic 4: alpha, las, things, please, follow, keep, getting, brooklyn, august, start, agrifields, means, far, around, soup
Topic 5: make, win, call, new, take, year, will, let, just, thing, even, tweet, stay, full, isn
Topic 6: earnings, now, just, black, never, women, back, tech, look, someone, tell, always, can, will, team
Topic 7: que, ipad, trump, para, abarth_es, abarth, arribi_rs, una, media, really, guillealfonsin, danim_andrade, happy, todos, donald
Topic 8: know, del, good, market, need, ceo, video, más, better, nuevo, stoc

In [44]:
data_20 = pyLDAvis.gensim.prepare(lda_20, corpus, dictionary)
pyLDAvis.display(data_20)

In [45]:
# LDA model with 40 topics
lda_40 = models.ldamodel.LdaModel(corpus, num_topics=40, id2word=dictionary, passes=10)

# Select top 10 words for each topic
top_words = [[word for word,_ in lda_40.show_topic(topicno, topn=15)] for topicno in range(lda_40.num_topics)]

display(top_words)

Topic 0: like, good, morning, oil, sale, start, hedge, awesome, one, can, move, number, boston, hear, reason
Topic 1: world, read, fund, air, problem, biggest, reading, info, point, delicious, calls, means, trying, place, box
Topic 2: businessinsider, latest, madrid, https, talking, since, chicken, vote, daily, ideas, rates, raises, friend, watching, plans
Topic 3: market, más, things, stock, pay, investors, getting, wants, euromoney, india, comes, benzinga, talks, says, cost
Topic 4: para, pero, este, todos, aquí, todo, august, hoy, mind, otro, mucho, mola, break, uso, three
Topic 5: women, doesn, fraud, makes, apple, billion, post, fuck, coming, feel, spider, aapl, package, advice, rules
Topic 6: make, great, twitter, work, sure, can, someone, ready, try, sin, short, god, wall, people, aliamjadrizvi
Topic 7: seeking, going, news, car, plan, audi, debt, bloomberg, using, chicago, uae, security, key, tan, buying
Topic 8: real, give, looks, everything, companies, text, whole, focus, int

In [46]:
data_40 = pyLDAvis.gensim.prepare(lda_40, corpus, dictionary)
pyLDAvis.display(data_40)

## Discussion

* Because of the diversity of Twitter users and reasons why one would want to categorize, twitter users can be classified into 6 classes: personal, professional, business, spam, feed/news, and viral (not only two classes: influencer, non-influencer)
* The algorithm of classifying twitter users in experiment 2 perform better because it analyze the characteristics, content of tweets
* We apply topic modeling of twitter data by applying LDA, however, it is difficult to gain the coherence of words in each topic. And also, it is still difficult to automatically to find the optimal number of topics. 