### Russian Trolls: Proposal #2

This dataset released by NBC on Feb. 14 contains 200K tweets from Russian Trolls:

#### Topic:
Interesting Dataset; Beginner Tutorial

#### Proposal:  
1.  Create a beginner tutorial on topic modeling  
2.  Tutorial on predicting number of retweets(Caveats: Very hard to predict)

#### Yellowbrick:  
1.  Proposal #1 :Use FreqDist to show most frequent words; Use t-SNE to visualize clustering  
2.  Proposal #2: Use Regression visualizers

![](images/twitter_russian_troll.jpg)

In [41]:
import pandas as pd

In [42]:
df = pd.read_csv('tweets.csv.xz', dtype={'user_id': object, 
                                         'created_at': object,
                                         'tweet_id': object})

In [None]:
df.shape

In [44]:
df[['user_key', 'retweet_count', 'text']].head()

Unnamed: 0,user_key,retweet_count,text
0,ryanmaxwell_1,,#IslamKills Are you trying to say that there w...
1,detroitdailynew,0.0,"Clinton: Trump should’ve apologized more, atta..."
2,cookncooks,,RT @ltapoll: Who was/is the best president of ...
3,queenofthewo,,RT @jww372: I don't have to guess your religio...
4,mrclydepratt,,RT @Shareblue: Pence and his lawyers decided w...


In [45]:
data = df[~df.text.isna()]

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data.text)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)

In [None]:
print(lda_Z[0])

In [9]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


print("Loading dataset...")
t0 = time()
data = df[['user_key', 'retweet_count', 'text']]
data_samples = data.text[:n_samples]
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Loading dataset...
done in 0.017s.
Extracting tf features for LDA...
done in 0.130s.

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 3.126s.

Topics in LDA model:
Topic #0: rt http post life news live new conservatexian ho bad making la 10 perfect tonight makes john person dear hot
Topic #1: rt just hate stop fake christmas shot friends todolistbeforechristmas ll christmasaftermath false nation brexit ihavearighttoknow rules cause hashtag topnews things
Topic #2: rt https trump clinton hillary obama amp people donald don politics tcot just like realdonaldtrump said know america right pjnet
Topic #3: kids future followthemoney awesome single sure sleep ll oscarhasnocolor rid hours 20 favorite dangerous arizona watch spread fly press jenner
Topic #4: stein risk murder bush recount book prisonplanet years jill 2015 jerusalem prison field takes incident elected gets care longer officer
Topic #5: rt https pence didn people plan help breaking lost early mi