# Twitter Classifier

## Description
For this project, I will be building a Twitter classifier.  I am pulling the most recent ~3200 tweets from four twitter handles: realDonaldTrump, junstinbieber, hillaryclinton, and katyperry.  I will then be using a variety of supervised and unsupervised models to classify the tweets.  

In [1]:
import tweepy

%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
from sklearn.model_selection import train_test_split


consumer_key = '9SRz5HMehrEVQf2m7AoN4shrq'
consumer_secret = 'zZa9j55quKFTmTwm4PKx4B6RUn3OyCsEtVJmvqbLAX9d8K3Adu'
access_token = '2801486303-55EJTjYXUPvw5uzXmRQV8wTHDmiLh70BJoASUj9'
access_token_secret = 'aRGfnR8N4if56loNt0yhwChXBe61go8qTpEanmXV2RBRp'

In [2]:
# Thank you, yanofsky! Adapted from: https://gist.github.com/yanofsky/5436496

def get_all_tweets(screen_names):

    #authorize twitter, initialize tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth)

    
    #initialize a list to hold all the tweepy Tweets
    df = pd.DataFrame()
    
    for screen_name in screen_names:
        
        alltweets = []

        #make initial request for most recent tweets (200 is the maximum allowed count)
        new_tweets = api.user_timeline(screen_name = screen_name ,count=200)

        #save most recent tweets
        alltweets.extend(new_tweets)

        #save the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1

        #keep grabbing tweets until there are no tweets left to grab
        while len(new_tweets) > 0:
            print("getting tweets before %s" % (oldest))

            #all subsiquent requests use the max_id param to prevent duplicates
            new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)

            #save most recent tweets
            alltweets.extend(new_tweets)

            #update the id of the oldest tweet less one
            oldest = alltweets[-1].id - 1

            print("...%s tweets downloaded so far" % (len(alltweets)))

        #transform the tweepy tweets into a 2D array that will populate the csv	
        outtweets = [[tweet.user.screen_name, tweet.text] for tweet in alltweets]

        df = df.append(pd.DataFrame(data=outtweets)).reset_index(drop=True)
        
    return df
    
    print('done!')

In [3]:
# Getting the most recent 3200 tweets from Donald Trump, Kanye West, Hillary Clinton, 
#  Taylor Swift
tweets = get_all_tweets(['realDonaldTrump','justinbieber','hillaryclinton','katyperry'])

#tweets = get_all_tweets(['acupofjoanne'])

getting tweets before 985489930343321599
...400 tweets downloaded so far
getting tweets before 972835128056664065
...600 tweets downloaded so far
getting tweets before 961693860916289535
...800 tweets downloaded so far
getting tweets before 949619270631256063
...1000 tweets downloaded so far
getting tweets before 938752267611721727
...1197 tweets downloaded so far
getting tweets before 928769154345324543
...1397 tweets downloaded so far
getting tweets before 921319017826091007
...1596 tweets downloaded so far
getting tweets before 914089003745468416
...1796 tweets downloaded so far
getting tweets before 907579024960098303
...1996 tweets downloaded so far
getting tweets before 897783159038910465
...2196 tweets downloaded so far
getting tweets before 889579795176181760
...2396 tweets downloaded so far
getting tweets before 880017678978736128
...2595 tweets downloaded so far
getting tweets before 868840252227674112
...2795 tweets downloaded so far
getting tweets before 854547423464759295


In [4]:
tweets[0].value_counts()

realDonaldTrump    3220
katyperry          3218
HillaryClinton     3203
justinbieber       3167
Name: 0, dtype: int64

In [5]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12808 entries, 0 to 12807
Data columns (total 2 columns):
0    12808 non-null object
1    12808 non-null object
dtypes: object(2)
memory usage: 200.2+ KB


In [6]:
tweets.columns = ['screenname','tweet']

In [7]:
tweets.head()

Unnamed: 0,screenname,tweet
0,realDonaldTrump,"RT @WhiteHouse: ""Finally, I want to deliver a ..."
1,realDonaldTrump,"RT @WhiteHouse: ""At the heart of the Iran deal..."
2,realDonaldTrump,Statement on the Iran Nuclear Deal: https://t....
3,realDonaldTrump,John Kerry can’t get over the fact that he had...
4,realDonaldTrump,"I will be speaking to my friend, President Xi ..."


In [8]:
tweets['tweet'] = tweets['tweet'].astype(str)

In [9]:
def text_cleaner(tweet):
    tweet = re.sub(r'http.*','',tweet)
    tweet = re.sub(r'bit/ly.*', "", tweet)
    tweet = re.sub(r'b\'', "", tweet)
    tweet = re.sub(r'b"', "", tweet)
    return tweet

In [10]:
for r in range(len(tweets['tweet'])):
    tweets['tweet'][r] = text_cleaner(tweets['tweet'][r])

#Dropping retweets    
tweets = tweets[tweets['tweet'].str.contains("RT") == False].reset_index(drop=True)
    

In [11]:
tweets.head(20)

Unnamed: 0,screenname,tweet
0,realDonaldTrump,Statement on the Iran Nuclear Deal:
1,realDonaldTrump,John Kerry can’t get over the fact that he had...
2,realDonaldTrump,"I will be speaking to my friend, President Xi ..."
3,realDonaldTrump,"Gina Haspel, my highly respected nominee to le..."
4,realDonaldTrump,I will be announcing my decision on the Iran D...
5,realDonaldTrump,National Prescription Drug #TakeBackDay number...
6,realDonaldTrump,The United States does not need John Kerry’s p...
7,realDonaldTrump,Is this Phony Witch Hunt going to go on even l...
8,realDonaldTrump,"Lisa Page, who may hold the record for the mos..."
9,realDonaldTrump,"Good luck to Ric Grenell, our new Ambassador t..."


Helpful link about disabling piplines: https://spacy.io/usage/processing-pipelines#disabling

In [12]:
# Concatenizing all tweets into one text and parsing.
nlp = spacy.load('en', disable=['parser'])

text = tweets['tweet'].str.cat()
text = nlp(text)

In [13]:
# Parse tweets
nlp = spacy.load('en')

parsed = []

for r in tweets['tweet']:
    p = nlp(r)
    parsed.append(p)
    
    

In [14]:
tweets['parsed'] = parsed

In [15]:
tweets.head()

Unnamed: 0,screenname,tweet,parsed
0,realDonaldTrump,Statement on the Iran Nuclear Deal:,"(Statement, on, the, Iran, Nuclear, Deal, :)"
1,realDonaldTrump,John Kerry can’t get over the fact that he had...,"(John, Kerry, ca, n’t, get, over, the, fact, t..."
2,realDonaldTrump,"I will be speaking to my friend, President Xi ...","(I, will, be, speaking, to, my, friend, ,, Pre..."
3,realDonaldTrump,"Gina Haspel, my highly respected nominee to le...","(Gina, Haspel, ,, my, highly, respected, nomin..."
4,realDonaldTrump,I will be announcing my decision on the Iran D...,"(I, will, be, announcing, my, decision, on, th..."


# Bag of Words
Let's use bag of words! For each tweet, we will count how many times each word appears and use those counts as features.  

In [16]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(1000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(tweets, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = tweets['parsed']
    df['text_source'] = tweets['screenname']
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 500 == 0:
            print("Processing row {}".format(i))
            
    return df


In [17]:
# Finding the top 1000 common words in all the tweets.

common_words = bag_of_words(text)

In [18]:
common_words

['-PRON-',
 'be',
 'trump',
 'not',
 'great',
 'amp',
 '\n',
 'the',
 'thank',
 "'s",
 'good',
 'people',
 'hillary',
 'time',
 'president',
 'today',
 'day',
 'go',
 'get',
 'america',
 'love',
 '\n\n',
 '️',
 'year',
 'want',
 'vote',
 ' ',
 'donald',
 'u',
 'new',
 'come',
 'work',
 'country',
 'big',
 'tax',
 'like',
 '’',
 'let',
 'have',
 'make',
 '❤',
 'news',
 'woman',
 'know',
 'this',
 'need',
 'right',
 '’s',
 'job',
 'will',
 'say',
 'tonight',
 'family',
 'look',
 'a',
 'fake',
 'american',
 'election',
 '❗',
 'help',
 'see',
 'live',
 '🏼',
 'watch',
 'thing',
 'honor',
 'night',
 'do',
 '🇺',
 'if',
 '—hillary',
 'take',
 'win',
 'man',
 'what',
 'way',
 'u.s.',
 '🇸',
 'pay',
 'to',
 'world',
 'in',
 '👁',
 'life',
 'campaign',
 'believe',
 'state',
 'talk',
 'think',
 'tomorrow',
 'join',
 'democrats',
 'purpose',
 'house',
 'clinton',
 'happy',
 'million',
 'plan',
 'hard',
 'stand',
 'ready',
 'so',
 'friend',
 'united',
 '✨',
 'high',
 'strong',
 'cut',
 'just',
 'mean'

In [19]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(tweets, common_words)
word_counts.head()

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
Processing row 6500
Processing row 7000
Processing row 7500
Processing row 8000
Processing row 8500
Processing row 9000
Processing row 9500
Processing row 10000


Unnamed: 0,-PRON-,be,trump,not,great,amp,Unnamed: 7,the,thank,'s,...,@hillaryclinton,˚,primary,demand,receive,michigan,d.c.,pour,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Statement, on, the, Iran, Nuclear, Deal, :)",realDonaldTrump
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(John, Kerry, ca, n’t, get, over, the, fact, t...",realDonaldTrump
2,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,"(I, will, be, speaking, to, my, friend, ,, Pre...",realDonaldTrump
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Gina, Haspel, ,, my, highly, respected, nomin...",realDonaldTrump
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, will, be, announcing, my, decision, on, th...",realDonaldTrump


In [20]:
word_counts.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10227 entries, 0 to 10226
Columns: 1002 entries, -PRON- to text_source
dtypes: object(1002)
memory usage: 78.2+ MB


In [21]:
#word_counts.to_csv('tweet_word_counts.csv')


### Models for BoW

In [22]:
# Train-test split
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0,
                                                    stratify=Y
                                                   )

In [23]:
y_train.value_counts()

realDonaldTrump    1715
katyperry          1708
HillaryClinton     1514
justinbieber       1199
Name: text_source, dtype: int64

In [24]:
y_test.value_counts()

realDonaldTrump    1143
katyperry          1139
HillaryClinton     1009
justinbieber        800
Name: text_source, dtype: int64

#### Random forest

In [25]:
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9488265971316818

Test set score: 0.7311170862869714


In [26]:
from sklearn.model_selection import cross_val_score
cross_val_score(rfc, X, Y, cv=5)

array([0.63067904, 0.73522228, 0.71847507, 0.73727984, 0.69358786])

In [27]:
y_pred = rfc.fit(X_train, y_train).predict(X_train)

In [28]:
from collections import Counter
Counter(y_pred)

Counter({'HillaryClinton': 1488,
         'justinbieber': 1067,
         'katyperry': 1887,
         'realDonaldTrump': 1694})

#### Multinomial Logistic Regression

In [29]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(6136, 1000) (6136,)
Training set score: 0.8864080834419817

Test set score: 0.8127597164507455


In [30]:
cross_val_score(lr, X, Y, cv=5)

array([0.71812408, 0.81875916, 0.80938416, 0.82485323, 0.79001468])

#### Gradient Boost

In [31]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))


Training set score: 0.7865058670143416

Test set score: 0.7450501099975556


In [32]:
cross_val_score(clf, X, Y, cv=5)

array([0.61553493, 0.74059599, 0.7487781 , 0.78571429, 0.72589329])

#### SVM

In [33]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

# Code adapted from http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
# Set the parameters by cross-validation
def gridsearch(X_train, y_train, X_test, y_test):
    
    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9],
                     'C': [1, 10, 100, 1000, 10000]}]

    scores = ['precision', 'recall']

    for score in scores:
        print("# Tuning hyper-parameters for %s" % score)
        print()

        clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                           scoring='%s_macro' % score)
        clf.fit(X_train, y_train)

        print("Best parameters set found on development set:")
        print()
        print(clf.best_params_)
        print()
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))
        print()

        print("Detailed classification report:")
        print()
        print("The model is trained on the full development set.")
        print("The scores are computed on the full evaluation set.")
        print()
        y_true, y_pred = y_test, clf.predict(X_test)
        print(classification_report(y_true, y_pred))
        print()

In [34]:
# To speed up production, we are going to take the X and Y test dataset, split it for testing and training
#  and use grid search on that.  A smaller dataset will hopefully make this run faster.
X2_train, X2_test, y2_train, y2_test = train_test_split(X_test, y_test, test_size=0.4,
                                                    random_state=0, stratify=y_test)

In [35]:
y2_train.value_counts(1)

realDonaldTrump    0.279544
katyperry          0.278321
HillaryClinton     0.246536
justinbieber       0.195599
Name: text_source, dtype: float64

In [37]:
#Show warning once
import warnings
warnings.filterwarnings('once')

gridsearch(X2_train, y2_train, X2_test, y2_test)

# Tuning hyper-parameters for precision



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Best parameters set found on development set:

{'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}

Grid scores on development set:

0.270 (+/-0.200) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 1, 'gamma': 1e-05, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 1, 'gamma': 1e-06, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 1, 'gamma': 1e-07, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 1, 'gamma': 1e-08, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 1, 'gamma': 1e-09, 'kernel': 'rbf'}
0.724 (+/-0.051) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.270 (+/-0.200) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 10, 'gamma': 1e-05, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 10, 'gamma': 1e-06, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 10, 'gamma': 1e-07, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 10, 'gamma': 1e-08, 'kernel': 'rbf'}
0.070 (+/-0.000) for {'C': 10, 'gamma': 1e-09, '

In [40]:
svc = SVC(C=100, gamma=0.001, kernel='rbf')

svc.fit(X_train, y_train)


print('Training set score:', svc.score(X_train, y_train))
print('\nTest set score:', svc.score(X_test, y_test))

Training set score: 0.8688070404172099

Test set score: 0.781715961867514


In [42]:
cross_val_score(svc, X, Y, cv=5)

array([0.6790425 , 0.79970689, 0.79374389, 0.82387476, 0.77141459])

Of the different models built on BoW, it looks like logistic regression is least prone to overfitting and has the highest cross validation scores.  

## Latent Semantic Analysis
What if we don't have information on the handle that the tweet belongs to?  How could the tweets be categorized?  For this unsupervised learning problem, I will be using Latent Semantic Analysis to generate clusters of terms that reflects a topic.  First, I will use tf-idf, which converts the tweets into vectors.  Then I will apply dimension reduction (Singular Value Decomposition SVD) to reduce the feature space and generate the clusters.

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

  return f(*args, **kwds)
  return f(*args, **kwds)


In [63]:
X_train, X_test = train_test_split(tweets['tweet'], test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the tweets
                             min_df=3, # only use words that appear at least three times
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Donald Trump has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer tweets and shorter tweets get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
tweets_tfidf=vectorizer.fit_transform(tweets['tweet'])
print("Number of features: %d" % tweets_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(tweets_tfidf, test_size=0.4, random_state=0)


#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of tweets
n = X_train_tfidf_csr.shape[0]
print('number of tweets: %d' %n)

#A list of dictionaries, one per tweet
tfidf_bytweet = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each tweet, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bytweet[i][terms[j]] = X_train_tfidf_csr[i, j]

Number of features: 4116
number of tweets: 6136


In [71]:
#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train.iloc[3])
print('Tf_idf vector:', tfidf_bytweet[3])

Original sentence: Make sure @realDonaldTrump's bullying never reaches the White House. Chip in now: 
Tf_idf vector: {'bullying': 0.4443024245049827, 'realdonaldtrump': 0.3651346305739997, 'chip': 0.3632204031774269, 'reaches': 0.465644263823698, 'sure': 0.3017629775755063, 'make': 0.23353437369164426, 'house': 0.2857506743178306, 'white': 0.30539444856515924}


Great! Let's apply SVD and see how the tweets are classified.

In [83]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space to 200.
svd= TruncatedSVD(200)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100, '\n')

#Looking at what sorts of paragraphs our solution considers similar, for the first five identified topics
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train)
for i in range(10):
    print('Component {}:'.format(i))
    print(paras_by_component.loc[:,i].sort_values(ascending=False)[0:10])
    print('\n')


Percent variance captured by all components: 33.40365749788838 

Component 0:
tweet
@billboard :) thanks. Thank you #beliebers love you                                             0.723914
And thanks for all the great musicians who inspire me everyday. Thank you. I love music         0.653585
#PurposeTourGlendale another great show. Thank u                                                0.600435
Thanks @PeoplesChoice                                                                           0.586154
THANK YOU to all of the great volunteers helping out with #HurricaneHarvey relief in Texas!     0.585894
thanks @mtvema :) \n#EMABiggestFansJustinBieber                                                 0.585516
Thanks Barry                                                                                    0.585280
See more on my @OfficialFahlo #cosmo thanks                                                     0.585280
@officialellenk thanks                                                      

Using Latent Semantic Analysis, we are able to capture about 33% of the variance for all the tweet components.  The first three clusters have to do with thanks.  The fourth cluster capture tweets with the word "great."  Make America Great Again tweets compose a good potion of this cluster.  Fifth cluster is around "love".  Sixth cluster are tweets that contain amperstands (&). Seventh cluster are lols.  Eighth and ninth cluster has the word "purpose", which is the title of Justin Beiber's new album.   The final cluster is about American Idol, which currently stars Katy Perry. 

In summary, some of the clusters do focus more on one twitter handle over the others. The "great" cluster is Donald Trump MAGA oriented.  #Purpose is mostly Justin Beiber.  #AmericanIdol is Katy Perry.  

# Word2vec
Word2vec is the most common unsupervised neural network approach for NLP.  It converts words to vectors using distributed representation, where each word is represented by many neurons, and each neuron represents multiple words.  word2vec is powerful because it assigns a vector of random values to each word W, then shifts the vectors for the words around W in the sentence.  Words that are close to W have vectors that are closer together, while words that are not near W have vectors that are also far away.  Word2vec is great for tweets because tweets can contain the same concepts written in many different ways (e.g. expressing thanks).

We do need are larger corpus when using word2vec.  So first, let's generate more tweets!

In [111]:
# Getting more tweets from Ellen Degeneres, Barack Obama, Rihanna, Senator John Mccain, and Pope Francis
more_twts = get_all_tweets(['TheEllenShow','BarackObama','Rihanna','senjohnmccain','pontifex'])

getting tweets before 984497357092827135
...400 tweets downloaded so far
getting tweets before 973616503151980543
...600 tweets downloaded so far
getting tweets before 962084589329002495


  obj, end = self.scan_once(s, idx)


...800 tweets downloaded so far
getting tweets before 956228551983841284
...1000 tweets downloaded so far
getting tweets before 948732513190465536
...1200 tweets downloaded so far
getting tweets before 938204950361804799
...1400 tweets downloaded so far
getting tweets before 928795784740401152


  self.api.last_response = resp


...1600 tweets downloaded so far
getting tweets before 921132488625340415
...1800 tweets downloaded so far
getting tweets before 912786414751723519
...2000 tweets downloaded so far
getting tweets before 904453885787389951


  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)


...2200 tweets downloaded so far
getting tweets before 880153610738311167
...2400 tweets downloaded so far
getting tweets before 864178069061357567


  self.api.last_response = resp


...2600 tweets downloaded so far
getting tweets before 854073711732727807
...2800 tweets downloaded so far
getting tweets before 841703160213200895
...3000 tweets downloaded so far
getting tweets before 830092662245986303
...3200 tweets downloaded so far
getting tweets before 820806326531932159
...3226 tweets downloaded so far
getting tweets before 819252834965164032
...3226 tweets downloaded so far


  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)


getting tweets before 782250219027128319
...400 tweets downloaded so far
getting tweets before 759058522453659652
...600 tweets downloaded so far
getting tweets before 732313014720860159


  obj, end = self.scan_once(s, idx)


...800 tweets downloaded so far
getting tweets before 710577503329341439
...1000 tweets downloaded so far
getting tweets before 688385432032133119
...1200 tweets downloaded so far
getting tweets before 675059752959926272
...1400 tweets downloaded so far
getting tweets before 654315653671727103


  obj, end = self.scan_once(s, idx)


...1600 tweets downloaded so far
getting tweets before 631925544141979647


  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)


...1800 tweets downloaded so far
getting tweets before 618147962242252800
...2000 tweets downloaded so far
getting tweets before 603942876301557759


  obj, end = self.scan_once(s, idx)


...2200 tweets downloaded so far
getting tweets before 585486104272445439
...2399 tweets downloaded so far
getting tweets before 566675009628688385
...2599 tweets downloaded so far
getting tweets before 551829212370575359


  obj, end = self.scan_once(s, idx)


...2798 tweets downloaded so far
getting tweets before 535872666763137023
...2996 tweets downloaded so far
getting tweets before 520603040382849023


  self.api.last_response = resp


...3196 tweets downloaded so far
getting tweets before 506514154594004991
...3211 tweets downloaded so far
getting tweets before 504647157526183936
...3211 tweets downloaded so far


  self.api.last_response = resp


getting tweets before 878366661224550399
...399 tweets downloaded so far
getting tweets before 695322347180457983
...595 tweets downloaded so far
getting tweets before 562468011416637440


  obj, end = self.scan_once(s, idx)


...794 tweets downloaded so far
getting tweets before 529690459270967295
...994 tweets downloaded so far
getting tweets before 488571123031085056
...1191 tweets downloaded so far
getting tweets before 478575175508557824
...1389 tweets downloaded so far
getting tweets before 434840507559067647


  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)


...1583 tweets downloaded so far
getting tweets before 392910613380607999
...1780 tweets downloaded so far
getting tweets before 381165979772125183
...1978 tweets downloaded so far
getting tweets before 365543437325451263


  obj, end = self.scan_once(s, idx)


...2175 tweets downloaded so far
getting tweets before 349150331931852799
...2370 tweets downloaded so far
getting tweets before 333817717910011903
...2569 tweets downloaded so far
getting tweets before 315879839481610240


  obj, end = self.scan_once(s, idx)


...2760 tweets downloaded so far
getting tweets before 297758860435935233
...2959 tweets downloaded so far
getting tweets before 283132850021203967


  obj, end = self.scan_once(s, idx)


...3153 tweets downloaded so far
getting tweets before 270459671675015167
...3190 tweets downloaded so far
getting tweets before 268363645673680895
...3190 tweets downloaded so far


  self.api.last_response = resp


getting tweets before 953395261371494399


  obj, end = self.scan_once(s, idx)


...400 tweets downloaded so far
getting tweets before 930092897894092800
...600 tweets downloaded so far
getting tweets before 918085781322977279


  obj, end = self.scan_once(s, idx)


...800 tweets downloaded so far
getting tweets before 908027343293296639
...1000 tweets downloaded so far
getting tweets before 896015893137956863
...1200 tweets downloaded so far
getting tweets before 885120099706970112


  obj, end = self.scan_once(s, idx)


...1400 tweets downloaded so far
getting tweets before 870361148738211839


  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():
  for k, v in json.items():


...1600 tweets downloaded so far
getting tweets before 854796024367517701
...1800 tweets downloaded so far
getting tweets before 844318544380788736


  obj, end = self.scan_once(s, idx)


...2000 tweets downloaded so far
getting tweets before 827569427910684673
...2200 tweets downloaded so far
getting tweets before 808416594271543303


  obj, end = self.scan_once(s, idx)


...2400 tweets downloaded so far
getting tweets before 785861325759062015
...2600 tweets downloaded so far
getting tweets before 770028695629209599
...2800 tweets downloaded so far
getting tweets before 753965400237506560


  obj, end = self.scan_once(s, idx)


...3000 tweets downloaded so far
getting tweets before 740910420299517951
...3200 tweets downloaded so far
getting tweets before 727601063189225473
...3212 tweets downloaded so far
getting tweets before 726147260879364095


  self.api.last_response = resp


...3212 tweets downloaded so far
getting tweets before 930050117737914367


  obj, end = self.scan_once(s, idx)
  obj, end = self.scan_once(s, idx)


...400 tweets downloaded so far
getting tweets before 854295948985475071
...600 tweets downloaded so far
getting tweets before 782905543472013311


  obj, end = self.scan_once(s, idx)


...800 tweets downloaded so far
getting tweets before 716188469559566337
...1000 tweets downloaded so far
getting tweets before 611478471131316224


  for k, v in json.items():


...1200 tweets downloaded so far
getting tweets before 487174462928736256
...1400 tweets downloaded so far
getting tweets before 384965928229695488
...1564 tweets downloaded so far
getting tweets before 313247631054864383
...1564 tweets downloaded so far




In [112]:
more_twts[0].value_counts()

TheEllenShow     3226
SenJohnMcCain    3212
BarackObama      3211
rihanna          3190
Pontifex         1564
Name: 0, dtype: int64

In [114]:
more_twts.columns = ['screenname','tweet']

In [115]:
more_twts['tweet'] = more_twts['tweet'].astype(str)


In [116]:
for r in range(len(tweets['tweet'])):
    more_twts['tweet'][r] = text_cleaner(more_twts['tweet'][r])

#Dropping retweets    
more_twts = more_twts[more_twts['tweet'].str.contains("RT") == False].reset_index(drop=True)

In [119]:
# Parse tweets
nlp = spacy.load('en')

parsed = []

for r in more_twts['tweet']:
    p = nlp(r)
    parsed.append(p)

  dtype=np.dtype(descr)).reshape(obj[b'shape'])


In [120]:
more_twts['parsed'] = parsed

In [121]:
# Adding tweets from previous pull to dataframe.
df = pd.concat([more_twts, tweets], axis=0)

In [122]:
df = df.reset_index(drop=True)
df.tail()

Unnamed: 0,screenname,tweet,parsed
23504,katyperry,Um...YES IT IS #magritte,"(Um, ..., YES, IT, IS, #, magritte)"
23505,katyperry,This could be us but ur playin' #magritte,"(This, could, be, us, but, ur, playin, ', #, m..."
23506,katyperry,"First things first, I'm surrealist. #magritte","(First, things, first, ,, I, 'm, surrealist, ...."
23507,katyperry,GO SEE THE MAGRITTE EXHIBIT artinstitutechi It...,"(GO, SEE, THE, MAGRITTE, EXHIBIT, artinstitute..."
23508,katyperry,Dear Jason @Starbucks on Ohio &amp; N State in...,"(Dear, Jason, @Starbucks, on, Ohio, &, amp, ;,..."


In [125]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
twts = []
for tweet in df['parsed']:
    tweet = [
        token.lemma_.lower()
        for token in tweet
        if not token.is_stop
        and not token.is_punct
    ]
    twts.append(tweet)


print(twts[10])
print('We have {} tweets and {} tokens.'.format(len(twts), len(text)))

We have 23509 tweets and 168978 tokens.


In [130]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    twts,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

done!


In [131]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(vocab)



In [144]:
print(model.wv.most_similar(positive=['crooked', 'honest'], negative=['dishonest']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('💄', '👠'))
print(model.wv.similarity('repeal', 'delay'))
print(model.wv.similarity('laugh', 'joy'))

# One of these things is not like the other...
print(model.doesnt_match("💪 🙌 🍕 👏".split()))
print(model.doesnt_match("phony dishonest crooked forgiveness".split()))

[('michelle', 0.6871036291122437), ('difference', 0.6757922172546387), ('—@flotus', 0.6386368274688721), ('laugh', 0.6340092420578003), ('supporter', 0.6325311064720154), ('passion', 0.6190457344055176), ('invite', 0.609879732131958), ('quit', 0.6039540767669678), ('—@flotu', 0.6027598977088928), ('how', 0.6005479693412781)]
0.928660143849258
0.7308674041490782
0.5728832561326569
🍕
forgiveness


  # Remove the CWD from sys.path while we load stuff.
