
# NLP Using the Twitter API: Guided Lab


---


<img src="https://snag.gy/RNAEgP.jpg" width="600">

### Can we correctly identify which of these two old men tweeted what?

> *Note: this lab is intended to be a guided lab until the independent practice questions.*


## Goals
---

We are going to attempt to classify whether a tweet comes from Trump or Sanders.  This lab involves multiple steps:
- Create a developer account on Twitter
- Create a method to pull a list of tweets from the Twitter API
- Perform proper preprocessing on our text
- Engineer sentiment feature in our dataset using TextBlob
- Explore supervised classification techniques


## Twitter API Developer Registration
---

If you haven't registered a Twitter account yet, this is a requirement in order to have a "developer" account.

[Twitter Sign Up](https://twitter.com/signup)



## Create an "App"

---

![](https://snag.gy/HPBQbJ.jpg)

Go to Twitter and register an "app" [apps.twitter.com](https://apps.twitter.com/).

> **Note**: For the required website field you can put a placeholder.

After you set up our app, you will only need to reference the corresponding keys Twitter generates for our app.  These are the keys that we will use with our application to communicate with the Twitter API.

## Install Python Twitter API library

---

Someone was nice enough to build a Python libary for us. It makes pulling tweets simple: we only need to plug in our keys and start collecting data. The library we will be using is provided by [Python Twitter Tools](http://mike.verdone.ca/twitter/).

To install it, just run the next frame (there is no conda package).

In [1]:
!pip install twitter python-twitter



## Some Boring Twitter Rules
---

**Twitter notifies you they will rate limit your requests:**

>THERE ARE LIMITS TO THE AMOUNT OF TIMES YOU CAN HIT THE API PER 15 MINUTE WINDOW. BEWARE!

Here's a quick overview of what Twitter says are "[the rules](https://dev.twitter.com/rest/public/rate-limiting)":

![](https://snag.gy/yJ6vIH.jpg)


## Our Application Keys
---

Take note of your application keys you will use to connect to Twitter and mine tweets from the official Bernie Sanders and Donald Trump twitter accounts:

![](https://snag.gy/H1djQK.jpg)

## `TweetMiner` class structure

---

The following code will get you up and running, providing connectivity to twitter. The class has the ability to make requests and can eventually transform the JSON responses into DataFrames.

This is a great example of using object-oriented Python to organize our code!

> **Note:** "request_limit" is used in this class to limit the number of tweets that are pulled per instance request.  Setting it to something lower until you've worked the bugs out of your request, and captured the data you want, is essential to avoiding the rate limit blocks.

### Twitter API key setup

Fill the information below in with the keys for your account.

- **consumer_key** - Find this in your app page under the "Keys and Access Tokens"
- **consumer_secret** - Right under **consumer_key** in the "Keys and Access Tokens" tab
- **access_token_key** - You will need to click the button to generate tokens to get this
- **access_token_secret** - Also available after you generate tokens


In [1]:
import twitter, re, datetime, pandas as pd

twitter_keys = {
    'consumer_key':        'gNh9ECBWkSJlxMDWQDhSq4zGA',
    'consumer_secret':     'anVMk1BJd1ai58MOPpItIVIM3kpLu8OBgw01ZBn3KSN1espRlL',
    'access_token_key':    '898996122479468545-atpLi5baU2fmCyddsa8bEMUD70tNRaE',
    'access_token_secret': 'fL5qxmKYn39VZDYxNVG3XgZLJXE1OqWp4MvqGKwyCaiO7'
}

api = twitter.Api(
    consumer_key         =   twitter_keys['consumer_key'],
    consumer_secret      =   twitter_keys['consumer_secret'],
    access_token_key     =   twitter_keys['access_token_key'],
    access_token_secret  =   twitter_keys['access_token_secret']
)

In [179]:
## We've build a function below that uses this python twitter package. Before we work with the function, let's
## take a look at some of the fundamental pieces of the package.

# This call will fetch the most recent tweet "statuses". Tweet statuses are objects that hold tweet id's,
# tweet text, tweeter name, when a tweet was created etc. 
# Here, we explicitly define that we'd like to pull the last 20 tweets for Bernie Sanders. 
x = api.GetUserTimeline(screen_name="berniesanders", count=20)
# Below we've pulled in a max_id parameter. 
# If max Id pertains to a tweet made 40 tweets ago, we'll pull tweet tweets from 40 to 60 tweets ago (since count=20).
y = api.GetUserTimeline(screen_name="berniesanders", count=20, max_id=900003837913817088)

In [180]:
# This is what the Get User Timeline returns: A list of status objects.
y

[Status(ID=900003837913817088, ScreenName=BernieSanders, Created=Tue Aug 22 14:36:50 +0000 2017, Text=u'LIVE: Join Bernie for a town hall in Ohio to talk about our priorities on health care, wages and infrastructure: https://t.co/6wkxdUxVbN'),
 Status(ID=899730088450719747, ScreenName=BernieSanders, Created=Mon Aug 21 20:29:03 +0000 2017, Text=u'RT @evanmcmurry: .@BernieSanders in Indianapolis: "We\'re not going to rebuild the shrinking middle class unless we rebuild the trade union\u2026'),
 Status(ID=899729416569356288, ScreenName=BernieSanders, Created=Mon Aug 21 20:26:23 +0000 2017, Text=u'RT @ABCPolitics: .@BernieSanders: "What an embarrassment it is that we have a president who cannot condemn in unequivocal terms" white supr\u2026'),
 Status(ID=899727519460593665, ScreenName=BernieSanders, Created=Mon Aug 21 20:18:51 +0000 2017, Text=u'We defeated the disastrous Republican health care plan. Now we must work together to guarantee health care to all. https://t.co/4uGxKpz73S'),
 Sta

In [184]:
# Individual status obj
y[0].created_at

u'Tue Aug 22 14:36:50 +0000 2017'

In [5]:
# Checking Type
type(y[0])

twitter.models.Status

In [6]:
# Here we'll take a look at some of the attributes of the Status object (which we use in our function)
print(y[0].id, y[0].user.name, y[0].retweet_count, y[0].text, y[0].created_at)

(900003837913817088, u'Bernie Sanders', 391, u'LIVE: Join Bernie for a town hall in Ohio to talk about our priorities on health care, wages and infrastructure: https://t.co/6wkxdUxVbN', u'Tue Aug 22 14:36:50 +0000 2017')


In [178]:
y[0]

0    0
0    1
0    2
Name: handle, dtype: int64

In [190]:
# Here we create a Tweetminer class that initializes the twitter api with our login credentials.
# Additionally, we define a method, mine_user_tweets that allows us to call the GetUserTimeline method
# for a user seamlessly. 

class TweetMiner(object):

    result_limit    =   20    
    api             =   False
    data            =   []
    
    twitter_keys = {
        'consumer_key':        'gNh9ECBWkSJlxMDWQDhSq4zGA',
        'consumer_secret':     'anVMk1BJd1ai58MOPpItIVIM3kpLu8OBgw01ZBn3KSN1espRlL',
        'access_token_key':    '898996122479468545-atpLi5baU2fmCyddsa8bEMUD70tNRaE',
        'access_token_secret': 'fL5qxmKYn39VZDYxNVG3XgZLJXE1OqWp4MvqGKwyCaiO7'
    }
    
    def __init__(self, keys_dict, api, result_limit = 20):
        
        self.api = api
        self.twitter_keys = keys_dict
        
        self.result_limit = result_limit
        

    def mine_user_tweets(self, user="michaelromanGA", mine_rewteets=False, max_pages=5, max_id=False):

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=max_id)        
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit)
                
            for item in statuses:

                mined = {
                    'tweet_id':        item.id,
                    'handle':          item.user.name,
                    'retweet_count':   item.retweet_count,
                    'text':            item.text,
                    'mined_at':        datetime.datetime.now(),
                    'created_at':      item.created_at,
                }
                
                last_tweet_id = item.id
                data.append(mined)
                
            page += 1
            
        return data

## Instantiate the class
---

Make sure you pass the keys dictionary and the api as arguments.

**Check:** call the object's `mine_user_tweets()` method, providing a user to pull the tweets of.

In [191]:
# result limit is the "count" argument we tuned earlier. 
miner = TweetMiner(twitter_keys, api, result_limit=2)

In [192]:
# We're pulling result_limit * max_pages results = 10 results for Trump and Sanders.
# sanders = miner.mine_user_tweets(user="berniesanders", max_pages=5)
donald = miner.mine_user_tweets(user="realDonaldTrump", max_pages=5, max_id=748920081942327296)

In [193]:
donald

[{'created_at': u'Fri Aug 25 19:18:49 +0000 2017',
  'handle': u'Donald J. Trump',
  'mined_at': datetime.datetime(2017, 8, 25, 16, 33, 33, 500144),
  'retweet_count': 4744,
  'text': u'I encourage everyone in the path of #HurricaneHarvey to heed the advice &amp; orders of their local and state officials. https://t.co/N6uEWCZUrv',
  'tweet_id': 901161964994539520},
 {'created_at': u'Fri Aug 25 16:02:33 +0000 2017',
  'handle': u'Donald J. Trump',
  'mined_at': datetime.datetime(2017, 8, 25, 16, 33, 33, 500174),
  'retweet_count': 5147,
  'text': u'Received a #HurricaneHarvey briefing this morning from Acting @DHSgov Secretary Elaine Duke, @FEMA_Brock,\u2026 https://t.co/VGdeIdgLbO',
  'tweet_id': 901112569322237952},
 {'created_at': u'Fri Jul 01 16:43:56 +0000 2016',
  'handle': u'Donald J. Trump',
  'mined_at': datetime.datetime(2017, 8, 25, 16, 33, 33, 670511),
  'retweet_count': 6727,
  'text': u"These crimes won't be happening if I'm elected POTUS. Killer should have never been her

In [11]:
print donald[0]

{'handle': u'Donald J. Trump', 'mined_at': datetime.datetime(2017, 8, 25, 11, 31, 53, 526864), 'created_at': u'Fri Aug 25 12:25:10 +0000 2017', 'tweet_id': 901057864516734978, 'text': u"Strange statement by Bob Corker considering that he is constantly asking me whether or not he should run again in '18. Tennessee not happy!", 'retweet_count': 7342}


### Convert the tweet ouputs to a pandas DataFrame

> *Hint: this is as easy as passing it to the DataFrame constructor!*

In [189]:
pd.DataFrame(donald)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Fri Aug 25 19:18:49 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.074230,4688,I encourage everyone in the path of #Hurricane...,901161964994539520
1,Fri Aug 25 16:02:33 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.074241,5130,Received a #HurricaneHarvey briefing this morn...,901112569322237952
2,Fri Aug 25 15:46:40 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.263565,8218,I have spoken w/ @GovAbbott of Texas and @Loui...,901108572041433089
3,Fri Aug 25 12:25:10 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.263574,10789,Strange statement by Bob Corker considering th...,901057864516734978
4,Fri Aug 25 11:32:23 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.449134,11904,"Nick Adams, ""Retaking America"" ""Best things o...",901044579750825985
5,Fri Aug 25 10:44:17 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.449144,12530,"Few, if any, Administrations have done more in...",901032475111116800
6,Fri Aug 25 10:40:32 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.628988,14187,General John Kelly is doing a fantastic job as...,901031532164468736
7,Fri Aug 25 10:33:32 +0000 2017,Donald J. Trump,2017-08-25 16:31:26.629000,13506,If Senate Republicans don't get rid of the Fil...,901029770401546243
8,Fri Aug 25 10:25:06 +0000 2017,Donald J. Trump,2017-08-25 16:31:27.038811,4257,RT @EricTrump: Honored to speak at the RNC Sum...,901027651216969728
9,Fri Aug 25 03:23:34 +0000 2017,Donald J. Trump,2017-08-25 16:31:27.038861,8163,RT @GregAbbott_TX: Spoke with Pres. Trump &amp...,900921565868687360


##  Create the training data

---

Let's get our "mined" data from the Twitter API.  

1. Mine Trump tweets
- Create a tweet DataFrame
- Mine Sanders tweets
- Append the results to our DataFrame

In [13]:
# we only need to "instantiate" once.  Then we can call mine_user_tweets as much as we want. Upping result limit
# from example above.
miner = TweetMiner(twitter_keys, api, result_limit=400)
trump_tweets = miner.mine_user_tweets("realDonaldTrump")

In [14]:
trump_df = pd.DataFrame(trump_tweets)
print trump_df.shape

(1000, 6)


In [15]:
bernie_tweets = miner.mine_user_tweets('berniesanders')

In [16]:
bernie_df = pd.DataFrame(bernie_tweets) 
print bernie_df.shape

(1000, 6)


In [17]:
tweets = pd.concat([trump_df, bernie_df], axis=0)
tweets.shape

(2000, 6)

## Any interesting ngrams going on with Trump?
---

Set up a vectorizer from sklearn and fit the text of Trump's tweets with an ngram range from 2 to 4. Figure out what the most common ngrams are.

> **Note:** It's up to you whether you want to remove stopwords or not. How does keeping or removing stopwords affect the results?

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(trump_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'fake news', 59),
 (u'rt foxandfriends', 29),
 (u'north korea', 25),
 (u'united states', 21),
 (u'great honor', 20),
 (u'white house', 20),
 (u'america great', 20),
 (u'news media', 19),
 (u'make america', 18),
 (u'republican senators', 18),
 (u'fake news media', 18),
 (u'make america great', 16),
 (u'honor welcome', 15),
 (u'president trump', 14),
 (u'jobs jobs', 13),
 (u'repeal amp', 13),
 (u'repeal amp replace', 13),
 (u'amp replace', 13),
 (u'working hard', 12),
 (u'great state', 12)]

### Look at the ngrams for Bernie Sanders

In [19]:
# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(bernie_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'health care', 115),
 (u'bernie sanders', 54),
 (u'donald trump', 46),
 (u'democratic party', 24),
 (u'working families', 24),
 (u'wall street', 23),
 (u'american people', 22),
 (u'climate change', 20),
 (u'health insurance', 18),
 (u'tax breaks', 17),
 (u'https rt', 16),
 (u'working people', 16),
 (u'millions people', 14),
 (u'minimum wage', 14),
 (u'guarantee health', 13),
 (u'guarantee health care', 13),
 (u'care right', 13),
 (u'health care right', 13),
 (u'mr trump', 12),
 (u'drug companies', 11)]

## Processing the tweets and building a model

---

To do classfication we will need to convert the tweets into a set of features.

**You will need to:**
- Vectorize input text data.
- Intialize a model (try Logistic regression).
- Train / Predict / cross-validate.
- Evaluate the performance of the model.

> **Bonus:** you may have noticed that there are website links in the tweets. What additional preprocessing steps can you do before building the model?


In [20]:
# BONUS
# Using the textacy package to do some more comprehensive preprocessing
# http://textacy.readthedocs.io/en/latest/
# To install: pip install textacy
# We're going to take 5 minutes so that you can look through this code and the arguments. Then we'll come together 
# to talk about as a group.
from textacy.preprocess import preprocess_text

tweet_text = tweets['text'].values
clean_text = [preprocess_text(x, fix_unicode=True, lowercase=True, transliterate=False,
                              no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for x in tweet_text]

In [54]:
print tweet_text[4:8]

[ u"2nd-Grade Teacher Can't Believe How Much Fatter They Keep Getting https://t.co/YP2quqpgsA https://t.co/9XK6AbYpeO"
 u'Alarming New Adult Trend \u2018Plateauing In Your Career And Relationship\u2019 Sweeps Nation https://t.co/lqdaZmZkAS https://t.co/Ro2AgZjZFU'
 u'National News Highlights https://t.co/nI6924DWcR'
 u'Man Somehow Getting Worse At Sex https://t.co/Q3lkIxBhtI https://t.co/6TfIaBjOod']


In [55]:
print clean_text[4:8]

[u'2ndgrade teacher cant believe how much fatter they keep getting url url', u'alarming new adult trend plateauing in your career and relationship sweeps nation url url', u'national news highlights url', u'man somehow getting worse at sex url url']


In [23]:
# target is the handle.
# make trump 1 and sanders 0
import numpy as np
y = tweets['handle'].map(lambda x: 1 if x == 'Donald J. Trump' else 0).values
print np.mean(y)

0.5


In [24]:
from sklearn.linear_model import LogisticRegression

# Preprocess our text data to Tfidf
tfv = TfidfVectorizer(ngram_range=(1,4), max_features=2000)
X = tfv.fit_transform(clean_text).todense()
print X.shape

(2000, 2000)


In [25]:
# cross-validate the accuracy:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(LogisticRegression(), X, y, cv=10)

print accuracies.mean()
print np.mean(accuracies)

# Setup logistic regression (or try another classification method here)
estimator = LogisticRegression()
estimator.fit(X, y)


0.899
0.899


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [26]:
# Very good accuracy considering the baseline is 50%

## Check the predicted probability for a random Sanders and Trump tweet
---

Below are provided a couple of tweets from both Sanders and Trump. I'm sure you can figure out on your own which one is which.

Estimate the predicted probability of being trump for the two tweets.

In [27]:
# Prep our source as TfIdf vectors
source_test = [
    "Demanding that the wealthy and the powerful start paying their fair share of taxes that's exactly what the American people want.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

############
# NOTE:  Do not re-initialize the tfidf vectorizor or the feature space willbe overwritten and
# hence your transform will not match the number of features you trained your model on.
#
# This is why you only need to "transform" since you already "fit" previously
#
####

Xtest = tfv.transform(source_test)

# Predict using previously trained logist regression `estimator`
estimator.predict_proba(Xtest)


array([[ 0.76219839,  0.23780161],
       [ 0.37726024,  0.62273976]])

In [28]:
# The 1st column is probability of being Bernie, and 2nd Trump. The classifier is getting it right.

## Independent practice questions

---

### 1. Pull tweets for some new users.

Experiment with using more data.  The API will not like it if you blow through their limits - be careful.  Try to grab only what you need one time, then work on the copy of the objects that are returned.  

> Read the documentation about rate limits and see if you can get enough without hitting the rate limit.  Are there any options available in the API to avoid such a problem?

**Pull tweets for more than two different users of your choice.**

In [29]:
# We deviate from trump / sanders using student tweets here to illustrate the NLP pipeine with twitter data

twitter_handles = ["theonion", "vice",'warriors']
tweets = {}

for twitter_handle in twitter_handles:
    print "Mining tweets for: ", twitter_handle
    miner = TweetMiner(twitter_keys, api, result_limit=500)
    tweets[twitter_handle] = miner.mine_user_tweets(user=twitter_handle, max_pages=10)


Mining tweets for:  theonion
Mining tweets for:  vice
Mining tweets for:  warriors


In [158]:
multi = pd.DataFrame(tweets[twitter_handles[0]])
multi = multi.append(pd.DataFrame((tweets[twitter_handles[1]])))
multi = multi.append(pd.DataFrame((tweets[twitter_handles[2]])))

print multi.shape

(6000, 6)


### 2. Build a multi-class classification model to distinguish between the users.

Try a new type of model than we used before.

In [159]:
tweet_text = multi['text'].values
clean_text = [preprocess_text(x, fix_unicode=True, lowercase=True, transliterate=False,
                              no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for x in tweet_text]

y = multi['handle'].apply(lambda x: 0 if x == "The Onion" else 1 if x == "VICE" else 2)
X = pd.DataFrame({"clean_text":clean_text})

In [161]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [162]:
tfv = TfidfVectorizer(ngram_range=(1,3), max_features=2500)
# BEWARE! FIT-TRANSFORMING OUR TFIDFVECTORIZER ON OUR TRAIN, BUT ONLY TRANSFORMING OUR TEST!
# We do this so that none of the top 2500 features come from the test data!
X_train, X_test = tfv.fit_transform(X_train['clean_text']), tfv.transform(X_test['clean_text'])

In [163]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

rf = RandomForestClassifier(n_estimators=250, verbose=1)
knn = KNeighborsClassifier(n_neighbors=7)

rf.fit(X_train, y_train)
knn.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    7.6s finished


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='uniform')

In [164]:
# Random forest score:
print 'RF:', rf.score(X_test, y_test)
print 'KNN:', knn.score(X_test, y_test)

RF:

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    1.1s finished


 0.898888888889
KNN: 0.460555555556


In [165]:
# Baseline score:
multi.handle.value_counts()/multi.shape[0]

VICE                   0.333333
GoldenStateWarriors    0.333333
The Onion              0.333333
Name: handle, dtype: float64

In [166]:
rf_yhat = knn.predict(X_test)

### 3. Make a confusion matrix and classification report.

In [167]:
from sklearn.metrics import classification_report, confusion_matrix

print classification_report(y_test, rf_yhat)

             precision    recall  f1-score   support

          0       0.93      0.27      0.42       634
          1       0.89      0.14      0.25       591
          2       0.38      0.99      0.55       575

avg / total       0.74      0.46      0.40      1800



In [168]:
# Confusion Matrix
print confusion_matrix(y_test, rf_yhat)

[[173   8 453]
 [ 12  85 494]
 [  1   3 571]]


### 4. What is the most and least "distinctive" tweets for each user?

To find this, identify the tweet that has the highest (correct) predicted probability of being that user's tweet for each user.

In [169]:
pp = rf.predict_proba(X_train)

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    1.2s finished


In [170]:
pp[0:5]

array([[ 0.084,  0.104,  0.812],
       [ 0.912,  0.088,  0.   ],
       [ 0.02 ,  0.964,  0.016],
       [ 0.964,  0.036,  0.   ],
       [ 0.672,  0.288,  0.04 ]])

In [171]:
pp = pd.DataFrame(pp, columns=['The_Onion_Prob', 'VICE_Prob', 'GoldenStateWarriors_Prob'])

In [172]:
pp.head()

Unnamed: 0,The_Onion_Prob,VICE_Prob,GoldenStateWarriors_Prob
0,0.084,0.104,0.812
1,0.912,0.088,0.0
2,0.02,0.964,0.016
3,0.964,0.036,0.0
4,0.672,0.288,0.04


In [173]:
print y_train.shape, pp.shape

(4200,) (4200, 3)


In [174]:
tweets_pp = pd.concat([multi.reset_index(), pp.reset_index()], axis=1)
tweets_pp.drop('index', axis=1, inplace=True)
tweets_pp.head(10)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,The_Onion_Prob,VICE_Prob,GoldenStateWarriors_Prob
0,Fri Aug 25 15:29:05 +0000 2017,The Onion,2017-08-25 11:33:26.074366,3,Today’s Weather Report https://t.co/oIlLmRF53p,901104150229811201,0.084,0.104,0.812
1,Fri Aug 25 15:13:02 +0000 2017,The Onion,2017-08-25 11:33:26.074381,55,What You Need To Know About Taylor Swift https...,901100110892998656,0.912,0.088,0.0
2,Fri Aug 25 14:42:09 +0000 2017,The Onion,2017-08-25 11:33:26.074386,166,College Roommates Surprised To Find Dorm Room ...,901092335995543553,0.02,0.964,0.016
3,Fri Aug 25 13:55:08 +0000 2017,The Onion,2017-08-25 11:33:26.074390,146,Astronomers Discover Planet Identical To Earth...,901080503855591424,0.964,0.036,0.0
4,Fri Aug 25 13:08:07 +0000 2017,The Onion,2017-08-25 11:33:26.074394,135,2nd-Grade Teacher Can't Believe How Much Fatte...,901068672311197696,0.672,0.288,0.04
5,Fri Aug 25 12:21:05 +0000 2017,The Onion,2017-08-25 11:33:26.074399,204,Alarming New Adult Trend ‘Plateauing In Your C...,901056835662295044,0.928,0.024,0.048
6,Fri Aug 25 04:30:04 +0000 2017,The Onion,2017-08-25 11:33:26.074404,14,National News Highlights https://t.co/nI6924DWcR,900938300600352768,1.0,0.0,0.0
7,Fri Aug 25 03:31:04 +0000 2017,The Onion,2017-08-25 11:33:26.074408,140,Man Somehow Getting Worse At Sex https://t.co/...,900923454295347200,0.064,0.92,0.016
8,Fri Aug 25 02:32:06 +0000 2017,The Onion,2017-08-25 11:33:26.074412,600,Flying Squirrel Loves It Every Time https://t....,900908613472014337,0.0,0.0,1.0
9,Fri Aug 25 01:33:02 +0000 2017,The Onion,2017-08-25 11:33:26.074417,72,Alternative Birth Control Methods https://t.co...,900893749177335808,0.992,0.0,0.008


In [175]:
print 'Most Onion-like:', tweets_pp[tweets_pp.handle == 'The Onion'].sort_values('The_Onion_Prob', ascending=False).text.values[0]
print 'Least Onion-like:', tweets_pp[tweets_pp.handle == 'The Onion'].sort_values('The_Onion_Prob', ascending=True).text.values[0]

Most Onion-like: Rec Sports League Organizer Needs To Cool It With The Emails https://t.co/JXnYbW7owp https://t.co/OrkT6jQ9Cj
Least Onion-like: Today's Weather Report https://t.co/1GkG5aaRAL


In [176]:
print 'Most VICE-like:', tweets_pp[tweets_pp.handle == 'VICE'].sort_values('VICE_Prob', ascending=False).text.values[0]
print 'Least VICE-like:', tweets_pp[tweets_pp.handle == 'VICE'].sort_values('VICE_Prob', ascending=True).text.values[0]

Most VICE-like: "If Batman teaches us anything, it's that not all rich people are dicks.” https://t.co/d0ABtwaW5l
Least VICE-like: Michelle 'suicide-by-text' Carter just got sentenced to prison https://t.co/4KoaXihjjD https://t.co/BAVHrjhiez


In [177]:
print 'Most GoldenStateWarriors-like:', tweets_pp[tweets_pp.handle == 'GoldenStateWarriors'].sort_values('GoldenStateWarriors_Prob', ascending=False).text.values[0]
print 'Least GoldenStateWarriors-like:', tweets_pp[tweets_pp.handle == 'GoldenStateWarriors'].sort_values('GoldenStateWarriors_Prob', ascending=True).text.values[0]

Most GoldenStateWarriors-like: ⛳️🏌🏾 @Andre is taking over the @PGA IG account for the #PGAChamp 👉🏽 https://t.co/qlqiERND8b https://t.co/DLY7aqyvms
Least GoldenStateWarriors-like: RT @WebDotComTour: Right in the heart. 👌

@StephenCurry30's first #WebTour par. ⛳️ https://t.co/uZPSbzAtHV
