# Building a Classifier

**GOALS**

Build a Classifier that compares our classification methods in order to predict the individual who is tweeting: 


```
@iamcardib
@hillaryclinton
@_yiannopoulos
@thrashermag
@fwmagazine
```

In order to do this, we will review the process of retrieving a tweet and building a dataframe from the text of the tweet.  From here, your goal is to

1. Build a labeled dataframe containing at least 100 tweets from the five users.  
2. Explore the top 5 retweeted tweets from each user, make a visualization, discuss
3. Prepare the data for modeling using a `CountVectorizer` or `TfidfVectorizer`.  Remember to incorporate stop words and n-grams in your work.
4. Use a `LogisticRegression` classifier to determine the user.  How did it perform?
5. Use a `NaiveBayes` classifier to determine the user.  Did this do better?
6. Use a `DecisionTreeClassifier` to model the tweets.  How did this compare to the other two methods?
7. Build a table that compares the important information about these models.  
8. Suppose your task is to verify whether or not another account was actually Milo all along.  Which model would you use?  Why?  

In [2]:
import importlib

In [3]:
#auth info
import twitter_credentials as Tw
importlib.reload(Tw)

<module 'twitter_credentials' from '/Users/karenhao/Google Drive/02 Working/Quartz/Education/GA Data Science/DAT-NYC-6.13/twitter_credentials.py'>

In [4]:
import tweepy
import json
from tweepy import OAuthHandler
auth = OAuthHandler(Tw.consumer_key, Tw.consumer_secret)
auth.set_access_token(Tw.access_token, Tw.access_token_secret)
 
api = tweepy.API(auth)

In [5]:
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text)

Sitting here trying to even come up with the last white player to get nabbed for PEDs https://t.co/pYJsY23sIa
In an era of national political discord, a candidate preaching constitutional localism would possess a new, unique… https://t.co/GXdRUpFA18
RT @MichaelEMann: @NatureNews Keep in mind that there are substantial limitations with model-based attribution  studies. They likely undere…
Evening run groove. https://t.co/iJRQWGu4fu
Firefighters are making small but promising headway on California wildfires: 

- #CarrFire: 20% contained
-… https://t.co/jMJSklKept
“The Perils of Adding Layers of Management” by C. L. Hamer https://t.co/w4Bu3CfQqG
RT @EMViews: Sub-Saharan #Africa continues to adopt use of the #renminbi. We've taken a closer look at why, and what it means for trade. ht…
Watching #TrayvonMartinStory tonight on @BET. Hear it picks up where 13TH left off about Florida’s disastrous “Stan… https://t.co/mpBaPqGysc
It took 2 months longer than planned to pull it together, but @thet

In [6]:
for status in tweepy.Cursor(api.home_timeline).items(1):
    # Process a single status
    print(status._json)

{'created_at': 'Tue Jul 31 01:07:32 +0000 2018', 'id': 1024099208884678656, 'id_str': '1024099208884678656', 'text': 'Sitting here trying to even come up with the last white player to get nabbed for PEDs https://t.co/pYJsY23sIa', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/pYJsY23sIa', 'expanded_url': 'https://twitter.com/btbscore/status/1024048323752275968', 'display_url': 'twitter.com/btbscore/statu…', 'indices': [86, 109]}]}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 19273919, 'id_str': '19273919', 'name': 'rhea butcher', 'screen_name': 'RheaButcher', 'location': 'Los Angeles, CA.', 'description': "comic with a 31 year old farmer's tan from akron, ohio.", 'url': 'https://t.co/ZZ2ANBpJdD

In [7]:
def process_or_store(tweet):
    print(json.dumps(tweet))

In [8]:
for status in tweepy.Cursor(api.home_timeline).items(1):
    print(status)

Status(_api=<tweepy.api.API object at 0x1046686d8>, _json={'created_at': 'Tue Jul 31 01:09:06 +0000 2018', 'id': 1024099600360189952, 'id_str': '1024099600360189952', 'text': 'RT @family_equality: This piece from @mmfa is a must-read following today\'s announcement of a "religious liberty task force" at DOJ. https:…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'family_equality', 'name': 'Family Equality', 'id': 79295663, 'id_str': '79295663', 'indices': [3, 19]}, {'screen_name': 'mmfa', 'name': 'Media Matters', 'id': 13493302, 'id_str': '13493302', 'indices': [37, 42]}], 'urls': []}, 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 13493302, 'id_str': '13493302', 'name': 'Media Matters', 'screen_name': 'mmfa', '

In [9]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(tweet)

Status(_api=<tweepy.api.API object at 0x1046686d8>, _json={'created_at': 'Fri Jul 27 22:40:58 +0000 2018', 'id': 1022975158536097792, 'id_str': '1022975158536097792', 'text': 'Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for… https://t.co/eSxSVjMh7I', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'BetteMidler', 'name': 'Bette Midler', 'id': 139823781, 'id_str': '139823781', 'indices': [23, 35]}, {'screen_name': 'HelloDollyBway', 'name': 'Hello, Dolly!', 'id': 4785031154, 'id_str': '4785031154', 'indices': [96, 111]}], 'urls': [{'url': 'https://t.co/eSxSVjMh7I', 'expanded_url': 'https://twitter.com/i/web/status/1022975158536097792', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': N

In [10]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(tweet._json['text'])

Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for… https://t.co/eSxSVjMh7I


In [11]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(json.dumps(tweet._json['text']))

"Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for\u2026 https://t.co/eSxSVjMh7I"


In [12]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print(json.dumps(tweet._json))

{"created_at": "Fri Jul 27 22:40:58 +0000 2018", "id": 1022975158536097792, "id_str": "1022975158536097792", "text": "Such a pleasure seeing @BetteMidler back where she belongs! Huge thanks to the cast and crew of @HelloDollyBway for\u2026 https://t.co/eSxSVjMh7I", "truncated": true, "entities": {"hashtags": [], "symbols": [], "user_mentions": [{"screen_name": "BetteMidler", "name": "Bette Midler", "id": 139823781, "id_str": "139823781", "indices": [23, 35]}, {"screen_name": "HelloDollyBway", "name": "Hello, Dolly!", "id": 4785031154, "id_str": "4785031154", "indices": [96, 111]}], "urls": [{"url": "https://t.co/eSxSVjMh7I", "expanded_url": "https://twitter.com/i/web/status/1022975158536097792", "display_url": "twitter.com/i/web/status/1\u2026", "indices": [117, 140]}]}, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_re

In [13]:
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(1):
    print((tweet._json.keys()))

dict_keys(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])


In [14]:
tweets = []
retweets = []
user = []
for tweet in tweepy.Cursor(api.user_timeline, id = "hillaryclinton").items(2000):
    tweets.append(tweet._json['text'])
    retweets.append(tweet._json['retweet_count'])
    user.append(tweet._json['user']['screen_name'])

In [15]:
import pandas as pd
df = pd.DataFrame({'tweets': tweets, 'retweets': retweets, 'user': user})
df.head()

Unnamed: 0,retweets,tweets,user
0,5717,Such a pleasure seeing @BetteMidler back where...,HillaryClinton
1,18964,Yesterday was the court-ordered deadline for t...,HillaryClinton
2,7944,From mother to activist to candidate - congrat...,HillaryClinton
3,7514,It was wonderful to spend some time with the t...,HillaryClinton
4,590,RT @domesticworkers: Miles de niños y niñas si...,HillaryClinton


In [16]:
tweets = []
retweets = []
user = []
for tweet in tweepy.Cursor(api.user_timeline, id = "iamcardib").items(2000):
    tweets.append(tweet._json['text'])
    retweets.append(tweet._json['retweet_count'])
    user.append(tweet._json['user']['screen_name'])

In [17]:
df2 = pd.DataFrame({'tweets': tweets, 'retweets': retweets, 'user': user})
df2.head()

Unnamed: 0,retweets,tweets,user
0,10975,Mood https://t.co/burVuT9Apz,iamcardib
1,16,@CardiMila__ 😎,iamcardib
2,1051,I got a baby i need some money shieeettt i nee...,iamcardib
3,2748,DO YOU SMELL WHAT CARDI IS COOKING ?,iamcardib
4,2022,People love doubting you then when you hit the...,iamcardib


In [19]:
tweets = pd.concat([df,df2])
tweets.head()

Unnamed: 0,retweets,tweets,user
0,5717,Such a pleasure seeing @BetteMidler back where...,HillaryClinton
1,18964,Yesterday was the court-ordered deadline for t...,HillaryClinton
2,7944,From mother to activist to candidate - congrat...,HillaryClinton
3,7514,It was wonderful to spend some time with the t...,HillaryClinton
4,590,RT @domesticworkers: Miles de niños y niñas si...,HillaryClinton


In [20]:
tweets.to_csv('tweets.csv')

In [21]:
len(tweets)

4000

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
tfidf = TfidfVectorizer(tweets,ngram_range=(1,3),stop_words='english')

In [25]:
tfidf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8',
        input=      retweets                                             tweets  \
0         5717  Such a pleasure seeing @BetteMidler back where...
1        18964  Yesterday was the court-ordered deadline for t...
2         7944  From mother to activist to candidate - congrat...
3         7514  It...rdib
1997       iamcardib
1998       iamcardib
1999       iamcardib

[4000 rows x 3 columns],
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

### To Do

- Write a function that takes in usernames and tweet number, and returns a `DataFrame` with the appropriate number of tweets, labeled user, tweet body, retweets, and geo location information.
- Explore top retweets
- Prepare for `sklearn`
- Classification Models