# Supervised Learning
So far we have looked at how we can work with numerical data in performing different types of classification.

Today we look at how machine learning can be used to process text. This is generally a field of machine learning called Natural Language Processing

## Harvesting Tweets

In [1]:
import tweepy # https://github.com/tweepy/tweepy
import csv # Write csv files

In [2]:
#Twitter API credentials
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''

In [19]:
usernames = ['BarackObama', 'realDonaldTrump']

In [4]:
# Twitter only allows access to a users most recent 3240 tweets with this method

def get_all_tweets(screen_name):
    
    #authorize twitter, initialize tweepy
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)
     
    #initialize a list to hold all the tweepy Tweets
    alltweets = []  
     
    #make initial request for most recent tweets (200 is the maximum allowed count)
    new_tweets = api.user_timeline(screen_name = screen_name,count=200)
     
    #save most recent tweets
    alltweets.extend(new_tweets)
     
    #save the id of the oldest tweet less one
    oldest = alltweets[-1].id - 1
     
    #keep grabbing tweets until there are no tweets left to grab
    while len(new_tweets) > 0:
        print("\tGetting tweets before %s" % (oldest))
         
        #all subsiquent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)

         
        #save most recent tweets
        alltweets.extend(new_tweets)
         
        #update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1
         
        print("\t...%s tweets downloaded so far" % (len(alltweets)))
        # transform the tweepy tweets into a 2D array that will populate the csv
        outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
                 
        #write the csv
        with open('datasets/{}_tweets.csv'.format(screen_name), 'a') as f:
                writer = csv.writer(f)
                #writer.writerow(["id","created_at","text"])
                writer.writerows(outtweets)
                #pass

In [20]:
 for username in usernames:
            try:
                    #pass in the username of the account you want to download
                    print("\nGetting @{}'s tweets\n".format(username))
                    get_all_tweets(username)
            except:
                    print('\tError! Failed to fetch tweets for {}'.format(username))


Getting @BarackObama's tweets

	Getting tweets before 912724426709503999
	...400 tweets downloaded so far
	Getting tweets before 776160016483033087
	...600 tweets downloaded so far
	Getting tweets before 748957408878211071
	...800 tweets downloaded so far
	Getting tweets before 726521029401665535
	...1000 tweets downloaded so far
	Getting tweets before 705160499902726144
	...1200 tweets downloaded so far
	Getting tweets before 687099389483958271
	...1400 tweets downloaded so far
	Getting tweets before 668933428620857346
	...1600 tweets downloaded so far
	Getting tweets before 648539602848886784
	...1800 tweets downloaded so far
	Getting tweets before 628965547690954751
	...2000 tweets downloaded so far
	Getting tweets before 616281266975956991
	...2200 tweets downloaded so far
	Getting tweets before 598921453338136575
	...2400 tweets downloaded so far
	Getting tweets before 581193144181604351
	...2599 tweets downloaded so far
	Getting tweets before 560847716578623487
	...2799 tweets d




# Author Attribution


In [7]:
# Imports
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.utils import shuffle
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


## Train

#### a.) Load the Data

In [21]:
user1 = pd.read_csv('datasets/{}_tweets.csv'.format(usernames[0]))
user2 = pd.read_csv('datasets/{}_tweets.csv'.format(usernames[1]))

# Assign columns to the data
user1.columns = ['id', 'timestamp', 'tweet']
user2.columns = ['id', 'timestamp', 'tweet']

# Create target columns for the users 
user1['Name'] = 0
user2['Name'] = 1

# Join the two dataframes into one huge dataframe and shuffle them
collectiveTweets = pd.concat([user1, user2])
collectiveTweets = shuffle(collectiveTweets)

# Target names
target_names = [usernames[0], usernames[1]]

### b. Split the data in training and test set

In [22]:
X = collectiveTweets['tweet']
y = collectiveTweets['Name']

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

X_train = X[:-1000]
y_train = y[:-1000]

X_test = X[-1000:]
y_test = y[-1000:]

### c.) Create and Train a classifier
#### Feature Extraction

In [23]:
# Occurences
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(62505, 14643)

In [24]:
# Frequencies

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(62505, 14643)

In [25]:
#Training a classifier
classifier = LogisticRegression()
clf = classifier.fit(X_train_tfidf, y_train)

### Test

In [26]:
X_tests_counts = count_vect.transform(X_test)
X_tests_tfidf = tfidf_transformer.transform(X_tests_counts)
expected  = y_test
predicted = clf.predict(X_tests_tfidf)
print("Accuracy of our model is:\n%s" % metrics.accuracy_score(expected, predicted))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

Accuracy of our model is:
0.999
Confusion matrix:
[[540   0]
 [  1 459]]


### Apply

In [32]:
#Predicting Outcome
tweet1 = 'Food'
tweet2 = 'Machine Learning'
tweet3 = 'Music'
tweet4 = 'yes we can'
tweet5 = 'I love you'
tweet6 = 'Go to hell'
tweet7 = 'Yaaay'
tweet8 = 'Nice'
tweet9 = 'God bless America'
tweet10 = 'billions'

tweets_new = [tweet1, tweet2, tweet3, tweet4, tweet5, tweet6, tweet7, tweet8, tweet9, tweet10]
X_new_counts = count_vect.transform(tweets_new)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for tw, category in zip(tweets_new, predicted):
    print('\n{} ===> {}'.format(tw, target_names[category]))



Food ===> BarackObama

Machine Learning ===> realDonaldTrump

Music ===> realDonaldTrump

yes we can ===> BarackObama

I love you ===> BarackObama

Go to hell ===> BarackObama

Yaaay ===> realDonaldTrump

Nice ===> realDonaldTrump

God bless America ===> realDonaldTrump

billions ===> realDonaldTrump
