<h2>Final Project: Identifying Trump's Tweets</h2>

<center>
<img src="white_house.jpg"/>
</center>


<h3>Introduction</h3>

<p>The goal is to classify the device that Trump uses to write each tweet with. It's been hypothesized that President Trump tweets only from his android phone and that someone else (his staff) tweets from his account using an iPhone. Analyze the text of the tweet as well as other contextual information to predict where each tweet came from. </p>

<h3>Rules</h3>

<p> Rules of the competition: You may use any techniques you've learned in class including any open source implementations in packages such as scikit-learn, tensorflow, or pre-trained models. If you use any open source implementations, <b>please cite them in your comments</b>. The sharing of personal code between teams is strictly not allowed. Additionally obtaining a copy of the labeled test set through any means is expressly forbidden. </p>

<p><b>NOTE: You are only allowed 10 submissions for this project. Please use them carefully. We will use your 10th and final submission (not be the best one) for grading.</b></p>

<h3>Grading</h3>

<p>There are two baselines we have implemented. <code>Baseline 1 = 0.7</code> and <code>Baseline 2 = 0.82</code>. If you beat the first baseline, you will 90 points. If you beat the second baseline, you'll get 100 points.</p>
<p>The top 30 teams on the leaderboard will receive an extra 5 bonus points.</p>

In [2]:
#<GRADED>
import numpy as np
import pandas as pd
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
#</GRADED>
## include your imports as necessary and cite open-source implementations appropriately

In [11]:
def read_files(train_file):
    """
    Output:
    df_X : pandas data frame of training data
    Y    : numpy array of labels
    """
    df = pd.read_csv(train_file, index_col=0)
    df_X = df[df.columns[0:17]]
    Y = np.array(df['label'])
        
    return df_X, Y

<h3> Training Data </h3>

<p> Take a look at the file <code>train.csv</code>. Here are the first 4 tweets in the train dataset.</p>

In [12]:
df_X_train, Y_train = read_files('train.csv')
df_X_train[:4]

Unnamed: 0_level_0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id.1,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,Senior United States District Judge Robert E. ...,False,14207,,7/12/2016 0:56,False,,752668000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,5256,False,False,,,-1
1,Speech on Veterans' Reform: https://t.co/XB7R...,False,9666,,7/11/2016 22:18,False,,752628000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,3432,False,False,,,-1
2,Great poll- Florida! Thank you! https://t.co/4...,False,25531,,7/11/2016 21:40,False,,752619000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,8810,False,False,,,-1
3,Thoughts and prayers with the victims; and the...,False,28850,,7/11/2016 19:51,False,,752591000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,9112,False,False,,,-1


In [16]:
pd.concat([df_X_train['text'],df_X_train['label']],axis=1)

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Senior United States District Judge Robert E. ...,-1
1,Speech on Veterans' Reform: https://t.co/XB7R...,-1
2,Great poll- Florida! Thank you! https://t.co/4...,-1
3,Thoughts and prayers with the victims; and the...,-1
4,Join me in Westfield; Indiana- tomorrow night ...,-1
5,I heard that the underachieving John King of @...,1
6,The media is so dishonest. If I make a stateme...,1
7,President Obama thinks the nation is not as di...,1
8,Look what is happening to our country under th...,1
9,New poll - thank you! #Trump2016 https://t.co...,-1


In [8]:
df_X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1089 entries, 0 to 1088
Data columns (total 17 columns):
text             1089 non-null object
favorited        1089 non-null bool
favoriteCount    1089 non-null int64
replyToSN        3 non-null object
created          1089 non-null object
truncated        1089 non-null bool
replyToSID       0 non-null float64
id.1             1089 non-null int64
replyToUID       3 non-null float64
statusSource     1089 non-null object
screenName       1089 non-null object
retweetCount     1089 non-null int64
isRetweet        1089 non-null bool
retweeted        1089 non-null bool
longitude        2 non-null float64
latitude         2 non-null float64
label            1089 non-null int64
dtypes: bool(4), float64(4), int64(4), object(5)
memory usage: 123.4+ KB


In [5]:
def Find_url(string): 
    # findall() has been used  
    # with valid conditions for urls in string 
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+] |[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string) 
    return len(url) 
df_X_train['link'] = df_X_train['text'].apply(Find_url)
df_X_train.loc[df_X_train['link'] >0, 'link'] = 1

def Find_sc(string): 
    regex = re.compile('[@_!#;$%^&*()<>?/\|}{~:]') 
    return len(regex.findall(string))
df_X_train['sc_count'] = df_X_train['text'].apply(Find_sc)

df_X_train['word_count'] = df_X_train['text'].str.split().str.len()

df_X_train['time'] = df_X_train['created'].str.split().str[1].str.split(':').str[0].apply(int)
df_X_train.loc[(df_X_train['time'] >=0) & (df_X_train['time'] <4), 'time'] = 0#late night
df_X_train.loc[(df_X_train['time'] >=4) & (df_X_train['time'] <8), 'time'] = 1#early morning
df_X_train.loc[(df_X_train['time'] >=8) & (df_X_train['time'] <12), 'time'] = 2#morning
df_X_train.loc[(df_X_train['time'] >=12) & (df_X_train['time'] <16), 'time'] = 3#noon
df_X_train.loc[(df_X_train['time'] >=16) & (df_X_train['time'] <20), 'time'] = 4#eve
df_X_train.loc[(df_X_train['time'] >=20) & (df_X_train['time'] <=23), 'time'] = 4#night

df_X_train['longitude'] = df_X_train['longitude'].fillna(0)
df_X_train['latitude'] = df_X_train['latitude'].fillna(0)
df_X_train['replyToSN'] = df_X_train['replyToSN'].fillna(0)
df_X_train['replyToUID'] = df_X_train['replyToUID'].fillna(0)
df_X_train.loc[(df_X_train['replyToSN'] != 0), 'replyToSN'] = 1
df_X_train.loc[(df_X_train['replyToUID'] != 0), 'replyToUID'] = 1
df_X_train.loc[(df_X_train['latitude'] != 0), 'latitude'] = 1
df_X_train.loc[(df_X_train['longitude'] != 0), 'longitude'] = 1
df_X_train['replyToSN'] = df_X_train['replyToSN'].astype(int)


df_X_train['retweetCount_normalized'] = (df_X_train['retweetCount'] - df_X_train['retweetCount'].min()) / (df_X_train['retweetCount'].max() - df_X_train['retweetCount'].min())
df_X_train['retweetCount_z'] = (df_X_train['retweetCount'] - df_X_train['retweetCount'].mean()) / df_X_train['retweetCount'].std() 

df_X_train['favoriteCount_z'] = (df_X_train['favoriteCount'] - df_X_train['favoriteCount'].mean()) / df_X_train['favoriteCount'].std()
df_X_train['favoriteCount_normalized'] = (df_X_train['favoriteCount'] - df_X_train['favoriteCount'].min()) / (df_X_train['favoriteCount'].max() - df_X_train['favoriteCount'].min())

#df_X_train = df_X_train.drop(columns=['statusSource','truncated','replyToSID','favorited','statusSource','screenName',
#                         'isRetweet','retweeted',])
#df_X_train = df_X_train['link','sc_count','word_count','time','longitude','replyToSN','retweetCount_normalized','favoriteCount_normalized']

In [6]:
correlation_data = df_X_train.corr()['label'].sort_values()

In [7]:
correlation_data

retweetCount    -0.027656
id.1             0.015367
favoriteCount    0.021665
label            1.000000
favorited             NaN
truncated             NaN
replyToSID            NaN
replyToUID            NaN
isRetweet             NaN
retweeted             NaN
longitude             NaN
latitude              NaN
Name: label, dtype: float64

In [8]:
df_X_train = df_X_train[['link','sc_count','word_count','time','longitude','replyToSN','retweetCount_z','favoriteCount_z']]

In [19]:
df_X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1089 entries, 0 to 1088
Data columns (total 8 columns):
link               1089 non-null int64
sc_count           1089 non-null int64
word_count         1089 non-null int64
time               1089 non-null int64
longitude          1089 non-null float64
replyToSN          1089 non-null int64
retweetCount_z     1089 non-null float64
favoriteCount_z    1089 non-null float64
dtypes: float64(3), int64(5)
memory usage: 76.6 KB


In [18]:
rf = RandomForestClassifier(n_estimators = 100, oob_score=True,max_depth=10)
rf.fit(df_X_train,Y_train)
acc = round(rf.score(df_X_train, Y_train)*100,2)
print(acc)
print("oob:",round(rf.oob_score_,4)*100,"%")

94.95
oob: 83.47 %


<h3> Train and Classify </h3>

<p> Implement <code>train_and_classify</code>. It should extract feature vectors from the given pandas dataframes. Train a model and return the labels of the test data. The feature vectors and models to use are up to you to decide.</p>

<p><b>Your final score will be determined by executing <code>train_and_classify</code> with the provided training set for training and a hidden test set for classification. We will then evaluate the accuracy of your output.</b></p>
<p><b>NOTE: Please limit your training time to 10 minutes.</b></p>

In [13]:
#<GRADED>
def train_and_classify(df_X_train, Y_train, df_X_test):
    """
    Extracts features from df_X_train. Train a model
    on training data and training labels (Y_train).
    Predict the labels of df_X_test.
    
    df_X_train : pandas data frame of training data
    Y_train    : numpy array of labels for training data
    df_X_test  : pandas data frame of test data
    
    Output:
    Y_test : numpy array of labels for test data
    """
    
    ## fill in code here
    def extract_feature_vec(df_X):
        # extracts feature vectors
        #1 word count in tweet
        
        def Find_url(string): 
            url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+] |[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string) 
            return len(url) 
        df_X['link'] = df_X['text'].apply(Find_url)
        df_X.loc[df_X['link'] >0, 'link'] = 1

        def Find_sc(string): 
            regex = re.compile('[@_!#;$%^&*()<>?/\|}{~:]') 
            return len(regex.findall(string))
        df_X['sc_count'] = df_X['text'].apply(Find_sc)

        df_X['word_count'] = df_X['text'].str.split().str.len()

        df_X['time'] = df_X['created'].str.split().str[1].str.split(':').str[0].apply(int)
        df_X.loc[(df_X['time'] >=0) & (df_X['time'] <4), 'time'] = 0#late night
        df_X.loc[(df_X['time'] >=4) & (df_X['time'] <8), 'time'] = 1#early morning
        df_X.loc[(df_X['time'] >=8) & (df_X['time'] <12), 'time'] = 2#morning
        df_X.loc[(df_X['time'] >=12) & (df_X['time'] <16), 'time'] = 3#noon
        df_X.loc[(df_X['time'] >=16) & (df_X['time'] <20), 'time'] = 4#eve
        df_X.loc[(df_X['time'] >=20) & (df_X['time'] <=23), 'time'] = 4#night

        df_X['longitude'] = df_X['longitude'].fillna(0)
        df_X['latitude'] = df_X['latitude'].fillna(0)
        df_X['replyToSN'] = df_X['replyToSN'].fillna(0)
        df_X['replyToUID'] = df_X['replyToUID'].fillna(0)
        df_X.loc[(df_X['replyToSN'] != 0), 'replyToSN'] = 1
        df_X.loc[(df_X['replyToUID'] != 0), 'replyToUID'] = 1
        df_X.loc[(df_X['latitude'] != 0), 'latitude'] = 1
        df_X.loc[(df_X['longitude'] != 0), 'longitude'] = 1
        df_X['replyToSN'] = df_X['replyToSN'].astype(int)


        df_X['retweetCount_normalized'] = (df_X['retweetCount'] - df_X['retweetCount'].min()) / (df_X['retweetCount'].max() - df_X['retweetCount'].min())
        df_X['retweetCount_z'] = (df_X['retweetCount'] - df_X['retweetCount'].mean()) / df_X['retweetCount'].std() 

        df_X['favoriteCount_z'] = (df_X['favoriteCount'] - df_X['favoriteCount'].mean()) / df_X['favoriteCount'].std()
        df_X['favoriteCount_normalized'] = (df_X['favoriteCount'] - df_X['favoriteCount'].min()) / (df_X['favoriteCount'].max() - df_X['favoriteCount'].min())


        df_X = df_X[['link','sc_count','word_count','time','longitude','replyToSN','retweetCount_z','favoriteCount_z']]
        
        return df_X
    
    X_train = extract_feature_vec(df_X_train)
    X_test  = extract_feature_vec(df_X_test)
    
    # create and train model (consider doing k-fold cross validation as well)
    rf = RandomForestClassifier(n_estimators = 100,max_depth=10)
    rf.fit(X_train,Y_train)

    # evaulate model
    Y_test = rf.predict(X_test)
    #Y_test = np.zeros(len(df_X_test)) 

    return Y_test
#</GRADED>

<h3> Evaluation</h3>

<p>Below is some code to see your accuracy when trained and tested on the training data set.</p>

In [14]:
# evalulate and classify on training set
Y_pred = train_and_classify(df_X_train, Y_train, df_X_train)

def accuracy(Y_pred, Y_true):
    return (Y_pred == Y_true).sum() / Y_pred.shape[0]

acc = accuracy(Y_pred, Y_train)
print('accurary: ' + str(round(acc * 100, 2)) + '%')

accurary: 95.32%
