<h2>Final Project: Identifying Trump's Tweets</h2>

<center>
<img src="white_house.jpg"/>
</center>


<h3>Introduction</h3>

<p>The goal is to classify the device that Trump uses to write each tweet with. It's been hypothesized that President Trump tweets only from his android phone and that someone else (his staff) tweets from his account using an iPhone. Analyze the text of the tweet as well as other contextual information to predict where each tweet came from. </p>

<h3>Rules</h3>

<p> Rules of the competition: You may use any techniques you've learned in class including any open source implementations in packages such as scikit-learn, tensorflow, or pre-trained models. If you use any open source implementations, <b>please cite them in your comments</b>. The sharing of personal code between teams is strictly not allowed. Additionally obtaining a copy of the labeled test set through any means is expressly forbidden. </p>

<p><b>NOTE: You are only allowed 10 submissions for this project. Please use them carefully. We will use your 10th and final submission (not be the best one) for grading.</b></p>

<h3>Grading</h3>

<p>There are two baselines we have implemented. <code>Baseline 1 = 0.7</code> and <code>Baseline 2 = 0.82</code>. If you beat the first baseline, you will 90 points. If you beat the second baseline, you'll get 100 points.</p>
<p>The top 30 teams on the leaderboard will receive an extra 5 bonus points.</p>

### To do (added by Martin)

Implement multiple learning algorithms.

Implement k-fold cross validation.

Optimize functions.

Feature ideas:
- Average number of words per sentence
- Average word length
- Number of punctuation symbols
- Sentiment analysis
- Day of the week
- time of the day
- number of capital letters

### What has been done

Implemented features:
- Number of sentences per tweet
- Numbers of characters per tweet
- Number of characters per sentence

Implemented sklearn's SVM as learning algorithm.

In [61]:
#<GRADED>
import numpy as np
import pandas as pd
from sklearn import svm
#</GRADED>

## include your imports as necessary and cite open-source implementations appropriately

In [2]:
def read_files(train_file):
    """
    Output:
    df_X : pandas data frame of training data
    Y    : numpy array of labels
    """
    df = pd.read_csv(train_file, index_col=0)
    df_X = df[df.columns[0:17]]
    Y = np.array(df['label'])
        
    return df_X, Y

<h3> Training Data </h3>

<p> Take a look at the file <code>train.csv</code>. Here are the first 4 tweets in the train dataset.</p>

In [186]:
df_X_train, Y_train = read_files('train.csv')
df_X_train[:]

Unnamed: 0_level_0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id.1,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,Senior United States District Judge Robert E. ...,False,14207,,7/12/2016 0:56,False,,752668000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,5256,False,False,,,-1
1,Speech on Veterans' Reform: https://t.co/XB7R...,False,9666,,7/11/2016 22:18,False,,752628000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,3432,False,False,,,-1
2,Great poll- Florida! Thank you! https://t.co/4...,False,25531,,7/11/2016 21:40,False,,752619000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,8810,False,False,,,-1
3,Thoughts and prayers with the victims; and the...,False,28850,,7/11/2016 19:51,False,,752591000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,9112,False,False,,,-1
4,Join me in Westfield; Indiana- tomorrow night ...,False,12567,,7/11/2016 11:57,False,,752472000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,4144,False,False,,,-1
5,I heard that the underachieving John King of @...,False,22978,,7/10/2016 18:58,False,,752215000000000000,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,6564,False,False,,,1
6,The media is so dishonest. If I make a stateme...,False,44600,,7/10/2016 18:42,False,,752211000000000000,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,14520,False,False,,,1
7,President Obama thinks the nation is not as di...,False,35167,,7/10/2016 18:27,False,,752208000000000000,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,11975,False,False,,,1
8,Look what is happening to our country under th...,False,55495,,7/10/2016 12:02,False,,752111000000000000,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,19030,False,False,,,1
9,New poll - thank you! #Trump2016 https://t.co...,False,24040,,7/9/2016 21:22,False,,751889000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,9147,False,False,,,-1


<h3> Train and Classify </h3>

<p> Implement <code>train_and_classify</code>. It should extract feature vectors from the given pandas dataframes. Train a model and return the labels of the test data. The feature vectors and models to use are up to you to decide.</p>

<p><b>Your final score will be determined by executing <code>train_and_classify</code> with the provided training set for training and a hidden test set for classification. We will then evaluate the accuracy of your output.</b></p>
<p><b>NOTE: Please limit your training time to 10 minutes.</b></p>

In [181]:
def extract_length(df_X_train):
    # the tweets themselves are in the zero-th column. Extract from dataframe
    tweets = df_X_train.iloc[:,0]
    length = tweets.str.len()
    return length

In [182]:
def split_sentences(string):
    '''
    Splits a tweet on ". " such that sentences are separated from each other.
    Does not split on "." so that periods in URLs are not misunderstood as the end of a sentence
    
    We should add additional characters to split on like exclamation marks and question marks
    '''
    return string.split(". ")

def extract_number_of_sentences(df_X_train):
    tweets = df_X_train.iloc[:,0]
    splitted_tweets = tweets.apply(split_sentences)
    n_sentences = splitted_tweets.apply(len)
    
    return n_sentences

def extract_number_of_characters_per_sentence(df_X_train):    
    def average_characters_per_string(list_of_strings):
        return np.mean(list(map(len,list_of_strings)))

    tweets = df_X_train.iloc[:,0]
    splitted_tweets = tweets.apply(split_sentences)
    n_characters_per_sentence = splitted_tweets.apply(average_characters_per_string)
    
    return n_characters_per_sentence

In [183]:
#<GRADED>
def train_and_classify(df_X_train, Y_train, df_X_test):
    """
    Extracts features from df_X_train. Train a model
    on training data and training labels (Y_train).
    Predict the labels of df_X_test.
    
    df_X_train : pandas data frame of training data
    Y_train    : numpy array of labels for training data
    df_X_test  : pandas data frame of test data
    
    Output:
    Y_test : numpy array of labels for test data
    """
    
    ## fill in code here
    def extract_feature_vec(df_X):
        # extracts feature vectors
        features = []
        
        features.append(extract_length(df_X))
        features.append(extract_number_of_sentences(df_X))
        features.append(extract_number_of_characters_per_sentence(df_X))
        
        return pd.concat(features, axis=1)
    
    X_train = extract_feature_vec(df_X_train)
    X_test  = extract_feature_vec(df_X_test)
    
    # create and train model (consider doing k-fold cross validation as well)
    clf = svm.SVC()
    clf.fit(X_train, Y_train)
    
    # evaulate model
    Y_test = clf.predict(X_train) 

    return Y_test
#</GRADED>

<h3> Evaluation</h3>

<p>Below is some code to see your accuracy when trained and tested on the training data set.</p>

In [184]:
# evalulate and classify on training set
Y_pred = train_and_classify(df_X_train, Y_train, df_X_train)

def accuracy(Y_pred, Y_true):
    return (Y_pred == Y_true).sum() / Y_pred.shape[0]

acc = accuracy(Y_pred, Y_train)
print('accurary: ' + str(round(acc * 100, 2)) + '%')

accurary: 77.78%


