# Who's Tweeting? Trump vs Trudeau 
## is a project I've seen on Datacamp. It asks you to classify a given tweet of either Donald Trump or Justin Trudeau.
The dataset consists of three columns, ID, Tweet itself and the authors (being either Donald Trump or Justin Trudeau). I have used support vector classifier and logistic regressor in this code, and also compared two word vectorizers; count vectorizer and TF-IDF vectorizer.

**Importing the libraries**

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression

**Importing the data, and I've also removed the label row that was given in the dataset**

In [2]:
tweetsdf = pd.read_table('../input/tweets-of-trump-and-trudeau/tweets.csv', sep=',', names=('ID', 'Author', 'tweet'))
tweetsdf=tweetsdf.iloc[1:]
tweetsdf.head()

Unnamed: 0,ID,Author,tweet
1,1,Donald J. Trump,I will be making a major statement from the @W...
2,2,Donald J. Trump,Just arrived at #ASEAN50 in the Philippines fo...
3,3,Donald J. Trump,"After my tour of Asia, all Countries dealing w..."
4,4,Donald J. Trump,Great to see @RandPaul looking well and back o...
5,5,Donald J. Trump,Excited to be heading home to see the House pa...


**We will predict the author from the tweet column, splitting the data as training and test**

In [3]:
y=tweetsdf['Author']
x=tweetsdf['tweet']
x_train, x_test, y_train, y_test =train_test_split(x,y,test_size=0.33, random_state=50)
print(x_train)

86     Everybody is asking why the Justice Department...
270    RT @RalphGoodale: Excellent travail par @rcmpg...
16     Met with President Putin of Russia who was at ...
109    Thanks to @SenateMajLdr McConnell and the @Sen...
271    Next stop: Da Nang for #APEC2017. Here’s a rec...
                             ...                        
133    The biggest story yesterday, the one that has ...
290    RT @SherryRomanado: Honoured to be leading the...
110    Thank you to the GREAT NYPD, First Responders ...
396    RT @googlecanada: Watch tmw: @JustinTrudeau di...
177    RT @seanhannity: BOOM!!  Tick Tock https://t.c...
Name: tweet, Length: 268, dtype: object


**TF-IDF vectorizer, vectorizes the words by dividing the frequency of that specific word by how many times that word appears in how many documents, it yields a matrix with values between 0 and 1 so it gives better precision than the count vectorizer** The columns of matrix are the words and the rows are the documents. 
It removes English stopwords, and n-gram determines the number of words taken in a phrase, and max and min df values get rid of words either used too much or too rare.

In [4]:
tvec= TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_df=0.9, min_df=0.05)


Splitting the data for the comparison of vectorizers.

In [5]:
t_train=tvec.fit_transform(x_train)
t_test=tvec.fit_transform(x_test)

**Count vectorizer basically counts the words that appear and returns a matrix with columns being the words and rows being tweets.** The elements of matrix are integers. Applying the same procedure with TF-IDF. 

In [6]:
cvec = CountVectorizer(stop_words="english",ngram_range=(1,2), max_df=0.9, min_df=0.05)
c_train=cvec.fit_transform(x_train)
c_test=cvec.fit_transform(x_test)

**Classification with SVC with RBF kernel on the TF-IDF data**

In [7]:
svclassifier = SVC(kernel='rbf')
svclassifier.fit(t_train, y_train)
t_predsvc = svclassifier.predict(t_test)



**Classification with SVC with RBF kernel on Count Vectorizer data**

In [8]:
svclassifier = SVC(kernel='rbf')
svclassifier.fit(c_train, y_train)
c_predsvc = svclassifier.predict(c_test)

**Calculation of accuracies of both vectorizers with SVC**

In [9]:
countsvcacc = accuracy_score(c_predsvc,y_test)
print(confusion_matrix(y_test,c_predsvc))
print(classification_report(y_test,c_predsvc))

tfidfsvmacc = accuracy_score(t_predsvc,y_test)
print(confusion_matrix(y_test,t_predsvc))
print(classification_report(y_test,t_predsvc))

[[70  7]
 [16 39]]
                 precision    recall  f1-score   support

Donald J. Trump       0.81      0.91      0.86        77
 Justin Trudeau       0.85      0.71      0.77        55

       accuracy                           0.83       132
      macro avg       0.83      0.81      0.82       132
   weighted avg       0.83      0.83      0.82       132

[[64 13]
 [ 8 47]]
                 precision    recall  f1-score   support

Donald J. Trump       0.89      0.83      0.86        77
 Justin Trudeau       0.78      0.85      0.82        55

       accuracy                           0.84       132
      macro avg       0.84      0.84      0.84       132
   weighted avg       0.84      0.84      0.84       132



**Classification with logistic regressor on the TF-IDF data**

In [10]:
logclassifier=LogisticRegression(random_state=0, solver='lbfgs') 
logclassifier.fit(t_train, y_train) 
t_predlog = logclassifier.predict(t_test)

**Classification with logistic regressor on the Count Vectorizer data**

In [11]:
logclassifier=LogisticRegression(random_state=0, solver='lbfgs')
logclassifier.fit(c_train, y_train)
c_predlog = logclassifier.predict(c_test)

**Calculation of accuracies of both vectorizers with Logistic Regression**

In [12]:
countlogacc = accuracy_score(c_predlog,y_test)
print(confusion_matrix(y_test,c_predlog))
print(classification_report(y_test,c_predlog))

countlogacc = accuracy_score(t_predlog,y_test)
print(confusion_matrix(y_test,t_predlog))
print(classification_report(y_test,t_predlog))

[[65 12]
 [12 43]]
                 precision    recall  f1-score   support

Donald J. Trump       0.84      0.84      0.84        77
 Justin Trudeau       0.78      0.78      0.78        55

       accuracy                           0.82       132
      macro avg       0.81      0.81      0.81       132
   weighted avg       0.82      0.82      0.82       132

[[64 13]
 [10 45]]
                 precision    recall  f1-score   support

Donald J. Trump       0.86      0.83      0.85        77
 Justin Trudeau       0.78      0.82      0.80        55

       accuracy                           0.83       132
      macro avg       0.82      0.82      0.82       132
   weighted avg       0.83      0.83      0.83       132



**Confusion matrices for both vectorizers**

In [13]:
tlog_confmatrix = confusion_matrix(t_predlog,y_test)
clog_confmatrix = confusion_matrix(c_predlog,y_test)

tsvc_confmatrix = confusion_matrix(t_predsvc,y_test)
csvc_confmatrix = confusion_matrix(c_predsvc,y_test)
print(tlog_confmatrix)
print(clog_confmatrix)
print(tsvc_confmatrix)
print(csvc_confmatrix)

[[64 10]
 [13 45]]
[[65 12]
 [12 43]]
[[64  8]
 [13 47]]
[[70 16]
 [ 7 39]]
