# Classification of MEPs' Tweets
Erasmia Kornelatou, f2821907

In [233]:
import os
import tweepy as tw
import pandas as pd
import json
import nltk
import re
import unicodedata
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
import numpy as np

# Downloading the stop words list
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\astar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\astar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Data Preparation

* Get the dataset from <https://www.clarin.si/repository/xmlui/handle/11356/1071>.

* You will use the `retweets.csv` file.

In [234]:
#reading the retweets.csv
data = pd.read_csv('...retweets.csv')

* Keep only the records for which the language of the original tweet is in English.

In [235]:
data = data[data['lang'] == 'en']

* Get the text of the *original tweet* and add it to the dataset as an extra column. Use the Tweeter API to get the text (e.g., with Tweepy). In order not to run into rate limits you can ask for multiple tweets with one call.
* Keep only the records for which you were able to download the tweet text.

1. We convert origin ids of tweets to list

In [236]:
ids =data['origTweetId'].to_list()

2. We connect to twitter with credentials

In [237]:
auth = tw.OAuthHandler('...', '...')
auth.set_access_token('...', '...')
api = tw.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

3. We load 100 tweets per request.

In [238]:
# number of ids
num = len(ids)
# limit for forloop
limit = int(num / 100) + 1
# tweets loaded by twitter
tweets = []
    
 #loading 100 tweets per request   
for i in range(limit):
    temp = (i + 1) * 100
    if(num < temp):
        tweets.extend(api.statuses_lookup(id_ = ids[i * 100:num]))
    else:
        tweets.extend(api.statuses_lookup(id_ = ids[i * 100:temp]))
        
tweets  
tweet_list = []
for tweet in tweets:
    tweet_list.append(tweet._json)
data2 = pd.DataFrame(tweet_list) 

4. We add the text in a new column in the dataframe

In [239]:
tw_array = pd.merge(left = data,right = data2[['id','text']],left_on= 'origTweetId',right_on= 'id',how ='inner').drop('id', axis=1)

* Group the records by the European group of the MEP that posted the original tweet. If you see that there are groups with very few tweets (less than 50), drop them. 

In [240]:
eurogroup =pd.DataFrame(tw_array['origUserId'].groupby(tw_array['origMepGroupShort']).count())
eurogroup

Unnamed: 0_level_0,origUserId
origMepGroupShort,Unnamed: 1_level_1
ALDE,1868
ECR,1102
EFDD,3283
ENL,23
EPP,2112
GUE-NGL,356
Greens-EFA,1199
NI,1
S&D,3107


As it seems, NI and ENL  are groups with very few tweets (less than 50), so we are going to drop them.

In [241]:
tw_array = tw_array[tw_array['origMepGroupShort'] != 'ENL']
tw_array = tw_array[tw_array['origMepGroupShort'] != 'NI']

## Data Cleaning
* You may want to strip accents, convert everything to lowercase, and remove all English stopwords. In general, you may experiment with additional ideas about how best to tokenize etc.

1. We remove duplicate rows as they contain the same tweets.

In [242]:
tw_array = tw_array.drop_duplicates(subset='text', keep="last")

2. We remove URLS as they do not help in classification, repeated characters (ex. likeeeeeeeeee into like) and accents (ex. café into cafe) .

In [243]:
for t in tw_array['text']:
    t = re.sub('((@\S+|https?://\S+))', 'URL', t) 
    t = word_tokenize(t) 
    for word in t:
       word = unicodedata.normalize('NFKD', word).encode('ASCII', 'ignore')    

3. We remove symbols: #.,:;/\()'|+“%‘’*&€@'?!-  .

In [244]:
symbols = list("#.,:;/\()'|+“%‘’*&€@'?!-")
for s in symbols:
    tw_array['text'] = tw_array['text'].str.replace(s, '')   

4. We convert everything to lowercase.

In [245]:
tw_array['text'] = tw_array['text'].str.lower()

5. We remove Possessive pronouns.

In [246]:
tw_array['text'] = tw_array['text'].str.replace("'s", "")

6. We remove \r , \n and "  .

In [247]:
tw_array['text'] = tw_array['text'].str.replace("\r", " ")
tw_array['text'] = tw_array['text'].str.replace("\n", " ")
tw_array['text'] = tw_array['text'].str.replace('"', '')

7. We remove stop words in english.

In [248]:
stop_words = list(stopwords.words('english'))
for stop in stop_words:

    regex = r"\b" + stop + r"\b"
    tw_array['text'] = tw_array['text'].str.replace(regex, '')   

8. We remove double spaces.  

In [249]:
#remove double spaces    
tw_array['text'] = tw_array['text'].str.replace("  ", " ")

9. We remove null tweet texts.

In [250]:
tw_array = tw_array[tw_array.text.notnull()]

## Classification
* Train at least two algorithms to learn to classify an unseen tweet. The target variable should be the political party of the original poster and the training features should be the original tweet's text.

* You should split your data to training and testing datasets, try the different algorithms with cross validation on the training dataset, and find the best hyperparameters for the best algorithm. 
 
* The tweet texts must be converted to a format suitable the classification, bag of word matrices or tf-idf matrices. You must investigate which one gives the best results. 


In [252]:
#split train and test data
X_train, X_test, y_train, y_test = train_test_split(
    tw_array['text'],tw_array['origMepGroupId'], test_size=0.15, random_state=42,shuffle = True)

1a. Multinomial Naive Bayes with tf-idf matrices format

In [254]:
text_clf = Pipeline([('vect', CountVectorizer(strip_accents = 'unicode',lowercase =True,stop_words ='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)

predicted = text_clf.predict(X_test)
print("The accuracy is: ")
np.mean(predicted == y_test) 

The accuracy is: 


0.5492772667542707

1b. Multinomial Naive Bayes with bag of word matrices format

In [255]:
text_clf = Pipeline([('vect', CountVectorizer(strip_accents = 'unicode',lowercase =True,stop_words ='english')),
                     ('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)

predicted = text_clf.predict(X_test)
print("The accuracy is: ")
np.mean(predicted == y_test) 

The accuracy is: 


0.6254927726675427

2a. Stochastic Gradient Descent with tf-idf matrices format

In [259]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer(strip_accents = 'unicode',lowercase =True,stop_words ='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3,
                                           random_state=42))])
text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
print("The accuracy is: ")
np.mean(predicted == y_test)

The accuracy is: 


0.621550591327201

2b. Stochastic Gradient Descent with bag of word matrices format

In [262]:
text_clf = Pipeline([('vect', CountVectorizer(strip_accents = 'unicode',lowercase =True,stop_words ='english')),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3,
                                           random_state=42))])
text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
print("The accuracy is: ")
np.mean(predicted == y_test)

The accuracy is: 


0.6353482260183968

3a. Support Vector Machines Classifier with tf-idf matrices format

In [265]:
text_clf = Pipeline([('vect', CountVectorizer(strip_accents = 'unicode',lowercase =True,stop_words ='english')),
                    ('tfidf', TfidfTransformer()),
                     ('clf',SVC(C=1, kernel='linear'))])                                           

text_clf.fit(X_train, y_train)
predicted_svc = text_clf.predict(X_test)
print("The accuracy is: ")
np.mean(predicted_svc == y_test)

The accuracy is: 


0.6360052562417872

3b. Support Vector Machines Classifier with  bag of word matrices format 

In [267]:
from sklearn.svm import SVC
text_clf = Pipeline([('vect', CountVectorizer(strip_accents = 'unicode',lowercase =True,stop_words ='english')),
                     ('clf',SVC(C=1, kernel='linear'))])                                           

text_clf.fit(X_train, y_train)
predicted_svc = text_clf.predict(X_test)
print("The accuracy is: ")
np.mean(predicted_svc == y_test)

The accuracy is: 


0.6090670170827858

* To gauge the efficacy of the algorithm, report also the results of a baseline classifier, using, for instance, scikit-learn's [`DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html).

In [268]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(tw_array['text'], tw_array['origMepGroupId'])
dummy_clf.predict(tw_array['text'])
dummy_clf.score(tw_array['text'], tw_array['origMepGroupId'])

0.238391008577344

In conclusion, we choose the Support Vector Machines Classifier with tf-idf matrices format format as it has the highest accuracy  (0.6360052562417872) , so it is greater than 0.238391008577344 which is the accuracy for the dummyClassifier. 