# Obama-McCain 2008 debate tweet classification
### Mohammadreza Osouli - 610395077

This jupyter note-book compare different classifiers and word-vectorizing methods accuracy on classifing emotions of tweets.

### Loding dataset
First of all, we should read data from excel file. In the code bellow, the dataset coloumns text and target are loaded.

In [1]:
import pandas as pd

def load_dataset(filename, cols):
    dataset = pd.read_csv(filename, encoding='latin-1')
    dataset.columns = cols
    return dataset

dataset = load_dataset("StrictOMD.csv", ['target', 'text'])

### Pre-processing
Pre-processing is one of the most important parts in text-mining or text-classification tasks. In this task, I did the following edits to any tweet in dataset.
1. Making the tweet lowercase
2. Removing mentions and hashtags
3. Removing punctuations (this part can be very tricky but in the models I used, I should do it)
4. Removing stop-words (like the last part, this may be tricky at all, but should be done in this order)
4. Stemming words (I did this part to only have the actual root of any word as result)

In [2]:
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))

def preprocess_tweet_text(tweet):
    tweet.lower()
    # Remove urls
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
    # Remove user @ references and '#' from tweet
    tweet = re.sub(r'\@\w+|\#\w+', '', tweet)
    # Remove punctuations
    tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    tweet_tokens = word_tokenize(tweet)
    filtered_words = [w for w in tweet_tokens if not w in stop_words]

    ps = PorterStemmer()
    stemmed_words = [ps.stem(w) for w in filtered_words]
    lemmatizer = WordNetLemmatizer()
    lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in stemmed_words]

    result = " ".join(lemma_words)
    # print(result)

    return result

dataset.text = dataset['text'].apply(preprocess_tweet_text)

### Feature vectors

I used two methods for getting feature vector for tweets, tf-idf and glove.

For glove method, I used a pre-trained model.

In [3]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

def get_feature_vector(train_fit):
    vector = TfidfVectorizer(sublinear_tf=True)
    vector.fit(train_fit)
    return vector

tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())

tf_X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
tf_y = np.array(dataset.iloc[:, 0]).ravel()

In [4]:
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')
glove_X = glove.transform(np.array(dataset.iloc[:, 1]).ravel())
glove_y = np.array(dataset.iloc[:, 0]).ravel()

### Cross-validation

As we didn't have test and train data separated, I used cross validation with 5 splits to evalute my models.


In [5]:
from sklearn.model_selection import cross_val_score, ShuffleSplit
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)

### Classifiers: 1. Naive-Bayes

Naive-Bayes method used as first classifier. As it cannot handle negative values, I did'nt run it on glove features.

In [10]:
from sklearn.naive_bayes import MultinomialNB

NB_model = MultinomialNB()
scores = cross_val_score(NB_model, tf_X, tf_y, cv=cv)
print("Naive Bayes Model with tf-idf:", scores)

# scores = cross_val_score(NB_model, glove_X, glove_y, cv=cv)
# print("Naive Bayes Model with glove:", scores)

Naive Bayes Model with tf-idf: [0.74545455 0.81818182 0.76363636 0.74909091 0.77818182]


### Classifiers: 2. Logistic Regression

Loggistic regression is always a choice for classifying large size feature vectors. As the result shown, It had better accuracy on tf-idf method. 

In [7]:
from sklearn.linear_model import LogisticRegression

LR_model = LogisticRegression(solver='lbfgs')
scores = cross_val_score(LR_model, tf_X, tf_y, cv=cv)
print("Logistic Regression with tf_idf:", scores)

scores = cross_val_score(LR_model, glove_X, glove_y, cv=cv)
print("Logistic Regression with glove:", scores)

Logistic Regression with tf_idf: [0.76727273 0.80363636 0.76       0.75272727 0.79636364]
Logistic Regression with glove: [0.74181818 0.74909091 0.73090909 0.72       0.71636364]


### Classifieres: 3. Support Vector Machine

After loggistic reggresion, Support vector machines are always a powerfull choice for classifying well-separated features. It had the best score among all classifiers with tf-idf features. 

In [8]:
from sklearn.svm import SVC

SVC_model = SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(SVC_model, tf_X, tf_y, cv=cv)
print("Support vector machine with tf_idf:", scores)

scores = cross_val_score(SVC_model, glove_X, glove_y, cv=cv)
print("Support vector machine with glove:", scores)

Support vector machine with tf_idf: [0.81090909 0.82545455 0.81454545 0.82181818 0.82909091]
Support vector machine with glove: [0.73454545 0.74181818 0.72       0.70909091 0.72363636]


### Classifiers: 4. Multi-Layer Perceptron
At last, I tried neural networks for classifying our vectors. As I guessed, features were not well separated, so I used MLP networks to have a better clustering. Its results on tf-idf feauters seems fine but the network can be tuned better in my opinion.

In [14]:
from sklearn.neural_network import MLPClassifier

MLP_model = MLPClassifier(solver='lbfgs', alpha=1e-4,
                    hidden_layer_sizes=(1200, 200), random_state=1)
scores = cross_val_score(MLP_model, tf_X, tf_y, cv=cv)
print("Multi-Layer Perceptron with tf_idf:", scores)

scores = cross_val_score(MLP_model, glove_X, glove_y, cv=cv)
print("Multi-Layer Perceptron with glove:", scores)

Multi-Layer Perceptron with tf_idf: [0.81090909 0.8        0.83272727 0.85818182 0.81454545]
Multi-Layer Perceptron with glove: [0.69454545 0.71272727 0.76727273 0.68727273 0.72363636]


### Conclusion
Four classifiers and two feature methods were implemented in this notebook to classify Obama-McCaine 2008 debate tweets' emotions. Best accuracy achived by using SVM on tf-idf vectors with about 81% accuracy. By contrast we have MLP on Glove vectors with about 68% accuracy.