# Sentiment classifier for Twitter classification dataset

The goal of this exercise is to develop a sentiment classifier using logistic regression for the Twitter sentiment classification dataset that will be presented above. We are going to combine a plethora of features, and see how it performs on the given dataset, aiming to high accuracy. We are going to use tools from Scikit-Learn. Afterwards, we are going to evaluate the classifier using several metrics.

## Import Libraries

In [1]:
# For data vizualization 
import matplotlib as mpl
import matplotlib.pyplot as plt
# For large and multi-dimensional arrays
import numpy as np
# For data manipulation and analysis
import pandas as pd
# Necessary for data format
from sklearn.feature_extraction.text import CountVectorizer
# Machine learning model
from sklearn.linear_model import LogisticRegression
# Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import precision_recall_fscore_support 

# Best parameters during classification
from sklearn.model_selection import GridSearchCV
# For basic cleaning and data preprocessing 
import re
# Data preprocessing
from sklearn.model_selection import train_test_split
# Validation of the model
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate


## Dataset Configuration

We are going to import the dataset...

In [2]:
df = pd.read_csv("SentimentTweets.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86681 entries, 0 to 86680
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  86681 non-null  int64 
 1   target      86681 non-null  int64 
 2   id          86681 non-null  int64 
 3   date        86681 non-null  object
 4   flag        86681 non-null  object
 5   user        86680 non-null  object
 6   text        86680 non-null  object
dtypes: int64(3), object(4)
memory usage: 4.6+ MB


...and take a look at its columns

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,target,id,date,flag,user,text
0,680949,0,2249621587,Fri Jun 19 22:41:08 PDT 2009,NO_QUERY,sukumarpant,#brokenpromises...
1,406741,0,2059003515,Sat Jun 06 16:03:21 PDT 2009,NO_QUERY,MTMSparrow,David Carradine so sad. Thai's law not sure i...
2,1337108,4,2017466467,Wed Jun 03 08:26:14 PDT 2009,NO_QUERY,itsmemcee,A @ 415 B @ 425. Tell your bro i say congrats!
3,1560887,4,2186457254,Mon Jun 15 18:52:04 PDT 2009,NO_QUERY,jdfreivald,@littlefluffycat Indeed.
4,1466295,4,2064458395,Sun Jun 07 06:19:20 PDT 2009,NO_QUERY,CrazyHan,Completed Race 4 Life in 58mins with girlies f...


Since we are going to use bag-of-words features, we have to preprocess our data set. We are going to convert everything to lowercase, and remove any punctuation points, weird characters and links

In [4]:
def text_normalization(text):
    # convert text to lowercase
    text = text.lower()
    # remove all special characters, punctuation and spaces from string
    text = re.sub('\n|\r|\t', '', text)
    text = re.sub(r'[^\w\s]+', '', text)
    # first group of special chars: \u followed by a number
    text = re.sub('u\d\w+', '', text)
    # second group: \x followed by a letter
    text = re.sub('x[a-z]\d', '', text)
    # remove links
    text = re.sub(r'^http?://', ' ', text)
    text = re.sub(r'^www://', ' ', text)
    # return normalized text
    return text

In [5]:
def preprocess(input_df):
    # Remove rows with missing values in column col
    input_df.dropna(inplace=True)
    # Speed up code using numpy vectorization
    vfunc = np.vectorize(text_normalization)
    input_df.text = vfunc(input_df.text.values)
    # return processed input_df
    return input_df

Let's now apply those techniques in our dataset

In [6]:
df = preprocess(df)
df.head()

Unnamed: 0.1,Unnamed: 0,target,id,date,flag,user,text
0,680949,0,2249621587,Fri Jun 19 22:41:08 PDT 2009,NO_QUERY,sukumarpant,brokenpromises
1,406741,0,2059003515,Sat Jun 06 16:03:21 PDT 2009,NO_QUERY,MTMSparrow,david carradine so sad thais law not sure if ...
2,1337108,4,2017466467,Wed Jun 03 08:26:14 PDT 2009,NO_QUERY,itsmemcee,a 415 b 425 tell your bro i say congrats
3,1560887,4,2186457254,Mon Jun 15 18:52:04 PDT 2009,NO_QUERY,jdfreivald,littlefluffycat indeed
4,1466295,4,2064458395,Sun Jun 07 06:19:20 PDT 2009,NO_QUERY,CrazyHan,completed race 4 life in 58mins with girlies f...


We are going to convert the target value to `1` instead of `4`, in order to be more convenient

In [7]:
def change_to_1(x):
    if (x == 4):
        return 1
    else:
        return x

In [8]:
df['target'] = df['target'].apply(change_to_1)

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,target,id,date,flag,user,text
0,680949,0,2249621587,Fri Jun 19 22:41:08 PDT 2009,NO_QUERY,sukumarpant,brokenpromises
1,406741,0,2059003515,Sat Jun 06 16:03:21 PDT 2009,NO_QUERY,MTMSparrow,david carradine so sad thais law not sure if ...
2,1337108,1,2017466467,Wed Jun 03 08:26:14 PDT 2009,NO_QUERY,itsmemcee,a 415 b 425 tell your bro i say congrats
3,1560887,1,2186457254,Mon Jun 15 18:52:04 PDT 2009,NO_QUERY,jdfreivald,littlefluffycat indeed
4,1466295,1,2064458395,Sun Jun 07 06:19:20 PDT 2009,NO_QUERY,CrazyHan,completed race 4 life in 58mins with girlies f...


### Splitting the dataset

In order to train our model, we are going to use a significant portion of the dataset, but we also need some data to test our classifier, thus we are going to split the dataset into 2 datasets: one for training, and one for testing

First, we are going to define which are our X and Y variables

In [10]:
X = df['text']
Y = df['target']

Then, split the datasets by keeping 80% for training and 20% for testing

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2)

## Feature extraction

We are going to use Bag-of-Words in order to train our system

In [12]:
# creating the feature matrix 
cv = CountVectorizer()

# use the cv in our sets to convert the words
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

## Linear Classifier

As we stated, we are going to use a linear classifier with logistic regression, which is given to us as a handy class by Scikit-Learn

In [13]:
model = LogisticRegression(max_iter=1000)

# fit the classifier
model.fit(X = X_train, y = Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now, let's use our model in order to predict the Y values of the train set

In [14]:
y_pred = model.predict(X_test)

Let's now see the results of our classifying method


In [15]:
print(classification_report(Y_test,y_pred))

              precision    recall  f1-score   support

           0       0.77      0.76      0.76      8600
           1       0.76      0.78      0.77      8736

    accuracy                           0.77     17336
   macro avg       0.77      0.77      0.77     17336
weighted avg       0.77      0.77      0.77     17336



In [16]:
print("Accuracy score: {}".format(round(accuracy_score(Y_test,y_pred),3)))

Accuracy score: 0.769


As we can see, without applying extreme preprocessing or complicated features, our logistic regression linear classifier works just fine, achieving an accuracy near 80%


### k-Fold cross validation


Finally, we are going to evaluate our model using 10-fold cross validation

In [17]:
scores = cross_val_score(model, X_train, Y_train, cv = 10)
print('Cross-Validation Accuracy Scores\n', scores)

Cross-Validation Accuracy Scores
 [0.77217015 0.77822639 0.7666907  0.77361211 0.77502163 0.77545428
 0.77920392 0.77516585 0.7636285  0.77588693]


We can then see the range of how our scores are doing

In [18]:
scores = pd.Series(scores)
print(scores.mean())

0.7735060470240681


We can see that our results are rather satisfying, something that indicates that our model is not undefitting or overfitting