Christopher O'Brien

DSE 511 - Project 2

In [1]:
import pandas as pd
import numpy as np
import time
import preprocessor as p


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

import nltk
from nltk.corpus import stopwords
from collections import Counter

# DATASET DESCRIPTION

The dataset analyzed was obtained from Kaggle’s ‘Natural Language Processing with Disaster Tweets’ competition. From the competition, we were instructed to use only the "train.csv" file. The contained a dataset of 7613 tweets. Each row of the dataset contained the following information: id, keyword, location, text, and target. The task of the project was to classify whether each tweet was discussing a real disaster or not.

# FEATURE EXTRACTION  

To begin, I started by preprocessing the dataset. This was first achieved through pip installing tweet-preprocessor. I cleaned the text data by removing the following items: URLs, Hashtags, Mentions, Reserved words (RT, FAV), Emojis, Smileys. My final step of preprocessing was that I removed blank spaces and punctuation. I next processed the preprocessed text into ML compatible features using TfidfVectorizer. I chose to use the TfidfVectorizer instead of CountVectorizer to hopefully not introduce a bias for the most frequent words that appear in the dataset. The main parameters that I tinkered with were min_df and ngram_range. It was determined that a min_df=3 was ideal. Additionally, I found the use of unigrams and bigrams performed the best with my initial testing. When only unigrams or bigrams were used the performance suffered. I also found that performance was improved if the tweets were converted to lowercase through the TfidfVectorizer. I used a 70/15/15 split to separate out the vectorized data into train, validation, and test sets.

## Data Import and Cleaning

In [2]:
tweets = pd.read_csv("train.csv") #load all data

#used a tweet processer from here - https://pypi.org/project/tweet-preprocessor/
#it removed URLs, Hashtags, Mentions, Reserved words (RT, FAV), Emojis, Smileys
def preprocess_tweet(row):
    text = row['text']
    text = p.clean(text)
    return text

#apply preprocess function
tweets['text'] = tweets.apply(preprocess_tweet, axis=1)

#remove blank spaces and punctuation
tweets['text'] = tweets['text'].str.replace('[^\w\s]',' ').str.replace('\s\s+', ' ')

dC_x_fin = tweets['text'] #only tweets
dC_y = tweets['target'] #only labels


  tweets['text'] = tweets['text'].str.replace('[^\w\s]',' ').str.replace('\s\s+', ' ')


### Tfidf

In [3]:
# process raw text into ML compatible features
vectorizer = TfidfVectorizer(min_df=3, 
             stop_words='english',ngram_range=(1, 2), lowercase=True)  
vectorizer.fit(dC_x_fin)
#print(vectorizer.vocabulary_)
X = vectorizer.transform(dC_x_fin)
#vectorizer.get_stop_words()

#### Tfidf Split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, dC_y, 
                                   test_size=0.15, shuffle=True, stratify=dC_y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                 test_size=0.15/0.85, shuffle=True, stratify=y_train, random_state=42)

In [5]:
print('original:',X.shape,dC_y.shape)
print('train:',X_train.shape,y_train.shape)
print('val:',X_val.shape,y_val.shape)
print('test:',X_test.shape,y_test.shape)

original: (7613, 5826) (7613,)
train: (5329, 5826) (5329,)
val: (1142, 5826) (1142,)
test: (1142, 5826) (1142,)


# MODELS AND HYPERPARAMETER OPTIMIZATION 

The machine learning models that I chose to implement were: Logistic Regression, RandomForestClassifier, and LinearSVC. For hyperparameter optimization, I decided to use GridSearchCV. Logistic Regression was my starting point for the analyses of the text data. For my exhaustive grid search I examined the following hyperparameters: solvers, penalty, and C. To begin, I chose an arbitrary range of C values. Once a semi-optimal value was obtained, I attempted to narrow the C value in some. Next, I used a RandomForestClassifier. I chose this model mainly because I have no prior experience working with it at all. This model took by far the longest time do exhaustive grid search on, largely due to the models run time and the number of hyperparameters evaluated. The hyperparameters that I examined were: max_features , max_depth, min_samples_split, min_samples_leaf, and bootstrap. Finally, I implemented the LinearSVC model. This specific model was chosen after examining the ‘sci-kit learn algorithm cheat sheet’ that was presented at the beginning of our lecture. The hyperparameters that I evaluated were penalty, loss, and C values. Just like with Logistic Regression I attempted to find an optimal C value the same way.

## Logistic Regression

In [6]:
model = LogisticRegression(random_state=0, max_iter=1000)
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'saga', 'sag']
penalty = ['l2']
c_values = [1.        , 1.02222222, 1.04444444, 1.06666667, 1.08888889,
       1.11111111, 1.13333333, 1.15555556, 1.17777778, 1.2]

#define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_val,y_val)

#summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.739987 using {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
0.739987 (0.030535) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
0.739987 (0.030535) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
0.739110 (0.030041) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
0.739110 (0.030041) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'saga'}
0.739987 (0.030535) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'sag'}
0.739110 (0.030041) with: {'C': 1.02222222, 'penalty': 'l2', 'solver': 'newton-cg'}
0.739110 (0.030041) with: {'C': 1.02222222, 'penalty': 'l2', 'solver': 'lbfgs'}
0.739106 (0.029039) with: {'C': 1.02222222, 'penalty': 'l2', 'solver': 'liblinear'}
0.738233 (0.029644) with: {'C': 1.02222222, 'penalty': 'l2', 'solver': 'saga'}
0.739110 (0.030041) with: {'C': 1.02222222, 'penalty': 'l2', 'solver': 'sag'}
0.738233 (0.030280) with: {'C': 1.04444444, 'penalty': 'l2', 'solver': 'newton-cg'}
0.738233 (0.030280) with: {'C': 1.04444444, 'penalty': 'l2', 'solver'

## RandomForestClassifier

In [8]:
model = RandomForestClassifier(random_state = 0, n_estimators = 200)
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5] #10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2] #4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
grid = {
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_val,y_val)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.747870 using {'bootstrap': True, 'max_depth': 110, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5}
0.636624 (0.013252) with: {'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2}
0.635750 (0.015438) with: {'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 5}
0.638374 (0.011936) with: {'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 2}
0.638374 (0.011936) with: {'bootstrap': True, 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5}
0.636624 (0.013252) with: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}
0.635750 (0.015438) with: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5}
0.638374 (0.011936) with: {'bootstrap': True, 'max_depth': 10, 'ma

## SVM

In [10]:
model = LinearSVC(random_state = 0)
penalty = ['l2']
loss = ['hinge', 'squared_hinge']
c_values = [0.7       , 0.71578947, 0.73157895, 0.74736842, 0.76315789,
       0.77894737, 0.79473684, 0.81052632, 0.82631579, 0.84210526,
       0.85789474, 0.87368421, 0.88947368, 0.90526316, 0.92105263,
       0.93684211, 0.95263158, 0.96842105, 0.98421053, 1.        ]
#define grid search
grid = dict(penalty=penalty,loss=loss, C=c_values)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_val,y_val)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.752222 using {'C': 1.0, 'loss': 'squared_hinge', 'penalty': 'l2'}
0.742599 (0.040004) with: {'C': 0.7, 'loss': 'hinge', 'penalty': 'l2'}
0.750471 (0.029141) with: {'C': 0.7, 'loss': 'squared_hinge', 'penalty': 'l2'}
0.744350 (0.039736) with: {'C': 0.71578947, 'loss': 'hinge', 'penalty': 'l2'}
0.750471 (0.029141) with: {'C': 0.71578947, 'loss': 'squared_hinge', 'penalty': 'l2'}
0.745223 (0.036826) with: {'C': 0.73157895, 'loss': 'hinge', 'penalty': 'l2'}
0.749598 (0.029044) with: {'C': 0.73157895, 'loss': 'squared_hinge', 'penalty': 'l2'}
0.745219 (0.034377) with: {'C': 0.74736842, 'loss': 'hinge', 'penalty': 'l2'}
0.748721 (0.029348) with: {'C': 0.74736842, 'loss': 'squared_hinge', 'penalty': 'l2'}
0.745223 (0.031432) with: {'C': 0.76315789, 'loss': 'hinge', 'penalty': 'l2'}
0.748717 (0.029510) with: {'C': 0.76315789, 'loss': 'squared_hinge', 'penalty': 'l2'}
0.746966 (0.031743) with: {'C': 0.77894737, 'loss': 'hinge', 'penalty': 'l2'}
0.746966 (0.027871) with: {'C': 0.77894737

# TESTING

## Logistic Regression

In [12]:
lg = LogisticRegression(random_state=0, C=1.17777778, penalty='l2', solver = 'liblinear', max_iter=1000) 

t0 = time.time()
lg.fit(X_train,y_train)
t1 = time.time() # ending time
lg_train_time = t1-t0

t0 = time.time()
y_true, y_pred_lg = y_test, lg.predict(X_test)
t1 = time.time() # ending time
lg_pred_time = t1-t0

lg_report = classification_report(y_true, y_pred_lg, output_dict=True)
df_lg = pd.DataFrame(lg_report)

## RandomForestClassifier

In [13]:
rf = RandomForestClassifier(random_state = 0, bootstrap=True, max_depth=None, max_features='auto', min_samples_leaf=1, min_samples_split=5)

t0 = time.time()
rf.fit(X_train,y_train)
t1 = time.time() # ending time
rf_train_time = t1-t0

t0 = time.time()
y_true, y_pred_rf = y_test, rf.predict(X_test)
t1 = time.time() # ending time
rf_pred_time = t1-t0

rf_report = classification_report(y_true, y_pred_rf, output_dict=True)
df_rf = pd.DataFrame(rf_report)

## LinearSVC

In [14]:
l_svc = LinearSVC(random_state = 0, C = 0.79473684, loss = 'hinge', penalty = 'l2')

t0 = time.time()
l_svc.fit(X_train,y_train)
t1 = time.time() # ending time
l_svc_train_time = t1-t0

l_svc_score = l_svc.score(X_test,y_test)

t0 = time.time()
y_true, y_pred_lSVC = y_test, l_svc.predict(X_test)
t1 = time.time() # ending time
l_svc_pred_time = t1-t0

l_svc_report = classification_report(y_true, y_pred_lSVC, output_dict=True)
df_l_svc = pd.DataFrame(l_svc_report)

## Table Generation

In [15]:
acc_all = {'Logistic Regression': [df_lg.iloc[0]['accuracy']], 
                 'RandomForestClassifier': [df_rf.iloc[0]['accuracy']], 
                 'LinearSVC': [df_l_svc.iloc[0]['accuracy']]}
acc_table = pd.DataFrame(data=acc_all)
acc_table.index = ['accuracy']
acc_table.style.set_caption("ML Model Accuracy Results")

Unnamed: 0,Logistic Regression,RandomForestClassifier,LinearSVC
accuracy,0.793345,0.792469,0.801226


In [16]:
#class 0 info
c_0_all = {'Logistic Regression': [df_lg.iloc[0]['0'], df_lg.iloc[1]['0'], df_lg.iloc[2]['0']], 
                 'RandomForestClassifier': [df_rf.iloc[0]['0'], df_rf.iloc[1]['0'], df_rf.iloc[2]['0']], 
                 'LinearSVC': [df_l_svc.iloc[0]['0'], df_l_svc.iloc[1]['0'], df_l_svc.iloc[2]['0']]}
c_0_table = pd.DataFrame(data=c_0_all)
c_0_table.index = ['precision', 'recall', 'f1-score']
c_0_table.style.set_caption("Class 0 Results")

Unnamed: 0,Logistic Regression,RandomForestClassifier,LinearSVC
precision,0.774834,0.791549,0.782667
recall,0.898618,0.863287,0.90169
f1-score,0.832148,0.825863,0.837973


In [17]:
#class 1 info
c_1_all = {'Logistic Regression': [df_lg.iloc[0]['1'], df_lg.iloc[1]['1'], df_lg.iloc[2]['1']], 
                 'RandomForestClassifier': [df_rf.iloc[0]['1'], df_rf.iloc[1]['1'], df_rf.iloc[2]['1']], 
                 'LinearSVC': [df_l_svc.iloc[0]['1'], df_l_svc.iloc[1]['1'], df_l_svc.iloc[2]['1']]}
c_1_table = pd.DataFrame(data=c_1_all)
c_1_table.index = ['precision', 'recall', 'f1-score']
c_1_table.style.set_caption("Class 1 Results")

Unnamed: 0,Logistic Regression,RandomForestClassifier,LinearSVC
precision,0.829457,0.793981,0.836735
recall,0.653768,0.698574,0.668024
f1-score,0.731207,0.743229,0.742922


In [18]:
time_all = {'Logistic Regression': [lg_train_time, lg_pred_time], 
                 'RandomForestClassifier': [rf_train_time, rf_pred_time], 
                 'LinearSVC': [l_svc_train_time, l_svc_pred_time]}
time_table = pd.DataFrame(data=time_all)
time_table.index = ['train time', 'prediction time']
time_table.style.set_caption("ML Train and Prediction Times")

Unnamed: 0,Logistic Regression,RandomForestClassifier,LinearSVC
train time,0.010318,3.170267,0.030324
prediction time,0.000343,0.132451,0.000175


# DISCUSSION

## Summary

Of the three machine learning algorithms implemented, LinearSVC resulted in the highest overall classification accuracy at 0.801226. Both Logistic Regression and RandomForestClassifier were right behind in terms of accuracy though. Regarding the Class 0 results, again LinearSVC performed the best and had the highest recall and f-1 score, however, RandomForestClassifier resulted in the highest precision. The results were a bit different for the class 1 results. RandomForestClassifier outperformed the other two models in recall and f-1 score but LinearSVC had the highest precision. Directly above you can see the train and prediction times. RandomForestClassifier took by far the longest to train and predict.

## Questions

### What kinds of tweets/language are consistently misclassified?

I decided that I wanted to examine if there were any tweets that were misclassified by ALL the models implemented. Interestingly, there were 167 tweets from the test set that none of the models correctly classified. This is a remarkably large number of the test set at nearly 15% of the tweets. I dove in and removed all of the stopwords from these tweets and used a counter to examine which words appeared with the most frequency within the tweets. A few notable words were: ('snowstorm', 6) ('refugees', 5), ('nuclear', 5), ('fire', 5), ('trapped', 5), ('burning', 5), ('mass', 5), ('weapons', 4), ('emergency', 4), ('bioterror', 4), ('dead', 3), ('electrocuted', 3), and ('death', 3). In some cases, the tweets are misclassified when disaster terms are used to describe other things. For example: tweet 6086 – “years afloat pension plans start sinking”.  

In [19]:
#find which tweets were misclasified by each model
misclas_lg = (y_pred_lg != y_true)
misclas_rf = (y_pred_rf != y_true)
misclas_svc = (y_pred_lSVC != y_true)

#pull only the tweet that were misclasified
misclas_lg = misclas_lg[misclas_lg == True]
misclas_rf = misclas_rf[misclas_rf == True]
misclas_svc = misclas_svc[misclas_svc == True]

#convert tweet idx to a list
misclas_lg_idk = list(misclas_lg.index.values)
misclas_rf_idk = list(misclas_rf.index.values)
misclas_svc_idk = list(misclas_svc.index.values)

#find matches between the 3 models
all_misclas = set(misclas_rf_idk) & set(misclas_svc_idk) & set(misclas_lg_idk) 

#all of the missclas and lowercase
dF_tweets_misclas = dC_x_fin[all_misclas]
dF_tweets_misclas = dF_tweets_misclas.str.lower()

#remove stopwords
stop = stopwords.words('english')
what_words = dF_tweets_misclas.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

#words that were in tweets that were misclassified by all models
miss_count = Counter(" ".join(what_words).split()).most_common(100)

In [20]:
#misclassified tweets
pd.set_option('display.max_rows', None)
what_words

5633    one couple using drones save refugees world de...
4609    prediction vikings game sunday dont expect who...
2564    6 destroy reg c competitiveness entire region ...
7175                       one direction concert war zone
5128    government pressure abandon plans construct uk...
2569                  sj gist houses farm produce destroy
5642    newlyweds feed thousands syrian refugees inste...
6667    purple heart vet finds jihad threat car mall i...
5648    chpsre rt refugees followers paris visit dream...
4115                                           gotta love
5142                little filming inside nuclear reactor
535                               chiasson sens come deal
6167                    thu aug 20 32 gmt 0000 utc sirens
4121                               ready close errrr nope
2079                              said dead one else dies
544                                 avalanche city sunset
38      barbados jamaica two cars set ablaze santa cru...
3117    blow d

In [21]:
#count of words that are in the above tweets
miss_count

[('one', 10),
 ('amp', 9),
 ('people', 8),
 ('like', 8),
 ('via', 6),
 ('fire', 6),
 ('school', 6),
 ('refugees', 5),
 ('dead', 5),
 ('snowstorm', 5),
 ('man', 5),
 ('trapped', 5),
 ('buildings', 5),
 ('nuclear', 4),
 ('car', 4),
 ('death', 4),
 ('fedex', 4),
 ('bioterror', 4),
 ('hope', 4),
 ('back', 4),
 ('hollywood', 4),
 ('movie', 4),
 ('miners', 4),
 ('burning', 4),
 ('mass', 4),
 ('mudslide', 4),
 ('save', 3),
 ('world', 3),
 ('game', 3),
 ('whole', 3),
 ('think', 3),
 ('st', 3),
 ('c', 3),
 ('entire', 3),
 ('b', 3),
 ('heart', 3),
 ('love', 3),
 ('come', 3),
 ('dies', 3),
 ('set', 3),
 ('police', 3),
 ('get', 3),
 ('electrocuted', 3),
 ('weapons', 3),
 ('would', 3),
 ('emergency', 3),
 ('stops', 3),
 ('shipping', 3),
 ('potential', 3),
 ('pathogens', 3),
 ('way', 3),
 ('oil', 3),
 ('price', 3),
 ('us', 3),
 ('day', 3),
 ('storm', 3),
 ('take', 3),
 ('deluge', 3),
 ('believe', 3),
 ('sinking', 3),
 ('done', 3),
 ('around', 3),
 ('exploded', 3),
 ('face', 3),
 ('murder', 3),
 ('ho

### Does TFIDF do better than plain old bag of words?

For the below questions, I am going to use the ML model with the hyperparameters that performed to obtain answers.

It is evident that the use of TFIDF results in a higher accuracy than when the plain old bag of words is implemented. TFIDF had an accuracy of 4.4809 percent higher than the plain old bag of words. As I mentioned at the beginning this may have something to do with how the CountVectorizer can introduce a bias for the most frequent words that appear in the dataset.

In [22]:
# process raw text into ML compatible features
vectorizer = TfidfVectorizer(min_df=3, 
             stop_words='english',ngram_range=(1, 2), lowercase=True)  
vectorizer.fit(dC_x_fin)
#print(vectorizer.vocabulary_)
X = vectorizer.transform(dC_x_fin)
#vectorizer.get_stop_words()

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, dC_y, 
                                   test_size=0.15, shuffle=True, stratify=dC_y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                 test_size=0.15/0.85, shuffle=True, stratify=y_train, random_state=42)

In [24]:
l_svc = LinearSVC(random_state = 0, C = 0.79473684, loss = 'hinge', penalty = 'l2')

l_svc.fit(X_train,y_train)

l_svc_score = l_svc.score(X_test,y_test)

print(f'With TFIDF, Logistic regression gives an accuracy of: {l_svc_score}')

With TFIDF, Logistic regression gives an accuracy of: 0.8012259194395797


In [25]:
countVectorizer = CountVectorizer(min_df=3, 
             stop_words='english',ngram_range=(1, 2), lowercase=True) 
XX = countVectorizer.fit_transform(dC_x_fin)
#countVectorizer.get_stop_words()

In [26]:
X_train, X_test, y_train, y_test = train_test_split(XX, dC_y, 
                                   test_size=0.15, shuffle=True, stratify=dC_y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                 test_size=0.15/0.85, shuffle=True, stratify=y_train, random_state=42)

In [27]:
l_svc = LinearSVC(random_state = 0, C = 0.79473684, loss = 'hinge', penalty = 'l2') 

l_svc.fit(X_train,y_train)

l_svc_score = l_svc.score(X_test,y_test)

print(f'With CountVectorizer, Logistic regression gives an accuracy of: {l_svc_score}')

With CountVectorizer, Logistic regression gives an accuracy of: 0.7653239929947461


### What's better, lowercasing or not lowercasing text?

To my surprise, for the experiment run above not lowercasing the text resulted in a slightly higher classification accuracy. The difference though is quite small. Non-lowercased data resulted in a 0.2181 percent higher classification accuracy. It seems that my inference and initial tests regarding whether or not the text should be lower case may be false. I am not sure why or if it really matters since it is so low.

In [28]:
# process raw text into ML compatible features
vectorizer = TfidfVectorizer(min_df=3, 
             stop_words='english',ngram_range=(1, 2), lowercase=True)  
vectorizer.fit(dC_x_fin)
#print(vectorizer.vocabulary_)
X = vectorizer.transform(dC_x_fin)
#vectorizer.get_stop_words()

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, dC_y, 
                                   test_size=0.15, shuffle=True, stratify=dC_y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                 test_size=0.15/0.85, shuffle=True, stratify=y_train, random_state=42)

In [30]:
l_svc = LinearSVC(random_state = 0, C = 0.79473684, loss = 'hinge', penalty = 'l2') 

l_svc.fit(X_train,y_train)

l_svc_score = l_svc.score(X_test,y_test)

print(f'Lowercase=True, Logistic regression gives an accuracy of: {l_svc_score}')

Lowercase=True, Logistic regression gives an accuracy of: 0.8012259194395797


In [31]:
# process raw text into ML compatible features
vectorizer = TfidfVectorizer(min_df=3, 
             stop_words='english',ngram_range=(1, 2), lowercase=False)  
vectorizer.fit(dC_x_fin)
#print(vectorizer.vocabulary_)
X = vectorizer.transform(dC_x_fin)
#vectorizer.get_stop_words()

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, dC_y, 
                                   test_size=0.15, shuffle=True, stratify=dC_y, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                 test_size=0.15/0.85, shuffle=True, stratify=y_train, random_state=42)

In [33]:
l_svc = LinearSVC(random_state = 0, C = 0.79473684, loss = 'hinge', penalty = 'l2')  

l_svc.fit(X_train,y_train)

l_svc_score = l_svc.score(X_test,y_test)

print(f'Lowercase=False, Logistic regression gives an accuracy of: {l_svc_score}')

Lowercase=False, Logistic regression gives an accuracy of: 0.8029772329246935
