# Εργασία 3 (Τεχνικές Εξόρυξης Δεδομένων)
## Data Mining: Assignment 3
***
### Μαρία Φριτζελά 1115201400218
***

In [1]:
import pandas as pd
import numpy as np
from unicodedata import normalize
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.model_selection import cross_val_score, GridSearchCV
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn import  svm, metrics
from sklearn.ensemble import RandomForestClassifier
from scipy.sparse import hstack

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
nltk.download('stopwords')

## Collection and cleaning of data (Pre-processing text)
Date information is not needed so it is not added to our dataframes

In [2]:
traindf = pd.read_csv("data/train.csv", usecols=['Insult', 'Comment'])
testdf = pd.read_csv("data/impermium_verification_labels.csv", index_col='id', usecols=['id', 'Insult', 'Comment'])

Looking at traindf:

In [3]:
traindf

Unnamed: 0,Insult,Comment
0,1,"""You fuck your dad."""
1,0,"""i really don't understand your point.\xa0 It ..."
2,0,"""A\\xc2\\xa0majority of Canadians can and has ..."
3,0,"""listen if you dont wanna get married to a man..."
4,0,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd..."
...,...,...
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."


Splitting the train and test data to X and y

In [4]:
X_train, y_train = traindf.Comment, traindf.Insult

In [5]:
X_test, y_test = testdf.Comment, testdf.Insult

### Clean up train comments' text:
- convert all letters to lowercase
- remove multiple instances of `\`<br>
For example "\\\n" becomes "\n"
- remove "\n" and "\xa0" (non-breaking space latin)
- remove usernames
- remove URLs
- remove special unicode characters (like \xe1, \xe2...)<br>
- remove puctuation
- remove all words containing digits, and any digits
- remove multiple spaces


In [6]:
def clean_comments(comments):
    return comments.apply(lambda comment: comment.lower())\
                .apply(lambda comment: re.sub("\\\\{2,}", " \\\\" ,comment))\
                .apply(lambda comment: re.sub("\\\\+n|\\\\+xa0", " ", comment))\
                .apply(lambda comment: re.sub('@\S+',' ',comment))\
                .apply(lambda comment: re.sub('(http(s)?:\/\/|www\.)(\S|[a-z]|[A-Z]| [0-9])+', " ", comment))\
                .apply(lambda comment: re.sub('\\\\+\S+'," ", comment))\
                .apply(lambda comment: re.sub('[^A-Za-z0-9 ]+', ' ',comment))\
                .apply(lambda comment: re.sub(r'\w*\d\w*', '', comment))\
                .apply(lambda comment: re.sub(r"\s+"," ", comment, flags = re.I))

In [7]:
X_train = clean_comments(traindf.Comment)

In [8]:
X_test = clean_comments(testdf.Comment)

For example this comment:

In [9]:
traindf.Comment[124]

'"Nope. Not working for me either.32-23-34www.facebook.com/annagillmodel\\\\n\\\\n \\\\n\\\\nYou have my email! :) "'

Has been transformed into this:

In [10]:
X_train[124]

' nope not working for me either you have my email '

Our cleaned up data looks like this:

In [11]:
X_train

0                                      you fuck your dad 
1        i really don t understand your point it seems...
2        a majority of canadians can and has been wron...
3        listen if you dont wanna get married to a man...
4        c b xu bi t c ho kh c ng d ng cu chi nh c ho ...
                              ...                        
3942     you are both morons and that is never happening 
3943     many toolbars include spell check like yahoo ...
3944     sioux falls s d i told my boy he should call ...
3945     how about felix he is sure turning into one h...
3946     you re all upset defending this hipster band ...
Name: Comment, Length: 3947, dtype: object

##  Naive Bayes

Transform the comments into word count vectors using CountVectorizer from sklearn

In [100]:
#Create bag-of-words vector
bow_vectorizer = CountVectorizer()

X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

Looking at the vector of the first comment in the train data:

In [101]:
pd.DataFrame(X_train_bow[0:1].T.todense(), index=bow_vectorizer.get_feature_names(), columns=["counts"])\
.sort_values(by=["counts"],ascending=False)

Unnamed: 0,counts
you,1
dad,1
fuck,1
your,1
aamir,0
...,...
firms,0
firmly,0
firing,0
fired,0


Trying the Naive Bayes

In [102]:
# Instantiate the model
nb = GaussianNB()

# Train the model on the BoW training set
nb.fit(X_train_bow.toarray(), y_train)
# predict the BoW test set
y_pred_nb_bow = nb.predict(X_test_bow.toarray())

In [103]:
print("10-fold Cross Validation Precision NB for BoW:",
      np.mean(cross_val_score(nb, X_train_bow.toarray(), y_train, cv=10, scoring='precision_macro')))
print("10-fold Cross Validation Recall NB for BoW:",
      np.mean(cross_val_score(nb, X_train_bow.toarray(), y_train, cv=10, scoring='recall_macro')))
print("10-fold Cross Validation F-Measure NB for BoW:",
     np.mean(cross_val_score(nb, X_train_bow.toarray(), y_train, cv=10, scoring='f1_macro')))
print("10-fold Cross Validation Accuracy NB for BoW:",
      np.mean(cross_val_score(nb, X_train_bow.toarray(), y_train, cv=10, scoring='accuracy')))

10-fold Cross Validation Precision NB for BoW: 0.5924473946218446
10-fold Cross Validation Recall NB for BoW: 0.6073470556111351
10-fold Cross Validation F-Measure NB for BoW: 0.593717216685966
10-fold Cross Validation Accuracy NB for BoW: 0.6534100109233438


Test Scores:

In [104]:
print("Precision NB for BoW:",metrics.precision_score(y_test, y_pred_nb_bow, average=None))
print("Recall NB for BoW:",metrics.recall_score(y_test, y_pred_nb_bow, average=None))
print("F-Measure NB for BoW:", metrics.f1_score(y_test, y_pred_nb_bow, average=None))
print()
print("Accuracy NB for BoW:",metrics.accuracy_score(y_test,y_pred_nb_bow))

Precision NB for BoW: [0.54675468 0.5015083 ]
Recall NB for BoW: [0.42918826 0.6174559 ]
F-Measure NB for BoW: [0.48089018 0.55347482]

Accuracy NB for BoW: 0.519910514541387


These scores are not very good... Let's improve them!

## Improving the scores of Naive Bayes

### 1) Lemmatization

Use lemmatization of words to improve the scores from the previous question, using the WordNetLemmatizer from nltk

In [13]:
lemmatizer = WordNetLemmatizer()

X_train_lem = X_train.apply(lambda item: ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(item)]))
X_test_lem = X_test.apply(lambda item: ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(item)]))

In [78]:
#Create bag-of-words vector
bow_vectorizer = CountVectorizer()

X_train_bow = bow_vectorizer.fit_transform(X_train_lem)
X_test_bow = bow_vectorizer.transform(X_test_lem)

In [79]:
# Train the model on the lemmatized BoW training set
nb.fit(X_train_bow.toarray(), y_train)
# predict the BoW test set
y_pred_nb_bow = nb.predict(X_test_bow.toarray())

In [80]:
print("F-Measure NB for Lemmatized BoW data:", metrics.f1_score(y_test, y_pred_nb_bow, average=None))
print()
print("Accuracy NB for Lemmatized BoW data:",metrics.accuracy_score(y_test,y_pred_nb_bow))

F-Measure NB for Lemmatized BoW data: [0.51469923 0.52501107]

Accuracy NB for Lemmatized BoW data: 0.519910514541387


### 2) Stop word filtering

Try a bag-of-words vector, removing stopwords

In [81]:
#Create bag-of-words vector
bow_vectorizer = CountVectorizer(stop_words='english')

X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

In [82]:
# Train the model on the stopword free BoW training set
nb.fit(X_train_bow.toarray(), y_train)
# predict the BoW test set
y_pred_nb_bow = nb.predict(X_test_bow.toarray())

In [83]:
print("F-Measure NB for stopword free BoW data:", metrics.f1_score(y_test, y_pred_nb_bow, average=None))
print()
print("Accuracy NB for stopword free BoW data:",metrics.accuracy_score(y_test,y_pred_nb_bow))

F-Measure NB for stopword free BoW data: [0.52037618 0.5212338 ]

Accuracy NB for stopword free BoW data: 0.5208053691275167


### 3) Use of bigrams

Try a bag-of-words vector, including bigrams

In [84]:
#Create bag-of-words vector for only bigrams
bow_vectorizer = CountVectorizer(ngram_range=(2,2))

X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

In [85]:
# Train the model on the bigrams BoW training set
nb.fit(X_train_bow.toarray(), y_train)
# predict the BoW test set
y_pred_nb_bow = nb.predict(X_test_bow.toarray())

In [86]:
print("F-Measure NB for bigrams BoW data:", metrics.f1_score(y_test, y_pred_nb_bow, average=None))
print()
print("Accuracy NB for bigrams BoW data:",metrics.accuracy_score(y_test,y_pred_nb_bow))

F-Measure NB for bigrams BoW data: [0.58853077 0.52763095]

Accuracy NB for bigrams BoW data: 0.5601789709172259


### 4) Laplace Smoothing

In [87]:
#Create bag-of-words vector
bow_vectorizer = CountVectorizer()

X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

Setting a=1 is called Laplace smoothing

_(https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes)_

In [88]:
# Instantiate the model
mnb = MultinomialNB(alpha=1.0)

# Train the model on the BoW training set
mnb.fit(X_train_bow.toarray(), y_train)
# predict the BoW test set
y_pred_mnb_bow = mnb.predict(X_test_bow.toarray())

In [89]:
print("F-Measure NB for BoW:", metrics.f1_score(y_test, y_pred_mnb_bow, average=None))
print()
print("Accuracy NB for BoW:",metrics.accuracy_score(y_test,y_pred_mnb_bow))

F-Measure NB for BoW: [0.71803018 0.63627049]

Accuracy NB for BoW: 0.6823266219239373


### Putting it all together:

After trying different combinations of configurations, this seems like the optimal setting for the highest accuracy score:

In [148]:
#Create bag-of-words vector
bow_vectorizer = CountVectorizer(min_df=2, ngram_range=(1,2), stop_words='english')

X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

In [147]:
# Train the model on the BoW training set
mnb.fit(X_train_bow.toarray(), y_train)
# predict the BoW test set
y_pred_mnb_bow = mnb.predict(X_test_bow.toarray())

In [46]:
print("10-fold Cross Validation Precision NB for BoW:",
      np.mean(cross_val_score(mnb, X_train_bow.toarray(), y_train, cv=10, scoring='precision_macro')))
print("10-fold Cross Validation Recall NB for BoW:",
      np.mean(cross_val_score(mnb, X_train_bow.toarray(), y_train, cv=10, scoring='recall_macro')))
print("10-fold Cross Validation F-Measure NB for BoW:",
     np.mean(cross_val_score(mnb, X_train_bow.toarray(), y_train, cv=10, scoring='f1_macro')))
print("10-fold Cross Validation Accuracy NB for BoW:",
      np.mean(cross_val_score(mnb, X_train_bow.toarray(), y_train, cv=10, scoring='accuracy')))

10-fold Cross Validation Precision NB for BoW: 0.7389453859092777
10-fold Cross Validation Recall NB for BoW: 0.792935771671752
10-fold Cross Validation F-Measure NB for BoW: 0.7479243192090579
10-fold Cross Validation Accuracy NB for BoW: 0.7762892758465592


In [145]:
print("Precision NB for BoW:",metrics.precision_score(y_test, y_pred_mnb_bow, average=None))
print("Recall NB for BoW:",metrics.recall_score(y_test, y_pred_mnb_bow, average=None))
print("F-Measure NB for BoW:", metrics.f1_score(y_test, y_pred_mnb_bow, average=None))
print()
print("Accuracy NB for BoW:",metrics.accuracy_score(y_test,y_pred_mnb_bow))

Precision NB for BoW: [0.66136835 0.74492386]
Recall NB for BoW: [0.82642487 0.5450325 ]
F-Measure NB for BoW: [0.73474088 0.62949062]

Accuracy NB for BoW: 0.69082774049217


Nice! **~20%** improvement in scores

Below, a table of combinations of settings tested along with their accuracy scores

| NB accuracy Score  | # of features | Lemmatization | Stopwords removal | n-gram_range | Laplace Smoothing |
|:------------------:|:-------------:|:-------------:|:-----------------:|:------------:|:-----------------:|
|        0.519       |  14220 (all)  |       N       |         N         |     (1,1)    |         N         |
|        0.510       |      4000     |       N       |         N         |     (1,1)    |         N         |
|       0.5199       |    min_df=2   |       N       |         N         |     (1,1)    |         N         |
|       0.5176       |    min_df=2   |       Y       |         N         |     (1,1)    |         N         |
|       0.5208       |    min_df=2   |       N       |         Y         |     (1,1)    |         N         |
|        0.519       |    min_df=2   |       Y       |         Y         |     (1,1)    |         N         |
|       0.5297       |    min_df=2   |       Y       |         Y         |     (1,2)    |         N         |
|       0.5364       |    min_df=2   |       N       |         Y         |     (1,2)    |         N         |
|        0.489       |    min_df=2   |       N       |         Y         |     (2,2)    |         N         |
|        0.508       |      all      |       N       |         Y         |     (2,2)    |         N         |
|     **0.6908**     |    min_df=2   |       N       |         Y         |     (1,2)    |         Y         |
|        0.688       |    min_df=2   |       Y       |         Y         |     (1,2)    |         Y         |
|        0.63        |      all      |       N       |         Y         |     (1,2)    |         Y         |
|        0.652       |    min_df=2   |       N       |         N         |     (1,2)    |         Y         |

**Notes:**<br>
- After trying different number of features (all, max=400, min_df=2) it appears that the best accuracy score is achieved by ignoring terms that have a document frequency strictly lower than 2. This leaves us with 5873 features.

- The use of lemmatization brings the accuracy score down. This can be explained because when analysing insults, the result might differ depending on the form of the word and therefore the input should not be stemmed or lemmatized. _(Example: "this situation sucks, man" (no insult) vs "you suck" (insult))_

- Stopwords removal causes an increase in the score, as well as using an n-gram range of (1,2) (unigrams and bigrams)

- Finally, applying Laplace smoothing results in great increase of accuracy score

## Creation of a custom feature vector: TF/IDF Vector & Part-of-Speech

### Part-of-Speech frequency features

Use nltk's pos_tag method for each word of every comment in our data. Set the tagset attribute to 'universal' in the pos_tag method.

In [14]:
X_train_tagged = X_train.apply(lambda item: nltk.pos_tag(nltk.word_tokenize(item), tagset='universal'))
X_test_tagged = X_test.apply(lambda item: nltk.pos_tag(nltk.word_tokenize(item), tagset='universal'))

Let's see what the result looks like for comment 1

In [254]:
X_train_tagged[1]

[('i', 'NOUN'),
 ('really', 'ADV'),
 ('don', 'ADJ'),
 ('t', 'NOUN'),
 ('understand', 'VERB'),
 ('your', 'PRON'),
 ('point', 'NOUN'),
 ('it', 'PRON'),
 ('seems', 'VERB'),
 ('that', 'ADP'),
 ('you', 'PRON'),
 ('are', 'VERB'),
 ('mixing', 'VERB'),
 ('apples', 'NOUN'),
 ('and', 'CONJ'),
 ('oranges', 'NOUN')]

Frequency distribution (`nltk.FreqDist`) can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.<br>
It will be used to record the frequency of each word type in each comment

In [255]:
nltk.FreqDist(tag for word, tag in X_train_tagged[1])    

FreqDist({'NOUN': 5, 'VERB': 4, 'PRON': 3, 'ADV': 1, 'ADJ': 1, 'ADP': 1, 'CONJ': 1})

**Function fractPOS:** Creates a list of dictionaries iterating through every comment passed in X_tagged. Each dictionary holds the fraction (=frequency_of_tag/number_of_words_in_comment) of each tag for that comment:

In [13]:
def fractPOS(X_tagged):
    fractions = []
    for tagged_comment in X_tagged:
        n_of_words = len(tagged_comment)
        freq = nltk.FreqDist(tag for word, tag in tagged_comment)
        # freq[tag_type] if a tag type doesn't exist, zero is returned
        try:
            d = {
                'ADV': freq['ADV']/n_of_words,
                'VERB': freq['VERB']/n_of_words,
                'ADJ': freq['ADJ']/n_of_words,
                'NOUN': freq['NOUN']/n_of_words
            }
        except ZeroDivisionError: #n_of_words ==0
            d = {'ADV': 0, 'VERB': 0, 'ADJ': 0,'NOUN': 0}
        fractions.append(d)
    return fractions

Dataframe is created using the list of dictionaries.<br>
_We chose not to fill the dataframe row by row, because iteratively appending rows to a DataFrame can be computationally intensive_ 

In [16]:
tags = ['ADV', 'VERB', 'ADJ', 'NOUN']
X_train_freqdf = pd.DataFrame(fractPOS(X_train_tagged), columns=tags)
X_test_freqdf = pd.DataFrame(fractPOS(X_test_tagged), columns=tags)

Looking at our custom feature vector (dataframe) for the train data

In [17]:
X_train_freqdf

Unnamed: 0,ADV,VERB,ADJ,NOUN
0,0.000000,0.250000,0.000000,0.250000
1,0.062500,0.250000,0.062500,0.312500
2,0.086957,0.202899,0.057971,0.202899
3,0.033898,0.305085,0.084746,0.118644
4,0.000000,0.047619,0.063492,0.809524
...,...,...,...,...
3942,0.111111,0.333333,0.000000,0.111111
3943,0.076923,0.346154,0.076923,0.153846
3944,0.000000,0.307692,0.076923,0.269231
3945,0.111111,0.222222,0.027778,0.277778


### TF/IDF

Create a TF/IDF vector

In [259]:
tfidf_vectorizer= TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [260]:
pd.DataFrame(X_train_tfidf[0:1].T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])\
.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
dad,0.794537
fuck,0.481069
your,0.308020
you,0.205932
aaaah,0.000000
...,...
flying,0.000000
flynn,0.000000
focking,0.000000
focus,0.000000


### Combining the custom part-of-speech features with the TF/IDF vector

TFIDF vector is a matrix where the rows are comments and the columns are features.<br>
To combine all features, the custom features will be added as columns to the end of the TF/IDF matrix.<br>

In [261]:
type(X_train_tfidf)

scipy.sparse.csr.csr_matrix

TF/IDF matrix is a sparse matrix (from Scipy).<br>
To save memory, do not convert to dense, rather use `scipy.sparse.hstack` to stack the matrices horizontally (column wise)

In [265]:
X_train_combined = hstack([X_train_tfidf, X_train_freqdf])
X_test_combined = hstack([X_test_tfidf, X_test_freqdf])

The combined matrix consists of 3947 rows (same as the number of comments in the train dataset) and 14220 features from TFIDF + 4 custom part-of-speech features

In [266]:
X_train_combined.shape

(3947, 14224)

Source:
_(https://stackoverflow.com/questions/48573174/how-to-combine-tfidf-features-with-other-features)_

## Support Vector Machines (SVM)

According to the assigment details, scoring should be calculated calculated using: classification accuracy and F1 score.

Find optimal parameters for the SVM model as shown in this example:<br>
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

In [23]:
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['f1_macro', 'accuracy']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)

    clf = GridSearchCV(
        svm.SVC(), tuned_parameters, scoring=score
    )
    clf.fit(X_train_combined, y_train)

    print("Best parameters set found on development set:")
    print(clf.best_params_)
    print()

# Tuning hyper-parameters for f1_macro
Best parameters set found on development set:
{'C': 1, 'kernel': 'linear'}

# Tuning hyper-parameters for accuracy
Best parameters set found on development set:
{'C': 1, 'kernel': 'linear'}



In [267]:
#instantiate the model
svm_clf = svm.SVC(C=1, kernel='linear')

# train the model on the custom training set
svm_clf.fit(X_train_combined, y_train)
# predict the custom test set
y_pred_svm = svm_clf.predict(X_test_combined)

In [113]:
print("10-fold Cross Validation F-Measure SVM for custom feature vector:",
     np.mean(cross_val_score(svm_clf, X_train_combined, y_train, cv=10, scoring='f1_macro')))
print("10-fold Cross Validation Accuracy SVM for custom feature vector:",
      np.mean(cross_val_score(svm_clf, X_train_combined, y_train, cv=10, scoring='accuracy')))

10-fold Cross Validation F-Measure SVM for custom feature vector: 0.7716475529688832
10-fold Cross Validation Accuracy SVM for custom feature vector: 0.8386172331812635


Test Scores:

In [268]:
print("F-Measure SVM for custom feature vector:", metrics.f1_score(y_test, y_pred_svm, average=None))
print("Accuracy SVM for custom feature vector:",metrics.accuracy_score(y_test, y_pred_svm))

F-Measure SVM for custom feature vector: [0.75598086 0.62179122]
Accuracy SVM for custom feature vector: 0.7033557046979866


## Random Forest

In [269]:
# Instantiate the model
rf = RandomForestClassifier()

# Train the model on the custom training set
rf.fit(X_train_combined, y_train)
# predict the custom test set
y_pred_rf = rf.predict(X_test_combined)

In [270]:
print("10-fold Cross Validation F-Measure RF for custom feature vector:",
     np.mean(cross_val_score(rf, X_train_combined, y_train, cv=10, scoring='f1_macro')))
print("10-fold Cross Validation Accuracy RF for custom feature vector:",
      np.mean(cross_val_score(rf, X_train_combined, y_train, cv=10, scoring='accuracy')))

10-fold Cross Validation F-Measure RF for custom feature vector: 0.6588474815828835
10-fold Cross Validation Accuracy RF for custom feature vector: 0.8016205101844116


Test Scores:

In [271]:
print("F-Measure RF for custom feature vector:", metrics.f1_score(y_test, y_pred_rf, average=None))
print("Accuracy RF for custom feature vector:",metrics.accuracy_score(y_test, y_pred_rf))

F-Measure RF for custom feature vector: [0.73501474 0.42907551]
Accuracy RF for custom feature vector: 0.6380313199105145


##  Naive Bayes

In [272]:
# Train the model on the custom training set
mnb.fit(X_train_combined.toarray(), y_train)
# predict the custom test set
y_pred_mnb_bow = mnb.predict(X_test_combined.toarray())

Test Scores:

In [273]:
print("F-Measure NB for custom feature vector:", metrics.f1_score(y_test, y_pred_mnb_bow, average=None))
print()
print("Accuracy NB for custom feature vector:",metrics.accuracy_score(y_test,y_pred_mnb_bow))

F-Measure NB for custom feature vector: [0.69154972 0.07850134]

Accuracy NB for custom feature vector: 0.537807606263982


## Beat the benchmark 

Attempting to improve the scores on the test set

### Function for creating the combined features vector

To test the classifiers for different data preprocessing options we create a function which returns the combined test and train vector. (TF/IDF + Part of Speech features)

- **lemm**: (True/False) whether or not we want to lemmatize the text data before creating the TFIDF vector
- **tfidfSW**: ('english'/None) what type of stopwords to use for TFIDF vector
- **ngram_r**: the ngram_range of the TFIDF vector
- **posSW**: (True/False) whether to keep stopwords before counting the PoS 
- **svd**: (True/False) whether or not to Perform dimensionality reduction (LSA via TruncatedSVD) on the sparse PoS data to make it dense and combine the features into a single dense matrix. 
_(because the features from the TF/IDF matrix are be sparse, meaning it will contain a lot of 0s. Whilst the PoS feature vector is dense and continuous.)_

In [18]:
# function which creates the combined feature vector with different parameters
# lemm = (True/False) whether or not we want to lemmatize the data before creating the TFIDF vector
# tfidfSW = ('english'/None) what type of stopwords to use for TFIDF vector
# ngram_r = the ngram_range of the TFIDF vector
# posSW = (True/False) whether to keep stopwords when counting the PoS 
# svd = whether or not to Perform dimensionality reduction (LSA via TruncatedSVD) on the sparse PoS data 
# to make it dense and combine the features into a single dense matrix

def create_combined(lemm, tfidfSW, ngram_r, posSW, svd,  X_train, X_test):
    #print("Preparing features...")
    #print("PoS stopwords: "+str(posSW)+" TFIDF Lemmatization: "+str(lemm))
    #print("TFIDF: ngram_range="+str(ngram_r)+" stopwords="+str(tfidfSW))
    #print("SVD="+str(svd))
    if posSW: #if we want stopwords in our PoS features
        X_train_tagged = X_train.apply(lambda item: nltk.pos_tag(nltk.word_tokenize(item), tagset='universal'))
        X_test_tagged = X_test.apply(lambda item: nltk.pos_tag(nltk.word_tokenize(item), tagset='universal'))
    else:
        stop_words = set(stopwords.words('english')) 
        X_train_noSW = X_train.apply(lambda item: ' '.join(list(filter(lambda word: word not in stop_words, item.split()))))
        X_test_noSW = X_test.apply(lambda item: ' '.join(list(filter(lambda word: word not in stop_words, item.split()))))
        X_train_tagged = X_train_noSW.apply(lambda item: nltk.pos_tag(nltk.word_tokenize(item), tagset='universal'))
        X_test_tagged = X_test_noSW.apply(lambda item: nltk.pos_tag(nltk.word_tokenize(item), tagset='universal'))
    tags = ['ADV', 'VERB', 'ADJ', 'NOUN']
    X_train_freqdf = pd.DataFrame(fractPOS(X_train_tagged), columns=tags)
    X_test_freqdf = pd.DataFrame(fractPOS(X_test_tagged), columns=tags)
    
    tfidf_vectorizer= TfidfVectorizer(ngram_range=ngram_r, stop_words=tfidfSW)
    if lemm: #if we want lemmatization in our TFIDF features
        lemmatizer = WordNetLemmatizer()
        X_train_lem = X_train.apply(lambda item: ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(item)]))
        X_test_lem = X_test.apply(lambda item: ' '.join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(item)]))
        
        X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_lem)
        X_test_tfidf = tfidf_vectorizer.transform(X_test_lem)
    else:
        X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
        X_test_tfidf = tfidf_vectorizer.transform(X_test)
        
    if svd:
        svd = TruncatedSVD(n_components=4)

        X_train_tfidf_svd = svd.fit_transform(X_train_tfidf)
        X_test_tfidf_svd = svd.fit_transform(X_test_tfidf)
        X_train_combined_ns = np.hstack([X_train_tfidf_svd, X_train_freqdf])
        X_test_combined_ns = np.hstack([X_test_tfidf_svd, X_test_freqdf])
        return X_train_combined_ns, X_test_combined_ns
    
    else:
        X_train_combined = hstack([X_train_tfidf, X_train_freqdf])
        X_test_combined = hstack([X_test_tfidf, X_test_freqdf])
        return X_train_combined, X_test_combined

### SVM rbf

In [None]:
svm_rbf = []
for lemm in [True, False]:
    for tfidfSW in [None, 'english']:
        for posSW in [True, False]:
            for svd in [True, False]:
                #print()
                X_train_combined, X_test_combined = create_combined(lemm, tfidfSW, (1,2), posSW, svd,  X_train, X_test)
                #instantiate the model
                svm_clf = svm.SVC(C=1000, gamma=0.001, kernel='rbf')

                # train the model on the custom training set
                svm_clf.fit(X_train_combined, y_train)
                # predict the custom test set
                y_pred_svm = svm_clf.predict(X_test_combined)
                
                # scores
                f1score = metrics.f1_score(y_test, y_pred_svm, average=None)
                acc = metrics.accuracy_score(y_test, y_pred_svm)
                #print("F-Measure SVM for custom feature vector:", f1score)
                #print("Accuracy SVM for custom feature vector:", acc)
                d = {"PoS_SW": posSW, "TFIDF Lemmatization":lemm,
                    "TFIDF: ngram_range": (1,2), "TFIDF_SW": tfidfSW, "SVD": svd,
                    "F1 score": f1score, "Accuracy":acc}
                svm_rbf.append(d)
svm_rbf_scores = pd.DataFrame(svm_rbf)

In [72]:
svm_rbf_scores

Unnamed: 0,PoS_SW,TFIDF Lemmatization,TFIDF: ngram_range,TFIDF_SW,SVD,F1 score,Accuracy
0,True,True,"(1, 2)",,True,"[0.6777117849327915, 0.18882769472856017]",0.538702
1,True,True,"(1, 2)",,False,"[0.7477911646586345, 0.6828282828282829]",0.719016
2,False,True,"(1, 2)",,True,"[0.6795752654590881, 0.19085173501577288]",0.54094
3,False,True,"(1, 2)",,False,"[0.7485988791032825, 0.6815415821501014]",0.719016
4,True,True,"(1, 2)",english,True,"[0.6764982742390964, 0.19641465315666407]",0.538702
5,True,True,"(1, 2)",english,False,"[0.7548872180451128, 0.6397790055248619]",0.708277
6,False,True,"(1, 2)",english,True,"[0.6767106089139988, 0.19781931464174457]",0.53915
7,False,True,"(1, 2)",english,False,"[0.7548022598870057, 0.6413223140495867]",0.708725
8,True,False,"(1, 2)",,True,"[0.6781142678738683, 0.186266771902131]",0.538702
9,True,False,"(1, 2)",,False,"[0.744484556758925, 0.6777946383409205]",0.714989


Comments:
- Stop words seem critical to understanding the actual meaning being considered, considering  SVM looks at the interactions between the features to a certain degree, when using a non-linear kernel (Gaussian, rbf, poly etc.)

Highest Accuracy score for SVM rbf:

In [94]:
svm_rbf_scores.loc[svm_rbf_scores['Accuracy'].idxmax()]

PoS_SW                                                     True
TFIDF Lemmatization                                        True
TFIDF: ngram_range                                       (1, 2)
TFIDF_SW                                                   None
SVD                                                       False
F1 score               [0.7477911646586345, 0.6828282828282829]
Accuracy                                               0.719016
Name: 1, dtype: object

### SVM linear

In [75]:
svm_lin = []
for lemm in [True, False]:
    for tfidfSW in [None, 'english']:
        for posSW in [True, False]:
            for svd in [True, False]:
                #print()
                X_train_combined, X_test_combined = create_combined(lemm, tfidfSW, (1,2), posSW, svd,  X_train, X_test)
                #instantiate the model
                svm_clf = svm.SVC(C=1, kernel='linear')

                # train the model on the custom training set
                svm_clf.fit(X_train_combined, y_train)
                # predict the custom test set
                y_pred_svm = svm_clf.predict(X_test_combined)
                
                # scores
                f1score = metrics.f1_score(y_test, y_pred_svm, average=None)
                acc = metrics.accuracy_score(y_test, y_pred_svm)
                #print("F-Measure SVM for custom feature vector:", f1score)
                #print("Accuracy SVM for custom feature vector:", acc)
                d = {"PoS_SW": posSW, "TFIDF Lemmatization":lemm,
                    "TFIDF: ngram_range": (1,2), "TFIDF_SW": tfidfSW, "SVD": svd,
                    "F1 score": f1score, "Accuracy":acc}
                svm_lin.append(d)
svm_lin_scores = pd.DataFrame(svm_rbf)

In [76]:
svm_lin_scores

Unnamed: 0,PoS_SW,TFIDF Lemmatization,TFIDF: ngram_range,TFIDF_SW,SVD,F1 score,Accuracy
0,True,True,"(1, 2)",,True,"[0.6777117849327915, 0.18882769472856017]",0.538702
1,True,True,"(1, 2)",,False,"[0.7477911646586345, 0.6828282828282829]",0.719016
2,False,True,"(1, 2)",,True,"[0.6795752654590881, 0.19085173501577288]",0.54094
3,False,True,"(1, 2)",,False,"[0.7485988791032825, 0.6815415821501014]",0.719016
4,True,True,"(1, 2)",english,True,"[0.6764982742390964, 0.19641465315666407]",0.538702
5,True,True,"(1, 2)",english,False,"[0.7548872180451128, 0.6397790055248619]",0.708277
6,False,True,"(1, 2)",english,True,"[0.6767106089139988, 0.19781931464174457]",0.53915
7,False,True,"(1, 2)",english,False,"[0.7548022598870057, 0.6413223140495867]",0.708725
8,True,False,"(1, 2)",,True,"[0.6781142678738683, 0.186266771902131]",0.538702
9,True,False,"(1, 2)",,False,"[0.744484556758925, 0.6777946383409205]",0.714989


Highest Accuracy score for SVM linear:

In [98]:
svm_lin_scores.loc[svm_lin_scores['Accuracy'].idxmax()]

PoS_SW                                                     True
TFIDF Lemmatization                                        True
TFIDF: ngram_range                                       (1, 2)
TFIDF_SW                                                   None
SVD                                                       False
F1 score               [0.7477911646586345, 0.6828282828282829]
Accuracy                                               0.719016
Name: 1, dtype: object

Only a **1%** improvement from the previous linear svm accuracy score:

_F-Measure SVM for custom feature vector: 0.75598086 0.62179122
Accuracy SVM for custom feature vector: 0.7033557046979866_


The features from the TF/IDF matrix are be sparse, meaning it will contain a lot of 0s. Whilst the PoS feature vector is dense and continuous. This will probably cause the prediction to be dominated by the dense variables.<br>
<br>
To combat this: perform dimensionality reduction (such as LSA via TruncatedSVD) on the sparse data to make it dense and combine the features into a single dense matrix.
<br>
SVM, RF and NB models are not additive models, they can combat the imbalance in frequency of sparse and dense features. This is why using the dimentionality reduction did not help with increasing their accuracy.
<br>
_(Further reading: https://datascience.stackexchange.com/questions/987/text-categorization-combining-different-kind-of-features )_


### Random Forest

In [101]:
rf_df = []
for lemm in [True, False]:
    for tfidfSW in [None, 'english']:
        for posSW in [True, False]:
            for svd in [True, False]:
                #print()
                X_train_combined, X_test_combined = create_combined(lemm, tfidfSW, (1,2), posSW, svd,  X_train, X_test)
                # Instantiate the model
                # Instantiate the model
                rf = RandomForestClassifier()

                # Train the model on the custom training set
                rf.fit(X_train_combined, y_train)
                # predict the custom test set
                y_pred_rf = rf.predict(X_test_combined)
                
                # scores
                f1score = metrics.f1_score(y_test, y_pred_rf, average=None)
                acc = metrics.accuracy_score(y_test, y_pred_rf)
                
                d = {"PoS_SW": posSW, "TFIDF Lemmatization":lemm,
                    "TFIDF: ngram_range": (1,2), "TFIDF_SW": tfidfSW, "SVD": svd,
                    "F1 score": f1score, "Accuracy":acc}
                rf_df.append(d)
rf_scores = pd.DataFrame(rf_df)

In [102]:
rf_scores

Unnamed: 0,PoS_SW,TFIDF Lemmatization,TFIDF: ngram_range,TFIDF_SW,SVD,F1 score,Accuracy
0,True,True,"(1, 2)",,True,"[0.6539618856569709, 0.3002028397565923]",0.536913
1,True,True,"(1, 2)",,False,"[0.7180500658761528, 0.4030683403068341]",0.617002
2,False,True,"(1, 2)",,True,"[0.6679764243614932, 0.2838983050847458]",0.546309
3,False,True,"(1, 2)",,False,"[0.7219834710743801, 0.4179930795847751]",0.623714
4,True,True,"(1, 2)",english,True,"[0.5312899106002553, 0.48090523338048086]",0.507383
5,True,True,"(1, 2)",english,False,"[0.7414372061786434, 0.4839142091152816]",0.655481
6,False,True,"(1, 2)",english,True,"[0.5365649811951525, 0.4660568127106403]",0.503803
7,False,True,"(1, 2)",english,False,"[0.7399662731871839, 0.4877076411960133]",0.655034
8,True,False,"(1, 2)",,True,"[0.6565087777409738, 0.2853204686423157]",0.536018
9,True,False,"(1, 2)",,False,"[0.7164965426407639, 0.3991625959525471]",0.614765


Highest Accuracy score for rf:

In [103]:
rf_scores.loc[rf_scores['Accuracy'].idxmax()]

PoS_SW                                                     True
TFIDF Lemmatization                                       False
TFIDF: ngram_range                                       (1, 2)
TFIDF_SW                                                english
SVD                                                       False
F1 score               [0.7450980392156862, 0.5013227513227513]
Accuracy                                                0.66264
Name: 13, dtype: object

Only a **3%** improvement in accuracy from the previous rf scores:

_F-Measure RF for custom feature vector: 0.73501474 0.42907551
Accuracy RF for custom feature vector: 0.6380313199105145_

### Voting Classifier

Use sklearn VotingClassifier to combine two different clasiffiers and predict the "most voted" output from two of classifiers.

In [19]:
X_train_combined, X_test_combined = create_combined(lemm=True, tfidfSW=None, ngram_r=(1,2), 
                                                    posSW=True, svd=False, X_train=X_train, X_test=X_test)

In [24]:
from sklearn.ensemble import VotingClassifier

rf = RandomForestClassifier()
rf.fit(X_train_combined, y_train)

svm_clf = svm.SVC(C=1, kernel='linear', probability=True)
svm_clf.fit(X_train_combined, y_train)

est_ensemble = VotingClassifier(estimators=[('RF', rf), ('SVM', svm_clf)],
                        voting='soft',
                        weights=[1, 1])

est_ensemble.fit(X_train_combined, y_train)
y_pred_ensemble = est_ensemble.predict(X_test_combined)

f1score = metrics.f1_score(y_test, y_pred_ensemble, average=None)
acc = metrics.accuracy_score(y_test, y_pred_ensemble)

In [25]:
print("F-Measure of Voting Classifier for custom feature vector:", f1score)
print("Accuracy of Voting Classifier for custom feature vector:", acc)

F-Measure of Voting Classifier for custom feature vector: [0.74586697 0.64633494]
Accuracy of Voting Classifier for custom feature vector: 0.7042505592841163


voting='soft' predicts the labels as the ones with the maximum probabilities after suming all models predictions.<br>
voting='hard' predicts the labels following majority vote rule, i.e. the mode of the models predictions.<br>
_( https://stackoverflow.com/questions/22433646/combine-two-different-classifier-result-in-scikit-learn-python )_

## Another idea...

Since the different types of preprocessing of the text data did not result in great improvement of scores for the models, another idea is suggested:<br>
<br>
Creation of a model using only the sparse TFIDF data and then combine its predictions (probabilities) as a dense feature with the PoS dense features to create a model (ie: ensembling via stacking).


**No time to try this!!**