# Spam Detection in YouTube Comments

#nlp #classification

Your task is to classify YouTube comments into spam and ham categories. You need to use something other than bag of words and Naive Bayes as your models (but it can be a baseline).

Dataset source: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection.

Assignment: https://is.muni.cz/auth/el/fi/jaro2021/IB031/um/cviceni/assignment.pdf?predmet=1323750

### Imports

In [1]:
# Math and data stuff
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sklearn
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.naive_bayes import GaussianNB, ComplementNB, BernoulliNB, MultinomialNB, CategoricalNB
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Tensorflow Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GaussianNoise, LSTM, Bidirectional, Dropout, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Language stuff
from pymagnitude import Magnitude, MagnitudeUtils
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")

# Other
import html
from mlxtend.feature_selection import ColumnSelector

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/petr.janik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data loading utils

In [2]:
RANDOM_STATE = 42

def load_all_data(drop_duplicates=True):
    data_dir = "data"

    psy = pd.read_csv(f"{data_dir}/Youtube01-Psy.csv", parse_dates=["DATE"])
    katy = pd.read_csv(f"{data_dir}/Youtube02-KatyPerry.csv", parse_dates=["DATE"])
    lmfao = pd.read_csv(f"{data_dir}/Youtube03-LMFAO.csv", parse_dates=["DATE"])
    eminem = pd.read_csv(f"{data_dir}/Youtube04-Eminem.csv", parse_dates=["DATE"])
    shakira = pd.read_csv(f"{data_dir}/Youtube05-Shakira.csv", parse_dates=["DATE"])
    
    all_datasets = [psy, katy, lmfao, eminem, shakira]
    dataset_names = ["psy", "katy", "lmfao", "eminem", "shakira"]

    # keep info about which video the comment appeared in
    for dataset_name, dataset in zip(dataset_names, all_datasets):
        dataset["INTERPRET"] = dataset_name

    # join all datasets
    joined = pd.concat(all_datasets).reset_index(drop=True)
    
    # common preprocessing
    if drop_duplicates:
        joined.drop_duplicates(inplace=True)
    
    # convert object types to strings
    object_cols = joined.select_dtypes("object").columns
    joined[object_cols] = joined[object_cols].astype("string")
    
    return joined

def load_data():
    all_data = load_all_data()
    
    df, final_test_df = train_test_split(
        all_data, test_size=0.2, random_state=RANDOM_STATE
    )
    
    return df

def load_final_test_data():
    all_data = load_all_data()
    
    df, final_test_df = train_test_split(
        all_data, test_size=0.2, random_state=RANDOM_STATE
    )
    
    return final_test_df

def load_train_test_all_cols_data(test_size=0.2):
    df = load_data()
    df_X, df_y = df.drop(columns="CLASS"), df.CLASS
    
    return train_test_split(
        df_X, df_y, test_size=test_size, random_state=RANDOM_STATE
    )

def load_train_test_data():
    df = load_data()
    df_X, df_y = df.CONTENT, df.CLASS
    
    return train_test_split(
        df_X, df_y, test_size=0.2, random_state=RANDOM_STATE
    )

Load data:

In [3]:
data_dir = "data"

psy = pd.read_csv(f"{data_dir}/Youtube01-Psy.csv", parse_dates=["DATE"])
katy = pd.read_csv(f"{data_dir}/Youtube02-KatyPerry.csv", parse_dates=["DATE"])
lmfao = pd.read_csv(f"{data_dir}/Youtube03-LMFAO.csv", parse_dates=["DATE"])
eminem = pd.read_csv(f"{data_dir}/Youtube04-Eminem.csv", parse_dates=["DATE"])
shakira = pd.read_csv(f"{data_dir}/Youtube05-Shakira.csv", parse_dates=["DATE"])

## Exploration analysis

Presentation: https://docs.google.com/presentation/d/1GIoRrhQ2_eERuowLotWscQhx7X4s4A4mwBm3-aGpO_E/edit

The dataset comes from 2017.
It has five datasets composed by 1 956 real messages extracted from five videos that were among the 10 most viewed on the collection period.

| Artist     | Song (our guess)     | Song published on YouTube (date) | Comments range |
|------------|----------------------|----------------------------------|----------------|
| Psy        | Gangnam Style        | 15. 7. 2012                      | 2013–2015      |
| Katy Perry | Roar                 | 5. 9. 2013                       | 2014–2015      |
| LMFAO      | Party Rock Anthem    | 9. 3. 2011                       | 2014–2015      |
| Eminem     | Love The Way You Lie | 5. 8. 2010                       | 2015-2015      |
| Shakira    | Waka Waka            | 5. 6. 2010                       | 2013–2015      |


In [4]:
joined = pd.concat([psy, katy, lmfao, eminem, shakira]).reset_index(drop=True)
joined

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07 06:20:48.000,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07 12:37:15.000,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08 17:34:21.000,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09 08:28:43.000,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10 16:05:38.000,watch?v=vtaRGgvGtWQ Check this out .﻿,1
...,...,...,...,...,...
1951,_2viQ_Qnc6-bMSjqyL1NKj57ROicCSJV5SwTrw-RFFA,Katie Mettam,2013-07-13 13:27:39.441,I love this song because we sing it at Camp al...,0
1952,_2viQ_Qnc6-pY-1yR6K2FhmC5i48-WuNx5CumlHLDAI,Sabina Pearson-Smith,2013-07-13 13:14:30.021,I love this song for two reasons: 1.it is abou...,0
1953,_2viQ_Qnc6_k_n_Bse9zVhJP8tJReZpo8uM2uZfnzDs,jeffrey jules,2013-07-13 12:09:31.188,wow,0
1954,_2viQ_Qnc6_yBt8UGMWyg3vh0PulTqcqyQtdE7d4Fl0,Aishlin Maciel,2013-07-13 11:17:52.308,Shakira u are so wiredo,0


In [5]:
print(f"There are {len(joined)} total rows in the dataset.")

There are 1956 total rows in the dataset.


The column `CLASS` is the label column we want to predict.

In [6]:
f"There are {len(joined[joined.CLASS == 1])} spams and {len(joined[joined.CLASS == 0])} hams."

'There are 1005 spams and 951 hams.'

We will split the data set at the very beginning not to make the test data dirty.
`final_test_X` and `final_test_y` will be used at the very end to evaluate and compare our models.

In [7]:
joined_X, joined_y = joined.drop(columns="CLASS"), joined.CLASS

random_state = 42

train_X_orig, final_test_X, train_y, final_test_y = train_test_split(
    joined_X, joined_y, test_size=0.2, random_state=random_state
)

train_df, test_df = train_test_split(
    joined, test_size=0.2, random_state=random_state
)

train_df = train_df.copy()

In [8]:
print(f"After putting test data apart, there are {len(train_X_orig)} train data rows.")

After putting test data apart, there are 1564 train data rows.


In [9]:
f"There are {(train_y == 1).sum()} spams and {(train_y == 0).sum()} hams."

'There are 789 spams and 775 hams.'

...which is well balanced, so we do not need any fancy sampling.

In [10]:
train_X_orig.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1564 entries, 836 to 1126
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   COMMENT_ID  1564 non-null   object        
 1   AUTHOR      1564 non-null   object        
 2   DATE        1369 non-null   datetime64[ns]
 3   CONTENT     1564 non-null   object        
dtypes: datetime64[ns](1), object(3)
memory usage: 61.1+ KB


#### Dataset includes the following columns:
- COMMENT_ID - string
- AUTHOR - string
- DATE - date
- CONTENT (comment itself) - string
- CLASS (1 = spam, 0 = ham) - categorical

The column `COMMENT_ID` provides very little (no) value, let's drop it.

In [11]:
train_X = train_X_orig.drop(columns="COMMENT_ID")
train_X

Unnamed: 0,AUTHOR,DATE,CONTENT
836,Blaze Blaziken,2015-05-20 03:50:29.098,you cant stop the shuffle﻿
1688,Lucia Scarlet,2015-05-22 11:56:53.104,Amazing song﻿
1505,Paul Buxton,2015-05-21 11:12:22.066,Omg! This guy sounds like an american professo...
1650,Tierra Carbon,2015-05-24 14:21:54.411,It was cool the best song ever ﻿
1573,David Bottenberg,NaT,subscribe to my channel /watch?v=NxK32i0HkDs
...,...,...,...
1130,ItsJoey Dash,2014-07-22 10:04:05.755,EVERYONE PLEASE SUBSCRIBE TO MY CHANNEL OR CAN...
1294,louis canellony,NaT,watch youtube video &quot;EMINEM -YTMA artist ...
860,vieshva .d.exodous,2015-05-18 08:38:34.236,Awesome﻿
1459,viviane trinh,2015-05-21 22:35:35.753,i like the lyrics but not to music video﻿


<font color='red'>The `AUTHOR` column can help in classifying spam – spammers may have specific names (e.g. having a "normal" name instead of a username, copying a well-known username etc.).</font>

There are no correlations between columns.

**Hypothesis**: Something that looks like a spam is not always a spam, it depends on the video.
E.g. if the word appears in the video subtitles, it is less likely to be a spam. However, we will not develop this hypothesis any further.

#### Exploration: Missing values? (Eminem is missing a lot of dates)

In [12]:
for dataset, dataset_name in zip([psy, katy, lmfao, eminem, shakira], ["psy", "katy", "lmfao", "eminem", "shakira"]):
    na = dataset.isna().any()
    print(dataset_name)
    for missing, column_name in zip(na, na.index):
        print(f'\t{column_name:10}: {"" if missing else "NO "}missing values')
    

psy
	COMMENT_ID: NO missing values
	AUTHOR    : NO missing values
	DATE      : NO missing values
	CONTENT   : NO missing values
	CLASS     : NO missing values
katy
	COMMENT_ID: NO missing values
	AUTHOR    : NO missing values
	DATE      : NO missing values
	CONTENT   : NO missing values
	CLASS     : NO missing values
lmfao
	COMMENT_ID: NO missing values
	AUTHOR    : NO missing values
	DATE      : NO missing values
	CONTENT   : NO missing values
	CLASS     : NO missing values
eminem
	COMMENT_ID: NO missing values
	AUTHOR    : NO missing values
	DATE      : missing values
	CONTENT   : NO missing values
	CLASS     : NO missing values
shakira
	COMMENT_ID: NO missing values
	AUTHOR    : NO missing values
	DATE      : NO missing values
	CONTENT   : NO missing values
	CLASS     : NO missing values


We can see that only eminem dataset has missing valued and they are in the `DATE` column.

In [13]:
f"{eminem.DATE.isna().sum()} out of {len(eminem)} DATE values in eminem dataset are missing."

'245 out of 448 DATE values in eminem dataset are missing.'

First, we wanted to replace `NaN` dates with previous `non NaN` date in a sorted dataset, but...

In [14]:
eminem[eminem.DATE.isna()]["CLASS"].value_counts()

1    245
Name: CLASS, dtype: int64

<font color='red'>All Eminem comments that do not have DATE are spams. The information that DATE is missing is therefore valuable.</font>

#### Exploration: What range are comment dates in?

In [15]:
for dataset, dataset_name in zip([psy, katy, lmfao, eminem, shakira], ["psy", "katy", "lmfao", "eminem", "shakira"]):
    sorted_dataset = dataset[~dataset.DATE.isna()].DATE.sort_values()
    earliest = sorted_dataset.iloc[0].year
    latest = sorted_dataset.iloc[-1].year
    print(f"{dataset_name}: {earliest}-{latest}")
    

psy: 2013-2015
katy: 2014-2015
lmfao: 2014-2015
eminem: 2015-2015
shakira: 2013-2015


#### Hypothesis: Longer comment implies spam.

In [16]:
y_pred = train_X.CONTENT.str.len() >= 50
accuracy_score(train_y, y_pred)

0.6304347826086957

...better than random.

#### Hypothesis: Spams contain "check" (e.g. "Check out my channel")

In [17]:
y_pred = train_X.CONTENT.str.contains("check")
accuracy_score(train_y, y_pred)

0.5818414322250639

...better than random.

#### Exploration: Are there any duplicate comments based on ID?

In [18]:
train_df[train_df.COMMENT_ID.duplicated(keep=False)]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
1421,LneaDw26bFvPh9xBHNw1btQoyP60ay_WWthtvXCx37s,janez novak,NaT,share and like this page to win a hand signed ...,1
1441,LneaDw26bFuH6iFsSrjlJLJIX3qD4R8-emuZ-aGUj0o,Amir bassem,NaT,if u love rihanna subscribe me,1
1420,LneaDw26bFvPh9xBHNw1btQoyP60ay_WWthtvXCx37s,janez novak,NaT,share and like this page to win a hand signed ...,1
1797,_2viQ_Qnc68fX3dYsfYuM-m4ELMJvxOQBmBOFHqGOk0,tyler sleetway,2013-10-05 00:57:25.078,so beutiful,0
1798,_2viQ_Qnc68fX3dYsfYuM-m4ELMJvxOQBmBOFHqGOk0,tyler sleetway,2013-10-05 00:57:25.078,so beutiful,0
1443,LneaDw26bFuH6iFsSrjlJLJIX3qD4R8-emuZ-aGUj0o,Amir bassem,NaT,if u love rihanna subscribe me,1


<font color='red'>We want to drop duplicate rows.</font>

In [19]:
f"We lose only {len(train_df) - len(train_df.drop_duplicates())} rows by dropping duplicates."

'We lose only 3 rows by dropping duplicates.'

Are there any duplicate comments based on Author name?

In [20]:
duplicate_authors = train_df[train_df.duplicated(subset=["AUTHOR"], keep=False)]
duplicate_authors

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
192,z13xtdlovm2hzl05d04ccz1pnvqtezdriqc0k,Uroš Slemenjak,2014-11-07 12:08:13.000,"People, here is a new network like FB...you re...",1
462,z12wj5g52rzbcvprl04cenuj1yyifhxq3hw,LuckyMusiqLive,2014-09-15 17:47:57.000,Katy has the voice of gold. this video really ...,1
141,z12udxjwpwurtlwz304ccbrhdtusth4herk0k,PacKmaN,2014-11-05 21:56:39.000,check men out i put allot of effort into my mu...,1
1828,_2viQ_Qnc6_umVgV0fI-CSScDHuFxNHIVvezCGhajW8,Alain Bruno,2013-10-02 04:01:20.922,Shakira is very beautiful,0
1131,z12dw3tbbzm2gpzty22gtf1bvviqeha2j,ItsJoey Dash,2014-07-22 10:04:00.700,EVERYONE PLEASE SUBSCRIBE TO MY CHANNEL OR CAN...,1
...,...,...,...,...,...
769,z13jvndqyr2zjzpxo04cij4pdva4yhqzlq40k,D Maw,2015-05-25 05:43:54.048,Remeber when this song was good﻿,0
1332,LneaDw26bFuUfz6WhRf9xiRHsxHm0t4fXYLbcPhB7Lk,Scott Johnson,NaT,You guys should check out this EXTRAORDINARY w...,1
1638,z132svd4fvq1wntfd221w5szfzezjri2r,Abdullah Fawzi,2015-05-25 06:23:24.405,"see this<br /><a href=""http://adf.ly"">http://a...",1
1724,z13qfn5yusqoslnn222dutdw4yqmhzhej,LiveLikeLien x,2015-05-20 00:57:56.444,"It makes me happy instantly, and makes me forg...",0


In [21]:
train_df[train_df.AUTHOR == "Uroš Slemenjak"]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
192,z13xtdlovm2hzl05d04ccz1pnvqtezdriqc0k,Uroš Slemenjak,2014-11-07 12:08:13,"People, here is a new network like FB...you re...",1
658,z13ocdbaxwqdvjnwx04ccz1pnvqtezdriqc0k,Uroš Slemenjak,2014-11-07 12:16:21,"People, here is a new network like FB...you re...",1


In [22]:
duplicate_authors["CLASS"].value_counts()

1    149
0     35
Name: CLASS, dtype: int64

In [23]:
spams, hams = duplicate_authors["CLASS"].value_counts()
f"{(spams / (spams + hams)) * 100:.02f}% of comments from authors who posted multiple comments are spams."

'80.98% of comments from authors who posted multiple comments are spams.'

Does the same author post both spams and hams? If so, are the labels truly correct?

In [24]:
grouped_by_author = duplicate_authors.groupby("AUTHOR").CLASS.mean()
posted_both_spam_and_ham = (grouped_by_author != 0) & (grouped_by_author != 1)
f"There are {len(grouped_by_author[posted_both_spam_and_ham])} authors who posted both spam and ham. But this might not be generally "

'There are 0 authors who posted both spam and ham. But this might not be generally '

In [25]:
duplicate_comments = train_df[train_df.duplicated(subset=["CONTENT"], keep=False)]
grouped_by_content = duplicate_comments.groupby("CONTENT").CLASS.mean()
same_content_spam_and_ham = (grouped_by_content != 0) & (grouped_by_content != 1)
f"There are {len(grouped_by_content[same_content_spam_and_ham])} comments that are classified both as spam and ham."

'There are 0 comments that are classified both as spam and ham.'

But this might be true only for the data we are working with, and not truth in general.

In [26]:
test_df[test_df.AUTHOR == "Connor Mire"]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
810,z13xxf3qlq2bxpm1o22zidpqbn2tfpcjr04,Connor Mire,2015-05-21 20:46:54.704,This Song is AWESOME!!!!﻿,0
809,z13uzb3zqoylw1kl022zidpqbn2tfpcjr04,Connor Mire,2015-05-21 20:47:08.673,I&#39;m A SUBSCRIBER﻿,1


We can see that *Connor Mire* posted both spam and ham. And the labels are correct.

In [27]:
spams, hams = duplicate_comments["CLASS"].value_counts()
f"{(spams / (spams + hams)) * 100:.02f}% of duplicate comments are spams."

'77.30% of duplicate comments are spams.'

We will take advantage of custom transformers. This way, it is easy to use them in pipelines both for training and testing. We can easily select a subset of transformations and vary transformations based on used model.

In [28]:
train_X, test_X, train_y, test_y = load_train_test_all_cols_data()

In [29]:
class ExplorativeTransformer(BaseEstimator, TransformerMixin):
    """
    
    """
    def __init__(self):
        super().__init__()
        self.duplicated_and_spammers = 0
        self.duplicated_and_not_spammers = 0
        self.not_unique_authors = set()

    def fit(self, X, y=None):
        X = X.copy()
        
        stop = stopwords.words('english')

        author_set = set()
        for author in X["AUTHOR"]:
            if author in author_set:
                self.not_unique_authors.add(author)
            author_set.add(author)
            
        # are all duplicated authors spammers?
        for index, row in X.iterrows():
            if row["AUTHOR"] in self.not_unique_authors and y[index] == 1:
                self.duplicated_and_spammers += 1
            if row["AUTHOR"] in self.not_unique_authors and y[index] == 0:
                self.duplicated_and_not_spammers += 1
        
        """
        Creates 2 dictionaries: spam_dic and ham_dict, one that counts occurence of each word in spam comments, 
        other counts occurence of each word in ham comments.
        Further, only those words that are repeated more than 15 times are left in ham_dict, 
        and those repeated more than 35 times in spam_dict.
        Numbers are chosen like that, because I wanted to empahsise, that it is more importatnt to not mark as spam sth.,
        that is not spam, than the other way around.
        The last step is to keep only those words in spam_dict that are not in ham_dict and are not stop_words.
        """

        self.spam_dict = ExplorativeTransformer.get_suspicious_words(X, y, 1)
        ham_dict = ExplorativeTransformer.get_suspicious_words(X, y, 0)
        my_inverted_dict_spam = dict(map(reversed, self.spam_dict.items()))
        my_inverted_dict_ham = dict(map(reversed, ham_dict.items()))
        suspicious_spam = []
        suspicious_ham = []

        for key in my_inverted_dict_spam:
            if key > 15 and my_inverted_dict_spam[key] not in stop:
                suspicious_spam.append(my_inverted_dict_spam[key])

        for key in my_inverted_dict_ham:
            if key > 35 and my_inverted_dict_ham[key] not in stop:
                suspicious_ham.append(my_inverted_dict_ham[key])

        # remove all from spam that is in ham
        self.suspicious_words_list = []
        for word in suspicious_spam:
            if word not in suspicious_ham:
                self.suspicious_words_list.append(word)

        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["HAS_LINK"] = np.where(X['CONTENT'].str.contains('http') |
                                  X['CONTENT'].str.contains('//'), 2, 0)
    
        X["NOT_UNIQUE_AUTHOR"] = np.where(X['AUTHOR'].str in self.not_unique_authors, self.duplicated_and_not_spammers
                                           // self.duplicated_and_spammers, 0)

        X["NULL_IN_DATE_TIME"] = np.where(X["DATE"].isna(), 2, 0)
                
        result = []
        for index, row in X.iterrows():
            suspicious_counter = 0
            for word in self.suspicious_words_list:
                my_row = row["CONTENT"].lower().split()
                if word in my_row:
                    suspicious_counter += self.spam_dict[word.lower()] // 100
            result.append(suspicious_counter)

        X["SUSPICIOUS_WORDS_COUNT"] = result

        return X
    
    @staticmethod
    def get_suspicious_words(X, y, num):
        result = {}
        for index, row in X.iterrows():
            if y[index] != num:
                continue
            words = (row["CONTENT"].lower()).split()
            for word in words:
                if word in result:
                    result[word] += 1
                else:
                    result[word] = 1
                    
        return result

# DEMO
transformed = ExplorativeTransformer().fit_transform(train_X, train_y)
transformed

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,HAS_LINK,NOT_UNIQUE_AUTHOR,NULL_IN_DATE_TIME,SUSPICIOUS_WORDS_COUNT
1447,z13wxtdpeznid12et23ogtd4zoyvzbnoz04,Sonny Carter,2015-05-22 11:46:35.988,I love this song sooooooooooooooo much﻿,eminem,0,0,0,0
1846,_2viQ_Qnc6-adCzTDLAhqNVQ5hFYcjPyPI5m7pHY4BY,Lizzy Molly,2013-09-09 17:34:07.052,PLEASE CHECK OUT MY VIDEO CALLED &quot;WE LOVE...,shakira,0,0,0,3
1304,z13uutbriumnuj3rq04ccbvqlwjuj1srhyk0k,Warcorpse666,2015-05-26 02:27:43.254,sorry but eminmem is a worthless wife beating ...,eminem,0,0,0,0
402,z121gbuy2unhc5m4n04cf3kyslqhepeqgvo0k,Santeri Saariokari,2014-09-03 16:32:59.000,"Hey guys go to check my video name ""growtopia ...",katy,0,0,0,3
652,z133hdqrqpukup0lp22chhoaztrhvxov5,Quinho Divulgaçoes,2014-11-06 19:50:16.000,me segue ha https://www.facebook.com/marcos.s...,katy,2,0,0,0
...,...,...,...,...,...,...,...,...,...
1865,_2viQ_Qnc6_FlLJN0izQaKVQNe6LGDmPZMmkVDjjymE,Neeru bala,2013-09-05 23:07:09.056,Hi.. Everyone.. If anyone after real online wo...,shakira,0,0,0,0
1427,LneaDw26bFsltJodWnZAafXscqrATBuKDM8-8lA4TQE,miamiscraziest,NaT,LADIES!!! -----&gt;&gt; If you have a broken h...,eminem,0,0,2,3
172,z12sitjpgyy3f5j2322iy1figqa4vnyja04,OverSpace33,2014-11-06 19:40:59.000,For Christmas Song visit my channel! ;)﻿,psy,0,0,0,0
269,z13axbnqtxfrc3ncc23xxp2wivqbgx43o,sahil samal,2014-11-08 06:28:01.000,1 millioon dislikessssssssssssssssssssssssssss...,psy,0,0,0,0


In [30]:
class AddMissingDateColumn(BaseEstimator, TransformerMixin):
    """
    Adds boolean column if DATE value is missing.
    The model should then understand that True most likely means spam.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # do not modify the original dataset
        X = X.copy()
        X["DATE_MISSING"] = X.DATE.isna()
        return X
    
# DEMO
transformed = AddMissingDateColumn().transform(train_X)
transformed[1244:1246]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,DATE_MISSING
1865,_2viQ_Qnc6_FlLJN0izQaKVQNe6LGDmPZMmkVDjjymE,Neeru bala,2013-09-05 23:07:09.056,Hi.. Everyone.. If anyone after real online wo...,shakira,False
1427,LneaDw26bFsltJodWnZAafXscqrATBuKDM8-8lA4TQE,miamiscraziest,NaT,LADIES!!! -----&gt;&gt; If you have a broken h...,eminem,True


In [31]:
class AddLongCommentColumn(BaseEstimator, TransformerMixin):
    """
    Adds boolean column if CONTENT is longer or equal to 50 charcters.
    The model should then understand that True most likely means spam.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["LONG_COMMENT"] = X.CONTENT.str.len() >= 50
        return X

# DEMO
transformed = AddLongCommentColumn().transform(train_X)
transformed[:2]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,LONG_COMMENT
1447,z13wxtdpeznid12et23ogtd4zoyvzbnoz04,Sonny Carter,2015-05-22 11:46:35.988,I love this song sooooooooooooooo much﻿,eminem,False
1846,_2viQ_Qnc6-adCzTDLAhqNVQ5hFYcjPyPI5m7pHY4BY,Lizzy Molly,2013-09-09 17:34:07.052,PLEASE CHECK OUT MY VIDEO CALLED &quot;WE LOVE...,shakira,True


In [32]:
class AddContainsCheckColumn(BaseEstimator, TransformerMixin):
    """
    Adds boolean column if CONTENT contains 'check'.
    The model should then understand that True most likely means spam.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTAINS_CHECK"] = X.CONTENT.str.contains("check")
        return X

# DEMO
transformed = AddContainsCheckColumn().transform(train_X)
transformed[2:4]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,CONTAINS_CHECK
1304,z13uutbriumnuj3rq04ccbvqlwjuj1srhyk0k,Warcorpse666,2015-05-26 02:27:43.254,sorry but eminmem is a worthless wife beating ...,eminem,False
402,z121gbuy2unhc5m4n04cf3kyslqhepeqgvo0k,Santeri Saariokari,2014-09-03 16:32:59.000,"Hey guys go to check my video name ""growtopia ...",katy,True


In [33]:
class AddMultipleCommentsColumn(BaseEstimator, TransformerMixin):
    """
    Adds boolean column if author posted multiple comments.
    The model should then understand that True most likely means spam.
    
    Looks also into data it has been fit on.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        self.prev_X = X.copy()
        return self

    def transform(self, X, y=None):
        X = X.copy()
        joined = pd.concat([X, self.prev_X])
        joined.drop_duplicates(inplace=True)
        X["MULTIPLE_COMMENTS"] = joined.duplicated(subset=["AUTHOR"], keep=False)[:len(X)]
        return X
    
# DEMO
mollys = train_X[train_X.AUTHOR == "Lizzy Molly"]
admc = AddMultipleCommentsColumn()
admc.fit_transform(mollys[:1])

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,MULTIPLE_COMMENTS
1846,_2viQ_Qnc6-adCzTDLAhqNVQ5hFYcjPyPI5m7pHY4BY,Lizzy Molly,2013-09-09 17:34:07.052,PLEASE CHECK OUT MY VIDEO CALLED &quot;WE LOVE...,shakira,False


In [34]:
transformed = admc.transform(mollys[1:2])
transformed[transformed.AUTHOR == "Lizzy Molly"]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,MULTIPLE_COMMENTS
1472,LneaDw26bFuhuiZ8uX6C-qYLIsOFj9BIWtKWtCz870c,Lizzy Molly,NaT,PLEASE CHECK OUT MY VIDEO CALLED &quot;WE LOVE...,eminem,True


In [35]:
class AddMultipleCommentsSameVideoColumn(BaseEstimator, TransformerMixin):
    """
    Adds boolean column if author posted multiple comments for the same video.
    The model should then understand that True most likely means spam.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["MULTIPLE_COMMENTS_SAME_VIDEO"] = X.duplicated(subset=["AUTHOR", "INTERPRET"], keep=False)
        return X
    
# DEMO    
transformed = AddMultipleCommentsSameVideoColumn().transform(train_X)
transformed[transformed.AUTHOR == "Lizzy Molly"]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,MULTIPLE_COMMENTS_SAME_VIDEO
1846,_2viQ_Qnc6-adCzTDLAhqNVQ5hFYcjPyPI5m7pHY4BY,Lizzy Molly,2013-09-09 17:34:07.052,PLEASE CHECK OUT MY VIDEO CALLED &quot;WE LOVE...,shakira,False
1472,LneaDw26bFuhuiZ8uX6C-qYLIsOFj9BIWtKWtCz870c,Lizzy Molly,NaT,PLEASE CHECK OUT MY VIDEO CALLED &quot;WE LOVE...,eminem,False


In [36]:
class AddContainsHttpColumn(BaseEstimator, TransformerMixin):
    """
    Adds boolean column if CONTENT contains a link.
    The model should then understand that True most likely means spam.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTAINS_HTTP"] = X.CONTENT.str.contains("http")
        return X
    
# DEMO    
transformed = AddContainsHttpColumn().transform(train_X)
transformed[3:5]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,CONTAINS_HTTP
402,z121gbuy2unhc5m4n04cf3kyslqhepeqgvo0k,Santeri Saariokari,2014-09-03 16:32:59,"Hey guys go to check my video name ""growtopia ...",katy,False
652,z133hdqrqpukup0lp22chhoaztrhvxov5,Quinho Divulgaçoes,2014-11-06 19:50:16,me segue ha https://www.facebook.com/marcos.s...,katy,True


In [37]:
class AddTimeColumn(BaseEstimator, TransformerMixin):
    """
    Adds time column.
    Hypothesis: spams are posted at night, or, on the contrary, spams are posted during main working hours. Let the model decide...
    This probably won't work, because the dates are most likely relative to the time zone, where they were collected.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["TIME"] = X.DATE.dt.time
        return X
    
# DEMO    
transformed = AddTimeColumn().transform(train_X)
transformed[0:1]

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,INTERPRET,TIME
1447,z13wxtdpeznid12et23ogtd4zoyvzbnoz04,Sonny Carter,2015-05-22 11:46:35.988,I love this song sooooooooooooooo much﻿,eminem,11:46:35.988000


In [38]:
class HtmlUnescaper(BaseEstimator, TransformerMixin):
    """
    For example, `&amp;` is escaped ampersand. Unescape it and other characters as well.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTENT"] = X["CONTENT"].apply(html.unescape)
        return X
    
# DEMO    
print("Before: ", train_X.CONTENT[700])
transformed = HtmlUnescaper().transform(train_X)
print("After: ", transformed.CONTENT[700])

Before:  <a href="http://www.youtube.com/watch?v=KQ6zr6kCPj8&amp;t=2m19s">2:19</a> best part﻿
After:  <a href="http://www.youtube.com/watch?v=KQ6zr6kCPj8&t=2m19s">2:19</a> best part﻿


In [39]:
class BOMRemover(BaseEstimator, TransformerMixin):
    """
    Remove Byte Order Mark from comments.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTENT"] = X["CONTENT"].str.replace("\ufeff", "", regex=False)
        return X
    

In [40]:
# DEMO
"Before: " + train_X.CONTENT.loc[700]

'Before: <a href="http://www.youtube.com/watch?v=KQ6zr6kCPj8&amp;t=2m19s">2:19</a> best part\ufeff'

In [41]:
transformed = BOMRemover().transform(train_X)
"After: " + transformed.CONTENT.loc[700]

'After: <a href="http://www.youtube.com/watch?v=KQ6zr6kCPj8&amp;t=2m19s">2:19</a> best part'

In [42]:
class AnchorTransformer(BaseEstimator, TransformerMixin):
    """
    Transforms all anchor tags into one keyword. 
    The model will figure out, that the presence of this keyword probably means spam.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTENT"] = X["CONTENT"].str.replace("(<a.+>)", "anchortag", regex=True)
        return X
    
# DEMO
print("Before: " + train_X.CONTENT.loc[700])
transformed = AnchorTransformer().transform(train_X)
print("After: " + transformed.CONTENT.loc[700])

Before: <a href="http://www.youtube.com/watch?v=KQ6zr6kCPj8&amp;t=2m19s">2:19</a> best part﻿
After: anchortag best part﻿


In [43]:
class AddContainsAnchorTagColumn(BaseEstimator, TransformerMixin):
    """
    Adds new column CONTAINS_ANCHOR_TAG which is True when CONTENT contains <a> tag.
    The model should then understand that True most likely means spam.
    Removes the link from CONTENT as well.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTAINS_ANCHOR_TAG"] = X["CONTENT"].str.contains("(<a.+>)")
        X["CONTENT"] = X["CONTENT"].str.replace("(<a.+>)", "", regex=True)
        return X
    
# DEMO
print("Before: " + train_X.CONTENT.loc[700])
transformed = AddContainsAnchorTagColumn().transform(train_X)
transformed.loc[700]

Before: <a href="http://www.youtube.com/watch?v=KQ6zr6kCPj8&amp;t=2m19s">2:19</a> best part﻿


  return func(self, *args, **kwargs)


COMMENT_ID             z13uwn2heqndtr5g304ccv5j5kqqzxjadmc0k
AUTHOR                                          Corey Wilson
DATE                              2015-05-28 21:39:52.376000
CONTENT                                           best part﻿
INTERPRET                                              lmfao
CONTAINS_ANCHOR_TAG                                     True
Name: 700, dtype: object

In [44]:
class Lower(BaseEstimator, TransformerMixin):
    """
    Makes CONTENT lowercase.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTENT"] = X["CONTENT"].str.lower()
        return X
    
# DEMO
print("Before: " + train_X.CONTENT.loc[192][:40])
transformed = Lower().transform(train_X)
print("After: " + transformed.CONTENT.loc[192][:40])

Before: People, here is a new network like FB...
After: people, here is a new network like fb...


In [45]:
class UrlTransformer(BaseEstimator, TransformerMixin):
    """
    Transforms all urls into one keyword. 
    The model will figure out, that the presence of this keyword probably means spam.
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        X["CONTENT"] = X["CONTENT"].str.replace(r"\S*\.com\S*|\S*watch\?\S*", "urllink", regex=True)
        return X
    
# DEMO
print("Before: " + train_X.CONTENT.loc[1573])
print("Before: " + train_X.CONTENT.loc[14])
transformed = UrlTransformer().transform(train_X)
print("After: " + transformed.CONTENT.loc[1573])
print("After: " + transformed.CONTENT.loc[14])

Before: subscribe to my channel  /watch?v=NxK32i0HkDs
Before: please like :D https://premium.easypromosapp.com/voteme/19924/616375350﻿
After: subscribe to my channel  urllink
After: please like :D urllink


We do not need to transform emojis, some word embeddings handle them well:

<div class="alert alert-block alert-danger">Will download <b>1.6 GB</b> large word embeddings model.</div>

In [46]:
vectors = Magnitude(MagnitudeUtils.download_model('fasttext/medium/wiki-news-300d-1M'))
print(vectors.similarity(":)", ":D"))

0.5047033597406592


## Feature engineering

In [47]:
# One-hot-encode AUTHOR column
pipeline = make_pipeline(
    make_column_transformer(
        (OneHotEncoder(), ["AUTHOR"]),
        remainder="passthrough",
        sparse_threshold=0
    )
)
# DEMO
pipeline.fit(train_X, train_y)
pipeline.transform(train_X[:1])

array([[0.0, 0.0, 0.0, ..., Timestamp('2015-05-22 11:46:35.988000'),
        'I love this song sooooooooooooooo much\ufeff', 'eminem']],
      dtype=object)

## Possible models and techniques to use:
- [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model)
- [Naive Bayes](https://www.kaggle.com/mohammedakhil/youtube-spam-filter-using-naive-bayes)
- [LSTM](https://keras.io/api/layers/recurrent_layers/lstm/)
- [DecisionTreeClassifier](https://datascience.stackexchange.com/questions/67250/decision-tree-in-sentiment-analysis)
- [Magnitude Word Embeddings](https://colab.research.google.com/drive/1lOcAhIffLW8XC6QsKzt5T_ZqPP4Y9eS4#scrollTo=95Xg9EyU-ZYr)

## Baseline models

In [48]:
train_X, test_X, train_y, test_y = load_train_test_data()

### Dummy Classifier

In [49]:
for strategy, strategy_name in zip(["uniform", "stratified", "most_frequent"],
                                   ["completely random", "proportional", "most frequent"]):
    dummy_clf = DummyClassifier(strategy=strategy)
    dummy_clf.fit(train_X, train_y)
    print(f"{strategy_name}: {dummy_clf.score(test_X, test_y):.2f}") # ignores the test data

completely random: 0.50
proportional: 0.48
most frequent: 0.48


### Bag of words + Naive Bayes

In [50]:
N_FEATURES = len(CountVectorizer().fit(train_X).vocabulary_)*10

vectorizers = [("Count vect.", CountVectorizer()), 
               ("TF-IDF vect.", TfidfVectorizer()), 
               ("Count binary vect.", CountVectorizer(binary=True)),
               ("Hashing vect.", HashingVectorizer(n_features = N_FEATURES, alternate_sign=False)),
              ]
bayeses = [
    ("ComplementNB", ComplementNB()), 
    ("BernoulliNB", BernoulliNB()), 
    ("MultinomialNB", MultinomialNB())
]

for vectorizer_name, vectorizer in vectorizers:
    for bayes_name, bayes in bayeses:
        pipeline= make_pipeline(
            vectorizer,
            bayes
        )

        pipeline.fit(np.array(train_X), train_y)
        y_pred = pipeline.predict(test_X)

        print(f"{vectorizer_name}; {bayes_name}: {pipeline.score(test_X, test_y):.2f}")

        tn, fp, fn, tp = confusion_matrix(test_y, y_pred).ravel()
        print(f"{vectorizer_name}; {bayes_name} - FP: {fp}, FN: {fn}")

Count vect.; ComplementNB: 0.95
Count vect.; ComplementNB - FP: 8, FN: 8
Count vect.; BernoulliNB: 0.89
Count vect.; BernoulliNB - FP: 4, FN: 29
Count vect.; MultinomialNB: 0.94
Count vect.; MultinomialNB - FP: 10, FN: 8
TF-IDF vect.; ComplementNB: 0.94
TF-IDF vect.; ComplementNB - FP: 8, FN: 10
TF-IDF vect.; BernoulliNB: 0.89
TF-IDF vect.; BernoulliNB - FP: 4, FN: 29
TF-IDF vect.; MultinomialNB: 0.94
TF-IDF vect.; MultinomialNB - FP: 10, FN: 10
Count binary vect.; ComplementNB: 0.95
Count binary vect.; ComplementNB - FP: 8, FN: 9
Count binary vect.; BernoulliNB: 0.89
Count binary vect.; BernoulliNB - FP: 4, FN: 29
Count binary vect.; MultinomialNB: 0.94
Count binary vect.; MultinomialNB - FP: 10, FN: 9
Hashing vect.; ComplementNB: 0.91
Hashing vect.; ComplementNB - FP: 21, FN: 7
Hashing vect.; BernoulliNB: 0.91
Hashing vect.; BernoulliNB - FP: 6, FN: 23
Hashing vect.; MultinomialNB: 0.90
Hashing vect.; MultinomialNB - FP: 24, FN: 7


## Our models

### Word embeddings + Keras LSTM

In [51]:
MAX_WORDS = 30 # he maximum number of words the sequence model will consider
STD_DEV = 0.01 # Deviation of noise for Gaussian Noise applied to the embeddings
HIDDEN_UNITS = 100 # The number of hidden units from the LSTM
DROPOUT_RATIO = .8 # The ratio to dropout
BATCH_SIZE = 100 # The number of examples per train/validation step
EPOCHS = 50 # The number of times to repeat through all of the training data
LEARNING_RATE = .01 # The learning rate for the optimizer

<div class="alert alert-block alert-danger">Will download <b>1.6 GB</b> large word embeddings model.</div>

In [52]:
vectors = Magnitude(MagnitudeUtils.download_model('fasttext/medium/wiki-news-300d-1M'), pad_to_length = MAX_WORDS)

In [53]:
train_X, test_X, train_y, test_y = load_train_test_data()

In [54]:
class WordEmbeddings(BaseEstimator, TransformerMixin):
    """
    ! Works with Series, not DataFrame !
    For each row in series, which contains a sentence, 
    it embeds the words in the series into 300 dimensional vectors (one vector for each word).
    """
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        tokenized = [word_tokenize(line) for line in X]
        return vectors.query(tokenized)
    
# DEMO
transformed = WordEmbeddings().transform(train_X[:10])
transformed.shape

(10, 30, 300)

**GaussianNoise** - useful to mitigate overfitting (could be used as a form of random data augmentation). Gaussian Noise (GS) is a natural choice as corruption process for real valued inputs.
As it is a regularization layer, it is only active at training time.

**Bidirectional** – sequence processing model that consists of two LSTMs: one taking the input in a forward direction, and the other in a backwards direction. Bidirectional LSTMs effectively increase the amount of information available to the network, improving the context available to the algorithm

**Dropout** – a stochastic regularization technique and should reduce overfitting by (theoretically) combining many different neural network architectures.
With Dropout, the training process essentially drops out neurons in a neural network.

**Dense** - The dense layer is a neural network layer that is connected deeply, which means each neuron in the dense layer receives input from all neurons of its previous layer.
In the background, the dense layer performs a matrix-vector multiplication. The values used in the matrix are actually parameters that can be trained and updated with the help of backpropagation.

In [55]:
def create_model():
    model = Sequential()

    model.add(GaussianNoise(STD_DEV, input_shape=(MAX_WORDS, vectors.dim)))
    model.add(Bidirectional(LSTM(HIDDEN_UNITS, activation='tanh'), merge_mode='concat'))
    model.add(Dropout(DROPOUT_RATIO))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(
        loss='binary_crossentropy',
        optimizer=Adam(lr=LEARNING_RATE),
        metrics=['accuracy'])
    
    return model

In [56]:
clf = make_pipeline(
    WordEmbeddings(),
    KerasClassifier(create_model, epochs=5, batch_size=32, validation_split=0.1)
)

clf.fit(train_X, train_y)
clf.score(test_X, test_y)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.920127809047699

In [57]:
lstm_pred_y = (clf.predict(test_X) > 0.5).astype("int32")
print("Confusion Matrix")
print(confusion_matrix(test_y, lstm_pred_y))
print("\nClassification Report")
print(classification_report(test_y, lstm_pred_y, target_names=["Ham", "Spam"]))



Confusion Matrix
[[149  14]
 [ 11 139]]

Classification Report
              precision    recall  f1-score   support

         Ham       0.93      0.91      0.92       163
        Spam       0.91      0.93      0.92       150

    accuracy                           0.92       313
   macro avg       0.92      0.92      0.92       313
weighted avg       0.92      0.92      0.92       313



## Support vector Classifier

In [58]:
train_X, test_X, train_y, test_y = load_train_test_all_cols_data()

In [59]:
svc = make_pipeline(
    ExplorativeTransformer(),
    ColumnSelector(["SUSPICIOUS_WORDS_COUNT", "NULL_IN_DATE_TIME", "HAS_LINK", "NOT_UNIQUE_AUTHOR"]),
    SVC(),
)

svc.fit(train_X, train_y)
svc_pred_y = svc.predict(test_X)
print("SVC accuracy score :" + str(accuracy_score(test_y, svc_pred_y)))

SVC accuracy score :0.8785942492012779


In [60]:
print("Confusion Matrix")
print(confusion_matrix(test_y, svc_pred_y))
print("\nClassification Report")
print(classification_report(test_y, svc_pred_y, target_names=["Ham", "Spam"]))

Confusion Matrix
[[159   4]
 [ 34 116]]

Classification Report
              precision    recall  f1-score   support

         Ham       0.82      0.98      0.89       163
        Spam       0.97      0.77      0.86       150

    accuracy                           0.88       313
   macro avg       0.90      0.87      0.88       313
weighted avg       0.89      0.88      0.88       313



## Decision Tree Classifier

Steps:
* splitting the space by setting rules
* removing the unnecessary splits
* using the class with majority votes as the prediction

Decision tree first splits the data based on these concepts: Pure and Impure, Impurity measurement, Information Gain.
This is calculated when traversing trough the tree, then edge with more gain is used to decide about the categorization of an instance
given to the algorithm.

In [61]:
class WordEmbeddingsDF(BaseEstimator, TransformerMixin):
    """
    For each column for each row, computes word embedddings of values in the column split with nltk.word_tokenize.
    The mean of these word embeddings is then computed, giving 300 dimensional vector for each row.
    This vector is then appended to the dataframe as 300 new columns.
    """
    def init(self):
        super().init()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X = X.copy()
        for col in X:
            tokenized = [word_tokenize(line) for line in X[col]]
            mean = np.mean(vectors.query(tokenized), axis=1)
            for i in range(300):
                X[f"{i}_EMBEDDED"] = mean[:,i]

        return X
    
transformed = WordEmbeddingsDF().transform(train_X[["CONTENT"]])
transformed.shape

(1249, 301)

In [62]:
dtc = make_pipeline(
    CountVectorizer(stop_words="english"),
    DecisionTreeClassifier()
)

dtc.fit(train_X["CONTENT"], train_y)
print("DTC accuracy score :" + str(dtc.score(test_X["CONTENT"], test_y)))

DTC accuracy score :0.9648562300319489


In [63]:
dtc_pred_y = dtc.predict(test_X["CONTENT"])
print("Confusion Matrix")
print(confusion_matrix(test_y, dtc_pred_y))
print("\nClassification Report")
print(classification_report(test_y, dtc_pred_y, target_names=["Ham", "Spam"]))

Confusion Matrix
[[159   4]
 [  7 143]]

Classification Report
              precision    recall  f1-score   support

         Ham       0.96      0.98      0.97       163
        Spam       0.97      0.95      0.96       150

    accuracy                           0.96       313
   macro avg       0.97      0.96      0.96       313
weighted avg       0.96      0.96      0.96       313



## Summary:
**Baseline MNBayes** - 0.9–0.95

**LSTM** - 0.93

This model uses Magnitude library to convert words into vectors of numbers (word embeddings). This conversion, combined with LSTM, preserves the order of words.
However, MLP with CountVectorizer, which does not preserve the order performed slightly better (0.95). This iundicates that the architecture and hyperparameters of the used LSTM still need some fine tuning.

**SVC** - 0.87

This result was reached with our state-of-the-art ExplorationTransformer. The SVM classifier was also tried with a Count Vectorizer and, interestingly, it performed better with stop words included than without them (0.94 vs 0.92).

**Tree Classifier** - 0.95

Considering the relative simplicity of this model, it performed rather well on the data set.

We have prepared many transformers, which could be used to create new columns. 
We could then try different combinations of these columns and select the best, perhaps by using SelectKBest.

However, considering the success of even the baseline model, and the success of other models on just the vectorized comments, the performance on this
dataset might increase only marginally.