# Context

Comment spam has grown strongly in recent years on the channels of many major YouTube creators.

Several of them expressed their frustration with malicious comments impersonating them in order to scam their viewers.

This therefore becomes an important problem to solve.

To solve this problem I use a machine learning based approch.

**Outline :**
1. Data collection and Exploration
2. Data Preprocessing for ML
3. Modelisation and validation 
4. Choice of the king Model
5. Model saving 


# Plugins importing 


In this section, I import all plugin necessary

In [1]:
from flask import Flask, render_template, url_for, request
import pandas as pd
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.externals import joblib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier


  from numpy.core.umath_tests import inner1d


# 1.Data Collection and Exploration

##  1.1 Data Collection 

In this article, I use a YouTube Spam Collection Data Set, it is a public dataset of comments collected for spam research available __[here](https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection)__. It has five datasets composed by 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period. I will gatherer the fives data set in one. 

In [12]:
#Data collection 
df1 = pd.read_csv("Youtube01-Psy.csv")
df2 = pd.read_csv("Youtube02-KatyPerry.csv")
df3 = pd.read_csv("Youtube03-LMFAO.csv")
df4 = pd.read_csv("Youtube04-Eminem.csv")
df5 = pd.read_csv("Youtube05-Shakira.csv")

#Concatenate all fives data in one
df  = pd.concat([df1, df2,df3,df4,df5])




## 1.2 Data Exploration

In [13]:
#Dysplay collumn 
list(df.columns)

['COMMENT_ID', 'AUTHOR', 'DATE', 'CONTENT', 'CLASS']

In [14]:
#Plot five fist collumn of data 
df.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [15]:
#Plot data size
df.shape

(1956, 5)

In [16]:
#Verify null values for collumns
df.isnull().sum()

COMMENT_ID      0
AUTHOR          0
DATE          245
CONTENT         0
CLASS           0
dtype: int64

**Note:** I notice 245 null values for the collumn DATE and Zero for others. The DATE collumn will ne be usefull for the model. So, I will not treat the DATE NULL values cases.

In [17]:
#Features extractions
df_data = df[['CONTENT', 'CLASS']]
df_data.head()

Unnamed: 0,CONTENT,CLASS
0,"Huh, anyway check out this you[tube] channel: ...",1
1,Hey guys check out my new channel and our firs...,1
2,just for test I have to say murdev.com,1
3,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,watch?v=vtaRGgvGtWQ Check this out .﻿,1


**Note:** I extracted only data for the CONTENT and CLASS collumns because the others collumns are not usefull for the model

In [18]:
#Display the repartition of data in CLASS
counts= df_data["CLASS"].value_counts()
counts

1    1005
0     951
Name: CLASS, dtype: int64

**Note:** I notice that the CLASS is not imbalaced. Thats a good point.

## 2. Data preprocessing for Machine Learning 

In [19]:
# Extract the features 
df_x = df_data['CONTENT']
df_x.head()


0    Huh, anyway check out this you[tube] channel: ...
1    Hey guys check out my new channel and our firs...
2               just for test I have to say murdev.com
3     me shaking my sexy ass on my channel enjoy ^_^ ﻿
4              watch?v=vtaRGgvGtWQ   Check this out .﻿
Name: CONTENT, dtype: object

In [20]:
# Extract the labal
df_y = df_data.CLASS
# Plot five first rows
df_y.head()

0    1
1    1
2    1
3    1
4    1
Name: CLASS, dtype: int64

In [21]:
# Encoding the features with countVectorizer
corpus = df_x
cv = CountVectorizer()
X = cv.fit_transform(corpus)
# Check X size

X.shape

(1956, 4454)

In [31]:
# Split Data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, df_y, test_size = 0.25, random_state = 42)
target_names = ['No Spam', 'Spam']

# 3. Modelisation and validation 

In this part I will use the following algorithms: Naive Bayes, Logistic Regression, Random Forest and SVM.


##  3.1 Naive Bayes

In [32]:
# Train with Naive Bayes
NaiveBayes= MultinomialNB()
NaiveBayes.fit(X_train, y_train)

# Test with Naive Bayes and compute the score
NaiveBayes.score(X_test, y_test)
y_predNB =NaiveBayes.predict(X_test)
# Plot classification report
classification_report(y_test,y_predNB,target_names=target_names)

'             precision    recall  f1-score   support\n\n    No Spam       0.92      0.92      0.92       221\n       Spam       0.94      0.93      0.93       268\n\navg / total       0.93      0.93      0.93       489\n'

##  3.2 Logistic Regression 

In [33]:
# Train with Logistic Regression
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)

# Test with Naive Bayes and compute the score
y_predRL =logisticRegr.predict(X_test)
logisticRegr.score(X_test, y_test)

# Plot classification report
classification_report(y_test,y_predRL,target_names=target_names)

'             precision    recall  f1-score   support\n\n    No Spam       0.91      0.98      0.95       221\n       Spam       0.98      0.92      0.95       268\n\navg / total       0.95      0.95      0.95       489\n'

## 3.3 Random Forest

In [34]:
# Train with Random Forest
RandomForest=RandomForestClassifier(n_estimators=100)
RandomForest.fit(X_train,y_train)

# Test with Random Forest and compute the score
y_predRF=RandomForest.predict(X_test)
RandomForest.score(X_test, y_test)

# Plot classification report
classification_report(y_test,y_predRF,target_names=target_names)


'             precision    recall  f1-score   support\n\n    No Spam       0.92      0.98      0.95       221\n       Spam       0.98      0.93      0.96       268\n\navg / total       0.96      0.96      0.96       489\n'

## 3.4 SVM

In [35]:
#Train with SVM
SVM= svm.SVC()
SVM.fit(X_train,y_train)

#Test with SVM and compute the score
y_predSVM=SVM.predict(X_test)
SVM.score(X_test, y_test)

#Plot classification report
classification_report(y_test,y_predSVM,target_names=target_names)

'             precision    recall  f1-score   support\n\n    No Spam       0.50      1.00      0.67       221\n       Spam       1.00      0.18      0.31       268\n\navg / total       0.78      0.55      0.47       489\n'

# 4. Choice of the King Model

My main aim is to detect effectively a spam YOUTUBE comment. That's 1 CLASS value. The best model is Logistic Regression OR Random Forest because  its f1-score is 0.95

# 5. Model Saving

In [46]:
    with open(r"MODEL.pickle", "wb") as output_file:
        pickle.dump(RandomForest, output_file)