## Introduction

Youtube is one of the most popular video sharing platform with more than 1 billion users. Users have long been outraged by the overwhelming number of spam messages in the comment section. In 2012 users created a petition asking Youtube to provide tools to deal with undesired content. In 2013, spam problem gets worse as Google overhauled the YouTube comment system to connect it to Google+, which allows users to post links. This attracts more malicious users to self-promote their videos using the platform. This project will build a spam filter to automatically filter spam comments.



## Data Wrangling

In this part we will:

    1: load the data into notebook
    2: check for any missing values

In [1]:
import pandas as pd


In [2]:
s1=pd.read_csv('./data/Youtube01-Psy.csv',encoding = 'utf-8-sig')
s2=pd.read_csv('./data/Youtube02-KatyPerry.csv',encoding = 'utf-8-sig')
s3=pd.read_csv('./data/Youtube03-LMFAO.csv',encoding = 'utf-8-sig')
s4=pd.read_csv('./data/Youtube04-Eminem.csv',encoding = 'utf-8-sig')
s5=pd.read_csv('./data/Youtube05-Shakira.csv',encoding = 'utf-8-sig')
s1['song']='Psy'
s2['song']='KetyPerry'
s3['song']='LMFAO'
s4['song']='Eminem'
s5['song']='Shakira'

In [3]:
df=pd.concat([s1,s2,s3,s4,s5])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1956 entries, 0 to 369
Data columns (total 6 columns):
COMMENT_ID    1956 non-null object
AUTHOR        1956 non-null object
DATE          1711 non-null object
CONTENT       1956 non-null object
CLASS         1956 non-null int64
song          1956 non-null object
dtypes: int64(1), object(5)
memory usage: 107.0+ KB


In [5]:
df.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS,song
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1,Psy
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1,Psy
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1,Psy
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,Psy
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1,Psy


In [32]:
df.groupby('CLASS').size()

CLASS
0     951
1    1005
dtype: int64


There are total of 1956 comments in this data for the 5 most popular songs. 951 hams and 1005 spams. There are some missing values in the column of date, but we will focus mostly on the CONTENT column so this will not affect us. The CLASS section is the label: 1 means spam and 0 means ham. 


## Data Analysis

We start the data analysis by asking some basic questions on the dataset:

    *is the average length of comment different among ham and spam?
    *can we identify some suspicious user account that seems relate to spam?
    *are spam comments more likely to have a URL in them?
    *does the spam comments have any correlation with time? (if it's fake account, maybe it's setup to send out spam comments periodically.)


In [6]:
## check the length of the comment.

test= df['CONTENT'].iloc[100]

In [7]:
# first expand contractions, so aren't will be are not so it's counted as two words.

import contractions

def contraction_expand(text):
    return contractions.fix(text)

df['CONTENT']=df['CONTENT'].apply(contraction_expand)

In [8]:

#tokenize workds

import nltk
df['words']=df['CONTENT'].apply(nltk.word_tokenize)

#calculate sentence length

df['length']=df['CONTENT'].apply(len)

df.groupby('CLASS')['length'].agg(['mean','std','min','max'])

Unnamed: 0_level_0,mean,std,min,max
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,49.82755,56.526731,2,755
1,137.769154,159.459172,10,1200


The average length of spam vs ham comments are different. 

In [9]:
## check if there contains URL, let's also include youtube URL (need to look up for it)

import re

def URL (string):
    pattern=r'http[s]?://*'
    return (bool(re.search(pattern,string)))

df['URL']=df['CONTENT'].apply(URL)
sum(df['URL'])/len(df['URL'])
#df.head()

0.10071574642126789

In [10]:

df.groupby('CLASS')['URL'].agg(sum)

CLASS
0     11.0
1    186.0
Name: URL, dtype: float64

about 10% of the comments contains URL, and most URLs are found in spam comments.

In [11]:
# calculate percentage of capital letters in a comment

def capital_letters(string):
    return (len(re.findall('[A-Z]',string))/len(string))

df['capital']=df['CONTENT'].apply(capital_letters)
df['capital'].head()

0    0.017857
1    0.119760
2    0.026316
3    0.000000
4    0.153846
Name: capital, dtype: float64

In [12]:
df.groupby('CLASS')['capital'].agg(['mean','std','min','max'])

Unnamed: 0_level_0,mean,std,min,max
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.090289,0.161128,0.0,1.0
1,0.108451,0.173301,0.0,0.919355


perform a significant test here

In [13]:
df.groupby('CLASS')['AUTHOR'].value_counts()

CLASS  AUTHOR               
0      5000palo                 7
       Marshmallow Kingdom      3
       Seth Ryan                3
       Alain Bruno              2
       Athena Gomez             2
       BigBird Larry            2
       Brian Brai               2
       Chris Madzier            2
       D Maw                    2
       Eric Gonzalez            2
       Juan Martinez            2
       LaiLa Steudle            2
       LiveLikeLien x           2
       Naga Berapi              2
       Paul Crowder             2
       Pepe The Meme King       2
       Sonny Carter             2
       The Technology Zoo       2
       Warcorpse666             2
       janet rangel             2
       lol Ippocastano          2
       tyler sleetway           2
          Berty  Winata         1
       Aarjav Parmar            1
       Abdinasir Omar           1
       Abdou Abdou              1
       Abdullah Alawani         1
       Abhi Vats                1
       Abhishek Kum

We have 1793 unique user names in this dataset. There is a small fraction of users who make multiple comments but all of users who created spams only combment once. It's noticed that some spam accounts are from foreign countries (since their names are not English character). Next we will see how many foreign user names for spam and ham group.

In [23]:
# how many of non_english characters in the username

def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True
    
df['user_isEnglish']=df['AUTHOR'].apply(isEnglish)
df.groupby('CLASS')['user_isEnglish'].value_counts()

CLASS  user_isEnglish
0      True              901
       False              50
1      True              956
       False              49
Name: user_isEnglish, dtype: int64

In [31]:
df[(df['CLASS']==1) & (df['user_isEnglish']==False)]['AUTHOR']

90                               Никита Безухов
91                             Михаил Панкратов
95                                    Олег Пась
175                                 David Boček
192                              Uroš Slemenjak
210                          O sábio das 7 eras
333                           Александр Федоров
337                               Tofik Miedzyń
20                                Cléda Dimitri
40                                   Mai Nguyễn
80                                 Nicolás Jara
83                                Mättr Valleni
179                     Mehmet Ertuğrul Tohumcu
248                                احمد الهوارى
282                            Nedim Alp SEÇGEL
302                          Quinho Divulgaçoes
308                              Uroš Slemenjak
175    TelePricol - FUNNY VIDEOS,ЛУЧШИЕ ПРИКОЛЫ
205                     Synchronized™ Nightcore
210             Almohtarif Info | المحترف انفو‎
227                                Nerdy

There is not much suspecious account we can see by looking at ethenity groups from user names. 

In [None]:
# extract hour from time data
# explore the distribution of # of spam/ham among hours



## Machine learning model

In this section we are going to build classifiers to filter out spam. 

As a first step, let's start with something simple. Let's create a bag of words and use naive-bayes.

In [50]:
# create a bag of words
# train with naive bayes

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import *

X=df.drop(['CLASS'],axis=1)
y=df['CLASS']
X_train,X_test,y_train,y_test=train_test_split(X,y)

pl=Pipeline([('vec',CountVectorizer()),('clf',MultinomialNB())])
pl.fit(X_train['CONTENT'],y_train)
print("the confusion matrix: \n", confusion_matrix(y_test,pl.predict(X_test['CONTENT'])))
print("the accuracy score is ", accuracy_score(y_test,pl.predict(X_test['CONTENT'])))


the confusion matrix: 
 [[221  29]
 [ 11 228]]
the accuracy score is  0.918200408997955


In [None]:
# using logistic regression
# use cross validation to tune hypoparameters

from sklearn.linear_model import LogisticRegression()



In [None]:
# bag of words will create a lot of features and we know that SVM works good with high dimensions. Let's try SVM



In [None]:
# next we explore if more features are added
# add length of the comment
# add if it contain URL
