# SIN Project

Models implemented:
<ul>
    <li>MultinomialNB</li>
    <li>RandomForestClassifier</li>
    <li>LogisticRegression</li>
    <li>SGDClassifier</li>
    <li>Perceptron</li>
    <li>MLPClassifier</li>
</ul>

In [1]:
import pandas as pd
from sklearn.pipeline import make_union
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import gc

In [2]:
## Reading CSV
train = pd.read_csv('train.csv').fillna('unknown')
test = pd.read_csv('test.csv').fillna('unknown')

In [3]:
## Initializing class labels

In [4]:
class_names = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
y = train[class_names]
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [5]:
## making tf-idf vectors
word_vectorizer = TfidfVectorizer(ngram_range =(1,3),
                             min_df=3, max_df=0.9,
                             strip_accents='unicode',
                             stop_words = 'english',
                             analyzer = 'word',
                             use_idf=1,
                             smooth_idf=1,
                             sublinear_tf=1 )

char_vectorizer = TfidfVectorizer(ngram_range =(1,4),
                                 min_df=3, max_df=0.9,
                                 strip_accents='unicode',
                                 analyzer = 'char',
                                 stop_words = 'english',
                                 use_idf=1,
                                 smooth_idf=1,
                                 sublinear_tf=1,
                                 max_features=50000)

In [6]:
vectorizer = make_union(word_vectorizer, char_vectorizer)
vectorizer.fit(all_text)

train_matrix =vectorizer.transform(train['comment_text'])
test_matrix = vectorizer.transform(test['comment_text'])



In [7]:
## Building naive bayes model
model = MultinomialNB()

In [9]:
for clas in class_names:

    print(clas)
    train_target = train[clas]
    model.fit(train_matrix,train_target)

    predictions = model.predict(train_matrix)
    print('\nAccuracy Score\n',accuracy_score(y[clas], predictions))

toxic

Accuracy Score
 0.9224608481490998
severe_toxic

Accuracy Score
 0.9900545838529557
obscene

Accuracy Score
 0.9539139317294496
threat

Accuracy Score
 0.9969982014275777
insult

Accuracy Score
 0.9543964755500686
identity_hate

Accuracy Score
 0.9912139423830144


## SGDClassifer
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate

In [10]:
## Building SGD Classifiers
from sklearn.linear_model import SGDClassifier
model1 = SGDClassifier()

In [11]:
for clas in class_names:

    print(clas)
    train_target = train[clas]
    model1.fit(train_matrix,train_target)

    predictions = model1.predict(train_matrix)
    print('\nAccuracy Score\n',accuracy_score(y[clas], predictions))

toxic

Accuracy Score
 0.9553490295855763
severe_toxic

Accuracy Score
 0.9900044494300343
obscene

Accuracy Score
 0.9790563448245608
threat

Accuracy Score
 0.9970044682304429
insult

Accuracy Score
 0.9720061916012308
identity_hate

Accuracy Score
 0.9916651521893076


In [12]:
from sklearn.linear_model import LogisticRegression as LR
model5 = LR()
for clas in class_names:

    print(clas)
    train_target = train[clas]
    model5.fit(train_matrix,train_target)

    predictions = model5.predict(train_matrix)
    print('\nAccuracy Score\n',accuracy_score(y[clas], predictions))

toxic


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Accuracy Score
 0.9730464808768511
severe_toxic

Accuracy Score
 0.9927618426907144
obscene

Accuracy Score
 0.9850599419694055
threat

Accuracy Score
 0.997693816545613
insult


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Accuracy Score
 0.9804162410463054
identity_hate

Accuracy Score
 0.9937582643462785


## RandomForestClassifer
In this, we use forest of decision trees to classify what we want. Basically, we use majority rule principle for the classification of the classes we have in our dataset.

In [13]:
from sklearn.ensemble import RandomForestClassifier as rfc
model3 = rfc(n_estimators=50)

In [14]:
for clas in class_names:

    print(clas)
    train_target = train[clas]
    model3.fit(train_matrix,train_target)

    predictions = model3.predict(train_matrix)
    print('\nAccuracy Score\n',accuracy_score(y[clas], predictions))

toxic

Accuracy Score
 0.9997493278853927
severe_toxic

Accuracy Score
 0.9999059979570223
obscene

Accuracy Score
 0.9998495967312356
threat

Accuracy Score
 0.9999373319713482
insult

Accuracy Score
 0.9997618614911231
identity_hate

Accuracy Score
 0.9998997311541571


## Perceptron
The Perceptron is a linear machine learning algorithm and may be considered one of the first and one of the simplest types of artificial neural networks.

In [15]:
from sklearn.linear_model import Perceptron
percetron = Perceptron()

for clas in class_names:

    print(clas)
    train_target = train[clas]
    percetron.fit(train_matrix,train_target)

    predictions = percetron.predict(train_matrix)
    print('\nAccuracy Score\n',accuracy_score(y[clas], predictions))

toxic

Accuracy Score
 0.9979820894774113
severe_toxic

Accuracy Score
 0.9988406414699412
obscene

Accuracy Score
 0.9986338369753902
threat

Accuracy Score
 0.9998057291111794
insult

Accuracy Score
 0.998145026351906
identity_hate

Accuracy Score
 0.9994422545449988


## MLPClassifier
Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function by training on a dataset, where  is the number of dimensions for input and  is the number of dimensions for output. Given a set of features and a target y, it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers. 
#### The advantages of Multi-layer Perceptron are:
<ul>
    <li>Capability to learn non-linear models.</li>
<li> Capability to learn models in real-time (on-line learning) using partial_fit.</li>
</ul>

#### The disadvantages of Multi-layer Perceptron (MLP) include:
<ul>
<li>MLP with hidden layers have a non-convex loss function where there exists more than one local minimum. Therefore different random weight initializations can lead to different validation accuracy.</li>

<li>MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.</li>

<li>MLP is sensitive to feature scaling.</li>
    </ul>

In [16]:
from sklearn.neural_network import MLPClassifier
model2 = MLPClassifier(hidden_layer_sizes=(10,10,10), max_iter= 10)

In [17]:
for clas in class_names:

    print(clas)
    #cross_validation(model2,train[clas])
    train_target = train[clas]
    model2.fit(train_matrix,train_target)

    predictions = model2.predict(train_matrix)
    print('\nAccuracy Score\n',accuracy_score(y[clas], predictions))

toxic





Accuracy Score
 0.9990161119501664
severe_toxic





Accuracy Score
 0.9994547881507292
obscene





Accuracy Score
 0.9993983869249425
threat





Accuracy Score
 0.9997681282939882
insult





Accuracy Score
 0.9993169184876951
identity_hate





Accuracy Score
 0.9996678594481453
