# Test_Kaggle_Hackathon
### 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
# Any results you write to the current directory are saved as output.

from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout
from keras.preprocessing import text, sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint

max_features = 20000
maxlen = 100

Using TensorFlow backend.


In [2]:
#print(check_output(["ls", "../input"]).decode("utf8"))

train = pd.read_csv("train_csv/train.csv")
test = pd.read_csv("test_csv/test.csv")
train = train.sample(frac=1)


In [3]:

list_sentences_train = train["comment_text"].fillna("CVxTz").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("CVxTz").values

In [4]:

tokenizer = text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)


In [5]:

def get_model():
    embed_size = 128
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size)(inp)
    x = Bidirectional(LSTM(50, return_sequences=True))(x)
    x = GlobalMaxPool1D()(x)
    x = Dropout(0.1)(x)
    x = Dense(50, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(6, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model


model = get_model()
batch_size = 32
epochs = 2


file_path="weights_base_3epochs.best.hdf5"
checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

early = EarlyStopping(monitor="val_loss", mode="min", patience=20)
 

callbacks_list = [checkpoint, early] #early
model.fit(X_t, y, batch_size=batch_size, epochs=3, validation_split=0.1, callbacks=callbacks_list)

model.load_weights(file_path)

y_test = model.predict(X_te)

Train on 143613 samples, validate on 15958 samples
Epoch 1/3
Epoch 00001: val_loss improved from inf to 0.04839, saving model to weights_base_3epochs.best.hdf5
Epoch 2/3
Epoch 00002: val_loss improved from 0.04839 to 0.04685, saving model to weights_base_3epochs.best.hdf5
Epoch 3/3
Epoch 00003: val_loss did not improve


In [7]:
sample_submission = pd.read_csv("sample_submission_csv/sample_submission.csv")

sample_submission[list_classes] = y_test



sample_submission.to_csv("baseline_epoch3.csv", index=False)

## Part 3: Vectorizing our dataset

Instead of Vectorising using TfidfVectorizer, we could use COUNTVECTORIZER. But in CountVectorizer words which have high count and words with low count wont be discriminated, where in reality we want them to be differentiated. A word like "hairfall" should have more value than a word like "the".

In [None]:
# instantiate the vectorizer
# Vectorising
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=1,max_features=10000,ngram_range=(1,3),stop_words = 'english')

In [None]:
# learn training data vocabulary, then use it to create a document-term matrix

tfidf.fit(X_train)
X_train_dtm = tfidf.transform(X_train)

In [None]:
# equivalently: combine fit and transform into a single step
X_train_dtm = tfidf.fit_transform(X_train)

In [None]:
# examine the document-term matrix
X_train_dtm

In [None]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = tfidf.transform(X_test)
X_test_dtm

## Part 4: Building and evaluating a model

Models used SGD Classifer, Multinominal Naive Bayes and Logistic Regression

>Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.



> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [None]:
# Testing 
from sklearn.linear_model import SGDClassifier
lr = SGDClassifier(loss='log', penalty='l1', alpha=1e-06)

In [None]:
%time lr.fit(X_train_dtm, y_train)

In [None]:
y_pred_class = lr.predict(X_test_dtm)

In [4]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

NameError: name 'y_test' is not defined

In [None]:
X_test[4517],y_pred_class

In [None]:
y_test

In [None]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

In [5]:
X_test[763]

NameError: name 'X_test' is not defined

In [24]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

3132    LookAtMe!: Thanks for your purchase of a video...
684     Hi I'm sue. I am 20 years old and work as a la...
5368    IMPORTANT MESSAGE. This is a final contact att...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
1073    Dear U've been invited to XCHAT. This is our f...
2821    INTERFLORA - It's not too late to order Inter...
1963    it to 80488. Your 500 free text messages are v...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [25]:
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [115]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [116]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)

Wall time: 10 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [119]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [120]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.82093663911845727

In [68]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1196,   12],
       [  12,  173]])

In [69]:
# print message text for the false positives (ham misclassified as spam)
X_test[y_test < y_pred_class]

136                I only haf msn. It's yijue@hotmail.com
5306    Ill be at yours in about 3 mins but look out f...
1615    Me sef dey laugh you. Meanwhile how's my darli...
694     Will purchase d stuff today and mail to you. D...
503                               Check with nuerologist.
5218            I accidentally brought em home in the box
4229                             Have you started in skye
5089                      What type of stuff do you sing?
2289                          Dont you have message offer
1497    I'm always on yahoo messenger now. Just send t...
3120                             Stop knowing me so well!
2890                               My battery is low babe
Name: message, dtype: object

In [35]:
# print message text for the false negatives (client incorrectly classified as non client)
X_test[y_test > y_pred_class]

1045    We know someone who you know that fancies you....
3530    Xmas & New Years Eve tickets are now on sale f...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
1625    500 free text msgs. Just text ok to 80488 and ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [124]:
# example false negative
X_test[773]

'mobile monthly cell services charges for att'

In [125]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  8.97567105e-01,   5.87126437e-01,   9.77139320e-01,
         6.29794242e-01,   1.70750363e-01,   6.53000144e-01,
         9.11693165e-01,   8.19610098e-01,   4.61506264e-01,
         8.84190626e-01,   5.48769888e-01,   7.70334801e-01,
         6.31593529e-01,   9.15613086e-01,   4.13413859e-01,
         9.52851543e-01,   2.82954434e-03,   3.27568531e-01,
         4.64356360e-01,   2.15699559e-01,   1.54564542e-01,
         6.08540722e-01,   5.95943196e-01,   8.46533058e-01,
         5.64835440e-01,   6.63289081e-01,   9.85019052e-01,
         2.58292306e-01,   7.22993287e-01,   6.65343814e-01,
         2.64037499e-01,   3.90199550e-01,   6.51796511e-01,
         2.28275163e-01,   6.65913915e-01,   7.24301917e-01,
         8.94321384e-01,   9.60481242e-01,   8.32109883e-01,
         5.96554897e-01,   7.96684274e-01,   5.21675228e-01,
         5.88256984e-01,   9.56694829e-01,   3.16618839e-01,
         4.87841391e-01,   6.23821990e-01,   5.49683328e-01,
         8.45490050e-01,

In [127]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [128]:
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

Wall time: 50 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [129]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [130]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([ 0.8360655 ,  0.5606743 ,  0.92067365,  0.50273048,  0.26795449,
        0.64725413,  0.89364957,  0.71344741,  0.51893192,  0.74116959,
        0.56460897,  0.72798002,  0.60893396,  0.78762047,  0.49807165,
        0.98067   ,  0.09266066,  0.39853635,  0.49863597,  0.20607042,
        0.30246763,  0.58962732,  0.56478771,  0.78029607,  0.69606878,
        0.62747733,  0.94813795,  0.31644793,  0.59843078,  0.6882525 ,
        0.37862951,  0.44885336,  0.60263369,  0.26913908,  0.5483894 ,
        0.63346747,  0.94098849,  0.92770084,  0.68344273,  0.56013626,
        0.60158428,  0.53519747,  0.56069876,  0.80955911,  0.43133067,
        0.54588554,  0.59225398,  0.63533255,  0.68580584,  0.52367748,
        0.66062092,  0.60150463,  0.64987947,  0.92643085,  0.9247504 ,
        0.48047527,  0.75147966,  0.83589687,  0.54978848,  0.66515226,
        0.62440879,  0.08318384,  0.41530255,  0.83576578,  0.51109476,
        0.79953214,  0.67875431,  0.74957688,  0.24310028,  0.74

In [131]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

0.83057851239669422

## Part 5: Examining a model for further insight

We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.

In [28]:
# store the vocabulary of X_train
X_train_tokens = tfidf.get_feature_names()
len(X_train_tokens)

10000

In [27]:
# examine the first 50 tokens
print(X_train_tokens[0:100000])

NameError: name 'X_train_tokens' is not defined

In [38]:
X_train_dtm.todense()[0][0][0]

matrix([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [32]:
tfidf.vocabulary_

{u'reboot ym': 7018,
 u'scold': 7403,
 u'beautiful tomorrow': 1294,
 u'wine kudi yarasu': 9586,
 u'bringing': 1521,
 u'wednesday': 9405,
 u'box385 m6': 1484,
 u'got hella': 3571,
 u'anytime network mins': 1071,
 u'reason brilliant': 7014,
 u'free text messages': 3237,
 u'cooking': 2164,
 u'china': 1860,
 u'w111wx': 9243,
 u'kids': 4465,
 u'kidz': 4469,
 u'amt lt gt': 1037,
 u'stamps country': 7997,
 u'fone weekly new': 3146,
 u'stylish simple pls': 8138,
 u'dnt': 2624,
 u'music': 5616,
 u'dear school': 2451,
 u'worse gastroenteritis takes': 9811,
 u'yahoo': 9910,
 u'couldn': 2206,
 u'usher britney fml': 9112,
 u'index wml id': 4193,
 u'themob': 8440,
 u'fone weekly': 3145,
 u'award free': 1203,
 u'hav nice': 3825,
 u'wana': 9303,
 u'pints': 6324,
 u'argument wins': 1110,
 u'callin': 1633,
 u'want': 9307,
 u'travel': 8732,
 u'url gt': 9088,
 u'wan2': 9300,
 u'wrong': 9833,
 u'yunny': 9981,
 u'09066364311 collect': 212,
 u'took place': 8692,
 u'welcomes': 9447,
 u'ur opinion': 9034,
 u'f

However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

**Guidelines for tuning TFIDFVectorizer:**

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
- **Experiment**, and let the data tell you the best approach!