## Prerequisites



In [0]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

### Note! Some of these models support only multiclass classification, please, while selecting your dataset,  
### be sure that for algorithms which does not support multilabel classification you use only examples with only one label. 
### Examples without a label in any of the provided categories are clean messages, without any toxicity.

In [2]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv("/content/drive/My Drive/train.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
df.shape

(159571, 8)

### As one of the methods to make the training simpier, use only examples, assigned to any category vs clean examples.  
For example:  
- Select only messages with obscene label == 1  
- Select all of the "clean" messages  
Implement a model which can perform a binary classification  - to understand whether your message is obscene or not.   

##### If you want to perform a multilabel classification, please understand the difference between multilabel and multiclass classification and be sure that you are solving the correct task - choose only algorithms applicable for solving this type of problem.

#### To work with multiclass task:  
You only need to select messages which have only one label assigned: message cannot be assigned to 2 or more categories.  

#### To work with multilabel task: 
You can work with the whole dataset - some of your messages have only 1 label, some more than 1. 

## Text vectorization

Previously we worked only with words vectorization. But we need to have a vector for each text, not only words from it. 

Before starting a text vectorization, please, make sure you are working with clean data - use the dataset created on the previous day. Cleaned from punctuation, stop words, lemmatized or stemmed, etc. 

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
from string import punctuation

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
stop_words = set(stopwords.words('english'))

In [0]:
def preprocess_text(tokenizer, lemmatizer, stop_words, punctuation, text): 
    tokens = tokenizer(text.lower())
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return [token for token in lemmas if token not in stop_words and token not in punctuation]

df['cleaned'] = df.comment_text.apply(lambda x: preprocess_text(word_tokenize, lemmatizer, stop_words, punctuation, x))

In [8]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[explanation, edits, made, username, hardcore,..."
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[d'aww, match, background, colour, 'm, seeming..."
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[hey, man, 'm, really, trying, edit, war, 's, ..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[``, ca, n't, make, real, suggestion, improvem..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[sir, hero, chance, remember, page, 's]"


In [0]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

In [0]:
vocab = set(flat_nested(df.cleaned.tolist()))

In [11]:
len(vocab)

249736

As we see, probably you vocabulary is too large.  
Let's try to make it smaller.  
For example, let's get rig of words, which has counts in our dataset less than some threshold.

In [0]:
from collections import Counter, defaultdict 

cnt_vocab = Counter(flat_nested(df.cleaned.tolist()))

In [13]:
cnt_vocab.most_common(10)

[("''", 242528),
 ('``', 155370),
 ('article', 73284),
 ("'s", 66767),
 ("n't", 57144),
 ('wa', 56592),
 ('page', 56263),
 ('wikipedia', 45418),
 ('talk', 35356),
 ('ha', 31896)]

You can clean words which are shorter that particular length and occur less than N times. 

In [0]:
threshold_count = 10
threshold_len = 4 
cleaned_vocab = [token for token, count in cnt_vocab.items() if count > threshold_count and len(token) > threshold_len]

In [15]:
len(cleaned_vocab)

18696



Much better!  
Let's try to vectorize the text summing one-hot vectors for each word. 

In [0]:
vocabulary = defaultdict()

for i, token in enumerate(cleaned_vocab): 
    empty_vec = np.zeros(len(cleaned_vocab))
    empty_vec[i] = 1 
    vocabulary[token] = empty_vec

In [17]:
vocabulary['source']

array([0., 0., 0., ..., 0., 0., 0.])


Rigth now we have vectors for words (words are one-hot vectorized)  
Let's try to create vectors for texts: 

In [18]:
sample_text = df.cleaned[10]
print(sample_text)

['``', 'fair', 'use', 'rationale', 'image', 'wonju.jpg', 'thanks', 'uploading', 'image', 'wonju.jpg', 'notice', 'image', 'page', 'specifies', 'image', 'used', 'fair', 'use', 'explanation', 'rationale', 'use', 'wikipedia', 'article', 'constitutes', 'fair', 'use', 'addition', 'boilerplate', 'fair', 'use', 'template', 'must', 'also', 'write', 'image', 'description', 'page', 'specific', 'explanation', 'rationale', 'using', 'image', 'article', 'consistent', 'fair', 'use', 'please', 'go', 'image', 'description', 'page', 'edit', 'include', 'fair', 'use', 'rationale', 'uploaded', 'fair', 'use', 'medium', 'consider', 'checking', 'specified', 'fair', 'use', 'rationale', 'page', 'find', 'list', "'image", 'page', 'edited', 'clicking', '``', "''", 'contribution', "''", "''", 'link', 'located', 'top', 'wikipedia', 'page', 'logged', 'selecting', '``', "''", 'image', "''", "''", 'dropdown', 'box', 'note', 'fair', 'use', 'image', 'uploaded', '4', 'may', '2006', 'lacking', 'explanation', 'deleted', 'one

### One-hot vectorization and count vectorization

In [0]:
sample_vector = np.zeros(len(cleaned_vocab))

for token in sample_text: 
    try: 
        sample_vector += vocabulary[token]
    except KeyError: 
        continue

In [20]:
sample_vector

array([3., 0., 0., ..., 0., 0., 0.])

Right now we have count vectorization for our text.   
Use this pipeline to create vectors for all of the texts. Save them into np.array. i-th raw in np.array is a vector which represents i-th text from the dataframe.  

In [0]:
from scipy.sparse import lil_matrix

def vectorize(text_list):
    vec = np.zeros(len(cleaned_vocab))
    for token in text_list: 
        try: 
            vec += vocabulary[token]
        except KeyError: 
            continue
    return vec   
def cnt_vectorize(series):
    vectorized_df = lil_matrix((series.size,len(cleaned_vocab)))  
    for i in range(series.size):
        vectorized_df[i] = vectorize(series.iloc[i])
    return vectorized_df  
#cnt_vectorized_df = cnt_vectorize(df.cleaned)

### The next step is to train any classification model on top of the received vectors and report the quality. 

Please, select any of the proposed pipelines for performing a text classification task. (Binary, multiclass or multilabel).  

The main task to calculate our models performance is to create a training and test sets. When you selected a texts for your task, please, use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html to have at least two sets - train and test.  

Train examples you will use to train your model on and test examples to evaluate your model - to understand how your model works on the unseen data. 

### Train-test split 

In [0]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, random_state=314, test_size=0.30)

### TF-IDF score 

#### Please, review again this article or read it if you have not done it before. 

https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7

#### Implement calculating a tf-idf score for each of the words from your vocabulary. 

The main goal of this taks is to create a dictionary - keys of the dictionary would be tokens and values would be corresponding tf-idf score of the token.

#### Calculate it MANUALLY and compare the received scores for words with the sklearn implementation:  
from sklearn.feature_extraction.text import TfidfTransformer 

#### Tip: 

##### TF = (Number of time the word occurs in the current text) / (Total number of words in the current text)  

##### IDF = (Total number of documents / Number of documents with word t in it)

##### TF-IDF = TF*IDF 

When you calculated a tf-idf score for each of the words in your vocabulary - revectorize the texts.  
Instead of using number of occurences of the i-th word in the i-th cell of the text vector, use it's tf-idf score.   

Revectorize the documents, save vectors into np.array. 

In [34]:
from numpy.linalg import norm
def tfidf_vectorize(series):
    N = series.size
    m = len(cleaned_vocab)
    tfidf_vectorized = lil_matrix((N,m))
    cnt = Counter(flat_nested(series.apply(lambda x: list(set(x))).tolist()))
    for i in range(N):
        vec = np.zeros(m)
        for word in set(series[i]):
            try:
                tf = float(series[i].count(word))/len(series[i])
                d = cnt[word]
                #vec += vocabulary[word]*np.log(float(N)/(d+1))*tf
                vec +=  vocabulary[word]*tf*(np.log(float(1+N)/(1+d))+1)
            except KeyError: 
                continue    
        if norm(vec) == 0: 
            tfidf_vectorized[i] = vec
        else: 
            tfidf_vectorized[i] = vec/norm(vec)
    return tfidf_vectorized 

tfidf_vectorized = tfidf_vectorize(df.cleaned)
tfidf_vectorized

<159571x18696 sparse matrix of type '<class 'numpy.float64'>'
	with 2717279 stored elements in List of Lists format>

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(vocabulary=cleaned_vocab)
print(tfidf_vectorized[0].todense(), vectorizer.fit_transform(df.cleaned.str.join(' '))[0].todense())

[[0.23737584 0.17370166 0.24984909 ... 0.         0.         0.        ]] [[0.23799837 0.17384245 0.25054612 ... 0.         0.         0.        ]]


In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
vectorizer = TfidfVectorizer(vocabulary=cleaned_vocab)
x_train = vectorizer.fit_transform(train.cleaned.str.join(' '))
x_test = vectorizer.fit_transform(test.cleaned.str.join(' '))
y_train = train[categories].values
y_test = test[categories].values

### Training the model 

As it was said before, select any of the text classification models for the selected task and train the model. 

When the model is trained, you need to evaluate it somehow. 

Read about True positive, False positive, False negative and True negative counts and how to calculate them:   

https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative 

##### Calculate TP, FP, FN and TN on the test set for your model to measure its performance. 


In [37]:
#tfidf KNN
classifier = KNeighborsClassifier()
classifier.fit(x_train,y_train)

predictions = classifier.predict(x_test)

def score(pred,y):
    TN = np.sum([(pred==y) & (y == 0)]) 
    TP = np.sum([(pred==y) & (y == 1)])
    
    FN = np.sum([(pred!=y) & (y == 1)])
    FP = np.sum([(pred!=y) & (y == 0)])

    return TN, TP, FN, FP

def score_2(pred,y):
    TN, TP, FN, FP = score(pred,y)
    prec = TP/(FP+TP)
    rec = TP/(TP+FN)
    F1 = 2*TP/(2*TP + FP + FN)
    F2 = 5*prec*rec/(4*prec + rec)
    acc = (TP +TN)/(TN + TP + FN + FP)
    lst = np.around([prec, rec, F1, F2, acc], decimals=2)
    names = ['prec', 'rec', 'F1', 'F2', 'acc']
    return dict(zip(names, lst))


for i in range(6):
    print(categories[i], ':', score_2(predictions.T[i], y_test.T[i])) 

toxic : {'prec': 0.27, 'rec': 0.33, 'F1': 0.29, 'F2': 0.31, 'acc': 0.85}
severe_toxic : {'prec': 0.29, 'rec': 0.14, 'F1': 0.19, 'F2': 0.16, 'acc': 0.99}
obscene : {'prec': 0.31, 'rec': 0.29, 'F1': 0.3, 'F2': 0.3, 'acc': 0.93}
threat : {'prec': 0.5, 'rec': 0.02, 'F1': 0.03, 'F2': 0.02, 'acc': 1.0}
insult : {'prec': 0.42, 'rec': 0.29, 'F1': 0.35, 'F2': 0.31, 'acc': 0.95}
identity_hate : {'prec': 0.51, 'rec': 0.1, 'F1': 0.17, 'F2': 0.12, 'acc': 0.99}


In [38]:
#tfidf RandomForest
classifier = RandomForestClassifier()
classifier.fit(x_train,y_train)

predictions = classifier.predict(x_test)

for i in range(6):
    print(categories[i], ':', score_2(predictions.T[i], y_test.T[i]))     

toxic : {'prec': 0.79, 'rec': 0.48, 'F1': 0.6, 'F2': 0.52, 'acc': 0.94}
severe_toxic : {'prec': 0.24, 'rec': 0.05, 'F1': 0.08, 'F2': 0.06, 'acc': 0.99}
obscene : {'prec': 0.78, 'rec': 0.5, 'F1': 0.61, 'F2': 0.54, 'acc': 0.97}
threat : {'prec': 0.13, 'rec': 0.02, 'F1': 0.03, 'F2': 0.02, 'acc': 1.0}
insult : {'prec': 0.71, 'rec': 0.5, 'F1': 0.59, 'F2': 0.53, 'acc': 0.97}
identity_hate : {'prec': 0.42, 'rec': 0.1, 'F1': 0.16, 'F2': 0.12, 'acc': 0.99}


In [60]:
#one-hot knn
x_train = cnt_vectorize(train.cleaned)
x_test = cnt_vectorize(test.cleaned)

classifier = KNeighborsClassifier()
classifier.fit(x_train,y_train)

predictions = classifier.predict(x_test)

for i in range(6):
    print(categories[i], ':', score_2(predictions.T[i], y_test.T[i]))     

toxic : {'prec': 0.57, 'rec': 0.33, 'F1': 0.42, 'F2': 0.36, 'acc': 0.91}
severe_toxic : {'prec': 0.34, 'rec': 0.16, 'F1': 0.22, 'F2': 0.18, 'acc': 0.99}
obscene : {'prec': 0.64, 'rec': 0.34, 'F1': 0.44, 'F2': 0.37, 'acc': 0.96}
threat : {'prec': 0.33, 'rec': 0.03, 'F1': 0.06, 'F2': 0.04, 'acc': 1.0}
insult : {'prec': 0.68, 'rec': 0.33, 'F1': 0.45, 'F2': 0.37, 'acc': 0.96}
identity_hate : {'prec': 0.42, 'rec': 0.08, 'F1': 0.13, 'F2': 0.09, 'acc': 0.99}


In [61]:
#one-hot RandomForest 
classifier = RandomForestClassifier()
classifier.fit(x_train,y_train)

predictions = classifier.predict(x_test)

for i in range(6):
    print(categories[i], ':', score_2(predictions.T[i], y_test.T[i]))   

toxic : {'prec': 0.7, 'rec': 0.51, 'F1': 0.59, 'F2': 0.54, 'acc': 0.93}
severe_toxic : {'prec': 0.32, 'rec': 0.14, 'F1': 0.19, 'F2': 0.16, 'acc': 0.99}
obscene : {'prec': 0.71, 'rec': 0.51, 'F1': 0.59, 'F2': 0.54, 'acc': 0.96}
threat : {'prec': 0.17, 'rec': 0.03, 'F1': 0.05, 'F2': 0.04, 'acc': 1.0}
insult : {'prec': 0.64, 'rec': 0.51, 'F1': 0.57, 'F2': 0.53, 'acc': 0.96}
identity_hate : {'prec': 0.31, 'rec': 0.1, 'F1': 0.15, 'F2': 0.12, 'acc': 0.99}


Calculate these metrics for the vectorization created using count vectorizing and for tf-idf vectorization.  
Compare them. 

### Conclusions and improvements 

For all of the vectorization pipelines we used all of the words, which were available in our dictionary, as experiment try to use the most meaningful words - select them using TF-IDF score. (for example for each text you can select not more than 10 words for vectorization, or less). 

Compare this approach with the first and second ones. Did your model improve? 



In [0]:
from scipy.sparse import csr_matrix
x_train = vectorizer.fit_transform(train.cleaned.str.join(' '))
x_test = vectorizer.fit_transform(test.cleaned.str.join(' '))
x_train = csr_matrix(x_train)
vec = np.ones(10)
for i in range(x_train.shape[0]):
    temp = x_train[i].toarray()[0]
    maxidx = np.argpartition(temp, -10)[-10:]
    vec1 = np.zeros(len(cleaned_vocab))
    np.put(vec1, maxidx, vec)
    x_train[i]=csr_matrix(np.multiply(temp,vec1))

In [63]:
classifier = KNeighborsClassifier()
classifier.fit(x_train,y_train)

predictions = classifier.predict(x_test)

for i in range(6):
    print(categories[i], ':', score_2(predictions.T[i], y_test.T[i]))     

toxic : {'prec': 0.29, 'rec': 0.38, 'F1': 0.33, 'F2': 0.36, 'acc': 0.86}
severe_toxic : {'prec': 0.28, 'rec': 0.14, 'F1': 0.19, 'F2': 0.16, 'acc': 0.99}
obscene : {'prec': 0.3, 'rec': 0.35, 'F1': 0.32, 'F2': 0.34, 'acc': 0.92}
threat : {'prec': 0.4, 'rec': 0.02, 'F1': 0.03, 'F2': 0.02, 'acc': 1.0}
insult : {'prec': 0.36, 'rec': 0.35, 'F1': 0.35, 'F2': 0.35, 'acc': 0.94}
identity_hate : {'prec': 0.48, 'rec': 0.08, 'F1': 0.13, 'F2': 0.09, 'acc': 0.99}


### Additionally, visualisations 

For now you have a vector for each word from your vocabulary. 
You have vectors with lenght > 18000, so the dimension of your space is more than 18000 - it's impossible to visualise it in 2d space. 

So try to research and look for algorithms which perform dimensionality reduction. (t-SNE, PCA) 
Try to visualise obtained vectors in a vectorspace, only subset from the vocabulary, don't plot all of the words. (100) 

Probably on this step you will realise how this type of vectorization using these techniques is not the best way to vectorize words. 

Please, analyse the obtained results and explain why visualisation looks like this. 