## Prerequisites



In [1]:
import pandas as pd 
import numpy as np 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

### Note! Some of these models support only multiclass classification, please, while selecting your dataset,  
### be sure that for algorithms which does not support multilabel classification you use only examples with only one label. 
### Examples without a label in any of the provided categories are clean messages, without any toxicity.

In [2]:
df = pd.read_csv("train.csv")

In [3]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
df.shape

(159571, 8)

### As one of the methods to make the training simpier, use only examples, assigned to any category vs clean examples.  
For example:  
- Select only messages with obscene label == 1  
- Select all of the "clean" messages  
Implement a model which can perform a binary classification  - to understand whether your message is obscene or not.   

##### If you want to perform a multilabel classification, please understand the difference between multilabel and multiclass classification and be sure that you are solving the correct task - choose only algorithms applicable for solving this type of problem.

#### To work with multiclass task:  
You only need to select messages which have only one label assigned: message cannot be assigned to 2 or more categories.  

#### To work with multilabel task: 
You can work with the whole dataset - some of your messages have only 1 label, some more than 1. 

## Text vectorization

Previously we worked only with words vectorization. But we need to have a vector for each text, not only words from it. 

Before starting a text vectorization, please, make sure you are working with clean data - use the dataset created on the previous day. Cleaned from punctuation, stop words, lemmatized or stemmed, etc. 

In [5]:
from string import punctuation

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
#from replacers import
lemmatizer = WordNetLemmatizer() 
stop_words = set(stopwords.words('english'))

In [6]:
def preprocess_text(tokenizer, lemmatizer, stop_words, punctuation, text): # cleaned words that 
    tokens = tokenizer(text.lower())
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return [token for token in lemmas if token not in stop_words and token not in punctuation and token]

df['cleaned'] = df.comment_text.apply(lambda x: preprocess_text(word_tokenize, lemmatizer, stop_words, punctuation, x))

In [7]:
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,"[explanation, edits, made, username, hardcore,..."
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,"[d'aww, match, background, colour, 'm, seeming..."
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,"[hey, man, 'm, really, trying, edit, war, 's, ..."
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,"[``, ca, n't, make, real, suggestion, improvem..."
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,"[sir, hero, chance, remember, page, 's]"


In [8]:
def flat_nested(nested):
    flatten = []
    for item in nested:
        if isinstance(item, list):
            flatten.extend(item)
        else:
            flatten.append(item)
    return flatten

In [9]:
vocab = set(flat_nested(df.cleaned.tolist()))

In [10]:
len(vocab)

249531

As we see, probably you vocabulary is too large.  
Let's try to make it smaller.  
For example, let's get rig of words, which has counts in our dataset less than some threshold.

In [11]:
from collections import Counter, defaultdict 

cnt_vocab = Counter(flat_nested(df.cleaned.tolist()))

In [12]:
cnt_vocab.most_common(10)

[("''", 241319),
 ('``', 156982),
 ('article', 73264),
 ("'s", 66766),
 ("n't", 57144),
 ('wa', 56590),
 ('page', 56239),
 ('wikipedia', 45413),
 ('talk', 35356),
 ('ha', 31896)]

You can clean words which are shorter that particular length and occur less than N times. 

In [13]:
threshold_count = 10
threshold_len = 4 
cleaned_vocab = [token for token, count in cnt_vocab.items() if count > threshold_count and len(token) > threshold_len]

In [14]:
len(cleaned_vocab)

18705

Much better!  
Let's try to vectorize the text summing one-hot vectors for each word. 

In [15]:
vocabulary = defaultdict()

for i, token in enumerate(cleaned_vocab): 
    empty_vec = np.zeros(len(cleaned_vocab))
    empty_vec[i] = 1 
    vocabulary[token] = empty_vec

In [16]:
vocabulary['source']

array([0., 0., 0., ..., 0., 0., 0.])

Rigth now we have vectors for words (words are one-hot vectorized)  
Let's try to create vectors for texts: 

In [17]:
sample_text = df.cleaned[10]
print(sample_text)

['``', 'fair', 'use', 'rationale', 'image', 'wonju.jpg', 'thanks', 'uploading', 'image', 'wonju.jpg', 'notice', 'image', 'page', 'specifies', 'image', 'used', 'fair', 'use', 'explanation', 'rationale', 'use', 'wikipedia', 'article', 'constitutes', 'fair', 'use', 'addition', 'boilerplate', 'fair', 'use', 'template', 'must', 'also', 'write', 'image', 'description', 'page', 'specific', 'explanation', 'rationale', 'using', 'image', 'article', 'consistent', 'fair', 'use', 'please', 'go', 'image', 'description', 'page', 'edit', 'include', 'fair', 'use', 'rationale', 'uploaded', 'fair', 'use', 'medium', 'consider', 'checking', 'specified', 'fair', 'use', 'rationale', 'page', 'find', 'list', "'image", 'page', 'edited', 'clicking', '``', "''", 'contribution', "''", "''", 'link', 'located', 'top', 'wikipedia', 'page', 'logged', 'selecting', '``', "''", 'image', "''", "''", 'dropdown', 'box', 'note', 'fair', 'use', 'image', 'uploaded', '4', 'may', '2006', 'lacking', 'explanation', 'deleted', 'one

### One-hot vectorization and count vectorization

In [18]:
sample_vector = np.zeros(len(cleaned_vocab))

for token in sample_text: 
    try: 
        sample_vector += vocabulary[token]
    except KeyError: 
        continue

In [19]:
sample_vector

array([3., 0., 0., ..., 0., 0., 0.])

Right now we have count vectorization for our text.   
Use this pipeline to create vectors for all of the texts. Save them into np.array. i-th raw in np.array is a vector which represents i-th text from the dataframe.  

In [20]:
# df['cleaned_cleaned'] = df.cleaned.apply(lambda x: [w for w in x if cnt_vocab[w] > threshold_count and len(w)>threshold_len])
# df.head()

In [21]:
### Your code here
def vectorize(x):
    vector = np.zeros(len(cleaned_vocab))
    
    for token in x: 
        try: 
            vector += vocabulary[token]
        except KeyError: 
            continue
    return vector

text_vectorize = np.array(df.cleaned.apply(lambda x: vectorize(x)))
text_vectorize

array([array([1., 1., 1., ..., 0., 0., 0.]),
       array([0., 0., 0., ..., 0., 0., 0.]),
       array([0., 1., 0., ..., 0., 0., 0.]), ...,
       array([0., 0., 0., ..., 0., 0., 0.]),
       array([0., 0., 0., ..., 0., 0., 0.]),
       array([0., 0., 0., ..., 0., 0., 0.])], dtype=object)

### The next step is to train any classification model on top of the received vectors and report the quality. 

Please, select any of the proposed pipelines for performing a text classification task. (Binary, multiclass or multilabel).  

The main task to calculate our models performance is to create a training and test sets. When you selected a texts for your task, please, use https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html to have at least two sets - train and test.  

Train examples you will use to train your model on and test examples to evaluate your model - to understand how your model works on the unseen data. 

### Train-test split 

In [90]:
### Your code here, splitting your dataset into train and test parts. there another label is toxic,in which i train
from sklearn.model_selection import train_test_split
corpus = df.cleaned.apply(lambda x: ' '.join(x))
X = corpus
y = df.obscene.tolist()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### TF-IDF score 

#### Please, review again this article or read it if you have not done it before. 

https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7

#### Implement calculating a tf-idf score for each of the words from your vocabulary. 

The main goal of this taks is to create a dictionary - keys of the dictionary would be tokens and values would be corresponding tf-idf score of the token.

#### Calculate it MANUALLY and compare the received scores for words with the sklearn implementation:  
from sklearn.feature_extraction.text import TfidfTransformer 

#### Tip: 

##### TF = (Number of time the word occurs in the current text) / (Total number of words in the current text)  

##### IDF = (Total number of documents / Number of documents with word t in it)

##### TF-IDF = TF*IDF 

When you calculated a tf-idf score for each of the words in your vocabulary - revectorize the texts.  
Instead of using number of occurences of the i-th word in the i-th cell of the text vector, use it's tf-idf score.   

Revectorize the documents, save vectors into np.array. 

In [83]:
### Your code here for obtaining a tf-idf vectorized documents.
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
import numpy as np
vocabulary = cleaned_vocab
pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),
                 ('tfid', TfidfTransformer())]).fit(corpus)
pipe['count'].transform(corpus)

pipe['tfid'].idf_

pipe.transform(corpus).shape

(159571, 18705)

In [60]:
df_idf = pd.DataFrame(pipe['tfid'].idf_, index=pipe['count'].get_feature_names())
df_idf.head()

Unnamed: 0,0
explanation,5.563271
edits,4.06361
username,5.856578
hardcore,8.40554
metallica,9.648046


In [28]:
cleaned_voc = {token: count for token, count in cnt_vocab.items() if count > threshold_count and len(token) > threshold_len}
len(cleaned_voc)

18705

In [66]:
#tf is a matrix 
def tf(word, i):
    comment_voc = Counter(df.cleaned[i])
    return comment_voc[word]/len(comment_voc)
TF = {(word, i): (tf(word,i)) for i in range(len(df.cleaned)) for word in df.cleaned[i]}

In [67]:
TF['explanation', 0]

0.038461538461538464

In [24]:
number_doc = len(df.cleaned)
number_doc

159571

In [25]:
list_comments = df.cleaned.tolist()

In [26]:
df['cleaned_cleaned'] = df.cleaned.apply(lambda x: [w for w in x if cnt_vocab[w] > threshold_count and len(w)>threshold_len])
# df.head()

In [28]:
def df_(token):
    count = 0
    for comment in df.cleaned_cleaned.tolist():
        if token in comment:
            count+=1
    return count
DF = {token: df_(token) for token in cleaned_vocab} # document frequency for every word 

In [29]:
IDF = {token: np.log((1+number_doc)/(1+DF[token]))+1 for token in cleaned_vocab}
IDF

{'explanation': 5.571114066090075,
 'edits': 4.076707074345483,
 'username': 5.863856365916739,
 'hardcore': 8.502913695531998,
 'metallica': 9.648045999835,
 'reverted': 4.92859095316825,
 'vandalism': 4.595218222196305,
 'closure': 8.525903213756695,
 'voted': 7.393001851609954,
 'please': 2.9774607564002133,
 'remove': 4.541451386021977,
 'template': 5.018531694028839,
 'since': 4.065221237650293,
 'retired': 7.722755137982422,
 'match': 6.3844699960488915,
 'background': 6.357514186060364,
 'colour': 7.707250951446457,
 'seemingly': 7.892654174777819,
 'stuck': 7.196425327680466,
 'thanks': 3.6178759821785227,
 'january': 6.0957638579674205,
 'really': 4.015554954694744,
 'trying': 4.584321406087005,
 'constantly': 6.996314229323013,
 'removing': 5.315903877911587,
 'relevant': 5.297307340131912,
 'information': 3.894453473108621,
 'talking': 5.463273285405882,
 'instead': 4.876151453866621,
 'seems': 4.381108735946799,
 'formatting': 7.205698964465795,
 'actual': 5.671707712471013

In [45]:
TF_IDF = {(word, i): TF[word,i]*IDF[word] for i in range(len(df.cleaned)) for word in df.cleaned[i] if word in cleaned_vocab}

In [69]:
[TF_IDF[w, 0] for w in cleaned_vocab[:10]] # for first comment

[0.21427361792654137,
 0.15679642593636475,
 0.22553293715064382,
 0.32703514213584606,
 0.3710786923013462,
 0.18956119050647116,
 0.17673916239216558,
 0.32791935437525754,
 0.2843462250619213,
 0.11451772140000821]

### Training the model 

As it was said before, select any of the text classification models for the selected task and train the model. 

When the model is trained, you need to evaluate it somehow. 

Read about True positive, False positive, False negative and True negative counts and how to calculate them:   

https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative 

##### Calculate TP, FP, FN and TN on the test set for your model to measure its performance. 


In [107]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(106912, 300)
(52659, 300)


In [108]:
lg = LogisticRegression().fit(features_train, labels_train)

In [109]:
lrc_pred = lg.predict(features_test)

In [111]:
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, lg.predict(features_train)))

The training accuracy is: 
0.9692831487578569


In [112]:
# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, lrc_pred))

The test accuracy is: 
0.9684194534647449


#### The next step is to calculate  Precision, Recall, F1 and F2 score 

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

In [113]:
# Classification report
print("Classification report")
print(classification_report(labels_test,lrc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     49828
           1       0.94      0.44      0.60      2831

    accuracy                           0.97     52659
   macro avg       0.96      0.72      0.79     52659
weighted avg       0.97      0.97      0.96     52659



Calculate these metrics for the vectorization created using count vectorizing and for tf-idf vectorization.  
Compare them. 

### Conclusions and improvements 

For all of the vectorization pipelines we used all of the words, which were available in our dictionary, as experiment try to use the most meaningful words - select them using TF-IDF score. (for example for each text you can select not more than 10 words for vectorization, or less). 

Compare this approach with the first and second ones. Did your model improve? 



### Additionally, visualisations 

For now you have a vector for each word from your vocabulary. 
You have vectors with lenght > 18000, so the dimension of your space is more than 18000 - it's impossible to visualise it in 2d space. 

So try to research and look for algorithms which perform dimensionality reduction. (t-SNE, PCA) 
Try to visualise obtained vectors in a vectorspace, only subset from the vocabulary, don't plot all of the words. (100) 

Probably on this step you will realise how this type of vectorization using these techniques is not the best way to vectorize words. 

Please, analyse the obtained results and explain why visualisation looks like this. 