### Text classification:

you will do experiments on text classification. you’re challenged to build a multi-headed model
that’s capable of detecting different types of toxicity like threats, obscenity, insults, and
identity-based hate text present in social media/document posts. You can download the dataset
from the following link .

https://drive.google.com/file/d/17jj450A1ViZkHhWJOImxZLCmK3vQRlQk/view

#### Requirements:
spacy <br>
pandas <br>
numpy<br>
plotly<br>
scikit-learn<br>

In [1]:
!pip install -r requirements.txt



In [2]:
import pandas as pd
import re
import spacy
import string
import plotly.express as px
from spacy.tokens import Doc, Span, Token
import numpy as np

In [3]:
from spacy.language import Language

@Language.component("lower_case_lemmas")
def custom_sentencizer(doc):
    for i, token in enumerate(doc):
        doc[i].lemma_ = doc[i].lemma_.lower()
    return doc

In [4]:
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
nlp.disable_pipes('ner', 'parser')
nlp.add_pipe('lower_case_lemmas', after='lemmatizer')
print(nlp.pipe_names)
is_punc_func = lambda token: token.text in string.punctuation
Token.set_extension("is_punc", getter = is_punc_func, force=True)


['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'lower_case_lemmas']


In [5]:
train_df = pd.read_csv('./dataset/train.csv')
test_df = pd.read_csv('./dataset/test.csv')

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB


In [7]:
train_df.describe()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0


## EDA
Find patterns in the data w.r.t the classification frequency, comments length, correlation etc.

In [8]:
train_df['comment_length'] = train_df['comment_text'].apply(lambda x: len(str(x)))

In [9]:
px.histogram(train_df.comment_length, title='Text Length accross the corpus')

In [10]:
px.bar(train_df.iloc[:,2:8].sum(), title='Distribution of comments type')

As evident from above plot, out of 1.5L comments: 
- Most of the comments don't fall under either of the category
- Those comments which are classified are highly imbalanced

### Cleaning & Pre-processing

In [11]:
misspelled_words = {
    r'\s+u\s+': ' you ',
    r'\s+yor\s+': ' your ',
    r'\s+bich\s+': ' bitch ',
    r'\s+ur\s+': ' your ',
    r'\s+fuk+\s+': 'fuck',
    r'\s+im\s+': ' i am '
}

# TODO check for other spelling mistakes, integrate a spellchecker

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'=', '', text)
    for pattern, replacement in misspelled_words.items():
        text = re.sub(pattern, replacement, text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub('[^A-Za-z\']+', ' ', text)
    return text

In [12]:
def preprocess_extract_text(cleaned_text):
    doc = nlp(cleaned_text)
    extracted_text = [w.lemma_ for w in doc if not w.is_stop]
    return " ".join(extracted_text)

In [13]:
# Sample of pre-processing output
preprocess_extract_text(clean_text(train_df.comment_text[0]))

'explanation edit username hardcore metallica fan revert vandalism closure gas vote new york doll fac remove template talk page retire'

### Feature Extraction & Model Pipeline

In [14]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.multioutput import ClassifierChain
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score, f1_score


In [15]:
 target_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [16]:
# Reference: https://ryan-cranfill.github.io/sentiment-pipeline-sklearn-3/

def pipelinize(function, active=True):
    def list_comprehend_a_function(list_or_series, active=True):
        if active:
            return [function(i) for i in list_or_series]
        else:
            return list_or_series
    return FunctionTransformer(list_comprehend_a_function, validate=False, kw_args={'active':active})

In [17]:
def train_test_eval(dataset, train_config, target_cols):
    X = dataset.comment_text
    y = dataset.loc[:, target_cols]
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        random_state=42, test_size=0.2)
    print(f'X_train:{X_train.shape} y_train:{y_train.shape}')
    print(f'X_test:{X_test.shape} y_test:{y_test.shape}')
    
    # Fit the pipeline
    pipeline = train_config['pipeline']
    grid = GridSearchCV(pipeline, hyper_param_tuning_config['param_grid'], scoring='f1_micro', cv=3)
    grid.fit(X_train, y_train)
    
    
    # Evaluate Model
    train_preds = grid.predict(X_train)
    print(f"Training set f1 score:{f1_score(y_train, train_preds, average='micro')}")
    
    test_preds = grid.predict(X_test)
    print(f"Validation set f1 score:{f1_score(y_test, test_preds, average='micro')}")
    
    
    


In [18]:
lr_classifier = LogisticRegression(class_weight='balanced', C=12, random_state=42)
tree_classifier = DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=42)
sgd_classifier = SGDClassifier(max_iter=200, tol=1e-3, class_weight='balanced', loss='log', early_stopping=True, n_jobs=-1, random_state=42)
rf_classifier = RandomForestClassifier(class_weight='balanced', max_depth=5, max_features=1000, random_state=42)
ada_boost_classifier = AdaBoostClassifier(tree_classifier, n_estimators=50, random_state=42)



pipeline_steps = [('clean_text', pipelinize(clean_text)), ('spacy-extract', pipelinize(preprocess_extract_text)), 
             ('tfidf-vectorize', TfidfVectorizer(max_features=2000, ngram_range=(1,2))),
                           ('clf', MultiOutputClassifier(rf_classifier))]

pipeline = Pipeline(pipeline_steps)

hyper_param_tuning_config = {
    'pipeline': pipeline,
    'param_grid': {
        'tfidf-vectorize__max_features': [1000, 2000],
        'clf__estimator__max_depth': [5,7]
    }
}

train_test_eval(train_df, hyper_param_tuning_config, target_cols)

X_train:(160,) y_train:(160, 6)
X_test:(40,) y_test:(40, 6)
Training set f1 score:0.9620253164556962
Test set f1 score:0.0


### Comments and Future Work:

The current notebook provides the base framework for feature extraction, pipeline for further experimentation with Classifiers, Hyper-param tuning.

<br />
Future Work: <br>

- Further hyperparam tuning, current config overfits the classifier.
- Try ClassifierChain
- Neural network training
    - Intuition: General Architecture would be Embeding -> Multi Layered ANN -> Pooling -> Dense layer with sigmoid activation
- Top n phrases:
    - Analyse the TF-IDF output for each category and extract top contributors based on the TF-IDF score.