# WikiShop Comment Toxicity Classification Project

## Project Overview

The objective of this project is to develop a machine learning model that classifies user comments as either positive or negative. We have access to a labeled dataset that includes information about the toxicity of these user edits.

The primary goal is to build a model that achieves a minimum quality metric of an F1 score of 0.75.

To successfully complete this project, we will follow these steps:

1. Data Preparation
Begin by loading and preparing the dataset provided.
Ensure that the dataset is correctly formatted and that it contains the necessary columns: text (the comment text) and toxic (the toxicity label).
2. Model Training
Experiment with different machine learning models for comment classification.
Consider trying various natural language processing techniques.
Evaluate the models using relevant evaluation metrics.
3. Conclusion
Summarize our findings and observations from the model training process.
Discuss the performance of different models and techniques.
Reflect on the achievement of the project's goal: an F1 score of at least 0.75.

Data Description
The dataset required for this project is available in the file toxic_comments.csv. It consists of two main columns:

`text`: This column contains the text of user comments.

`toxic`: This column serves as the target attribute, indicating whether a comment is toxic (1) or not toxic (0).

## Libraries import

In [1]:
import pandas as pd
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore")


## Preparation

1. Load dataset

In [2]:
data = pd.read_csv('./toxic_comments.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
data['toxic'].value_counts()

toxic
0    143106
1     16186
Name: count, dtype: int64

The dataset exhibits a class imbalance with a majority of non-toxic comments (143,106) and a minority of toxic comments (16,186)

2. Check for missing values and duplicates

In [4]:
data.isna().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

In [5]:
data.duplicated().sum()

0

3. Let's prepare a function that will clean the text from unnecessary characters, convert all words to lowercase, and lemmatize the text.

In [6]:
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text): 
    cleared_text = re.sub(r'[^a-zA-Z]', ' ', text) 
    cleared_text = cleared_text.lower()
    tokenized = nltk.word_tokenize(cleared_text)
    lemmatized = [lemmatizer.lemmatize(word) for word in tokenized]
    return " ".join(lemmatized)

In [7]:
data['text'] = data['text'].apply(lemmatize_text)
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,explanation why the edits made under my userna...,0
1,1,d aww he match this background colour i m seem...,0
2,2,hey man i m really not trying to edit war it s...,0
3,3,more i can t make any real suggestion on impro...,0
4,4,you sir are my hero any chance you remember wh...,0


4. Let's split the dataset into training and testing sets.

In [8]:
train, test = train_test_split(data, test_size=0.2)

In [9]:
corpus_train = train['text'].values.astype('U')
corpus_test = test['text'].values.astype('U')

5. Create matrices with TF-IDF values for the samples, while simultaneously removing stop words.

In [10]:
stopwords = set(nltk_stopwords.words('english'))

In [11]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf_train = count_tf_idf.fit_transform(corpus_train)
tf_idf_test = count_tf_idf.transform(corpus_test)

Split features and target

In [12]:
features_train = tf_idf_train
target_train = train['toxic']

features_test = tf_idf_test
target_test = test['toxic']

## Training models

In [13]:
best_score = []

### DecisionTreeClassifier

In [14]:
dt = DecisionTreeClassifier()
param_grid = {'max_depth': range(1, 10, 2),
    'class_weight': [None, 'balanced']}
dt_grid = GridSearchCV(dt, param_grid=param_grid, cv=3, scoring='f1')
dt_grid.fit(features_train, target_train)
predict_train_dt = dt_grid.predict(features_train)
best_score.append(dt_grid.best_score_)
print('Best score:',dt_grid.best_score_)
print('Best params:',dt_grid.best_params_)

Best score: 0.5788533017523151
Best params: {'class_weight': None, 'max_depth': 9}


### RandomForestClassifier

In [15]:
rf = RandomForestClassifier()
params = {'n_estimators': range(1,5),
         'max_depth': range(1,7),
         'class_weight': [None, 'balanced']}
rf_grid = GridSearchCV(rf, param_grid=params, cv=3, scoring='f1')
rf_grid.fit(features_train, target_train)
predict_train_rf = rf_grid.predict(features_train)
best_score.append(rf_grid.best_score_)
print('Best score:',rf_grid.best_score_)
print('Best params:',rf_grid.best_params_)

Best score: 0.25097271789456593
Best params: {'class_weight': 'balanced', 'max_depth': 6, 'n_estimators': 3}


###  LogisticRegression

In [16]:
lr = LogisticRegression()
params = {'class_weight': [None, 'balanced'],
         'solver': ['lbfgs','liblinear'],
         'fit_intercept' : [True, False],
         'multi_class': ['ovr','multinomial']}
lr_grid = GridSearchCV(lr, param_grid=params,  cv=5, scoring='f1')
lr_grid.fit(features_train, target_train)
best_score.append(lr_grid.best_score_)
print('Best score:',lr_grid.best_score_)
print('Best params:',lr_grid.best_params_)

Best score: 0.7575055854259978
Best params: {'class_weight': 'balanced', 'fit_intercept': True, 'multi_class': 'multinomial', 'solver': 'lbfgs'}


In [17]:
conclusion = [best_score]
index = ['f1 score']
columns = ['Decision tree','Random Forest', 'Logistic regression'] 
df = pd.DataFrame(conclusion, index, columns) 
df.T.head(6)

Unnamed: 0,f1 score
Decision tree,0.578853
Random Forest,0.250973
Logistic regression,0.757506



The best result was demonstrated by Logistic regression. Let's test it on the test dataset.

### Testing the Best Model

In [18]:
lr = LogisticRegression(class_weight = 'balanced', fit_intercept = True, multi_class = 'multinomial', solver= 'lbfgs')
lr.fit(features_train, target_train)
predict = lr.predict(tf_idf_test)
f1 = f1_score(target_test, predict)
print (f1)

0.7487982419997253


## Conclusion


In conclusion, for this project, we experimented with various machine learning models to predict toxic comments. After thorough data preprocessing, including text cleaning and lemmatization, we trained and evaluated several models. Among them, the Logistic Regression model with balanced class weights, fit intercept, 'multinomial' multi-class handling, and 'lbfgs' solver achieved the highest F1 score of approximately 0.7488 on the test dataset. This suggests that the Logistic Regression model performed the best in identifying toxic comments in our dataset. Further fine-tuning and optimization could potentially improve the model's performance even more.