**Dataset**
labeled datasset collected from twitter (Lab 1 - Hate Speech.tsv)

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Total Estimated Time = 90-120 Mins**

**Evaluation metric**
macro f1 score

### Import used libraries

In [1]:
pip install optuna

Collecting optuna
  Downloading optuna-3.6.1-py3-none-any.whl (380 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.13.1-py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.5-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.5 alembic-1.13.1 colorlog-6.8.2 optuna-3.6.1


In [105]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
import string
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
import optuna
import tqdm
import spacy
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import numpy as np

In [3]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

### Load Dataset

###### Note: search how to load the data from tsv file

In [123]:
df = pd.read_csv("Lab 1 - Hate Speech.tsv", sep= "\t")
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [169]:
x = df.tweet

In [170]:
y = df.label

In [171]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [172]:
x_train.head()

18275              @user you were my childhood, why are you suppoing those who want to maintain a tax advantage over first time buyers?  
7544     note for those who call #wn  for opposing the latest @user ad: damn right we are! get that filth off the net, you race-traitors-
18342                                                                                                    i almost always trust brazilians
20301                                                  people want to love in the moment, they don't want that forever love ð¯ #truth  
15034                                                                    set of 6 glass ... gbp 24.99 get here:  #shop #cool   #home #fun
Name: tweet, dtype: object

### EDA on training data

In [173]:
x_train.info()

<class 'pandas.core.series.Series'>
Index: 25228 entries, 18275 to 2732
Series name: tweet
Non-Null Count  Dtype 
--------------  ----- 
25228 non-null  object
dtypes: object(1)
memory usage: 394.2+ KB


- check NaNs

In [174]:
x_train.isnull().sum()

0

- check duplicates

In [175]:
x_train.duplicated().sum()

1832

In [176]:
duplicates = x_train[x_train.duplicated()]
duplicates

9280                             #model   i love u take with u all the time in urð±!!! ðððð
ð¦ð¦ð¦
27552                            #model   i love u take with u all the time in urð±!!! ðððð
ð¦ð¦ð¦
16238                           #flagday2016   #flag #day #2016 #(30 #photos) buy things about "flag day 2016": â¦  
30272               @user #feminismiscancer #feminismisterrorism #feminismmuktbharat why  #malevote is ignored  @user
21290              i finally found a way how to delete old tweets! you might find it useful as well:    #deletetweets
                                                             ...                                                     
19721                     ð #smile #smiling top.tags #toptags #smiles #beautifulsmile #smiley #smilee #pretty  â¦
797                              #model   i love u take with u all the time in urð±!!! ðððð
ð¦ð¦ð¦
7599     #zaynmalik   bull up: you will dominate your bu

- show a representative sample of data texts to find out required preprocessing steps

In [177]:
x_train.head(20)

18275                      @user you were my childhood, why are you suppoing those who want to maintain a tax advantage over first time buyers?  
7544             note for those who call #wn  for opposing the latest @user ad: damn right we are! get that filth off the net, you race-traitors-
18342                                                                                                            i almost always trust brazilians
20301                                                          people want to love in the moment, they don't want that forever love ð¯ #truth  
15034                                                                            set of 6 glass ... gbp 24.99 get here:  #shop #cool   #home #fun
23579                                          thank you very much for the s!!! :) :)  @user @user @user    fridayeveryone!   #friday! friday! :)
15981                                                                                                                       

- check dataset balancing

In [178]:
y_train.unique()

array([0, 1])

In [179]:
y_train[y_train == 0].count()

23467

In [180]:
not_hate = y_train[y_train == 0].count()
hate = y_train[y_train == 1].count()
not_hate, hate

(23467, 1761)

- Cleaning and Preprocessing are:
    - Drop Duplicates
    - Remove some of special characters and emojis.
       - Remove =>      @user, #, &, -, _, :, ?, ","
       - keep   =>      !
    - 3
    - ... etc.

### Cleaning and Preprocessing

#### Extra: use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

#### Example usage:

In [181]:
pip install cleantext



In [182]:
from sklearn.base import BaseEstimator, TransformerMixin
from cleantext import clean

class CustomTransformer():
    def __init__(self):
        pass

    def fit(self, X, y=None):
        # Add code for fitting the transformer here
        return self

    def transform(self, X):
        # Add code for transforming the data here
        transformed_X = X.copy()
        transformed_X = transformed_X.apply(self.preprocess)
        return transformed_X

    def preprocess(self, text):

        text = text.lower()

        # removed @ and #
        text = re.sub(r'\@\w+|\#','', text)

        # i removed punctions but kept ! because i think it may express about angry speech
        punctuation = string.punctuation.replace('!', '')
        text = text.translate(str.maketrans(' ', ' ', punctuation))

        # here i removed the numbers
        text = re.sub(r'\d+', '', text)

        text = clean(text)

        text = text.strip()

        preprocessed_text = re.sub(r'\s+', ' ', text)

        return preprocessed_text

    def fit_transform(self, X, y=None):
        # This function combines fit and transform
        self.fit(X, y)
        return self.transform(X)

In [183]:
combined = pd.concat([x_train, y_train], axis=1)

In [184]:
combined.duplicated().sum()

1832

In [185]:
combined = combined.drop_duplicates()

In [186]:
combined.duplicated().sum()

0

In [187]:
combined.iloc[:, 0]

18275              @user you were my childhood, why are you suppoing those who want to maintain a tax advantage over first time buyers?  
7544     note for those who call #wn  for opposing the latest @user ad: damn right we are! get that filth off the net, you race-traitors-
18342                                                                                                    i almost always trust brazilians
20301                                                  people want to love in the moment, they don't want that forever love ð¯ #truth  
15034                                                                    set of 6 glass ... gbp 24.99 get here:  #shop #cool   #home #fun
                                                                       ...                                                               
13123                                                   all ready to pay xx #saturday #daughter #love #pay #igers #instagood   #goodtimes
19648                             

In [188]:
x_train = combined.iloc[:, 0]
y_train = combined.iloc[:, 1]

In [189]:
len(x_train), len(y_train)

(23396, 23396)

**You  are doing Great so far!**

### Modelling

#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

#### Example usage:

In [25]:
from sklearn.pipeline import Pipeline

model = LogisticRegression()

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', CountVectorizer()),
    ('model', model),
])

# Now you can use the pipeline for training and prediction
# pipeline.fit(X_train, y_train)
# pipeline.predict(X_test)

In [26]:
pipeline.fit(x_train, y_train)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [27]:
predictions = pipeline.predict(x_test)

#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [28]:
f1_score(y_test, predictions, average='macro')

0.8181435309973045

### Enhancement

- Using different N-grams
- Using different text representation technique
- Hyperparameter tuning

**N_Grams**

In [29]:
pipeline1 = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', CountVectorizer(ngram_range=(1, 4))),
    ('model', model),
])

pipeline1.fit(x_train, y_train)
predictions = pipeline1.predict(x_test)

f1_score(y_test, predictions, average='macro')

0.7986669892058218

**Another Text Representation Techniques**
- TfidfVectorizer
- Spacy(Glove)
- Gensim

In [190]:
pipeline1 = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', TfidfVectorizer()),
    ('model', LogisticRegression()),
])

pipeline1.fit(x_train, y_train)
predictions = pipeline1.predict(x_test)

f1_score(y_test, predictions, average='macro')

0.6898352457431646

In [30]:
! python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**GLOVE**

In [31]:
nlp = spacy.load('en_core_web_md')

In [160]:
x1 = CustomTransformer().fit_transform(x_train)
x1.head()

18275          childhood suppo want maintain tax advantag first time buyer
7544     note call wn oppos latest ad damn right get filth net racetraitor
18342                                         almost alway trust brazilian
20301               peopl want love moment dont want forev love ð¯ truth
15034                                 set glass gbp get shop cool home fun
Name: tweet, dtype: object

In [118]:
len(x1), len(y_train)

(23397, 25228)

In [33]:
x_train_v = np.zeros((len(x_train), 300))
x_test_v = np.zeros((len(x_test), 300))

for i, doc in enumerate(nlp.pipe(x1)):
    x_train_v[i, :] = doc.vector

for i, doc in enumerate(nlp.pipe(x_test)):
    x_test_v[i, :] = doc.vector

In [34]:
model = LogisticRegression(max_iter=500)
model.fit(x_train_v, y_train)
predictions = model.predict(x_test_v)

f1_score(y_test, predictions, average='macro')

0.6251364538644731

**Gensim**

In [166]:
train_sent = [row.split(".") for row in x1]
train_tokenized = [sublist[0].split() for sublist in train_sent]

word2vec = Word2Vec(train_tokenized,
                          min_count=3,
                          vector_size=100,
                          window=3,
                          sg=0)

In [86]:
pd.Series(train_tokenized)

0                 [childhood, suppo, want, maintain, tax, advantag, first, time, buyer]
1        [note, call, wn, oppos, latest, ad, damn, right, get, filth, net, racetraitor]
2                                                     [almost, alway, trust, brazilian]
3                     [peopl, want, love, moment, dont, want, forev, love, ð¯, truth]
4                                         [set, glass, gbp, get, shop, cool, home, fun]
                                              ...                                      
23391         [readi, pay, xx, saturday, daughter, love, pay, iger, instagood, goodtim]
23392                   [spend, afternoon, cant, take, posit, goodvib, memyselfandiâ¦]
23393                               [differ, type, recip, one, easi, make, walk, think]
23394                                             [jon, ran, squirrel, ð, done, wtf]
23395                                        [control, battl, loud, soft, suppoiv, via]
Length: 23396, dtype: object

In [59]:
word2vec.wv[0]

array([-0.36470422,  0.87039053, -0.00648388, -0.05554286, -0.16097887,
       -1.3730615 , -0.09811382,  1.4306861 , -0.7076815 , -0.53736824,
        0.27361578, -0.9230445 , -0.12791973,  0.27494735,  0.04590406,
       -0.48602778, -0.13009144, -0.16108018,  0.31629166, -1.2647496 ,
        0.12621203,  0.51611763,  0.36407587, -0.4276884 , -0.30173057,
       -0.25948307, -0.292025  ,  0.458892  , -0.5321031 , -0.3511609 ,
        0.18481325, -0.2164476 ,  0.27778938, -0.77444494, -0.20947906,
        0.8119903 ,  0.17909867, -0.60880923, -0.5475136 , -0.62688655,
        0.30641574, -0.30525425, -1.1008073 , -0.09442446,  0.47591054,
        0.03392561, -0.6873704 ,  0.02381719,  0.47162238,  0.41128257,
       -0.18784167, -0.4126852 , -0.10907754,  0.3179588 , -0.1526898 ,
        0.4122841 , -0.02151402, -0.29560632, -0.44754672,  0.7314956 ,
        0.74504846,  0.05091804, -0.40987176, -0.28479385, -0.43422192,
        0.4334904 ,  0.16306376,  0.73945683, -0.6960311 ,  1.00

In [90]:
# https://www.kaggle.com/code/kstathou/word-embeddings-logistic-regression

def document_vector(doc):
    """Create document vectors by averaging word vectors. Remove out-of-vocabulary words."""
    doc = [word for word in doc if word in word2vec.wv.key_to_index]

    if not doc:
        return np.zeros(word2vec.vector_size)
    return np.mean([word2vec.wv[word] for word in doc], axis=0)

In [149]:
x1['doc_vector'] = pd.Series(train_tokenized).apply(document_vector)
x_test['doc_vector'] = x_test.apply(document_vector)

In [150]:
x1['doc_vector']

0        [-0.05541523, 0.26788476, 0.11543044, 0.0048089437, 0.095442444, -0.43380132, 0.10923269, 0.49446762, -0.06488189, -0.15599471, -0.1183351, -0.33013192, -0.09655377, 0.04567255, 0.10300186, -0.18212761, -0.004630252, -0.33735475, 0.003000922, -0.4562627, 0.2594765, 0.1450267, 0.2492792, -0.048775237, -0.06430552, 0.07769963, -0.23772484, -0.090306066, -0.24443446, 0.07717265, 0.20778905, 0.062358305, 0.14810404, -0.1509816, -0.13534695, 0.25243106, -0.05178558, -0.29308784, -0.12242588, -0.4...
1        [-0.084377006, 0.24851449, 0.11342337, -0.014248826, 0.09511699, -0.4640367, 0.1354576, 0.53322965, -0.07179531, -0.15465583, -0.13806377, -0.34910148, -0.09902184, 0.06366843, 0.11396919, -0.16861427, 0.022879977, -0.35881862, -0.029306363, -0.49346122, 0.2568024, 0.13411511, 0.24803543, -0.07017474, -0.008812878, 0.056000523, -0.2055588, -0.122456096, -0.2621875, 0.09598617, 0.24822788, 0.08756009, 0.1517838, -0.17344034, -0.13915189, 0.310237, -0.0033930843, -0.32084984, -0.

In [151]:
X_train = np.vstack(x1['doc_vector'].values)
X_test = np.vstack(x_test['doc_vector'].values)

In [152]:
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

f1_score(y_test, predictions, average='macro')

0.4814175300115113

**Use Optuna To Tune Hyperparameters**

In [156]:
x_trans = CustomTransformer().fit_transform(x_train)

In [157]:
def objective(trial):
    lg_c = trial.suggest_float('C', 1e-2, 1)
    lg_tol = trial.suggest_float('tol', 1e-6 , 1e-3)
    lg_solver = trial.suggest_categorical('solver' , ['newton-cg', 'lbfgs','liblinear'])

    lg = LogisticRegression(C=lg_c, tol=lg_tol, solver=lg_solver)

    x_vectorized = CountVectorizer().fit_transform(x_trans)

    return cross_val_score(lg, x_vectorized, y_train, n_jobs=-1, cv=3).mean()


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print('Best hyperparameters: ', study.best_params)
print('Best performance: ', study.best_value)

[I 2024-05-24 09:03:29,750] A new study created in memory with name: no-name-bf23e672-631c-4ebf-8aff-ad2a1806fc61
[I 2024-05-24 09:03:33,474] Trial 0 finished with value: 0.9462728779223805 and parameters: {'C': 0.23730231431052806, 'tol': 0.00033884158527558206, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9462728779223805.
[I 2024-05-24 09:03:34,775] Trial 1 finished with value: 0.9459736833044371 and parameters: {'C': 0.2375206999645625, 'tol': 6.40684044493892e-05, 'solver': 'newton-cg'}. Best is trial 0 with value: 0.9462728779223805.
[I 2024-05-24 09:03:35,756] Trial 2 finished with value: 0.9476834138151947 and parameters: {'C': 0.2942652127328899, 'tol': 0.0006286063893570586, 'solver': 'liblinear'}. Best is trial 2 with value: 0.9476834138151947.
[I 2024-05-24 09:03:37,554] Trial 3 finished with value: 0.9526415281581609 and parameters: {'C': 0.767974105302214, 'tol': 0.000865955711937618, 'solver': 'lbfgs'}. Best is trial 3 with value: 0.9526415281581609.
[I 2024-05-

Best hyperparameters:  {'C': 0.997239296692065, 'tol': 0.0008082421095328949, 'solver': 'liblinear'}
Best performance:  0.9537528365472747


### Conclusion and final results


In [191]:
pipeline2 = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', CountVectorizer()),
    ('model', LogisticRegression(C=0.997239296692065, tol=0.0008082421095328949, solver='liblinear')),
])

pipeline2.fit(x_train, y_train)
predictions = pipeline2.predict(x_test)

f1_score(y_test, predictions, average='macro')

0.8181435309973045

- Using CountVectorizer(80) is more better than TF-IDF Vectorizer(68), Glove(62) and Gensim(48)
- Using N-Grams = 1 was with the best result than using n_grams=2, 3, 4 or 5. AS N_Grams increase, f1_score decreases

#### Done!