**Dataset**
labeled datasset collected from twitter (Lab 1 - Hate Speech.tsv)

**Objective**
classify tweets containing hate speech from other tweets. <br>
0 -> no hate speech <br>
1 -> contains hate speech <br>

**Total Estimated Time = 90-120 Mins**

**Evaluation metric**
macro f1 score

### Import used libraries

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
import string
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
import optuna
import tqdm
import spacy
import gensim
from gensim.models import word2vec
from gensim.models import KeyedVectors

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

### Load Dataset

###### Note: search how to load the data from tsv file

In [5]:
df = pd.read_csv("Lab 1 - Hate Speech.tsv", sep= "\t")
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [6]:
x = df.tweet

In [7]:
y = df.label

In [8]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [9]:
x_train.head()

18275              @user you were my childhood, why are you suppoing those who want to maintain a tax advantage over first time buyers?  
7544     note for those who call #wn  for opposing the latest @user ad: damn right we are! get that filth off the net, you race-traitors-
18342                                                                                                    i almost always trust brazilians
20301                                                  people want to love in the moment, they don't want that forever love ð¯ #truth  
15034                                                                    set of 6 glass ... gbp 24.99 get here:  #shop #cool   #home #fun
Name: tweet, dtype: object

### EDA on training data

In [10]:
x_train.info()

<class 'pandas.core.series.Series'>
Index: 25228 entries, 18275 to 2732
Series name: tweet
Non-Null Count  Dtype 
--------------  ----- 
25228 non-null  object
dtypes: object(1)
memory usage: 394.2+ KB


- check NaNs

In [11]:
x_train.isnull().sum()

0

- check duplicates

In [12]:
x_train.duplicated().sum()

1832

In [13]:
duplicates = x_train[x_train.duplicated()]
duplicates

9280                             #model   i love u take with u all the time in urð±!!! ðððð
ð¦ð¦ð¦
27552                            #model   i love u take with u all the time in urð±!!! ðððð
ð¦ð¦ð¦
16238                           #flagday2016   #flag #day #2016 #(30 #photos) buy things about "flag day 2016": â¦  
30272               @user #feminismiscancer #feminismisterrorism #feminismmuktbharat why  #malevote is ignored  @user
21290              i finally found a way how to delete old tweets! you might find it useful as well:    #deletetweets
                                                             ...                                                     
19721                     ð #smile #smiling top.tags #toptags #smiles #beautifulsmile #smiley #smilee #pretty  â¦
797                              #model   i love u take with u all the time in urð±!!! ðððð
ð¦ð¦ð¦
7599     #zaynmalik   bull up: you will dominate your bu

- show a representative sample of data texts to find out required preprocessing steps

In [14]:
x_train.head(20)

18275                      @user you were my childhood, why are you suppoing those who want to maintain a tax advantage over first time buyers?  
7544             note for those who call #wn  for opposing the latest @user ad: damn right we are! get that filth off the net, you race-traitors-
18342                                                                                                            i almost always trust brazilians
20301                                                          people want to love in the moment, they don't want that forever love ð¯ #truth  
15034                                                                            set of 6 glass ... gbp 24.99 get here:  #shop #cool   #home #fun
23579                                          thank you very much for the s!!! :) :)  @user @user @user    fridayeveryone!   #friday! friday! :)
15981                                                                                                                       

- check dataset balancing

In [15]:
y_train.unique()

array([0, 1], dtype=int64)

In [16]:
y_train[y_train == 0].count()

23467

In [17]:
not_hate = y_train[y_train == 0].count()
hate = y_train[y_train == 1].count()
not_hate, hate

(23467, 1761)

- Cleaning and Preprocessing are:
    - Drop Duplicates
    - Remove some of special characters and emojis.
       - Remove =>      @user, #, &, -, _, :, ?, ","
       - keep   =>      !, :)
    - 3
    - ... etc.

### Cleaning and Preprocessing

#### Extra: use custom scikit-learn Transformers

Using custom transformers in scikit-learn provides flexibility, reusability, and control over the data transformation process, allowing you to seamlessly integrate with scikit-learn's pipelines, enabling you to combine multiple preprocessing steps and modeling into a single workflow. This makes your code more modular, readable, and easier to maintain.

##### link: https://www.andrewvillazon.com/custom-scikit-learn-transformers/

#### Example usage:

In [18]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer():
    def __init__(self):
        pass

    def fit(self, X, y=None):
        # Add code for fitting the transformer here
        return self
   
    def transform(self, X):
        # Add code for transforming the data here
        transformed_X = X.copy()
        transformed_X = transformed_X.apply(self.preprocess)
        return transformed_X
   
    def preprocess(self, text):

        text = text.lower()
        
        # removed @ and #
        text = re.sub(r'\@\w+|\#','', text)

        # i removed punctions but kept ! because i think it may express about angry speech
        punctuation = string.punctuation.replace('!', '')
        text = text.translate(str.maketrans(' ', ' ', punctuation)) 
        
        # here i removed the numbers
        text = re.sub(r'\d+', '', text)
        
        text = text.strip()
        
        preprocessed_text = re.sub(r'\s+', ' ', text)
    
        return preprocessed_text
   
    def fit_transform(self, X, y=None):
        # This function combines fit and transform
        self.fit(X, y)
        return self.transform(X)

In [19]:
combined = pd.concat([x_train, y_train], axis=1)

In [20]:
combined.duplicated().sum()

1832

In [21]:
combined = combined.drop_duplicates()

In [22]:
combined.duplicated().sum()

0

In [23]:
combined.iloc[:, 0]

18275              @user you were my childhood, why are you suppoing those who want to maintain a tax advantage over first time buyers?  
7544     note for those who call #wn  for opposing the latest @user ad: damn right we are! get that filth off the net, you race-traitors-
18342                                                                                                    i almost always trust brazilians
20301                                                  people want to love in the moment, they don't want that forever love ð¯ #truth  
15034                                                                    set of 6 glass ... gbp 24.99 get here:  #shop #cool   #home #fun
                                                                       ...                                                               
13123                                                   all ready to pay xx #saturday #daughter #love #pay #igers #instagood   #goodtimes
19648                             

In [24]:
x_train = combined.iloc[:, 0]
y_train = combined.iloc[:, 1]

**You  are doing Great so far!**

### Modelling

#### Extra: use scikit-learn pipline

##### link: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Using pipelines in scikit-learn promotes better code organization, reproducibility, and efficiency in machine learning workflows.

#### Example usage:

In [25]:
from sklearn.pipeline import Pipeline

model = LogisticRegression()

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', CountVectorizer()),
    ('model', model),
])

# Now you can use the pipeline for training and prediction
# pipeline.fit(X_train, y_train)
# pipeline.predict(X_test)

In [26]:
pipeline.fit(x_train, y_train)

In [27]:
predictions = pipeline.predict(x_test)

#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [29]:
f1_score(y_test, predictions, average='macro')

0.8040956691515797

### Enhancement

- Using different N-grams
- Using different text representation technique
- Hyperparameter tuning

**N_Grams**

In [30]:
pipeline1 = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', CountVectorizer(ngram_range=(1, 4))),
    ('model', model),
])

pipeline1.fit(x_train, y_train)
predictions = pipeline1.predict(x_test)

f1_score(y_test, predictions, average='macro')

0.7763281261733274

**Another Text Representation Techniques**
- Spacy
- Gensem

In [None]:
nlp = spacy.load('en_core_web_md')

**Use Optuna To Tune Hyperparameters**

In [56]:
def objective(trial):
    lg_c = trial.suggest_loguniform('C', 1e-2, 1)
    lg_tol = trial.suggest_loguniform('tol', 1e-6 , 1e-3)
    lg_solver = trial.suggest_categorical('solver' , ['newton-cg', 'lbfgs','liblinear'])

    lg = LogisticRegression(C=lg_c, tol=lg_tol, solver=lg_solver)

    x1 = CustomTransformer().fit_transform(x_train)
    x2 = CountVectorizer().fit_transform(x1)

    return cross_val_score(lg, x2, y_train, n_jobs=-1, cv=3).mean()


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print('Best hyperparameters: ', study.best_params)
print('Best performance: ', study.best_value)

[I 2024-05-23 09:46:00,902] A new study created in memory with name: no-name-e8220931-cc92-4b9a-b747-6300fc6e4077
  lg_c = trial.suggest_loguniform('C', 1e-2, 1)
  lg_tol = trial.suggest_loguniform('tol', 1e-6 , 1e-3)
[I 2024-05-23 09:46:02,393] Trial 0 finished with value: 0.9519576852824936 and parameters: {'C': 0.6662531299067997, 'tol': 0.0004726286135837765, 'solver': 'liblinear'}. Best is trial 0 with value: 0.9519576852824936.
  lg_c = trial.suggest_loguniform('C', 1e-2, 1)
  lg_tol = trial.suggest_loguniform('tol', 1e-6 , 1e-3)
[I 2024-05-23 09:46:03,603] Trial 1 finished with value: 0.9439648294282099 and parameters: {'C': 0.1506769249993073, 'tol': 0.00047444989367019606, 'solver': 'lbfgs'}. Best is trial 0 with value: 0.9519576852824936.
  lg_c = trial.suggest_loguniform('C', 1e-2, 1)
  lg_tol = trial.suggest_loguniform('tol', 1e-6 , 1e-3)
[I 2024-05-23 09:46:05,380] Trial 2 finished with value: 0.9455035540021347 and parameters: {'C': 0.20357177189116119, 'tol': 0.000412332

Best hyperparameters:  {'C': 0.9941607778943732, 'tol': 1.2468661945980451e-05, 'solver': 'liblinear'}
Best performance:  0.9539238260412292


### Conclusion and final results


In [32]:
pipeline2 = Pipeline(steps=[
    ('preprocessing', CustomTransformer()),
    ('Vectorizing', CountVectorizer()),
    ('model', LogisticRegression(C=0.9941607778943732, tol=1.2468661945980451e-05, solver='liblinear')),
])

pipeline2.fit(x_train, y_train)
predictions = pipeline2.predict(x_test)

f1_score(y_test, predictions, average='macro')

0.8031022489902082

- Using CountVectorizer(80) is more better than TF-IDF Vectorizer(60)
- Using N-Grams = 1 was with the best result than using n_grams=2, 3, 4 or 5. AS N_Grams increase, f1_score decreases

#### Done!