# Sentiment Analysis

The aim is to find hate speech in tweets that have been labelled as racist or sexist. A dataset of 31,962 labelled tweets is provided. Tweets are labelled with a '1' if they are racist or sexist, and a '0' if they are not.

**Loading the dataset**

In [80]:
import os
import pandas as pd

In [81]:
os.getcwd()

'/Users/meliscelik'

In [82]:
path = "/Users/meliscelik/Desktop/projects"
df = pd.read_csv(path + "/train_tweets_vidhya_2021.csv")
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [84]:
import numpy as np 
import re
import string 
import nltk
import warnings

%matplotlib inline
warnings.filterwarnings("ignore")

**Cleaning and preprocessing the data**

In [70]:
#function to identify and delete pattterns

def remove_pattern(input_text, pattern):
    matches = re.findall(pattern, input_text)
    for match in matches:
        input_text = re.sub(match, "", input_text)
    return input_text

In [71]:
df['clean_set'] = np.vectorize(remove_pattern)(df['tweet'], "@[\w]*")
df['clean_set'] = df['clean_set'].str.replace("[^a-zA-Z#]", " ")
df.head()

Unnamed: 0,id,label,tweet,clean_set
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide: society now #motivation


**Converting words to integers for modelling in next step**

In [72]:
toki_twt = df['clean_set'].apply(lambda x: x.split())
toki_twt.head()

0    [when, a, father, is, dysfunctional, and, is, ...
1    [thanks, for, #lyft, credit, i, can't, use, ca...
2                              [bihday, your, majesty]
3    [#model, i, love, u, take, with, u, all, the, ...
4             [factsguide:, society, now, #motivation]
Name: clean_set, dtype: object

In [74]:
from nltk.stem.porter import PorterStemmer

stem = PorterStemmer()
df['clean_set'] = toki_twt.apply(lambda sentence: [stem.stem(word) for word in sentence])
df['clean_set'] = toki_twt.apply(lambda sentence: " ".join(sentence))

toki_twt.head()

0    [when, a, father, is, dysfunctional, and, is, ...
1    [thanks, for, #lyft, credit, i, can't, use, ca...
2                              [bihday, your, majesty]
3    [#model, i, love, u, take, with, u, all, the, ...
4             [factsguide:, society, now, #motivation]
Name: clean_set, dtype: object

**Using TF-IDF for feature extraction**

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer
# feature TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')

In [76]:
from sklearn.model_selection import train_test_split
x_train_tfidf, x_test_tfidf, y_train, y_test = train_test_split(tfidf_vectorizer.fit_transform(df['clean_set']), df['label'], random_state=17, test_size=0.3)

**Model training**

In [77]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score

model = LogisticRegression()
model.fit(x_train_tfidf, y_train)
pred = model.predict(x_test_tfidf)

**Checking model performance**

In [78]:
f1_score(y_test, pred)

0.5213675213675214

In [79]:
accuracy_score(y_test,pred)

0.9532797997705704