**Data preprossing**

In [2]:
import pandas as pd
#path of hate.csv file is /content/drive/MyDrive/Colab Notebooks/hate.csv
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/hate.csv', encoding='latin1')

In [3]:
df = df.drop('Unnamed: 0', axis=1)


In [4]:
df

Unnamed: 0,comment,label
0,dalits are lowlives,N
1,gay people are a burden to our society,N
2,Arabs are not welcome,N
3,I'm not saying we should actually eliminate he...,N
4,bananas are for black people,N
...,...,...
41139,Indian culture is so backwards,N
41140,"Women don't want equality, they just want to b...",N
41141,fuck covid,P
41142,This computer is garbage,P


<h1>We'll remove unnecessary characters, punctuation, and symbols using regular expressions.</h1>


In [7]:
import re

def clean_text(text):
    # Removing non-alphanumeric characters and extra spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text)  # Removing extra whitespaces
    return text.lower()  # Converting text to lowercase

df['cleaned_comment'] = df['comment'].apply(clean_text)


In [8]:
df

Unnamed: 0,comment,label,cleaned_comment
0,dalits are lowlives,N,dalits are lowlives
1,gay people are a burden to our society,N,gay people are a burden to our society
2,Arabs are not welcome,N,arabs are not welcome
3,I'm not saying we should actually eliminate he...,N,im not saying we should actually eliminate hee...
4,bananas are for black people,N,bananas are for black people
...,...,...,...
41139,Indian culture is so backwards,N,indian culture is so backwards
41140,"Women don't want equality, they just want to b...",N,women dont want equality they just want to be ...
41141,fuck covid,P,fuck covid
41142,This computer is garbage,P,this computer is garbage


<h1>We'll tokenize the text into individual words using NLTK's word_tokenize function.</h1>

In [9]:
!pip install nltk
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

df['tokens'] = df['cleaned_comment'].apply(word_tokenize)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [10]:
df

Unnamed: 0,comment,label,cleaned_comment,tokens
0,dalits are lowlives,N,dalits are lowlives,"[dalits, are, lowlives]"
1,gay people are a burden to our society,N,gay people are a burden to our society,"[gay, people, are, a, burden, to, our, society]"
2,Arabs are not welcome,N,arabs are not welcome,"[arabs, are, not, welcome]"
3,I'm not saying we should actually eliminate he...,N,im not saying we should actually eliminate hee...,"[im, not, saying, we, should, actually, elimin..."
4,bananas are for black people,N,bananas are for black people,"[bananas, are, for, black, people]"
...,...,...,...,...
41139,Indian culture is so backwards,N,indian culture is so backwards,"[indian, culture, is, so, backwards]"
41140,"Women don't want equality, they just want to b...",N,women dont want equality they just want to be ...,"[women, dont, want, equality, they, just, want..."
41141,fuck covid,P,fuck covid,"[fuck, covid]"
41142,This computer is garbage,P,this computer is garbage,"[this, computer, is, garbage]"


<h1>We'll remove stop words using NLTK's stopwords list.</h1>


In [11]:
from nltk.corpus import stopwords

# Download NLTK stopwords
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [token for token in tokens if token not in stop_words]

df['clean_tokens'] = df['tokens'].apply(remove_stopwords)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [12]:
df

Unnamed: 0,comment,label,cleaned_comment,tokens,clean_tokens
0,dalits are lowlives,N,dalits are lowlives,"[dalits, are, lowlives]","[dalits, lowlives]"
1,gay people are a burden to our society,N,gay people are a burden to our society,"[gay, people, are, a, burden, to, our, society]","[gay, people, burden, society]"
2,Arabs are not welcome,N,arabs are not welcome,"[arabs, are, not, welcome]","[arabs, welcome]"
3,I'm not saying we should actually eliminate he...,N,im not saying we should actually eliminate hee...,"[im, not, saying, we, should, actually, elimin...","[im, saying, actually, eliminate, heebs, wish,..."
4,bananas are for black people,N,bananas are for black people,"[bananas, are, for, black, people]","[bananas, black, people]"
...,...,...,...,...,...
41139,Indian culture is so backwards,N,indian culture is so backwards,"[indian, culture, is, so, backwards]","[indian, culture, backwards]"
41140,"Women don't want equality, they just want to b...",N,women dont want equality they just want to be ...,"[women, dont, want, equality, they, just, want...","[women, dont, want, equality, want, charge]"
41141,fuck covid,P,fuck covid,"[fuck, covid]","[fuck, covid]"
41142,This computer is garbage,P,this computer is garbage,"[this, computer, is, garbage]","[computer, garbage]"


<h1>We can convert text data into numerical representation using techniques like Bag-of-Words (BoW) or TF-IDF (Term Frequency-Inverse Document Frequency).</h1>
<h4>from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example using CountVectorizer
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(df['cleaned_comment'])

# Example using TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_comment'])

</h4>

<h1>Model training</h1>




doing label mapping like n= 0 o=1 p=2

In [21]:
from sklearn.preprocessing import LabelEncoder

# Define label encoder
label_encoder = LabelEncoder()

# Convert string labels to numerical values
df['label_encoded'] = label_encoder.fit_transform(df['label'])

# Check the mapping between original labels and encoded values
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label Mapping:", label_mapping)


Label Mapping: {'N': 0, 'O': 1, 'P': 2}


checking labels


In [14]:
print(df['label'].dtype)

object


In [15]:
df['label'].value_counts()

N    22158
P    18950
O       36
Name: label, dtype: int64

In [16]:
df['label']

0        N
1        N
2        N
3        N
4        N
        ..
41139    N
41140    N
41141    P
41142    P
41143    N
Name: label, Length: 41144, dtype: object

importing necessary libries


In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

In [18]:
# Convert labels to binary encoding
label_encoder = LabelEncoder()
df['label_encoded'] = label_encoder.fit_transform(df['label'])


In [26]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['clean_tokens'], df['label_encoded'], test_size=0.2, random_state=42)


In [30]:
print(X_train)
print(y_train)

15232                               [head, fucking, hurts]
21775    [im, gobsmacked, behavious, tbh, simply, doesn...
22060    [messy, curly, fade, classic, ladyboy, haircut...
37800                     [muslims, terrorists, murderers]
6010                   [yeah, bit, naive, fuck, im, sorry]
                               ...                        
6265     [always, associate, watermelon, muslims, id, s...
11284            [pakkis, always, smell, like, shit, piss]
38158    [chinese, new, year, zodiacal, rabbit, one, tw...
860                      [benders, arent, welcome, around]
15795                                 [youre, awfulwaffle]
Name: clean_tokens, Length: 32915, dtype: object
15232    2
21775    0
22060    0
37800    0
6010     2
        ..
6265     2
11284    0
38158    2
860      0
15795    2
Name: label_encoded, Length: 32915, dtype: int64


In [None]:
# Tokenize text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_test = tokenizer.texts_to_sequences(X_test)

In [31]:
print(sequences_train)


[[466, 6, 1994], [9, 11836, 9871, 1028, 397, 87, 128, 360, 11837, 11838, 11839, 121, 42, 108, 673, 3279, 226, 345, 94, 227, 3424, 385, 18079, 1, 18080, 71, 2, 470], [3742, 7339, 6537, 1712, 18081, 5942, 407, 617, 5079, 208, 45], [27, 418, 4440], [252, 368, 3151, 20, 9, 313], [1, 196, 21, 26, 138, 119, 66, 1278, 522, 482, 1, 19, 522, 674, 1303, 286, 674, 1303, 4730, 1427, 238, 210, 26, 191, 18082, 1802, 122, 101, 1427, 5444, 18083, 8306, 85], [16, 7340, 6, 816, 466], [668, 1638, 24, 20, 587], [115, 57, 19], [63, 6, 346, 6538, 1802, 6, 7341, 1574, 1403, 24, 34, 95, 4190, 2275, 1605, 263], [370, 137, 3939, 2901, 11840, 1803, 3743, 473, 3940, 1015, 3579, 119, 3579, 38, 23, 451, 3940, 84, 696, 18084, 11841, 3280, 2100, 6539], [595, 1, 1279, 2697, 3152, 3281, 2385, 1125, 1029, 48, 289, 132, 9872, 146, 363, 18085, 11842, 117, 414, 94, 18086, 104, 1713, 911, 419, 5080, 132, 3028, 649, 186, 5943, 140, 2539, 697, 105, 179, 595, 228, 8, 1995, 11843, 137, 4731, 3152, 3281, 228, 7342, 2211, 182, 22

In [32]:
# Pad sequences to ensure uniform length
max_sequence_length = max(len(seq) for seq in sequences_train + sequences_test)
padded_sequences_train = pad_sequences(sequences_train, maxlen=max_sequence_length, padding='post')
padded_sequences_test = pad_sequences(sequences_test, maxlen=max_sequence_length, padding='post')

In [33]:
# Define model architecture
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(3, activation='softmax')  # Replace num_classes with the number of classes
])

In [34]:
# Compile model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [35]:
# Fit model to data
history = model.fit(padded_sequences_train, y_train, epochs=15, batch_size=32, validation_split=0.2)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [37]:
def predict_user_input(user_input):
    # Tokenize and pad user input
    sequences_input = tokenizer.texts_to_sequences([user_input])
    padded_sequences_input = pad_sequences(sequences_input, maxlen=max_sequence_length, padding='post')

    # Predict
    prediction = model.predict(padded_sequences_input)

    # Get the class with the highest probability
    predicted_class = np.argmax(prediction)

    # Decode the predicted class
    predicted_label = label_encoder.inverse_transform([predicted_class])

    return predicted_label[0]

In [39]:
user_input = input("Enter your text: ")
print(predict_user_input(user_input))

Enter your text: dalits 
N


In [36]:
# Evaluate model performance on test set
loss, accuracy = model.evaluate(padded_sequences_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

Test Loss: 2.7488174438476562
Test Accuracy: 0.5841535925865173


In [41]:
# Predict probabilities for each class
y_pred_probabilities = model.predict(padded_sequences_test)

# Convert probabilities to class labels by selecting the class with the highest probability
y_pred = np.argmax(y_pred_probabilities, axis=1)




In [42]:
from sklearn.metrics import classification_report

# Generate a classification report
class_report = classification_report(y_test, y_pred)
print(class_report)


              precision    recall  f1-score   support

           0       0.60      0.66      0.63      4375
           1       0.00      0.00      0.00         4
           2       0.56      0.49      0.53      3850

    accuracy                           0.58      8229
   macro avg       0.39      0.39      0.39      8229
weighted avg       0.58      0.58      0.58      8229



Sentiment Analysis Model Evaluation Report

Evaluation Metrics:
- Accuracy: 58.42%
- Precision:
  - Negative sentiment (Class 0): 60%
  - Neutral sentiment (Class 1): 0%
  - Positive sentiment (Class 2): 56%
- Recall:
  - Negative sentiment (Class 0): 66%
  - Neutral sentiment (Class 1): 0%
  - Positive sentiment (Class 2): 49%
- F1-score:
  - Negative sentiment (Class 0): 63%
  - Neutral sentiment (Class 1): 0%
  - Positive sentiment (Class 2): 53%

Support:
- Negative sentiment (Class 0): 4375
- Neutral sentiment (Class 1): 4
- Positive sentiment (Class 2): 3850

Analysis:
- Strengths:
  - The model demonstrates decent performance in identifying negative sentiment, with balanced precision, recall, and F1-score.
  - Overall accuracy is above random guessing, indicating that the model has learned some patterns in the data.
- Weaknesses:
  - Performance on neutral sentiment is poor, with zero precision, recall, and F1-score. This suggests that the model struggles to classify neutral sentiment accurately, possibly due to class imbalance or lack of representation in the training data.
  - Positive sentiment classification, while better than neutral sentiment, still shows room for improvement, especially in recall.

In [47]:
model.save('sentiment.keras')


In [57]:
# Save the tokenizer using joblib
!pip install joblib
import joblib
joblib.dump(tokenizer, 'tokenizers.pkl')



['tokenizers.pkl']