<a href="https://colab.research.google.com/github/omkarade/Hindi-Toxic-Comment-Classification-Using-BERT-Embeddings/blob/main/Hindi_Toxic%C2%A0Comment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install --upgrade pandas tensorflow_gpu ktrain
! rm -rf apex
! git clone https://www.github.com/nvidia/apex
! cd apex && python setup.py install

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from transformers import BertTokenizer
from transformers import AutoTokenizer
import tensorflow as tf
import numpy as np
import pandas as pd

**Loading data**

In [3]:
import pandas as pd 

df = pd.read_csv("/hi_3500.csv", header=0, names=['review', 'sentiment'])
print(df.head())

                                              review sentiment
0  गुमनाम है वतन पर मिटने वाले लोग आतन्कवादियों स...  negative
1  ज़ंजीर बदली जा रही थी मैं समझा था रिहाई हो गयी है  negative
2  यूपी में बड़े स्तर पर दंगे करवा सकती है बीजेपी...  negative
3  अंग्रेजी नहीं आती है इसलिए हिन्दी ट्विट ज्यादा...  negative
4                    कश्मीर में हो रहा है जल जिहाद ।  negative


**Converting three class to two class**

In [4]:
hui=[]
for i in df.sentiment.values:
  if i =='negative':
    hui.append(1)
  else:
    hui.append(0)
df=df.drop(columns='sentiment')
df['sentiment']=hui      

**Data pre-processing**

In [17]:
def preprocess(text):
    new_text = []
    for i in text.split(" "):
        i= '' if i.startswith('@') and len(i) > 1 else i
        i = '' if i.startswith('http') else i
        i = i.replace("#","")
        new_text.append(i)
    return " ".join(new_text).strip().replace("  ", " ")

In [18]:
df['review']=df.review.apply(preprocess)

In [20]:
df.head()

Unnamed: 0,review,sentiment
0,गुमनाम है वतन पर मिटने वाले लोग आतन्कवादियों स...,1
1,ज़ंजीर बदली जा रही थी मैं समझा था रिहाई हो गयी है,1
2,यूपी में बड़े स्तर पर दंगे करवा सकती है बीजेपी...,1
3,अंग्रेजी नहीं आती है इसलिए हिन्दी ट्विट ज्यादा...,1
4,कश्मीर में हो रहा है जल जिहाद ।,1


**Converting text into Bert Tokenizer**

In [5]:
from transformers import BertTokenizer
from transformers import AutoTokenizer
from tqdm import tqdm
tokenizer = BertTokenizer.from_pretrained('google/muril-large-cased', do_lower_case=True)

Downloading:   0%|          | 0.00/3.02M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/181 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/406 [00:00<?, ?B/s]

In [30]:
new=df.review.values
newdf=[]
for i in tqdm(new):
  tokenizer = BertTokenizer.from_pretrained('google/muril-large-cased', do_lower_case=True)
  i2= tokenizer(i)
  i3=i2['input_ids']
#we do padding also. 60 words max in one sentence
  out=[0 for i in range(60)]
  out[:len(i3)]=i3
  newdf.append(out)

100%|██████████| 9076/9076 [2:39:52<00:00,  1.06s/it]


In [40]:
y_hat=df.sentiment.values
Xtrain=np.array(newdf)
y_hat=np.array(y_hat)

**Training RandomForest Model**

In [44]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xtrain, y_hat, random_state=91)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print(classification_report(clf.predict(X_test),y_test))

              precision    recall  f1-score   support

           0       0.92      0.76      0.83      1777
           1       0.46      0.76      0.58       492

    accuracy                           0.76      2269
   macro avg       0.69      0.76      0.70      2269
weighted avg       0.82      0.76      0.78      2269



**Deep Learning approach**

**I use  ktrain library is a lightweight wrapper for tf.keras in Tensor Flow 2. It is designed to make deep learning and AI more accessible and easier to apply for beginners and domain experts. this library has inbuilt text cleaner**

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], random_state=91)

In [7]:
import ktrain
from ktrain import text
t = text.Transformer("monsoon-nlp/hindi-bert", maxlen=500, class_names=list(set(y_train.values)))


Downloading:   0%|          | 0.00/572 [00:00<?, ?B/s]

In [8]:
trn = t.preprocess_train(X_train.to_numpy(), y_train.to_numpy())


preprocessing train...
language: hi
train sequence lengths:
	mean : 16
	95percentile : 27
	99percentile : 30


Downloading:   0%|          | 0.00/181 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/593k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Is Multi-Label? False


In [9]:
evalr = t.preprocess_test(X_test.to_numpy(), y_test.to_numpy())


preprocessing test...
language: hi
test sequence lengths:
	mean : 16
	95percentile : 27
	99percentile : 30


In [10]:
model = t.get_classifier()

Downloading:   0%|          | 0.00/56.4M [00:00<?, ?B/s]

**Training Model**

In [11]:
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=evalr, batch_size=6)
learner.fit(1.2e-4, 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f714db7dd30>

**Evaluation on Train Dataset**

In [17]:
model.explain("इन लोगो ने हमारे देश का नाम ख़राब किया हे,इन्हे जूते से मारना चाहिए🤬")



Contribution?,Feature
3.15,Highlighted in text (sum)
-0.001,<BIAS>


In [18]:
model.explain("तुम सबको काट देना चाहिए")



Contribution?,Feature
0.374,<BIAS>
0.369,Highlighted in text (sum)


In [23]:
model.explain("बहुत खूबसूरत हो तुम कभी मैं कहु के मोहब्बत है तुमसे तो मुझको खुदरा ग़लत ना समझना के मेरी ज़रुरत हो तुम बहुत खूबसूरत हो तुम")



Contribution?,Feature
3.777,Highlighted in text (sum)
-1.158,<BIAS>


**failure cases**

In [21]:
model.explain('महाराष्ट्र में बड़े स्तर पर दंगे करवा सकती है बीजेपी')



Contribution?,Feature
3.284,Highlighted in text (sum)
-0.609,<BIAS>


In [29]:
model.explain('आगे से तुम ने फिर से ऐसे कपड़ों में फोटो डाला तो तुम्हारा बलात्कार कर देंगे')



Contribution?,Feature
0.855,<BIAS>
-0.743,Highlighted in text (sum)


**due to less data this model is overfit**