<a href="https://colab.research.google.com/github/msrepo/ml-mscise-2023/blob/master/Problem_set/Problem_set_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
import numpy as np
import pandas as pd
import sklearn

In [1]:
csv_path = 'https://raw.githubusercontent.com/naamiinepal/covid-tweet-classification/main/analysis/Dataset/nepali_tweets_dataset_vectors_EHnM_analysis_v1.csv'

In [45]:
df = pd.read_csv(csv_path)
TEXT_COLUMN_ID = 7
LABEL_ID_START = 8

categories = list(df)[LABEL_ID_START:]
tweet_text = df.iloc[:,TEXT_COLUMN_ID].to_numpy()
target = df.iloc[:,LABEL_ID_START:].to_numpy()
df.iloc[:,TEXT_COLUMN_ID:].head()


Unnamed: 0,text,covid_stats,vaccination,covid_politics,humour,lockdown,civic_views,life_during_pandemic,covid_waves_and_variants,misinformation,others
0,चितवनमा ९३ हजार बढीले लगाए कोरोनाविरुद्धको खोप,1,1,0,0,0,0,0,0,0,0
1,जोरबिजोर भनेको गाडी संख्या धेरै भएर ट्राफिक जा...,0,0,0,0,0,1,0,0,0,0
2,३१ सय ८ जना संक्रमित थपिदा १५ सय ९५ जना डिस्चा...,1,0,0,0,0,0,0,0,0,0
3,कोरोनाको जोखिम बढ्दै: झापाको दमक फेरी १ हफ्ता ...,0,0,0,0,1,0,0,0,0,0
4,कोरोना खोप राज्यले निःशुल्क लगाइरहेकै छ र थप ल...,0,1,0,0,0,1,0,0,0,0


**Bag of Words representation of text**

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(binary=True)
X_train = count_vect.fit_transform(tweet_text)
X_train.shape

(8089, 4225)

In [40]:
print('Index for the word कोरोना:',count_vect.vocabulary_.get(u'कोरोना'))
print(count_vect.vocabulary_.get(u'खोप'))
print(count_vect.vocabulary_.get(u'ऋण'))

Index for the word कोरोना: None
None
1493


**preprocess class categories**

Each tweet can belong to more than one categories:

> चितवनमा ९३ हजार बढीले लगाए कोरोनाविरुद्धको खोप : covid_stats, vaccination





In [None]:
# Show tweet text and class labels side by side
for x, t in zip(tweet_text, target):
  indices = np.where(t == 1)[0]
  print(x, [categories[p] for p in indices])
  


We will create two separate tweet samples from the above data where the tweet text is the same but consists of only one of the class labels  

In [62]:
target_processed = []
tweet_text_processed = []

for i, (x, t) in enumerate(zip(tweet_text, target)):
  indices = np.where(t == 1)[0]
  target_processed.extend(indices)
  tweet_text_processed.extend([x]*len(indices))

print(f'Before Preprocessing: {len(tweet_text)}')
print(f'After Preprocessing: {len(tweet_text_processed)}')

Before Preprocessing: 8089
After Preprocessing: 10518


Classification using BoW features

In [73]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

NUM_TRAIN = int(0.9 * len(tweet_text_processed))
X_train,X_test = tweet_text_processed[:NUM_TRAIN], tweet_text_processed[NUM_TRAIN:]
target_train, target_test = target_processed[:NUM_TRAIN], target_processed[NUM_TRAIN:]

count_vect = CountVectorizer(binary=True,max_features=4000)
X_train = count_vect.fit_transform(X_train)
print(f'Training set shape after BoW features: {X_train.shape}')

logreg = LogisticRegression(multi_class='multinomial',solver='newton-cg',max_iter=1000, verbose=True)
logreg.fit(X_train,target_train)

Training set shape after BoW features: (9466, 4000)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.4s finished


LogisticRegression(max_iter=1000, multi_class='multinomial', solver='newton-cg',
                   verbose=True)

And now we can use this model for predicting on new inputs

In [74]:
X_test_vectorized = count_vect.transform(X_test)
target_predicted = logreg.predict(X_test_vectorized)

for x, t_pred, t_groundtruth in zip(X_test,target_predicted,target_test):
  print(f'{x} Predicted:{categories[t_pred]} GroundTruth:{categories[t_groundtruth]}')

थपिए २२३७ जना कोरोना संक्रमित, १८ जनाको मृत्यु Predicted:covid_stats GroundTruth:covid_stats
एकच अडानचोट विना मास्क दिसतय... Predicted:vaccination GroundTruth:others
एक हजार ५४९ सङ्क्रमण पुष्टि, दुई हजार ५० संक्रमित निको Predicted:covid_stats GroundTruth:covid_stats
भोलिबाट सेमेष्टर शुरु, धन्न लकडाउन छ र घर गएर घरबाट नै क्लास लिन पाइएको छ। नत्र त आउने ६ हप्तासम्म हरेक दोस्रो सोमबार बिजोग हुनेरहेछ। 😣😐️😩 Predicted:lockdown GroundTruth:lockdown
भोलिबाट सेमेष्टर शुरु, धन्न लकडाउन छ र घर गएर घरबाट नै क्लास लिन पाइएको छ। नत्र त आउने ६ हप्तासम्म हरेक दोस्रो सोमबार बिजोग हुनेरहेछ। 😣😐️😩 Predicted:lockdown GroundTruth:life_during_pandemic
यि लकडाउन बढाउनेहरु कालो सिसा लाएर हिँड्छन् क्या हो गाडीमा ? नत्र बाहिर कस्तो छ हालत देख्न पर्ने हो सिमित ब्यापार ब्यबसाय मात्र लकडाउनको नाममा किन ठप्प ?? Predicted:lockdown GroundTruth:lockdown
यि लकडाउन बढाउनेहरु कालो सिसा लाएर हिँड्छन् क्या हो गाडीमा ? नत्र बाहिर कस्तो छ हालत देख्न पर्ने हो सिमित ब्यापार ब्यबसाय मात्र लकडाउनको नाममा किन ठप्प ?? Predicted:loc