This notebook demonstrate Sentiment Analysis on Roman Urdu

## Imports
Here we are simply importing the things we will be using in our Script


In [16]:
import sys
print(sys.executable)
print(sys.version)
print(sys.version_info)

/home/programmer/.pyenv/versions/3.6.3/envs/roman-urdu-classification/bin/python3.6
3.6.3 (default, May 13 2023, 13:46:03) 
[GCC Clang 15.0.7 (Fedora 15.0.7-2.fc37)]
sys.version_info(major=3, minor=6, micro=3, releaselevel='final', serial=0)


In [17]:

from __future__ import print_function

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import eli5

import re
from tqdm import *



## Preprocessing
Here are two utility functions to clean data, and optionally use the phonetic algorithm to hash the data.

In [18]:

def cleaner(word):
  word = re.sub(r'\#\.', '', word)
  word = re.sub(r'\n', '', word)
  word = re.sub(r',', '', word)
  word = re.sub(r'\-', ' ', word)
  word = re.sub(r'\.', '', word)
  word = re.sub(r'\\', ' ', word)
  word = re.sub(r'\\x\.+', '', word)
  word = re.sub(r'\d', '', word)
  word = re.sub(r'^_.', '', word)
  word = re.sub(r'_', ' ', word)
  word = re.sub(r'^ ', '', word)
  word = re.sub(r' $', '', word)
  word = re.sub(r'\?', '', word)

  return word.lower()


def hashing(word):
  word = re.sub(r'ain$', r'ein', word)
  word = re.sub(r'ai', r'ae', word)
  word = re.sub(r'ay$', r'e', word)
  word = re.sub(r'ey$', r'e', word)
  word = re.sub(r'ie$', r'y', word)
  word = re.sub(r'^es', r'is', word)
  word = re.sub(r'a+', r'a', word)
  word = re.sub(r'j+', r'j', word)
  word = re.sub(r'd+', r'd', word)
  word = re.sub(r'u', r'o', word)
  word = re.sub(r'o+', r'o', word)
  word = re.sub(r'ee+', r'i', word)
  if not re.match(r'ar', word):
    word = re.sub(r'ar', r'r', word)
  word = re.sub(r'iy+', r'i', word)
  word = re.sub(r'ih+', r'eh', word)
  word = re.sub(r's+', r's', word)
  if re.search(r'[rst]y', 'word') and word[-1] != 'y':
    word = re.sub(r'y', r'i', word)
  if re.search(r'[bcdefghijklmnopqrtuvwxyz]i', word):
    word = re.sub(r'i$', r'y', word)
  if re.search(r'[acefghijlmnoqrstuvwxyz]h', word):
    word = re.sub(r'h', '', word)
  word = re.sub(r'k', r'q', word)
  return word

def array_cleaner(array):
  # X = array
  X = []
  for sentence in array:
    clean_sentence = ''
    words = str(sentence).split(' ')
    for word in words:
      clean_sentence = clean_sentence +' '+ cleaner(word)
    X.append(clean_sentence)
  return X


## Data

Here we are reading the file containing data

In [19]:
import os
print(os.getcwd())
data = pd.read_csv('Dataset/Roman Urdu DataSet.csv', encoding="ISO-8859-1", header=None)
data.head()

/mnt/programmer/projects/anti-social-behavior/Roman-Urdu-Dataset.git_Smat26


Unnamed: 0,0,1,2
0,Sai kha ya her kisi kay bus ki bat nhi hai lak...,Positive,
1,sahi bt h,Positive,
2,"Kya bt hai,",Positive,
3,Wah je wah,Positive,
4,Are wha kaya bat hai,Positive,


We are training the data on all of the dataset.

In [20]:
numpy_array = data.as_matrix()
X = numpy_array[:, 0]
# Clean X here
X_train = array_cleaner(X)
y_train = numpy_array[:, 1]

## Vectorizing
And using TF-IDF as our vectorizing method.
We are specifying the N-gram to be 3.


In [21]:
ngram = 3
vectorizer = TfidfVectorizer(sublinear_tf=True, ngram_range=(1, ngram), max_df=0.5)
X_train = vectorizer.fit_transform(X_train)


## Classification

A utility function to help us train different classifier:


In [22]:
def benchmark(clf, name):
  print('_' * 80)
  print("Training: ")
  print(clf)
  clf.fit(X_train, y_train)
  return clf

Uncomment single classifier to train the model to it.

The top features (both positive and negative) for each class would be listed.


In [23]:
# clf = benchmark(RidgeClassifier(tol=1e-2, solver="sag"), "Ridge Classifier")
clf = benchmark(SGDClassifier(alpha=.0001, n_iter=50,penalty="elasticnet"), 'SGD-elasticnet')
# clf = benchmark(SGDClassifier(alpha=.0001, n_iter=50,penalty='l1'), 'SGD-L1')
# clf = benchmark(LinearSVC(penalty='l1', dual=False,tol=1e-3), 'liblinear L1')
# clf = benchmark(LinearSVC(penalty='l2', dual=False,tol=1e-3), 'liblinear L2')
# clf = benchmark(MultinomialNB(alpha=.01), 'MultiNB')
# clf = benchmark(BernoulliNB(alpha=.01), 'BernoulliNB')
# clf = benchmark(NearestCentroid(), 'Rocchio')
# clf = benchmark(KNeighborsClassifier(n_neighbors=10), "kNN")
# clf = benchmark(PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive")

eli5.show_weights(clf, vec=vectorizer)

________________________________________________________________________________
Training: 
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=50,
       n_jobs=1, penalty='elasticnet', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)




Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3
-1.066,<BIAS>,,
+7.230,lanat,,
+3.881,chor,,
+3.597,police,,
+3.544,bc,,
+3.403,band,,
+3.361,jahil,,
+3.222,mar,,
+2.981,firing,,
+2.868,zakhmi,,

Weight?,Feature
-1.066,<BIAS>

Weight?,Feature
+7.230,lanat
+3.881,chor
+3.597,police
+3.544,bc
+3.403,band
+3.361,jahil
+3.222,mar
+2.981,firing
+2.868,zakhmi
+2.736,pagal

Weight?,Feature
… 8879 more positive …,… 8879 more positive …
… 8542 more negative …,… 8542 more negative …
-3.099,pak
-3.141,ne
-3.211,se
-3.405,bohat
-3.444,acha
-3.448,great
-3.450,nahi
-3.451,or

Weight?,Feature
+7.364,allah
+6.699,dua
+5.519,achi
+5.284,acha
+5.138,good
+4.657,love
+4.598,bohat
+4.434,great
+4.370,pak
+4.340,kamal


## Testing

We can check our model against a test sentence to see how well it performed.

In [27]:
test_sentence = "tum buhat bury admi ho"
eli5.show_prediction(clf, doc=test_sentence, vec=vectorizer)

Contribution?,Feature
-1.066,<BIAS>

Contribution?,Feature
0.281,Highlighted in text (sum)
-1.001,<BIAS>

Contribution?,Feature
0.283,<BIAS>
-1.719,Highlighted in text (sum)

Contribution?,Feature
0.301,Highlighted in text (sum)
-1.001,<BIAS>
