<a href="https://colab.research.google.com/github/limesun/GitHub/blob/master/Urdu_ML_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Data & Libraries


Hello, the new type of text data - Urdu language for sentiment analysis! 

Let's start to import the dataset from my Github and to install/import some required python libraries.

(Oh, as there was no column name in the dataset, I gave them before I imported the dataset for later conveniences in the data analysis.)

In [7]:
pip install cytoolz



In [0]:
#Import required libraries 
import pandas as pd
import numpy as np
import statistics as stat
import math
import re
import string
from cytoolz import concat

#Import the dataset
url = 'https://raw.githubusercontent.com/limesun/Sentiment_Analysis_Urdu_Language/master/Roman%20Urdu%20DataSet_my.csv'

df = pd.read_csv(url)


In [9]:
df.head()

Unnamed: 0,Review,Sentiment
0,Sai kha ya her kisi kay bus ki bat nhi hai lak...,Positive
1,sahi bt h,Positive
2,"Kya bt hai,",Positive
3,Wah je wah,Positive
4,Are wha kaya bat hai,Positive


#Data Cleaning & Pre-processing

After I looked into the dataset, I found that the language is not familiar to me at all. However, fortunetly, I still could get some ideas about this dataset through the study.

It was a social media review dataset, so it is including relatively short and messy texts as well as emoji like tweets. 

Thus, I decided to use 'TweetTokenizer' later for the tokenization to handle emoji and some weired expression like 'goooooooood'.

Before I applied the text pre-processing task, I had to do some data cleaning task first as I found one row incluing wrong spelling of 'Neative' instead of 'Negative' in Sentiment column as well as one null data row and many empty rows in Review column. 

The below codes are for the data cleaning tasks.


In [0]:
def remove_white_space_row(data):
    #Remove white space
    data['Review'] = data['Review'].apply(lambda x: ' '.join(x.split())) #remove the different types of whitespace
    data['Review'].replace('', np.nan, inplace=True) #change the empty cell('') into nan value
    #Drop null rows
    data.dropna(subset=['Review'], inplace=True) #drop them all nan-value rows
    return data

def pre_processing(data):
    #Lower case
    data = data.astype(str).str.lower() 
    #Filter punctuation
    data = data.astype(str).str.replace('[{}]'.format(string.punctuation), ' ') 
    #Filter number
    data = data.apply(lambda x: ' '.join([item for item in str(x).split() if not item.isdigit()]))
    data = data.astype(str).str.strip() #Filter Whitespace
    return data

Then, I applied some text pre-processing procedures such as lower case normalization, filtering punctuations and numbers, but I was not able to apply some procedurea such as filtering stopwords or pos tagging as it is an unknown language for those things.

In [11]:
# 1. Wrong spelling in 'Sentiment'
df.groupby('Sentiment').count()
#df[df['Sentiment'] == 'Neative'] #Check the index of the probramatic rows
df.loc[df['Sentiment'] == 'Neative', ['Sentiment']] = 'Negative'  # changed one row
#df.loc[13277,:] #Verification
df.groupby('Sentiment').count() #verification

# 2. Null/NaN/empty in 'Review' 
np.where(pd.isnull(df)) #Null case - one row
df.dropna(subset=['Review'], inplace=True)
print(len(df))

df = remove_white_space_row(df)
print(len(df))
df.Review = pre_processing(df.Review)
print(len(df))
df = remove_white_space_row(df)
print(len(df))

# 3. Drop 'Neutral' data so make it binary problem
df = df[df.Sentiment != 'Neutral']
df = df.reset_index(drop=True)
len(df)

20228
20116
20116
20045


11299

I have also deleted 'Neutral' data as the target has a clear direction from positive to negative and it will make the prediction more efficient and accurate.

After I have done the data cleaning tasks, I got a corpus having  '11,299' data records and its vocabrary size is '24,689'.

The target (Sentiment) distribution is having a pretty good balance (pos 6013 :  neg 5286) so we may not need to worry about the unbalanced data.

In [13]:
# Vocabrary Size Check
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
bow = df['Review'].apply(tokenizer.tokenize)
f = pd.DataFrame({'all': pd.value_counts(list(concat(bow)))})
corpus_size, vocab_size = len(df['Review']), len(f)
print(corpus_size, vocab_size)
#f.sort_values('all', ascending=False)

11299 24689


In [14]:
# Change 'Sentiment' in numerical categorical data
df['Target_category'] = df['Sentiment'].factorize()[0]
category_id_df = df[['Sentiment', 'Target_category']].drop_duplicates().sort_values('Target_category')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['Target_category', 'Sentiment']].values)

df.groupby(['Target_category', 'Sentiment'])['Review'].agg(['count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Target_category,Sentiment,Unnamed: 2_level_1
0,Positive,6013
1,Negative,5286


#Word Representation & Conventional Machine Learning



Now, let's start the main prediction task using conventional machine learning algorithms.

For this part I chose 'TweetTokenizer' from NLTK for the tokenization and 'TF/TFIDF' from sklearn for the word represenation.

I also chose five different alogorithms that are known as good predition methods for a text data.

- Support Vector Machine(Linear)
- Naive Baysian 
- Logistic Regression
- Random Forest
- Neural Network (Multi-Layer Perceptron)

In the below code results, you can check each model performance with different word representatin methods and performance matrices.

In [0]:
# import keras & sklearn libraries for various text mining techniques

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score


from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
from sklearn.metrics import classification_report 
from sklearn.metrics import confusion_matrix
from sklearn.calibration import CalibratedClassifierCV
from sklearn.pipeline import Pipeline
'''
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SimpleRNN
from keras.utils.np_utils import to_categorical
'''
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

I have splited the dataset into train and test dataset. The ratio is about 75%:25%.

In [16]:
# Train & Test Split

X = df.Review
y = df.Target_category

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

print("Train set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(X_train),
                                                                                         (len(X_train[y_train == 0]) / (len(X_train)*1.))*100,
                                                                        (len(X_train[y_train == 1]) / (len(X_train)*1.))*100))
print("Test set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(X_test),
                                                                             (len(X_test[y_test == 0]) / (len(X_test)*1.))*100,
                                                                            (len(X_test[y_test == 1]) / (len(X_test)*1.))*100))
print("# of positive in train set :", len(X_train[y_train == 0]))
print("# of positive in test set :", len(X_test[y_test == 0]))

Train set has total 8474 entries with 53.46% negative, 46.54% positive
Test set has total 2825 entries with 52.50% negative, 47.50% positive
# of positive in train set : 4530
# of positive in test set : 1483


In [0]:
# Making two different word represenations using TweetTokenizer


tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
max_feature = 10000

# TF_matrix
cv = TfidfVectorizer(max_features=max_feature, ngram_range=(1,3), norm='l2', min_df = 0, use_idf=False, smooth_idf=False, lowercase = True, 
                                sublinear_tf=False, tokenizer=tokenizer.tokenize)


# TFIDF_matrix
tfidf = TfidfVectorizer(max_features=max_feature, ngram_range=(1,3), norm='l2', min_df = 0, use_idf=True, smooth_idf=False, lowercase = True, 
                                sublinear_tf=True, tokenizer=tokenizer.tokenize)



I gave the value of 10,000 on max_feature in generating the word representation based on the vocab_size (24,689) of the corpus as we shoud consider the Zip's law.

In [0]:
# Various Machine Learning Methods

# set parameters
models = [  LinearSVC(),
            MultinomialNB(),
            LogisticRegression(random_state=0),
            RandomForestClassifier(n_estimators=200, max_depth=5, random_state=0),
            MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)]

data_set =[cv, tfidf]
data_set_name= ["TF", "TFIDF"]
label_set = [df.Target_category, df.Target_category]            
cross_validation = 5
cv_df = pd.DataFrame(index=range(cross_validation * len(models)))

In [20]:
for data_set_index in range(len(data_set)):
  
  for model in models:
   
    model_name = model.__class__.__name__
    pipeline = Pipeline([
        ('vectorizer', data_set[data_set_index]),
        ('classifier', model)
    ])
    sentiment_fit = pipeline.fit(X_train, y_train)
    y_pred = sentiment_fit.predict(X_test)
    
    print("--------------------------------------------------------------")
    print("Data set", data_set_name[data_set_index])
    print("Model name", model_name)
    print(classification_report(y_test, y_pred, target_names=['positive','negative']))

--------------------------------------------------------------
Data set TF
Model name LinearSVC
              precision    recall  f1-score   support

    positive       0.79      0.79      0.79      1483
    negative       0.77      0.77      0.77      1342

    accuracy                           0.78      2825
   macro avg       0.78      0.78      0.78      2825
weighted avg       0.78      0.78      0.78      2825

--------------------------------------------------------------
Data set TF
Model name MultinomialNB
              precision    recall  f1-score   support

    positive       0.79      0.80      0.79      1483
    negative       0.77      0.77      0.77      1342

    accuracy                           0.78      2825
   macro avg       0.78      0.78      0.78      2825
weighted avg       0.78      0.78      0.78      2825





--------------------------------------------------------------
Data set TF
Model name LogisticRegression
              precision    recall  f1-score   support

    positive       0.77      0.79      0.78      1483
    negative       0.76      0.74      0.75      1342

    accuracy                           0.77      2825
   macro avg       0.77      0.76      0.76      2825
weighted avg       0.77      0.77      0.77      2825

--------------------------------------------------------------
Data set TF
Model name RandomForestClassifier
              precision    recall  f1-score   support

    positive       0.54      0.99      0.70      1483
    negative       0.91      0.08      0.15      1342

    accuracy                           0.56      2825
   macro avg       0.73      0.54      0.43      2825
weighted avg       0.72      0.56      0.44      2825

--------------------------------------------------------------
Data set TF
Model name MLPClassifier
              precision    recal

  'precision', 'predicted', average, warn_for)


--------------------------------------------------------------
Data set TFIDF
Model name LinearSVC
              precision    recall  f1-score   support

    positive       0.78      0.79      0.79      1483
    negative       0.77      0.76      0.76      1342

    accuracy                           0.77      2825
   macro avg       0.77      0.77      0.77      2825
weighted avg       0.77      0.77      0.77      2825

--------------------------------------------------------------
Data set TFIDF
Model name MultinomialNB
              precision    recall  f1-score   support

    positive       0.81      0.79      0.80      1483
    negative       0.77      0.79      0.78      1342

    accuracy                           0.79      2825
   macro avg       0.79      0.79      0.79      2825
weighted avg       0.79      0.79      0.79      2825





--------------------------------------------------------------
Data set TFIDF
Model name LogisticRegression
              precision    recall  f1-score   support

    positive       0.78      0.80      0.79      1483
    negative       0.77      0.75      0.76      1342

    accuracy                           0.78      2825
   macro avg       0.78      0.78      0.78      2825
weighted avg       0.78      0.78      0.78      2825

--------------------------------------------------------------
Data set TFIDF
Model name RandomForestClassifier
              precision    recall  f1-score   support

    positive       0.54      0.99      0.70      1483
    negative       0.91      0.08      0.15      1342

    accuracy                           0.56      2825
   macro avg       0.73      0.54      0.43      2825
weighted avg       0.72      0.56      0.44      2825

--------------------------------------------------------------
Data set TFIDF
Model name MLPClassifier
              precision

  'precision', 'predicted', average, warn_for)


The results shows the prediction through Naive Baysian model with TFIDF is the best even though the prediction performances between SVM, NB, LR with TF and TFIDF are not that significantly diffrent each other.

In [19]:
import sklearn; print(sklearn.__version__)

0.21.2
