
# Machine Learning models for classifing persion language dataset



Here I worked with simple machine learning models, naive bayes and svm for sentiment analysis with the persian language dataset, taaghche.

since we are using persian dataset, we need some libraries for preprocessing the data before using it. we need hazm package which is one of the famous libraries for tokenizing text, lemmatization and other preprocessing task. for removing stopwords in persian we can use guilannlp stopwords, so lets install them first...

In [1]:
!pip install hazm



In [3]:
!pip install stopwords_guilannlp

Collecting stopwords_guilannlp
  Downloading stopwords_guilannlp-13.2019.3.5-py3-none-any.whl (8.2 kB)
Installing collected packages: stopwords-guilannlp
Successfully installed stopwords-guilannlp-13.2019.3.5


After some installation we need to import some libraries that are usable for this task.

In [28]:
# General
import numpy as np
import pandas as pd
import codecs
# sklearn
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC
from sklearn.pipeline import Pipeline
# Preprocessing
from stopwords_guilannlp import stopwords_output
from hazm import *
# Visualization
%matplotlib inline
import matplotlib.pyplot as plt
# Measuring metrics
from sklearn.metrics import f1_score

# Lets look at the dataset
taghche dataset is the collection of book reviews which have polarities from 0 to 5. like some other datasets there are some problems so we need preprocesing. first we load the dataset using pandas library and look at the structure of dataset by calling the first 5 rows of it.

In [29]:
df = pd.read_csv("taghche.csv", encoding="utf-8")
df.head()

Unnamed: 0,date,comment,bookname,rate,bookID,like
0,1395/11/14,اسم کتاب No one writes to the Colonel\nترجمش...,سرهنگ کسی ندارد برایش نامه بنویسد,0.0,3.0,2.0
1,1395/11/14,"طاقچه عزیز،نام کتاب""کسی به سرهنگ نامه نمینویسد...",سرهنگ کسی ندارد برایش نامه بنویسد,5.0,3.0,2.0
2,1394/06/06,بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...,سرهنگ کسی ندارد برایش نامه بنویسد,5.0,3.0,0.0
3,1393/09/02,به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...,سرهنگ کسی ندارد برایش نامه بنویسد,2.0,3.0,0.0
4,1393/06/29,کتاب خوبی است,سرهنگ کسی ندارد برایش نامه بنویسد,3.0,3.0,0.0


since we dont need columns like date, bookname, bookID and like we can drop them from dataset.

In [30]:
df.drop('date', inplace=True, axis=1)
df.drop('bookname', inplace=True, axis=1)
df.drop('bookID', inplace=True, axis=1)
df.drop('like', inplace=True, axis=1)
df.head()

Unnamed: 0,comment,rate
0,اسم کتاب No one writes to the Colonel\nترجمش...,0.0
1,"طاقچه عزیز،نام کتاب""کسی به سرهنگ نامه نمینویسد...",5.0
2,بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...,5.0
3,به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...,2.0
4,کتاب خوبی است,3.0


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69829 entries, 0 to 69828
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   comment  69808 non-null  object 
 1   rate     69790 non-null  float64
dtypes: float64(1), object(1)
memory usage: 1.1+ MB


there are some conflicts and some part of the dataset is null and there are some missing values and duplicated raws, so we should fix them first.

In [32]:
print("missing_values_stat:")
print(df.isnull().sum())

missing_values_stat:
comment    21
rate       39
dtype: int64


for skipping null values we can use dropna() and then for removing some duplicates that may happen we use drop_duplicate and then reset index.

In [33]:
data = df.dropna(subset=['comment'])
data = data.dropna(subset=['rate'])
data = data.drop_duplicates(subset=['comment'], keep='first')
data = data.reset_index(drop=True)

In [34]:
print("missing_values_stat:")
print(data.isnull().sum())

missing_values_stat:
comment    0
rate       0
dtype: int64


In [35]:
unique_rates = list(sorted(data['rate'].unique()))
print(f'We have #{len(unique_rates)}: {unique_rates}')

We have #21: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 2932.0, 4206.0, 4385.0, 10053.0, 12836.0, 13482.0, 15965.0, 17152.0, 22694.0, 28912.0, 38473.0, 38591.0, 41097.0, 42368.0, 54374.0]


as we could guess there are some other rates which are more than 5 in the rate columns of our dataset, so we need to skip the rows with the rate of more than 5.

In [36]:
data = data[data['rate'] <= 5]
data.head()

Unnamed: 0,comment,rate
0,اسم کتاب No one writes to the Colonel\nترجمش...,0.0
1,"طاقچه عزیز،نام کتاب""کسی به سرهنگ نامه نمینویسد...",5.0
2,بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...,5.0
3,به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...,2.0
4,کتاب خوبی است,3.0


In [37]:
unique_rates = list(sorted(data['rate'].unique()))
print(f'We have #{len(unique_rates)}: {unique_rates}')

We have #6: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0]


now for the binary classification and for simplicity we can transform rates to a binary form of 0 with the rates of less than a threshold and 1 with rates of more than that.

In [38]:
def rate_to_label(rate, threshold=3.0):
  if rate < threshold:
    return 0.0
  else:
   return 1.0

data['rate'] = data['rate'].apply(lambda t: rate_to_label(t, 3.0))
labels = list(sorted(data['rate'].unique()))
data.head()

Unnamed: 0,comment,rate
0,اسم کتاب No one writes to the Colonel\nترجمش...,0.0
1,"طاقچه عزیز،نام کتاب""کسی به سرهنگ نامه نمینویسد...",1.0
2,بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...,1.0
3,به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...,0.0
4,کتاب خوبی است,1.0


now we can do some cleaning, removing HTML tags, some patterns and emojies and normalizing text.

In [12]:
!pip install clean-text



In [39]:
from cleantext import clean
import re

In [40]:
def cleanHTML(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext
def cleaning(text):
    text = text.strip()
    
    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )
     # cleaning htmls
    text = cleanHTML(text)
    
    # normalizing
    normalizer = Normalizer()
    text = normalizer.normalize(text)
    
    # removing wierd patterns
    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        # u"\u200c"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)
    
    text = wierd_pattern.sub(r'', text)
    
    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)
    
    return text

In [41]:
data['cleaned_comment'] = data['comment'].apply(cleaning)
data = data.dropna(subset=['cleaned_comment'])
data = data.reset_index(drop=True)

data.head()

Unnamed: 0,comment,rate,cleaned_comment
0,اسم کتاب No one writes to the Colonel\nترجمش...,0.0,اسم کتاب no one writes to the colonel ترجمش می...
1,"طاقچه عزیز،نام کتاب""کسی به سرهنگ نامه نمینویسد...",1.0,طاقچه عزیز، نام کتاب «کسی به سرهنگ نامه نمینوی...
2,بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...,1.0,بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...
3,به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...,0.0,به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...
4,کتاب خوبی است,1.0,کتاب خوبی است


In [42]:
data = data[['cleaned_comment','rate']]
data.columns = ['cleaned_comment','rate']
data.head()

Unnamed: 0,cleaned_comment,rate
0,اسم کتاب no one writes to the colonel ترجمش می...,0.0
1,طاقچه عزیز، نام کتاب «کسی به سرهنگ نامه نمینوی...,1.0
2,بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...,1.0
3,به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...,0.0
4,کتاب خوبی است,1.0


now we can randomly sample dataset based on the fewer label.

In [43]:
negative_data = data[data['rate'] == 0.0]
positive_data = data[data['rate'] == 1.0]
cutting_point = min(len(negative_data), len(positive_data))
if cutting_point <= len(negative_data):
  negative_data = negative_data.sample(n=cutting_point).reset_index(drop=True)
if cutting_point <= len(positive_data):
    positive_data = positive_data.sample(n=cutting_point).reset_index(drop=True)
new_data = pd.concat([negative_data, positive_data])
new_data = new_data.sample(frac=1).reset_index(drop=True)
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26360 entries, 0 to 26359
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   cleaned_comment  26360 non-null  object 
 1   rate             26360 non-null  float64
dtypes: float64(1), object(1)
memory usage: 412.0+ KB


In [44]:
new_data.head()

Unnamed: 0,cleaned_comment,rate
0,نامه‌ای به داروین...... سلام جناب داروین! … می...,0.0
1,یک ستاره؛ برای اینکه کتاب گویا با نمایش رادیوی...,0.0
2,یه عاشقانه خیلی معمولی ولی چون من- دختری ک رها...,1.0
3,بسیار خوب و آموزنده بود,1.0
4,aliii,1.0


In [45]:
unique_rates = list(sorted(new_data['rate'].unique()))
print(f'We have #{len(unique_rates)}: {unique_rates}')

We have #2: [0.0, 1.0]


Now we need to split dataset into 3 parts, train, validation and test set. for this purpose we use sklearn package, we use the rate of 0.2 for both valid and the test set.

In [46]:
from sklearn.model_selection import train_test_split

In [47]:
new_data['rate_id'] = new_data['rate'].apply(lambda t: labels.index(t))
train, test = train_test_split(new_data, test_size=0.2, stratify= new_data['rate'])
train, valid = train_test_split(train, test_size=0.2, stratify= train['rate'])

train = train.reset_index(drop=True)
valid = valid.reset_index(drop=True)
test = test.reset_index(drop=True)
x_train, y_train = train['cleaned_comment'].values.tolist(), train['rate_id'].values.tolist()
x_valid, y_valid = valid['cleaned_comment'].values.tolist(), valid['rate_id'].values.tolist()
x_test, y_test = test['cleaned_comment'].values.tolist(), test['rate_id'].values.tolist()

print(train.shape)
print(valid.shape)
print(test.shape)

(16870, 3)
(4218, 3)
(5272, 3)


To feed data into the model we can change the type of data from dataframe to numpy array.

In [48]:
train = train.to_numpy()
valid = valid.to_numpy()
test = test.to_numpy()

In [49]:
x_train = np.array(x_train)
y_train = np.array(y_train)
x_test = np.array(x_test)
y_test = np.array(y_test)

# Model
now its time to define our ML model, here I used a simple naive bayse and svm model, with tf-idf vectorizer for transforming and vectorizing the cleaned and preprocessed data.

In [50]:
def tokenize(text):
    return word_tokenize(text)
min_df = 1

In [51]:
naive_bayes = Pipeline([('vect', CountVectorizer(tokenizer=tokenize,
                                              analyzer='word', ngram_range=(1, 2), min_df=min_df, lowercase=False)),
                     ('tfidf', TfidfTransformer(sublinear_tf=True)),
                     ('clf', MultinomialNB())])
naive_bayes = naive_bayes.fit(x_train, y_train)
naive_score = naive_bayes.score(x_test, y_test)
print('Naive Bayes Model: ', naive_score)
predict_nb = naive_bayes.predict(x_test)

Naive Bayes Model:  0.7659332321699545


# SVM Model
we can examine a svm model to check and compare these 2 models

In [52]:
svm = Pipeline([('vect', CountVectorizer(tokenizer=tokenize,
                                                         analyzer='word', ngram_range=(1, 2),
                                                         min_df=min_df, lowercase=False)),
                                ('tfidf', TfidfTransformer(sublinear_tf=True)),
                                ('clf-svm', LinearSVC(loss='hinge', penalty='l2',
                                                      max_iter=5))])

svm = svm.fit(x_train, y_train)
linear_svc_score = svm.score(x_test, y_test)
print('Linear SVC Model: ', linear_svc_score)
predict_svm = svm.predict(x_test)



Linear SVC Model:  0.756638846737481


# Evaluation with F1 Score

In [53]:
print("F1 score of NB model:")
f1_score(y_test, predict_nb, average='weighted')

F1 score of NB model:


0.765913827447639

In [54]:
print("F1 score of SVM model:")
f1_score(y_test, predict_svm, average='weighted')

F1 score of SVM model:


0.7565813857585859

###### the preprocessing part was super inspired by: https://github.com/hooshvare/parsbert/tree/master/notebooks.
###### taaghche dataset is available at https://www.kaggle.com/saeedtqp/taaghche