
<h1> Computer Assignment 3 Report </h1>
<h2> Artificial Intelligence Course - University of Tehran - Fall 1400 </h2>
<h2> Naive Bayes Classifier </h2>
<h3> Name: Kianoush Arshi <br>
 Student ID: 810198438 </h3>

In this assignment, we'll be using a naive bayes classifier to classify a dataset of advertisements. There are six different classes:<br>
<li>Vehicles</li>
<li>Electronic devices</li>
<li>Businesses</li>
<li>For the home</li>
<li>Personal</li>
<li>Leisure & Hobbies</li>

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Data/divar_train.csv")
df

Unnamed: 0,title,description,categories
0,بلبل خرمایی,سه عدد بلبل خرمایی سه ماهه.از وقتی جوجه بودن خ...,leisure-hobbies
1,عینک اسکی در حد,عینک اسکی دبل لنز مارک يو وكس در حد نو اصلی م...,leisure-hobbies
2,تکیه سر تویوتا پرادو,پارچه ای سالم و تمیز.,vehicles
3,مجسمه کریستال24%,مجسمه دکوری کریستال بالرین Rcr24%,for-the-home
4,کیف و ساک,هر 2 کاملا تمیز هستند,personal
...,...,...,...
10195,ان هاش 85,نیمه دوم همه چی به شرط در حد خشک 260تا کار,vehicles
10196,405 دوگانه کارخانه. تمیز,فابریک 4 حلقه لاستیک 205 نو بیمه یکسال تخفیف ب...,vehicles
10197,بخاری گازی دودکش دار پلار,بخاری نو و بسیار تمیز هستش\nبا مشتری واقعی کنا...,for-the-home
10198,نر کله برنجی چتری,سلام به دلیل کمبود جا واسباب کشی به کمترین قیم...,leisure-hobbies


First, let's check the count of advertisements in each category:

In [29]:
df['categories'].value_counts()

leisure-hobbies       1700
vehicles              1700
for-the-home          1700
personal              1700
electronic-devices    1700
businesses            1700
Name: categories, dtype: int64

As mentioned in the project description, there are equal number of categories so ne resampling is needed.

<h2>Phase 1: Pre-processing the Data</h2>
<p>In this phase, the dataset is edited so that it'll be able to be used efficiently and correctly in the future. The changes made to the dataset include:</p>
<li>Stemming</li>
<li>Lemmatizing</li>
<li>Tokenizing</li>


<h3>Stemming and lemmatization</h3>

In [4]:
from __future__ import unicode_literals
from hazm import *

stemmer = Stemmer()
lemmatizer = Lemmatizer()

In [20]:
print(stemmer.stem(df['title'][0]))
print(stemmer.stem('خوابیدند'))
print(stemmer.stem('خوابید'))
print(stemmer.stem('خواب'))
print(stemmer.stem('بخوابی'))
print(stemmer.stem('خوابیدم'))

بلبل خرما
خوابیدند
خوابید
خواب
بخواب
خوابید


Stemmer reduces the words to the root word. This isn't very much useful for us since it removes some parts of the word.

In [21]:
print(lemmatizer.lemmatize(df['title'][0]))
print(lemmatizer.lemmatize('خوابیدند'))
print(lemmatizer.lemmatize('خوابید'))
print(lemmatizer.lemmatize('خواب'))
print(lemmatizer.lemmatize('بخوابی'))
print(lemmatizer.lemmatize('خوابیدم'))

بلبل خرمایی
خوابید#خواب
خوابید#خواب
خواب
خوابید#خواب
خوابید#خواب


Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word. This is made possible by extracting principal parts of the verb. (bon e mazi # bon e mozare)
Lemmatization is the preferred method over Stemming.<br>

But for making use of lemmatization, we first need to tokenize the data:

In [23]:
print(word_tokenize(df['title'][0]))
print(word_tokenize(df['description'][0]))

['بلبل', 'خرمایی']
['سه', 'عدد', 'بلبل', 'خرمایی', 'سه', 'ماهه', '.', 'از', 'وقتی', 'جوجه', 'بودن', 'خودم', 'بزرگشون', 'کردم', 'اما', 'دستی', 'نیستن', 'واسه', 'همین', 'قیمت', 'پایین', 'دادم', '.', 'هر', 'سه', 'با', 'هم', '100', 'تومان', 'مقطوع', 'مقطوع']


The word_tokenize() function will breaks the sentence into it's words.<br>
This will prove useful in the future since we'll be using bag of words model in solving the problem.

In [28]:
print(word_tokenize(df['title'][0])[1])
print(lemmatizer.lemmatize(word_tokenize(df['title'][0])[1]))

خرمایی
خرما


We need to save stop words so that they are ignored. Also, some characters need to e removed and the data needs to be cleaned.

In [33]:
with open('stop_words.txt', encoding="utf8") as stop_words:
    ignore = [line.rstrip() for line in stop_words]

ignore = [x.strip() for x in ignore]
ignore[20:50]

['…',
 'آاو و و و',
 'آخ',
 'آخر',
 'آخرها',
 'آخه',
 'آدمهاست',
 'آرام',
 'آرام آرام',
 'آره',
 'آزادانه',
 'آسان',
 'آسيب پذيرند',
 'آشكارا',
 'آشنايند',
 'آمرانه',
 'آن',
 'آن گاه',
 'آن ها',
 'آنان',
 'آناني',
 'آنجا',
 'آنچنان',
 'آنچنان كه',
 'آنچه',
 'آنرا',
 'آنقدر',
 'آنگاه',
 'آنها',
 'آنهاست']

<h2>Phase 1: Problem Solving Process</h2>
<p>In this phase, we aim for solving the problem using Naive Bayes. As mentioned previously, bag of words model will be used for solving the problem. The feature used for classifying the advertisements is the number of words of each category used in the advertisement. The base formula used for classifying advertisements is as follows:</p><br>

$$P(c|x)=\frac{P(x|c)P(c)}{P(x)}$$

$x$: The word(s) detected<br>
$c$: Advertisement class<br>
$P(c|x)$: Probability of the current class being $c$ knowing that the word $x$ has appeared in the title and/or description. (Posterior)<br>
$P(x|c)$: Probability of seeing word $x$ in a class description of type $c$ (Likelihood)<br>
$P(c)$: Probability of seeing a book with genre $c$. This is equal for all genres since they all have occured the same number of times in the dataset. (Class Prior Probability)<br>
$P(x)$: Probability of seeing word $x$ in the context(Predictor Prior Probability (Evidence))<br>
Note that x can be viewed as multiple words which then we will have the fllwoing formula:<br>
$$P(c|X)=P(x_1|c)P(x_2|c)...P(x_n|c)P(c)$$

The process of solving this classifying problem is listeb below:<br>
<li> Tokenize the words (this can be unigram, bigram or ngram)
<li> Classify the tokenized words and calculate the given probabilities of the above formula (all except P(c|x) which will be tested on test dataset).
<li> Test the classifier.
<li> Calculate the accuracy.
<li> Repeat until we get an adequate accuracy.

Tokenizing the words:<br>
First, we'll tokenize the title column and use that as our feature only.<br>
We'll repeat this for description only and both title and description as features and choose the feature with best accuracy.<br>
This part is related with Grouping Operations technique [1].<br>


Bigrams and N-grams:<br>
This part relates to Feature Splitting technique of feature engineering. To know more about this technique, visit source [1].<br>
Note that using unigrams might increase inaccuracy of the model, for example check out the following snetences:<br>
I left my phone in the room.<br>
I'm left alone.<br>
Left here has two meanings and we need to know more than one word in order to figure out what left means.<br>
Persian example:<br>
<br>شیر امید را خورد
<br>امید شیر را خورد
<br>
This example is extremly difficult! For this one, not only we need to check all the words, but also we need to check their order!

References:<br>
[1] https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114<br>


