<h2 style = "font-size: 35px; text-align: center;">AI CA3</h2>
<h2 style = "font-size: 35px; text-align: center;">Data processing and bayesian networks</h2>
<h2 style = "font-size: 32px; text-align: center; color: #666">Houmch 810196443</h2>

In [1]:
import pandas as pd
import nltk
import numpy as np
import math
import hazm as hz
from hazm import stopwords_list
from hazm import word_tokenize
from collections import Counter

In [2]:
train_file_name = "comment_train.csv"
test_file_name = "comment_test.csv"

<br>
<h1 style = "text-align: center">Prelude</h1>
<br>
<h2>Definition of project:</h2>
<p style = "font-size: 14px">Text classification is useful in lots of aspects, for instance: spam email recognition, automatic book classification and ...
<br>    
    The goal of this project is to classify texts based on <mark>comments</mark> and <mark>titles</mark>. We have two main categories: <b>recommended and not_recommended </b> and data stored at <mark>comment_train.csv & comment_test.csv</mark>.
<br>
First of all we train system with train dataset with a strategy called <b>bag of words</b>, then for evaulation we use test dataset.Lastly we measure accuracy, recall, and precision for each one of the four processing methods.
<br></p>

<br>

<h1 style= "text-align: center">Prerequisites </h1>
<br>
<h3>Preprocessing data: </h3>
<p><mark>preprocess()</mark>: Every context in training data should be preprocessed in order to extract main words and to calculate probability of words per categories. preprocessing steps are shown below:</p>
<p style="text-indent :2em;">1. <mark>Normalize()</mark>: The normalization process can improve text matching. For example, there are several ways that the term "modem router" can be expressed, such as modem and router, modem & router, modem/router, and modem-router. By normalizing these words to the common form, it makes it easier to supply the right information to a shopper. By transforming the words to a standard format, other operations are able to work with the data and will not have to deal with issues that might compromise the process.</p>

<p style="text-indent :2em;">2. <mark>lemmatize()</mark>: This function uses <b>hazm library</b> to lemmatizes words.</p>
<p style="text-indent :2em;">3. <mark>remove_stopwords()</mark>: This function uses <b>hazm library</b> and tokenizer to remove <b>stopWords and punctuations</b>.</p>

## Question 1:  what is stemming and lemmatization?
<p>The main difference between lemmatization and stemming is the way they work and therefore the result they each of them returns</p>
<p><b>Stemming</b> algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word while <b>lemmatization</b>, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.</p>
<p>In our project lemmatization works a bit better with better accuracy.</p>
<br>

In [3]:
stop_words = stopwords_list()

def normalize():
    normalizer = hz.Normalizer()
    train_df['title'] = train_df['title'].apply(lambda x: normalizer.normalize(x))
    train_df['comment'] = train_df['comment'].apply(lambda x: normalizer.normalize(x))
    test_df['title'] = test_df['title'].apply(lambda x: normalizer.normalize(x))
    test_df['comment'] = test_df['comment'].apply(lambda x: normalizer.normalize(x))

def lemmatize():
    lemmatizer = hz.Lemmatizer()
    train_df['title'] = train_df['title'].apply(lambda x: lemmatizer.lemmatize(x))
    train_df['comment'] = train_df['comment'].apply(lambda x: lemmatizer.lemmatize(x))
    test_df['title'] = test_df['title'].apply(lambda x: lemmatizer.lemmatize(x))
    test_df['comment'] = test_df['comment'].apply(lambda x: lemmatizer.lemmatize(x))

# def stemmer():
#     stemmer = hz.Stemmer()
#     train_df['title'] = train_df['title'].apply(lambda x: stemmer.stem(x))
#     train_df['comment'] = train_df['comment'].apply(lambda x: stemmer.stem(x))
#     test_df['title'] = test_df['title'].apply(lambda x: stemmer.stem(x))
#     test_df['comment'] = test_df['comment'].apply(lambda x: stemmer.stem(x))
    
def trim_cell(input_cell):
    cell_array = word_tokenize(input_cell)
    filtered_sentence = [] 
    for i in cell_array:
        if i not in stop_words:
             filtered_sentence.append(i)
    return filtered_sentence

def remove_stopwords():
    train_df['title'] = train_df['title'].apply(lambda x: trim_cell(str(x)))
    train_df['comment'] = train_df['comment'].apply(lambda x: trim_cell(str(x)))
    test_df['title'] = test_df['title'].apply(lambda x: trim_cell(str(x)))
    test_df['comment'] = test_df['comment'].apply(lambda x: trim_cell(str(x)))
    

def join_columns():
    train_df["joined"] = train_df["title"] + train_df["comment"]
    test_df["joined"] = test_df["title"] + test_df["comment"]
 

    
def preprocess():
    normalize()
#     stemmer()
    lemmatize()
    remove_stopwords()


## Question 2:  what is bag of words?

<p>Bag of words is a model which is insensitive to word orders. it uses bayesian networks and naive bayes to calculate probabilites. In our model we have 4 types of probabilites: </p>
<p style="text-indent :2em;">1. <mark>Posterier: p(c|x)</mark>: If we have the <b>x</b> as a word in our text, what is the probability of <b>c</b> category for this text.</p>
<p style="text-indent :2em;">2. <mark>Prior: p(c)</mark>: the probability of <b>c</b> as category (calculated based on training data). </p>
<p style="text-indent :2em;">3. <mark>Likelihood: p(x|c)</mark>: If we know that <b>c</b> is category of current text, what is the probability that this text contains word <b>x</b>.</p>
<p style="text-indent :2em;">4. <mark>Evidence: p(x)</mark>: the probability of exitance of <b>x</b> as a word in text. We devide the number of current word to total words to calculate p(x) for every x.</p>
<p><b>NOTE 1</b>: At the end we use <b>p(c|X = {x1,x2,x3,...}) = p(x1|c)*p(x2|c)* ..... * p(c)</b> to evaluate.</p>

### Implementation Explanation:
<p> First off I divided train_df into to parts based on their "recommend" column. After that on each new dataframe these steps are carried out sequnetially:
<p style="text-indent :2em;">1. <mark>List definition</mark>: Go through all the rows of the dataframe and append each one of the encountered words to a list.</p>    
<p style="text-indent :2em;">2. <mark>Dict definition</mark>: Make a dictionary based on that list. This dict is made up of a key denoted by "word" and a value equal to the number of occurrences of the said word divided by the total number of words.</p> <p style="text-indent :2em;">3. <mark>Dict Normalization</mark>: In order to make values a bit more comprehensible, each value has been replaced by it's corresponding logarithm (base 10). By using this method we can use addition instead of multiplication when calculating the p(c|X = {x1,x2,x3,...}).</p>
<p style="text-indent :2em;">4. <mark>Smoothin value definition</mark>: Smoothing method has been fully explained below but in brief, this variable is defined for each one of the dataframes and it contains smoothing value. It will be used in cases where a word has not been encountered before so that the naive bayes model would not crash!</p>
<p style="text-indent :2em;">5. <mark>P(c) calculation</mark>: Last but not least, this probability is the lenght of each one of the dataframes divided by the lenght of the original dataframe(train_df).</p> 
</p>

## Question 3&4:  Additive smoothing
<p>Let's say you've trained your Naive Bayes Classifier on 2 classes, "Ham" and "Spam" (i.e. it classifies emails). For the sake of simplicity, we'll assume prior probabilities to be 50/50.
Now let's say you have an email (w1,w2,...,wn) which your classifier rates very highly as "Ham", say: <mark> P(Ham|w1,w2,...wn)=0.90 and P(Spam|w1,w2,..wn)=0.10 </mark>.Now let's say you have another email (w1,w2,...,wn,wn+1) which is exactly the same as the above email except that there's one word in it that isn't included in the vocabulary. Therefore, since this word's count is 0. So now we have: <mark> P(Ham|wn+1)=P(Spam|wn+1)=0 </mark> and <mark> P(Spam|w1,w2,..wn,wn+1)=P(Spam|w1,w2,...wn)∗P(Spam|wn+1)=0 </mark>. Despite the 1st email being strongly classified in one class, this 2nd email may be classified differently because of that last word having a probability of zero.

<b>Laplace smoothing</b> solves this by giving the last word a small non-zero probability for both classes, so that the posterior probabilities don't suddenly drop to zero.
</p>



### Method of calculating p(x|c):
#### This method is used to consider the additive smoothing value in probabilities.
<h5 style = "font-size: 20px; text-align: center;"><mark> div_value </mark> = Unique words in Category c + occurance of Category c + 1</h5>
<h5 style = "font-size: 20px; text-align: center;"> <mark> P(x|c) </mark> = Number of occurances + 1 / div_value</h5>

In [4]:
def train_predictor():
    good_train_df = train_df[train_df.recommend == "recommended"]
    bad_train_df = train_df[train_df.recommend == "not_recommended"]
    method_index = 1
    good_train_df

    words_in_goods = []
    words_in_bads = []
    for i in range(len(good_train_df.index)):
        lst = good_train_df.iloc[i]['joined']
        for j in lst:
            words_in_goods.append(j)

    for i in range(len(bad_train_df.index)):
        lst = bad_train_df.iloc[i]['joined']
        for j in lst:
            words_in_bads.append(j)



    good_words_dict = {}
    bad_words_dict = {}
    g_words_size = len(words_in_goods)
    b_words_size = len(words_in_bads)
    g_vocab_size = len(Counter(words_in_goods).keys())
    b_vocab_size = len(Counter(words_in_bads).keys())
    count_g = len(good_train_df.index)
    count_b = len(bad_train_df.index)
    
    if (method_index == 1):
        for word in words_in_goods:
            if (word not in good_words_dict):
                good_words_dict[word] = 1/g_words_size
            else:
                good_words_dict[word] += 1/g_words_size

        for word in words_in_bads:
            if (word not in bad_words_dict):
                bad_words_dict[word] = 1/b_words_size
            else:
                bad_words_dict[word] += 1/b_words_size
                
    else:
        for word in words_in_goods:
            if (word not in good_words_dict):
                good_words_dict[word] = 1/(count_g + g_vocab_size + 1)
            else:
                good_words_dict[word] += 1/(count_g + g_vocab_size + 1)

        for word in words_in_bads:
            if (word not in bad_words_dict):
                bad_words_dict[word] = 1/(count_b + b_vocab_size + 1)
            else:
                bad_words_dict[word] += 1/(count_b + b_vocab_size + 1)




    for word in good_words_dict:
        good_words_dict[word] = math.log(good_words_dict[word], 10)

    for word in bad_words_dict:
        bad_words_dict[word] = math.log(bad_words_dict[word], 10)    

    if (method_index == 1):
        gmoothing_value = int(good_words_dict[min(good_words_dict, key=good_words_dict.get)]) - 1
        bsmoothin_value = int(bad_words_dict[min(bad_words_dict, key=bad_words_dict.get)]) - 1
    else:
        gmoothing_value = math.log(1/(count_g + g_vocab_size + 1), 10)
        bsmoothin_value = math.log(1/(count_b + b_vocab_size + 1), 10)


    good_p_c = math.log(len(good_train_df.index) / len(train_df.index), 10)
    bad_p_c = math.log(len(bad_train_df.index) / len(train_df.index), 10)
    
    return (good_words_dict, good_p_c, bad_words_dict, bad_p_c, gmoothing_value, bsmoothin_value)

## Question 5:  Precision or Recall?
<p> <b>Precision</b> means the percentage of your results which are relevant. On the other hand, <b>recall</b> refers to the percentage of total relevant results correctly classified by your algorithm </p>
<p> Imagine a retail app, wherein there is a limited space on each webpage, and extremely limited attention span of the customer. Therefore, if the customer is shown a lot of irrelevant results and very few relevant results (in order to achieve a high recall), the customer will not keep browsing each and every product forever and will eventually leave the app. </p>
<p> <b>Better Examples</b>: </p> 
<p style="text-indent :2em;">1. <mark>Recall is High.</mark> Choose a scenario where we will consider all encountered comments to <mark> recommended </mark>. Since recall is correct detected recommended / total recommended then recall is 100% but our model is definately poor. </p>
<p style="text-indent :2em;">2. <mark>Precision is Hight</mark> Choose a scenario where we will consider only one test comment to be recommended. Since precision is calculated using correct detected recommended / all detected recommended and both values are 1 then our precision will be 100% but our model is definately poor.</p>

## Question 6:  F1?
<p> The F1 score is the <mark>harmonic mean</mark> of precision and recall taking both metrics into account. We use the harmonic mean instead of a simple average because it punishes extreme values. A classifier with a precision of 1.0 and a recall of 0 has a simple average of 0.5 but an F1 score of 0. The F1 score gives equal weight to both measures. <b> If we want to create a balanced classification model with the optimal balance of recall and precision, then we try to maximize the F1 score. </b></p>

In [5]:
def test_predictor(use_smoothing, good_words_dict, good_p_c, bad_words_dict, bad_p_c, gmoothing_value, bsmoothin_value):
    answers = test_df['recommend'].tolist()
    predictions = []
    for m in range(800):
        gprob = 0
        gflag = False
        bflag = False
        bprob = 0
        for i in test_df.iloc[m]['joined']:
            if (i in good_words_dict):
                gprob += good_words_dict[i]
            elif (use_smoothing):
                gprob += gmoothing_value
            else:
                gflag = True
            
            if (i in bad_words_dict):
                bprob += bad_words_dict[i]
            elif (use_smoothing):
                bprob += bsmoothin_value
            else:
                bflag = True
        
        if (gflag):
            gprob = 0
        else:
            gprob += good_p_c
            gprob = 10**gprob
        
        if (bflag):
            bprob = 0
        else:
            bprob += bad_p_c
            bprob = 10**bprob
        
        if (gprob > bprob):
            predictions.append('recommended')
        elif (gprob < bprob):
            predictions.append('not_recommended')
        else:
            choice = np.random.randint(2)
            if (choice == 0):
                predictions.append('recommended')
            else:
                predictions.append('not_recommended')
    
    wrong_shots_idx = [] 
    head_shots = 0
    correct_detected_recomm = 0
    all_detected_recomm = predictions.count('recommended')
    total_recomm = answers.count('recommended')
    for i in range(len(answers)):
        if (answers[i] == predictions[i]):
            if (answers[i] == 'recommended'):
                correct_detected_recomm += 1
            head_shots += 1
        else:
             wrong_shots_idx.append(i)

    accuracy = head_shots / len(answers)
    precision = correct_detected_recomm / all_detected_recomm
    recall = correct_detected_recomm / total_recomm
    f1 = 2*((precision * recall) / (precision + recall))
    return (accuracy , precision, recall, f1, wrong_shots_idx)

In [6]:
def pretty_list(inp):
    a = [100*x for x in inp]
    return [round(num, 2) for num in a]


report_df = pd.DataFrame(columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
report_df['Accuracy'] = [0, 0, 0, 0]
report_df['Precision'] = [0, 0, 0, 0]
report_df['Recall'] = [0, 0, 0, 0]
report_df['F1'] = [0, 0, 0, 0]
report_df.index = ['a', 'b' , 'c', 'd']
wrong_shots_idx = []


for i in range(4):
    train_df = pd.read_csv(train_file_name)
    test_df = pd.read_csv(test_file_name)
    if (i == 0):
        preprocess()
        join_columns()
        good_words_dict, good_p_c, bad_words_dict, bad_p_c, gmoothing_value, bsmoothin_value = train_predictor()
        out = test_predictor(True, good_words_dict, good_p_c, bad_words_dict, 
                             bad_p_c, gmoothing_value, bsmoothin_value)
        wrong_shots_idx = out[4]
        out = out[0:4]
        report_df.loc['a'] = pretty_list(out)
        
    elif (i == 1):
        join_columns()
        good_words_dict, good_p_c, bad_words_dict, bad_p_c, gmoothing_value, bsmoothin_value = train_predictor()
        out = test_predictor(True, good_words_dict, good_p_c, bad_words_dict, 
                             bad_p_c, gmoothing_value, bsmoothin_value)
        out = out[0:4]
        report_df.loc['b'] = pretty_list(out)
        
    elif (i == 2):
        preprocess()
        join_columns()
        good_words_dict, good_p_c, bad_words_dict, bad_p_c, gmoothing_value, bsmoothin_value = train_predictor()
        out = test_predictor(False, good_words_dict, good_p_c, bad_words_dict, 
                             bad_p_c, gmoothing_value, bsmoothin_value)
        out = out[0:4]
        report_df.loc['c'] = pretty_list(out)
    
    elif (i == 3):
        join_columns()
        good_words_dict, good_p_c, bad_words_dict, bad_p_c, gmoothing_value, bsmoothin_value = train_predictor()
        out = test_predictor(False, good_words_dict, good_p_c, bad_words_dict, 
                             bad_p_c, gmoothing_value, bsmoothin_value)
        out = out[0:4]
        report_df.loc['d'] = pretty_list(out)

report_df

Unnamed: 0,Accuracy,Precision,Recall,F1
a,91.38,92.11,90.5,91.3
b,63.75,64.86,60.0,62.34
c,85.88,85.79,86.0,85.89
d,64.38,65.67,60.25,62.84


## Question 8: 
<p> When we are using both preprocess and additive smoothing, values are above 90 percent as presented above. The reason is that we are doing both essential parts of the bag of words model, first one being the cleaning of data and second one considering a small amount of probability for non-encountered words in training. One reason that the results are not in the hight 90's is that by using the bag of words method we are ignoring the realtion between words and structure of sentences within comments. </p>
<p> In the second one we are no using any sort of preprocess and we are only using additive smoothing. As we can see, results are significantly lower. The reason is by ignoring the preprocess level there are words which must be eliminated in order make a better predictor. <b>Why do we need to remove stop words?</b> That is because stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering. Also no normalization is being done on the dataset.However, the result are better that random guess (50 precent) since we are using a method after all! </p>

<p> The third method is better than the second since it uses preprocessing. We have explained why to use stop words removal above. Now, <b> Why do we need to normalize?</b> Basically, normalization is the process of efficiently organising data in a database. There are two main objectives of the normalization process: eliminate redundant data (storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in a table). By ignoring the additive smoothing we will face problems when we encounter a new word in test dataset. But these ocuurances are not too much to present big obsticles. Also, a random method has been applied in those situations and as a result, there's a 50 precent chance (ammortized) that our prediction will be correct. </p>

<p> Last but not least, we will have near random results when we igonre all mentioned methods in out prediction. No normalization and no additive smoothing will result in a dataset where there is <mark> resundant data </mark> and also luck will decide the fate in cases where we encounter a new word in test dataset. However, results are a bit better than absolute luck since at least a model (weak one!) is being applied on the dataset </p>

## Question 9: 
<p> A better thing we could have done is to consider the linguistic features of comments. In the bag of words method, we are totaly ignoring how words are combined together into a sentence and we're just focusing on occurances of various words in each category. Also, I think it would be better to use a specific way removing stop words. when a comment is full of words associated with happiness and joy it is more likely to be under recommended category than not_recommended. (Calculating this probability is a project for itself!). </p>

In [7]:
five_wrongs = wrong_shots_idx[0:5]
for i in range(5):
    element = test_df.iloc[five_wrongs[i]]
    print("comment title: ", element['title'])
    print("comment body: ", element['comment'])
    print("expected result: ", element['recommend'])
    print('-------------------------------------------------')

comment title:  دستگاه خیلی ضعیف
comment body:  من این فیس براس چند روز یپش به دستم رسید و الان بعد از چند روز استفاده از همه سری هاش دارم نظرم رو مینویسم
اول تشکر کنم بابت ارسال و بسته بندیه خوب دیجیکالا 
و اینکه این فیس براش رو من توی تخفیف خریدم. اول اینکه دستگاه به شدت ضعیفه با اینکه من 4تا باطری خوب هم انداختم روش ولی بازم خیلی ضعیفه در حدی که زود خاموش میشه! برس صورتش به نظر من باید لطیف تر باشه و برعکس برس بدنش باید یکم زبر تر باشه که بتونه لایه برداری انجام بده ولی برای شستشو معمولی بدن بد نیست ولی خب انقد ضعیفه که نمیشه باهاش شست. من تصمیم دارم بازش کنم و خودم قوی ترش کنم امیدوارم اون موقع بهتر بشه. در کل بنظر من بد حد یه فیس براش دستیه ولی آدم پولش رو جمع کنه و یه بهترش رو بخره خیلی بهتره چون این حتی سری هاشم گیر نمیاد و وقتی سری ها خراب بشن دیگه قابل استفاده نیست..
expected result:  not_recommended
-------------------------------------------------
comment title:  نقد پس از خرید
comment body:  سلام ، راحت شدم از کابل شارژ ، توصیه میشود به شدت . ارزان گوشی خود را به شارژ وای