<br>
<b>
    <font size="10" face="verdana">AI Fall 98 Project 3</font>
</b>
<hr>

## By Parsa Hoseininejad

## Introduction
In our world, most of industries use different artificial intelligence methods to improve their products. One of these, is classifying texts to determine some property about them. In this project we to build a mechanism to classify different kinds of news based on a few lines of that news. We use machine learning algorithms to do this task.<br>

## Definition of Question
To classify the news, a model should be built that gets news description, headline, and etc. and determins the type of news. To build this model, a dataset of news with their type is needed to train the model. This data set is first processed to build a dictionary. Then, we start to train our model based on that and evaluate our model based on the rest of dataset. At last, we use this model to determine news types.<br>
The dataset used in this project is a portion of dataset prepared in the this link: https://www.kaggle.com/rmisra/news-category-dataset <br>
It provides us with authors, category, date, headline, link and a short description of about 25,000 different news. It should be noted that in some inputs, fileds can be empty and all news type are TRAVEL, BUSINESS or STYLE & BEAUTY.<br>
The approach used in this project is bag of words model. In this model, each word gets a probablity of appearing in a certain news type. We use these probabalities to determine news type.

## Model
<hr>
To build a model based on dataset, baysian networks and naive bayes is used.<br>
<img src="./Images/Bayes_rule.png"/> <br>
In the following part, each probabality is defined:<br>
1. Posterior Probabality: The probabality of a news being class C with the condition of having words X in it.<br>
2. Likelihood: The probabality of existance of word x in news type C. This probabality is calculated in modeling sections. For each word, the probabality would be: number of occurances/number of all words.<br>
3. Class Prior Probabality: The probabality of a news being type c. These probabalities are extracted from the original dataset.<br>
4. Predictor Prior Probabality: The probabality of occurance of word x1, x2, and so on in the news. This probabality is constant given the input, so it is ommited.<br>

### Dataset Preprocessing
To turn this data set to useful data, four major steps are done:<br>
1. Deleting stop words: Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. These kinds of words are unimportant and they are removed to allow the model to focus on the important words instead.
2. Turning Words into Lowercase: In the bag of words model, the position of words in the scentence doesn't matter. So, for instance, "Style" and "style" words are interpreted the same. In this case, the probabality of the word "style" is the sum of the probablity of "Style" and "style" which is bigger than the both probabalities. This is done to improve the accuracy of our model.
3. Tokenizing: To tokenize the news, non alphabetic characters are used as delimiters.
4. Lemmatization: Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For instance, "stones" and "stone" are both interpreted as "stone". Because these two words have the same root, it is essential to calculate the frequency of this word by adding both previous frequencies. In this project, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.<br>

### Training
After pre processing the data, first 80% of each news type is selected. Then, for every word, the occurance of it is stored in a data structure. To keep the likelihoods, 3 different dictionaries are used to store the data for each class. The likelihood of every word is calculated as mentioned before. Note that because the multiplication of numbers less than one will make smaller and smaller numbers, logarithm of each likelihood is stored to be used in further parts. We know that programming languages interpret very small integers as 0. This action will prevent this unwanted scenario.

### Evaluating
This evaluating has 2 phases. In the first phase, the prediction is made between BUSINESS and TRAVEL classes. In the second phase, all classes take part in the process of evaluating.<br>
To evaluate the model trained, the rest of the dataset is used to calculate three indicators:
1. Recall: Recall is the fraction of the total amount of relevant instances that were actually retrieved.
2. Precision: Precision is the fraction of relevant instances among the retrieved instances.
3. Accuracy: Accuracy is the fraction of the total amount of relevant instances among all of the instances.

### Oversampling
Oversampling in data analysis is techniques used to adjust the class distribution of a data set. In the dataset given, the distribution of the BUSINESS class is much lower than the other two classes (About 4,000 less data). So, oversampling is done by selecting radom news with BUSINESS class and appneding it to the dataset. Over sampling insures that the proababality of important words increases because there is a higher chance to select a random news containig an important word than selecting a news which dosn't contain that word. Meanwhile, the probabality of non important words decreases.

## Implementation
<hr>
To implement this model, the data is read and stored in a dataframe available in <code>Pandas</code> library. The tokenizing and lemmatization and stop word deletion is implemented using <code>NLTK</code> library.

In [1]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer 
from nltk.stem import WordNetLemmatizer   
import re
import string
import math
import random 

TRAVEL = 'TRAVEL'
BUSINESS = 'BUSINESS'
STYLE = 'STYLE & BEAUTY'
DESCRIPTION_COL = 'short_description'
CATEGORY_COL = 'category'
HEADLINE_COL = 'headline'
WORDS_COL = 'keywords'

businessP = math.log(5937 / 200853, 10) 
travelP = math.log(9887 / 200853, 10)
styleP = math.log(9649 / 200853, 10)
trainPercentage = 0.8

stopWords = set(stopwords.words('english')) 

class Classifier:
    def __init__(self, dataFileAddr, testFileAddr):
        self.dataFileAddr = dataFileAddr
        self.testFileAddr = testFileAddr
        self.travelDict = {}
        self.styleDict = {}
        self.businessDict = {}

    def run(self):
        self.data = self.cleanData(self.dataFileAddr)
        self.oversample()
        self.train()

    def cleanData(self, path):
        df = pd.read_csv(path, delimiter = ',')
        df.drop(['index', 'authors', 'date', 'link'], axis=1, inplace=True)
        df.dropna(how='all', inplace=True)
        tokenizer = RegexpTokenizer(r'\w+')
        lemmatizer = WordNetLemmatizer()
        wordsCol = []

        for index, row in df.iterrows():
            description = ''
            if not isinstance(row[HEADLINE_COL], float):
                desciption = row[HEADLINE_COL]
            if not isinstance(row[DESCRIPTION_COL], float):
                desciption += ' ' + row[DESCRIPTION_COL]
            words = tokenizer.tokenize(desciption)
            filteredWords = [lemmatizer.lemmatize(w.lower()) for w in words if not w in stopWords]
            wordsCol.append(filteredWords)
        
        df.drop([DESCRIPTION_COL, HEADLINE_COL], axis=1, inplace=True)
        df[WORDS_COL] = wordsCol
        return df

    def oversample(self):
        travelDf = self.data.loc[self.data[CATEGORY_COL] == TRAVEL]
        businessDf = self.data.loc[self.data[CATEGORY_COL] == BUSINESS]
        businessTrainDf = businessDf.head(int(trainPercentage * len(businessDf)))

        randomSelected = []

        for i in range(len(travelDf) - len(businessDf)):
            idx = int(random.randint(0, len(businessTrainDf)-1))
            for index, row in businessTrainDf.iloc[[idx]].iterrows():
                randomSelected.insert(0, {CATEGORY_COL: row[CATEGORY_COL], WORDS_COL: row[WORDS_COL]})
        self.data = pd.concat([pd.DataFrame(randomSelected), self.data], ignore_index=True)

    def train(self):
        travelDf = self.data.loc[self.data[CATEGORY_COL] == TRAVEL]
        businessDf = self.data.loc[self.data[CATEGORY_COL] == BUSINESS]
        styleDf = self.data.loc[self.data[CATEGORY_COL] == STYLE]

        wordsCount = 0
        lastTravelTrainIdx = int(trainPercentage * len(travelDf))
        idx = 0
        for index, row in travelDf.iterrows():
            for word in row[WORDS_COL]:
                wordsCount += 1
                if word in self.travelDict:
                    self.travelDict[word] += 1
                else:
                    self.travelDict[word] = 1
            idx += 1
            if idx > lastTravelTrainIdx:
                break
        for key, value in self.travelDict.items():
            self.travelDict[key] = math.log(value/wordsCount, 10)
    
        wordsCount = 0
        lastBusinessTrainIdx = int(trainPercentage * len(businessDf))
        idx = 0
        for index, row in businessDf.iterrows():
            for word in row[WORDS_COL]:
                wordsCount += 1
                if word in self.businessDict:
                    self.businessDict[word] += 1
                else:
                    self.businessDict[word] = 1
            idx += 1
            if idx > lastBusinessTrainIdx:
                break
        for key, value in self.businessDict.items():
            self.businessDict[key] = math.log(value/wordsCount, 10)

        wordsCount = 0
        lastStyleTrainIdx = int(trainPercentage * len(styleDf))
        idx = 0
        for index, row in styleDf.iterrows():
            for word in row[WORDS_COL]:
                wordsCount += 1
                if word in self.styleDict:
                    self.styleDict[word] += 1
                else:
                    self.styleDict[word] = 1
            idx += 1
            if idx > lastStyleTrainIdx:
                break
        for key, value in self.styleDict.items():
            self.styleDict[key] = math.log(value/wordsCount, 10)

    def evaluatePhase1(self):
        travelDf = self.data.loc[self.data[CATEGORY_COL] == TRAVEL]
        businessDf = self.data.loc[self.data[CATEGORY_COL] == BUSINESS]
        travelDf = travelDf.tail(int((1-trainPercentage) * len(travelDf)))
        businessDf = businessDf.tail(int((1-trainPercentage) * len(businessDf)))
        
        currectTravel = 0
        wrongTravel = 0
        for index, row in travelDf.iterrows():
            if (self.getNewsTypePhase1(row[WORDS_COL]) == TRAVEL): 
                currectTravel += 1
            else:
                wrongTravel += 1
        travelRecall = currectTravel / len(travelDf) * 100

        currectBusiness = 0
        wrongBusiness = 0
        for index, row in businessDf.iterrows():
            if (self.getNewsTypePhase1(row[WORDS_COL]) == BUSINESS): 
                currectBusiness += 1
            else:
                wrongBusiness += 1
        businessRecall = currectBusiness / len(businessDf) * 100

        travelPrecision = currectTravel / (currectTravel+wrongBusiness) * 100
        businessPrecision = currectBusiness / (currectBusiness+wrongTravel) * 100

        accuracy = (currectBusiness + currectTravel) / (len(travelDf) + len(businessDf)) * 100

        outDf = pd.DataFrame(columns=['Phase1', 'Travel', 'Business'])
        outDf['Phase1'] = ['Recall', 'Precision', 'Accuracy']
        outDf['Travel'] = [travelRecall, travelPrecision, accuracy]
        outDf['Business'] = [businessRecall, businessPrecision, accuracy]
        return outDf
            
    def evaluatePhase2(self):
        travelDf = self.data.loc[self.data[CATEGORY_COL] == TRAVEL]
        businessDf = self.data.loc[self.data[CATEGORY_COL] == BUSINESS]
        styleDf = self.data.loc[self.data[CATEGORY_COL] == STYLE]
        travelDf = travelDf.tail(int((1-trainPercentage) * len(travelDf)))
        businessDf = businessDf.tail(int((1-trainPercentage) * len(businessDf)))
        styleDf = styleDf.tail(int((1-trainPercentage) * len(styleDf)))
        
        travelCount = 0
        businessCount = 0
        styleCount = 0        
        currectTravel = 0
        for index, row in travelDf.iterrows():
            if (self.getNewsTypePhase2(row[WORDS_COL]) == TRAVEL): 
                currectTravel += 1
                travelCount += 1
            elif (self.getNewsTypePhase2(row[WORDS_COL]) == BUSINESS):
                businessCount += 1
            else:
                styleCount += 1
        travelRecall = currectTravel / len(travelDf) * 100

        currectBusiness = 0
        for index, row in businessDf.iterrows():
            if (self.getNewsTypePhase2(row[WORDS_COL]) == BUSINESS): 
                currectBusiness += 1
                businessCount += 1
            elif (self.getNewsTypePhase2(row[WORDS_COL]) == TRAVEL):
                travelCount += 1
            else:
                styleCount += 1
        businessRecall = currectBusiness / len(businessDf) * 100

        currectStyle = 0
        for index, row in styleDf.iterrows():
            if (self.getNewsTypePhase2(row[WORDS_COL]) == STYLE): 
                currectStyle += 1
                styleCount += 1
            elif (self.getNewsTypePhase2(row[WORDS_COL]) == TRAVEL):
                travelCount += 1
            else:
                businessCount += 1
        styleRecall = currectStyle / len(styleDf) * 100

        travelPrecision = currectTravel / travelCount * 100
        businessPrecision = currectBusiness / businessCount * 100
        stylePrecision = currectStyle / styleCount * 100

        accuracy = (currectBusiness + currectTravel + currectStyle) / (len(travelDf) + len(businessDf) + len(styleDf)) * 100

        outDf = pd.DataFrame(columns=['Phase1', 'Travel', 'Business', 'Style & Beauty'])
        outDf['Phase1'] = ['Recall', 'Precision', 'Accuracy']
        outDf['Travel'] = [travelRecall, travelPrecision, accuracy]
        outDf['Business'] = [businessRecall, businessPrecision, accuracy]
        outDf['Style & Beauty'] = [styleRecall, stylePrecision, accuracy]
        return outDf
    
    def getNewsTypePhase1(self, words):
        if self.travelProbability(words) > self.businessProbablity(words):
            return TRAVEL
        else:
            return BUSINESS

    def getNewsTypePhase2(self, words):
        t = self.travelProbability(words)
        b = self.businessProbablity(words)
        s = self.styleProbablity(words)
        if t > b and t > s:
            return TRAVEL
        elif b > s:
            return BUSINESS
        else:
            return STYLE
    
    def travelProbability(self, words):
        p = travelP
        for word in words:
            if word in self.travelDict:
                p += self.travelDict[word]
            else:
                p -= 6
        return p

    def businessProbablity(self, words):
        p = businessP
        for word in words:
            if word in self.businessDict:
                p += self.businessDict[word]
            else:
                p -= 6
        return p

    def styleProbablity(self, words):
        p = styleP
        for word in words:
            if word in self.styleDict:
                p += self.styleDict[word]
            else:
                p -= 6
        return p
            
    def classify(self):
        self.test = self.cleanData(self.testFileAddr)
        file = open('output.csv', 'w')
        file.write('index,category\n')
        for index, row in self.test.iterrows():
            file.write(str(index) + ',' + self.getNewsTypePhase2(row[WORDS_COL]) + '\n')
        file.close()
        
    def getConfusionMatrix(self):
        travelDf = self.data.loc[self.data[CATEGORY_COL] == TRAVEL]
        businessDf = self.data.loc[self.data[CATEGORY_COL] == BUSINESS]
        styleDf = self.data.loc[self.data[CATEGORY_COL] == STYLE]
        travelDf = travelDf.tail(int((1-trainPercentage) * len(travelDf)))
        businessDf = businessDf.tail(int((1-trainPercentage) * len(businessDf)))
        styleDf = styleDf.tail(int((1-trainPercentage) * len(styleDf)))

        travelPredictedStyle = 0
        travelPredictedBusiness = 0
        businessPredictedStyle = 0
        businessPredictedTravel = 0
        stylePredictedTravel = 0
        stylePredictedBusiness = 0

        currectTravel = 0
        for index, row in travelDf.iterrows():
            if (self.getNewsTypePhase2(row[WORDS_COL]) == TRAVEL): 
                currectTravel += 1
            elif (self.getNewsTypePhase2(row[WORDS_COL]) == BUSINESS):
                travelPredictedBusiness += 1
            else:
                travelPredictedStyle += 1

        currectBusiness = 0
        for index, row in businessDf.iterrows():
            if (self.getNewsTypePhase2(row[WORDS_COL]) == BUSINESS): 
                currectBusiness += 1
            elif (self.getNewsTypePhase2(row[WORDS_COL]) == TRAVEL):
                businessPredictedTravel += 1
            else:
                businessPredictedStyle += 1

        currectStyle = 0
        for index, row in styleDf.iterrows():
            if (self.getNewsTypePhase2(row[WORDS_COL]) == STYLE): 
                currectStyle += 1
            elif (self.getNewsTypePhase2(row[WORDS_COL]) == TRAVEL):
                stylePredictedTravel += 1
            else:
                stylePredictedBusiness += 1

        outDf = pd.DataFrame(columns=['x', 'Predicted: Travel', 'Predicted: Business', 'Predicted: Style & Beauty'])
        outDf['x'] = ['Actual: Travel', 'Actual: Business', 'Actual: Style & Beauty']
        outDf['Predicted: Travel'] = [currectTravel, businessPredictedTravel, stylePredictedTravel]
        outDf['Predicted: Business'] = [travelPredictedBusiness, currectBusiness, stylePredictedBusiness]
        outDf['Predicted: Style & Beauty'] = [travelPredictedStyle, businessPredictedStyle, currectStyle]
        return outDf

        
        
cl = Classifier('./Attachment/data.csv', './Attachment/test.csv')
cl.run()

### Phase1 Evaluation

In [2]:
cl.evaluatePhase1()

Unnamed: 0,Phase1,Travel,Business
0,Recall,95.559303,92.636313
1,Precision,92.84544,95.425594
2,Accuracy,94.097808,94.097808


### Phase2 Evaluation

In [3]:
cl.evaluatePhase2()

Unnamed: 0,Phase1,Travel,Business,Style & Beauty
0,Recall,93.816751,90.949972,93.260369
1,Precision,89.827772,92.881745,95.572609
2,Accuracy,92.670948,92.670948,92.670948


### Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. The number of correct and incorrect predictions are summarized with count values and broken down by each class. The confusion matrix of this model is shown below.

In [4]:
cl.getConfusionMatrix()

Unnamed: 0,x,Predicted: Travel,Predicted: Business,Predicted: Style & Beauty
0,Actual: Travel,1669,78,32
1,Actual: Business,118,1618,43
2,Actual: Style & Beauty,71,46,1619


## Questions
1. __Stemming or Lemmatization?__ Lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words. In this case, lemmatization improves all of the indicator by less than 1%.
2. __What is tf-idf measure?__ TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. This measure is calculated by this formula: __tf-idf(t, d) = tf(t, d)* idf(t, d)__ for a term t in document d. <code>tf(t, d)</code> is basically our measure in this project. <code>idf(t, d)</code> is calculated by this formula: idf(t) = N/ df(t) where df(t) = Document frequency of a term t and N(t) = Number of documents containing the term t. If this measure was selected, the tf-idf should have been calculated for each word in each class. Then, the rest of the operations would be the same.
3. __What if a model only consider precision?__ For example, consider a model which takes an email and determines if it's spam or ham. If the model is trained so that it takes so many factors for determining if an email is spam, the precision will be 100%. Howerver, it won't determine an email spam if there is a good chance that the email is spam. In this case, the precision would be high but the model doesn't work correctly.
4. __What happens if a word is occured in only one class?__ In this case, the probabality implied by that word would be 10^-6 for those classes which doesn't contain the word. So the model will probably implies the class containing the word as answer