# Introduction
### Naive Bayes
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods.

### Bag of Words
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:<br>
1- A vocabulary of known words.<br>
2- A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

Here we apply both algorythms in order to predict 3 types of news based on their description and headline.

# Main Program and Algorithms
## Imports

In [1]:
import pandas as pd
import numpy as np
import math
import nltk
import time
import string
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

## Removing what we don't need
Here we read files and then remove columns that we don't need.<br> After that, We drop data rows with NaN data within both test file and data file.<br> By then, we change index values to index column within data and remove the extra column to have a cleaner view on dataset.

In [2]:
data = pd.read_csv("data.csv")
test = pd.read_csv("test.csv")
data.pop('authors')
# data.pop('headline')
data.pop('date')
data.pop('link')
data.dropna(inplace=True)
test.pop('authors')
# test.pop('headline')
test.pop('date')
test.pop('link')
test.dropna(inplace=True)
data.index = data['index']
data.pop('index')
test.index = test['index']
test.pop('index')
pass

## NLTK
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries and other usages.

Here we implement nltk to firstly have stopwords and punctuations, then remove them by tokenizing the words and have tokenized strings. other downloads and object creations are for alternative approaches.

In [3]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
sw = list(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'\w+')

[nltk_data] Downloading package stopwords to /Users/parsa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/parsa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/parsa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/parsa/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Alternative Professional Lemmitization
The first step is to convert the sentence to a list of tuples where every tuple contains both the word and its part-of-speech tag. Since [python]WordNetLemmatizer[/python] expects a different kind of POS tags, we have to convert the ones generated by [python]nltk.pos_tag()[/python] to those expected by [python]WordNetLemmatizer.lemmatize()[/python]. This is done in [python]nltk2wn_tag()[/python].

There are some POS tags that correspond to words where the lemmatized form does not differ from the original word. For these, [python]nltk2wn_tag()[/python] returns None and [python]lemmatize_sentence()[/python] just copies them from the input to the output sentence.

In [4]:
def nltk2wn_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:                    
        return None

def lemmatize_sentence(sentence):
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))    
    wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged)

    res_words = []
    for word, tag in wn_tagged:
        if tag is None:                        
            res_words.append(word)
        else:
            res_words.append(lemmatizer.lemmatize(word, tag))

    return " ".join(res_words)

## Main PreProcess
### Stemming:
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Applications of stemming :<br>
1- Stemming is used in information retrieval systems like search engines.<br>
2- It is used to determine domain vocabularies in domain analysis.

### Lemitization:
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word. Applications of lemmatization are:<br>
1- Used in comprehensive retrieval systems like search engines.<br>
2- Used in compact indexing

### We use:
The best approach for this project though is stemming approach. Because here we use bag of words, lemitization reduces accuracy when it doesn't know about the role of word in sentences and prematurely thinks that all words are nounes.

of course before that, we tokenize our words by removing punctuations, stopwords, numbers, and make all letters in words lowercase. if we do not do that, it may reduce the accuracy since words won't be found in word probability dictionaty and also stopwords. 

Some treat these two as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

In [5]:
for i in data.index:
    a = data['short_description'][i]
#     a = lemmatize_sentence(a)
    a = tokenizer.tokenize(a)
#     data['short_description'][i] = [j.lower() for j in a if (j.lower() not in sw) and (j.isalpha())]
#     data['short_description'][i] = [lemmatize_sentence(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
    data['short_description'][i] = [ps.stem(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     data['short_description'][i] = [lemmatizer.lemmatize(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     data['short_description'][i] = set(data['short_description'][i])
#     data['short_description'][i] = list(data['short_description'][i])
    
    a = data['headline'][i]
#     a = lemmatize_sentence(a)
    a = tokenizer.tokenize(a)
#     data['headline'][i] = [j.lower() for j in a if (j.lower() not in sw) and (j.isalpha())]
#     data['headline'][i] = [lemmatize_sentence(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
    data['headline'][i] = [ps.stem(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     data['headline'][i] = [lemmatizer.lemmatize(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     data['headline'][i] = set(data['headline'][i])
#     data['headline'][i] = list(data['headline'][i])
for i in test.index:
    a = test['short_description'][i]
#     a = lemmatize_sentence(a)
    a = tokenizer.tokenize(a)
#     test['short_description'][i] = [j.lower() for j in a if (j.lower() not in sw) and (j.isalpha())]
#     test['short_description'][i] = [lemmatize_sentence(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
    test['short_description'][i] = [ps.stem(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     test['short_description'][i] = [lemmatizer.lemmatize(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     test['short_description'][i] = set(test['short_description'][i])
#     test['short_description'][i] = list(test['short_description'][i])
    
    a = test['headline'][i]
#     a = lemmatize_sentence(a)
    a = tokenizer.tokenize(a)
#     test['headline'][i] = [j.lower() for j in a if (j.lower() not in sw) and (j.isalpha())]
#     test['headline'][i] = [lemmatize_sentence(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
    test['headline'][i] = [ps.stem(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     test['headline'][i] = [lemmatizer.lemmatize(word.lower()) for word in a if (word.lower() not in sw) and (word.isalpha())]
#     test['headline'][i] = set(test['headline'][i])
#     test['headline'][i] = list(test['headline'][i])

data['Model_Predict'] = ' '

## Category Probabilities and Oversampeling
Here we first find the probabilities for each category, and then make copy of random rows in order to have rows in the level of 9000. This could prevent our model from differing accuracies along with three categories and reduces accuracy diffrence between 3 types. 

### SMOT Oversampeling
I used SMOT Oversampling; wich is The most naive strategy and it is to generate new samples by randomly sampling with replacement the current available samples.

### Update Index numbers
and at last, I updated indexes to have ordinarily and unique index numbers.

In [6]:
tDataFrame = data.loc[data['category']=='TRAVEL']
sDataFrame = data.loc[data['category']=='STYLE & BEAUTY']
bDataFrame = data.loc[data['category']=='BUSINESS']

lens = []
lens.append(len(tDataFrame))
lens.append(len(sDataFrame))
lens.append(len(bDataFrame))

probT = math.log(len(tDataFrame) / len(data))
probS = math.log(len(sDataFrame) / len(data))
probB = math.log(len(bDataFrame) / len(data))

tDataFrame = tDataFrame.append(tDataFrame.sample(9000 - len(tDataFrame)))
sDataFrame = sDataFrame.append(sDataFrame.sample(9000 - len(sDataFrame)))
bDataFrame = bDataFrame.append(bDataFrame.sample(9000 - len(bDataFrame)))

tDataFrame.index = range(len(tDataFrame))
sDataFrame.index = range(len(sDataFrame))
bDataFrame.index = range(len(bDataFrame))

## Bayesian Role

<img src="Bayesian Formula.png">

We should note that In the picture below:<br>
posterior probability is the probability that a news with the word xi is in the category of c.<br>
likelihood is the probability that the word xi will be in a news of category c, which will be computed as the total number of xi in category c news divided by the number of c category news.<br>
class probability is the number of news for each category divided by the total number of news.<br>
And at last, evidence is the probability that the word xi appears in any news regardless of its category. This probability will not be computed directly.

#### Note:
p(c|xi) or  We need the probability of p(c|X) where X is a combination of xis which means the probability that a news that contains the words x1 to xn belongs to category c. This probability can be computed with the second formula in the picture below. The category with the highest p(c|X) for a given news is the category predicted by the model for that news.

## Implementation
Here we separate train-set from our cross-validation-set by using numpy built-in random function. train-set will be the 8 out of 10th of the size of each category dataframe.<br>
### Probability dictionaries for each category:
First we count number of each word in each category, and then we devide number of each specific word to the number of items in each category. Then apply logarithm to the number in order to save data.

In [7]:
tWords = set()
tDict = {}
msk = np.random.rand(len(tDataFrame)) < 0.8
tTrain = tDataFrame[msk]
tTest = tDataFrame[~msk]
for i in tTrain['short_description']:
    for j in i:
        tWords.add(j)
        try:
            tDict[j] += 1
        except:
            tDict[j] = 1
for i in tTrain['headline']:
    for j in i:
        tWords.add(j)
        try:
            tDict[j] += 1
        except:
            tDict[j] = 1
for i in tDict.keys():
    tDict[i] = math.log(tDict[i] / len(tDataFrame))
    
sWords = set()
sDict = {}
msk = np.random.rand(len(sDataFrame)) < 0.8
sTrain = sDataFrame[msk]
sTest = sDataFrame[~msk]
for i in sTrain['short_description']:
    for j in i:
        sWords.add(j)
        try:
            sDict[j] += 1
        except:
            sDict[j] = 1
for i in sTrain['headline']:
    for j in i:
        sWords.add(j)
        try:
            sDict[j] += 1
        except:
            sDict[j] = 1
for i in sDict.keys():
    sDict[i] = math.log(sDict[i] / len(sDataFrame))
    
bWords = set()
bDict = {}
msk = np.random.rand(len(bDataFrame)) < 0.8
bTrain = bDataFrame[msk]
bTest = bDataFrame[~msk]
for i in bTrain['short_description']:
    for j in i:
        bWords.add(j)
        try:
            bDict[j] += 1
        except:
            bDict[j] = 1
for i in bTrain['headline']:
    for j in i:
        bWords.add(j)
        try:
            bDict[j] += 1
        except:
            bDict[j] = 1
for i in bDict.keys():
    bDict[i] = math.log(bDict[i] / len(bDataFrame))

## Phase 1
After having dictionaries, on cross validation sets for each category, we process words one by one and adding the probabilities in the score variables, by doing this we are calculating that which label is more valid in the view of point of our model.<br>
and after that we categorize by the score that is more than the other.

### Note:
Commented parts and why I commented them out are describes in question number 4 at the end of this letter.

In [8]:
t1Pred = 0
b1Pred = 0
for i in tTest.index:
    tScore = 0
    bScore = 0
    for j in tTest['short_description'][i]:
        try:
            temp = tDict[j] + bDict[j]
            if tDict[j] < bDict[j]:
                tScore += tDict[j]
                continue
            else:
                bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
    for j in tTest['headline'][i]:
        try:
            temp = tDict[j] + bDict[j]
            if tDict[j] < bDict[j]:
                tScore += tDict[j]
                continue
            else:
                bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
    tScore += probT
    bScore += probB
    if tScore > bScore:
        tTest['Model_Predict'][i] = 'TRAVEL'
        t1Pred += 1
    else:
        tTest['Model_Predict'][i] = 'BUSINESS'
        b1Pred += 1
t1 = t1Pred / (t1Pred + b1Pred)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [9]:
t2Pred = 0
b2Pred = 0
for i in bTest.index:
    tScore = 0
    bScore = 0
    for j in bTest['short_description'][i]:
        try:
            temp = tDict[j] + bDict[j]
            if tDict[j] < bDict[j]:
                tScore += tDict[j]
                continue
            else:
                bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
    for j in bTest['headline'][i]:
        try:
            temp = tDict[j] + bDict[j]
            if tDict[j] < bDict[j]:
                tScore += tDict[j]
                continue
            else:
                bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
    tScore += probT
    bScore += probB
    if tScore > bScore:
        bTest['Model_Predict'][i] = 'TRAVEL'
        t2Pred += 1
    else:
        bTest['Model_Predict'][i] = 'BUSINESS'
        b2Pred += 1
b1 = b2Pred / (b2Pred + t2Pred)

## Phase 1 Recall - Precision - Accuracy

In [10]:
travel_recall = t1Pred / (t1Pred + b1Pred)
business_recall = b2Pred / (b2Pred + t2Pred)
travel_precision = t1Pred / (t1Pred + t2Pred)
business_precision = b2Pred / (b1Pred + b2Pred)

accuracy = (t1Pred + b2Pred + b2Pred + t1Pred) / (t1Pred + b2Pred + b2Pred + t1Pred + t2Pred + b1Pred + b1Pred + t2Pred)
print('Recall:')
print('Business:', business_recall, 'Travel:', travel_recall)
print('Precision:')
print('Business:', business_precision, 'Travel:', travel_precision)
print('Phase 1 accuracy:')
print(accuracy)

Recall:
Business: 0.9021323127392018 Travel: 0.8744565217391305
Precision:
Business: 0.8771929824561403 Travel: 0.8998881431767338
Phase 1 accuracy:
0.8882529299536659


<img src="Phase 1.png">

## Phase 2
We process words one by one and adding the probabilities in the score variables, by doing this we are calculating that which label is more valid in the view of point of our model.<br>
and after that we categorize by the score that is more than the others.

### Note:
Commented parts and why I commented them out are describes in question number 4 at the end of this letter.

#### Travel

In [11]:
t3Pred = 0
b3Pred = 0
s3Pred = 0
for i in tTest.index:
    tScore = 0
    bScore = 0
    sScore = 0
    for j in tTest['short_description'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                sScore += sDict[j]
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    for j in tTest['headline'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                sScore += sDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    tScore += probT
    bScore += probB
    sScore += probS
    if (tScore > bScore)and(tScore > sScore):
        tTest['Model_Predict'][i] = 'TRAVEL'
        t3Pred += 1
    elif (tScore < bScore)and(bScore > sScore):
        tTest['Model_Predict'][i] = 'BUSINESS'
        b3Pred += 1
    else:
        tTest['Model_Predict'][i] = 'STYLE & BEAUTY'
        s3Pred += 1
t2 = t3Pred / (b3Pred + t3Pred + s3Pred)

#### Business 

In [12]:
t4Pred = 0
b4Pred = 0
s4Pred = 0
for i in bTest.index:
    tScore = 0
    bScore = 0
    sScore = 0
    for j in bTest['short_description'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                sScore += sDict[j]
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    for j in bTest['headline'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                sScore += sDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    tScore += probT
    bScore += probB
    sScore += probS
    if (tScore > bScore)and(tScore > sScore):
        bTest['Model_Predict'][i] = 'TRAVEL'
        t4Pred += 1
    elif (tScore < bScore)and(bScore > sScore):
        bTest['Model_Predict'][i] = 'BUSINESS'
        b4Pred += 1
    else:
        bTest['Model_Predict'][i] = 'STYLE & BEAUTY'
        s4Pred += 1
b2 = b4Pred / (b4Pred + t4Pred + s4Pred)

#### Style and Beauty

In [13]:
t5Pred = 0
b5Pred = 0
s5Pred = 0
for i in sTest.index:
    tScore = 0
    bScore = 0
    sScore = 0
    for j in sTest['short_description'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                sScore += sDict[j]
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    for j in sTest['headline'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                sScore += sDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    tScore += probT
    bScore += probB
    sScore += probS
    if (tScore > bScore)and(tScore > sScore):
        sTest['Model_Predict'][i] = 'TRAVEL'
        t5Pred += 1
    elif (tScore < bScore)and(bScore > sScore):
        sTest['Model_Predict'][i] = 'BUSINESS'
        b5Pred += 1
    else:
        sTest['Model_Predict'][i] = 'STYLE & BEAUTY'
        s5Pred += 1
s2 = s5Pred / (b5Pred + t5Pred + s5Pred)

## Phase 2 Recall - Precision - Accuracy


In [14]:
travel_recall = t3Pred / (t3Pred + b3Pred + s3Pred)
business_recall = b4Pred / (b4Pred + t4Pred + s4Pred)
style_recall = s5Pred / (b5Pred + t5Pred + s5Pred)
travel_precision = t3Pred / (t3Pred + t4Pred + t5Pred)
business_precision = b4Pred / (b3Pred + b4Pred + b5Pred)
style_precision = s5Pred / (s3Pred + s4Pred + s5Pred)

accuracy = (t3Pred + b4Pred + s5Pred) / (t3Pred + t4Pred + t5Pred + b3Pred + b4Pred + b5Pred + s3Pred + s4Pred + s5Pred)
print('Recall:')
print('Business:', business_recall, 'Travel:', travel_recall, 'Style&Beauty', style_recall)
print('Precision:')
print('Business:', business_precision, 'Travel:', travel_precision , 'Style&Beauty', style_precision)
print('Phase 2 accuracy:')
print(accuracy)

Recall:
Business: 0.9229086932750137 Travel: 0.8217391304347826 Style&Beauty 0.77088948787062
Precision:
Business: 0.7704244637151986 Travel: 0.8395335924486397 Style&Beauty 0.933420365535248
Phase 2 accuracy:
0.8381607530774801


<img src="Phase 2.png">

## Confusion matrix
A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

In [22]:
print('Confusion Matrix:')
print('Travel:', t3Pred, b3Pred, s3Pred)
print('Business:', t4Pred, b4Pred, s4Pred)
print('Style & Beauty:', t5Pred, b5Pred, s5Pred)
print(' \n')
Confusion_Matrix = pd.read_csv('CM.csv')
print(Confusion_Matrix)

Confusion Matrix:
Travel: 1512 254 74
Business: 113 1688 28
Style & Beauty: 176 249 1430
 

  Actual \ Predicted  Travel  Business  Beauty
0             Travel    1512       254      74
1           Business     113      1688      28
2     Style & Beauty     176       249    1430


<img src="Confusion Matrix.png">

## Test set 

In [16]:
test['Model_Predict'] = ' '
for i in test.index:
    tScore = 0
    bScore = 0
    sScore = 0
    for j in test['short_description'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                sScore += sDict[j]
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    for j in test['headline'][i]:
        try:
            temp = tDict[j] + bDict[j] + sDict[j]
            if (tDict[j] < bDict[j])and(tDict[j] < sDict[j]):
                tScore += tDict[j]
                if (bDict[j] < sDict[j]):
                    bScore += bDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > bDict[j])and(sDict[j] > bDict[j]):
                bScore += bDict[j]
                if (tDict[j] < sDict[j]):
                    tScore += tDict[j]
                else:
                    sScore += sDict[j]
                continue
            elif (tDict[j] > sDict[j])and(bDict[j] > sDict[j]):
                if (tDict[j] < bDict[j]):
                    tScore += tDict[j]
                else:
                    bScore += bDict[j]
                sScore += sDict[j]
                continue
        except:
            pass
#         try:
#             if (bDict[j] < sDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (tDict[j] < sDict[j]):
#                 tScore += tDict[j]
#                 continue
#             else:
#                 sScore += sDict[j]
#                 continue
#         except:
#             pass
#         try:
#             if (bDict[j] < tDict[j]):
#                 bScore += bDict[j]
#                 continue
#             else:
#                 tScore += tDict[j]
#                 continue
#         except:
#             pass
#         try:
#             tScore += tDict[j]
#         except:
#             pass
#         try:
#             bScore += bDict[j]
#         except:
#             pass
#         try:
#             sScore += sDict[j]
#         except:
#             pass
    tScore += probT
    bScore += probB
    sScore += probS
    if (tScore > bScore)and(tScore > sScore):
        test['Model_Predict'][i] = 'TRAVEL'
    elif (tScore < bScore)and(bScore > sScore):
        test['Model_Predict'][i] = 'BUSINESS'
    else:
        test['Model_Predict'][i] = 'STYLE & BEAUTY'
test.to_csv (r'output.csv', header=True)

# Questions & Discussion
Lets address questions first: <br>
1- Has been discussed at the very beginning of "Main PreProcess" Section! <br>

2- tf–idf or TFIDF is the short for term frequency–inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today.<br>
As a matter of fact we used the described approach in our model. Previousley on the main preprocess code section, I tried to make the words for each dictionary unique and one in whole (by transform it to set and then bringing back to list again). using this Technique (adding repeated words) improved our model accuracy over 10 percent.<br>

3- Precision is the result of deviding True Positive to the sum of True Positive + False Positive. This kind of evaluation does not consider negativeley predicted values. This could cause Serious problems when False Negatives are very Crucial for us; for example if we want to  know that if a buiding is safe and strong or not; The precision could give us a good evaluation if we predict good conditioned buildings very well, but as for the bad ones, our model could brings us to Serious troubles.<br>

4- If one word like "Tabriz" has repeated in just one datframes, since we find the probability of tabriz on that dataframe, it has the value of 1 devided to size of dataframe; which is most probably a low value. One the other hand, since there is no word "Tabriz" in other datasets, this value will be added in just one score of probability and lowers the probability of that currect category. So this is not the condition that we want to Experience! Thats why I Commented out the parts in phase 1 and phase 2 and Test Part in order to just compaire the words that are accuring in all kinds of categories.


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. Here We Implemented these algorithms in order to prove that this is a very simple and precise way of solving problems with medium dificulty.

Thanks!