# Lab 1
## Part 1 Naive Bayes
### Import packages and data

In [22]:
import os
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

PATH_ROOT = os.getcwd()
PATH_TRAIN = os.path.join(PATH_ROOT, 'train.csv')
PATH_TEST = os.path.join(PATH_ROOT, 'test.csv')

print("Use train data:", PATH_TRAIN)
print("Use test  data:", PATH_TRAIN)

train_data = pd.read_csv(PATH_TRAIN)
test_data = pd.read_csv(PATH_TEST)

Use train data: d:\Windows\Documents\Code\conda\machine-learning-lab\lab1\train.csv
Use test  data: d:\Windows\Documents\Code\conda\machine-learning-lab\lab1\train.csv


### Recoginize categories

In [23]:
categories = train_data['category'].unique()
print(categories)

['Restaurants' 'Nightlife' 'Shopping']


We find that there are three categories in the data set, they are: `['Restaurants' 'Nightlife' 'Shopping']`.

Then, we transform this column into integers.

In [24]:
categories_type = CategoricalDtype(categories = categories)
train_data['category'] = train_data['category'].astype(categories_type).cat.codes.astype('long')
test_data['category'] = test_data['category'].astype(categories_type).cat.codes.astype('long')
print(train_data['category'].head(), "\n", test_data['category'].head(), sep="")

0    0
1    0
2    0
3    1
4    0
Name: category, dtype: int32
0    0
1    0
2    0
3    0
4    0
Name: category, dtype: int32


### Naive Bayes process
We start our Naive Bayes process.

Firstly, we should build training and testing dataframe variables.


In [25]:
train_x = train_data['review']
train_y = train_data['category']

test_x = test_data['review']
test_y = test_data['category']

Then we need to build a vector of word counts. Use built in class `CountVectorizer`. And transform original data into vector. For test variables, we use the same methods.

In [26]:
vector = CountVectorizer()
train_x = vector.fit_transform(train_x).toarray()
test_x = vector.transform(test_x).toarray()

Other attributes like `mean_checkin_time` also need to be considered. We merge these data into training and testing data set.

In [27]:
train_x = np.append(train_data[['latitude', 'longitude', 'mean_checkin_time']], train_x, axis=1)
test_x = np.append(test_data[['latitude', 'longitude', 'mean_checkin_time']], test_x, axis=1)

We have already built our training and testing data set.

Lastly, we could classify texts by using Naive Bayes.

In [28]:
nb = GaussianNB()
nb.fit(train_x, train_y)

GaussianNB()

We can see how many correct prediction we have made.

In [29]:
nb.score(test_x,test_y)

0.7532467532467533

### Summary
About $75.3%$ of the entire data set has been classified correctly. However, this is the simplest model. We could build a more complex model to classify these data.

## Part 2 Data preprocessing
Note that the `CountVectorizer` just split text into single words simply, we could try adjusting the parameter of how it split texts.

We try set parameter `ngram_range` to `1` and `2`, 

In [30]:
vector2 = CountVectorizer(ngram_range = (1, 2))
train_x2 = train_data['review']
train_y2 = train_data['category']
train_x2 = vector2.fit_transform(train_x2).toarray()
train_x2 = np.append(train_data[['latitude', 'longitude', 'mean_checkin_time']], train_x2, axis=1)
nb2 = GaussianNB()
nb2.fit(train_x2, train_y2)

test_x2 = test_data['review']
test_y2 = test_data['category']
test_x2 = vector2.transform(test_x2).toarray()
test_x2 = np.append(test_data[['latitude', 'longitude', 'mean_checkin_time']], test_x2, axis=1)
nb2.score(test_x2,test_y2)

0.797979797979798

### Summary
After a long processing period, we got a $79.8%$ correct rate. It did an improvement. However, we should consider **wether to drop** this optimization because it cost too much time while brought not very large improvement.

Because its complexity, we must comment these code in order to prevent wasting time on this optional model.

## Part 3 Model improving
Maybe we should use other Naive Bayes model after using `GaussianNB`.

### `BernoulliNB`.

In [31]:
vector3 = CountVectorizer()
train_x3 = train_data['review']
train_y3 = train_data['category']
train_x3 = vector3.fit_transform(train_x3).toarray()
train_x3 = np.append(train_data[['latitude', 'longitude', 'mean_checkin_time']], train_x3, axis=1)
nb3 = BernoulliNB()
nb3.fit(train_x3, train_y3)

test_x3 = test_data['review']
test_y3 = test_data['category']
test_x3 = vector3.transform(test_x3).toarray()
test_x3 = np.append(test_data[['latitude', 'longitude', 'mean_checkin_time']], test_x3, axis=1)
nb3.score(test_x3,test_y3)

0.8253968253968254

### `MultinomialNB`.

Because `MultinomialNB` cannot accept negative values, we tries to remove `latitude` and `longitude` attributes from our data frame.

In [32]:
vector4 = CountVectorizer()
train_x4 = train_data['review']
train_y4 = train_data['category']
train_x4 = vector4.fit_transform(train_x4).toarray()
train_x4 = np.append(train_data[['mean_checkin_time']], train_x4, axis=1)
nb4 = MultinomialNB()
nb4.fit(train_x4, train_y4)

test_x4 = test_data['review']
test_y4 = test_data['category']
test_x4 = vector4.transform(test_x4).toarray()
test_x4 = np.append(test_data[['mean_checkin_time']], test_x4, axis=1)
nb4.score(test_x4,test_y4)

0.8701298701298701

### The combination

What if we combine the best 2 methods in first and second try?

In [33]:
vector5 = CountVectorizer(ngram_range = (1, 2))
train_x5 = train_data['review']
train_y5 = train_data['category']
train_x5 = vector5.fit_transform(train_x5).toarray()
train_x5 = np.append(train_data[['mean_checkin_time']], train_x5, axis=1)
nb5 = MultinomialNB()
nb5.fit(train_x5, train_y5)

test_x5 = test_data['review']
test_y5 = test_data['category']
test_x5 = vector5.transform(test_x5).toarray()
test_x5 = np.append(test_data[['mean_checkin_time']], test_x5, axis=1)
nb5.score(test_x5,test_y5)

0.8051948051948052

Unluckily, the combination of 2 good methods led to a worse result.

### Summary
After verification, use `MultinomialNB` is a better model. The $87.0%$ of accuracy is far better than other models.

Some further tries could be done. For example, we can make `latitude` and `longitude` positive. However, that didn't changed the results. So we can infer that `latitude` and `longitude` have low association with results.