# Classification with NB

In this notebook we are going to implement document classification with Naive bayes (NB) using bag of words
We are also going to implement show how to perform the grid search to find the best NB hiperparameters

Note that most of the code is implemented in the document find_best_hyperparameters
Thus, we are going to import the bag of words class and the pandas notebook

we will also perfor the classification in two datasets Dmoz-Sports.csv and Dmoz-Computers.csv

In [1]:
from find_best_hyperparameters import Bag_of_words
import pandas as pd

We then instanciate the class Bag_of_words, that contains the code to find the best hiperparamters, the code to fit the best NB.

In [2]:
BG = Bag_of_words()

## Dmoz-Sports.csv

first we need to load the dataset and splits its text and labels

In [3]:
df = pd.read_csv('Dmoz-Sports.csv')
X = df['text'].values
labels = df['class'].values

Afterward we are performing the grid search.
You can check the code for the grid_seach method however, the important part of it is that it will first count each word in the document with a CountVectorizer(), then it will split the X and labels into train and test sets (80% train and 20% tests).

it will then perform a grid search in the alpha parameter of NB. you can pass pther parameters in the following format:
params = {'alpha': [0.1, 0.4, 0.7, 1, 1.3, 1.5]}

In [4]:
grid_sports = BG.grid_search(X, labels)

The results of the grid search are then stored into a pandas dataframe

In [5]:
df_sports = pd.DataFrame(grid_sports.cv_results_)
df_sports

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.05641,0.003497,0.009141,0.001624,0.1,{'alpha': 0.1},0.815556,0.820556,0.798333,0.811481,0.009519,3
1,0.056795,0.00232,0.009523,0.002122,0.4,{'alpha': 0.4},0.821667,0.823889,0.804722,0.816759,0.00856,1
2,0.064059,0.008115,0.008237,0.001673,0.7,{'alpha': 0.7},0.816944,0.824444,0.803889,0.815093,0.008493,2
3,0.058787,0.001997,0.010417,0.001095,1.0,{'alpha': 1},0.811944,0.818333,0.798611,0.80963,0.008216,4
4,0.053993,0.002129,0.010217,0.001147,1.3,{'alpha': 1.3},0.806667,0.814167,0.797778,0.806204,0.006699,5
5,0.05646,0.001979,0.007876,0.001909,1.5,{'alpha': 1.5},0.805556,0.809722,0.7975,0.804259,0.005073,6


We then check the best estimetor in the grid seach

In [6]:
grid_sports.best_estimator_

MultinomialNB(alpha=0.4)

Afterwards, we run the train bast method, that will train the best model using the hole train dataset and retrun the results of the tests

In [7]:
report, micro_f1 = BG.train_best(X, labels, grid_sports.best_estimator_)

The train best method will return a report containing the precision, recall, accuracy and f1-score and a micro_f1

we then print those values to evaluate the results

In [8]:
print(report)
print(f'f1_micro = {micro_f1}')

              precision    recall  f1-score   support

    Baseball       0.80      0.59      0.68       114
  Basketball       0.84      0.90      0.87       102
     Bowling       0.92      0.96      0.94       101
     Cricket       0.66      0.94      0.78        90
     Cycling       0.88      0.85      0.87       107
  Equestrian       0.98      0.87      0.92        95
     Fencing       0.91      0.93      0.92       111
      Flying       0.81      0.81      0.81        97
    Football       0.75      0.68      0.72       107
        Golf       0.92      0.85      0.88        94
  Gymnastics       0.91      0.91      0.91        95
      Hockey       0.76      0.58      0.66       101
    Lacrosse       0.72      0.91      0.80       100
     Martial       0.92      0.95      0.94        88
 Motorsports       0.86      0.85      0.86       101
   Paintball       0.95      0.90      0.93        84
     Running       0.85      0.87      0.86       108
     Skating       0.90    

## Dmoz-Computers.csv

For this dataset we aregoing to repeat the entire processe that we did with the other one

In [9]:
df = pd.read_csv('Dmoz-Computers.csv')
X = df['text'].values
labels = df['class'].values

In [10]:
grid_computers = BG.grid_search(X, labels)


In [11]:
df_computers = pd.DataFrame(grid_computers.cv_results_)
df_computers

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.0312,0.001084,0.00857,0.00067,0.1,{'alpha': 0.1},0.679163,0.667588,0.674694,0.673815,0.004766,1
1,0.043484,0.002989,0.008343,0.001376,0.4,{'alpha': 0.4},0.674428,0.66285,0.671141,0.669473,0.004871,2
2,0.045094,0.004437,0.010048,0.002065,0.7,{'alpha': 0.7},0.662194,0.645085,0.650217,0.652499,0.007169,3
3,0.050058,0.005084,0.007237,0.002577,1.0,{'alpha': 1},0.644436,0.62653,0.632057,0.634341,0.007486,4
4,0.043411,0.00184,0.004826,0.000781,1.3,{'alpha': 1.3},0.632991,0.611923,0.614291,0.619735,0.009423,5
5,0.03329,0.001464,0.003009,0.000335,1.5,{'alpha': 1.5},0.623125,0.602448,0.604816,0.61013,0.00924,6


In [12]:
grid_computers.best_estimator_

MultinomialNB(alpha=0.1)

In [13]:
report, micro_f1 = BG.train_best(X, labels, grid_computers.best_estimator_)

In [14]:
print(report)
print(f'f1_micro = {micro_f1}')

              precision    recall  f1-score   support

  Artificial       0.76      0.84      0.80       104
         CAD       0.84      0.83      0.83       100
   Companies       0.55      0.61      0.58       106
    Computer       0.78      0.81      0.79        95
 Consultants       0.54      0.80      0.64       100
        Data       0.74      0.70      0.72       202
   Education       0.86      0.91      0.88        96
    Graphics       0.83      0.92      0.87       106
    Hardware       0.68      0.61      0.64       104
    Internet       0.79      0.59      0.68       108
      Mobile       0.79      0.68      0.73       103
  Multimedia       0.70      0.75      0.73       101
        Open       0.61      0.58      0.59        95
 Programming       0.53      0.53      0.53       104
    Robotics       0.86      0.85      0.86       101
    Security       0.62      0.60      0.61        85
    Software       0.28      0.22      0.24        97
     Systems       0.56    

### save the grid search results

In [15]:
df_computers.to_csv('grid_search_NB_Dmoz-Computer.csv')
df_sports.to_csv('grid_search_NB_Dmoz-Sports.csv')