## 1 Information Summary

* **Input/Output**: This data has a set of attributes of links, including keywords like 'faculty', 'staff', 'people', 'professor', 'bio', 'index', and 'id'. There is also a categorical variable ('outcome') telling whether this link is a faculty link or not. 

* **Final Goal**: We want to build a classifier that can predict whether a link is a faculty link or not. To do this, we will randomly shuffle the whole dataset, divide it into training dataset and evaluation dataset. We trained the classifier with the training dataset based on the Navie-Bayes model and Guassian fiiting. Then, we predicted the outcome of evaluation dataset and got the accuracy of out binary classifier.

## 2 Loading Data

In [23]:
%matplotlib inline
import pandas as pd
import numpy as np

In [24]:
df = pd.read_csv('./bio_Binary.csv')
df.head()

Unnamed: 0,faculty,staff,people,professor,bio,index,id,profile,outcome
0,1,0,1,0,0,1,0,0,1
1,1,0,1,0,0,1,0,0,1
2,1,0,1,0,0,1,0,0,1
3,1,0,1,0,0,1,0,0,1
4,1,0,1,0,0,1,0,0,1


## 3 Splitting The Data

First, we will shuffle the data completely, and forget about the order in the original csv file. 

* The training and evaluation dataframes will be named ```train_df``` and ```eval_df```, respectively.

* We will also create the 2-d numpy array `train_features` whose number of rows is the number of training samples, and the number of columns is 8 (i.e., the number of features). We will define `eval_features` in a similar fashion

* We would also create the 1-d numpy arrays `train_labels` and `eval_labels` which contain the training and evaluation labels, respectively.

In [25]:
# Let's generate the split ourselves.
np_random = np.random.RandomState(seed=12345)
rand_unifs = np_random.uniform(0,1,size=df.shape[0])
division_thresh = np.percentile(rand_unifs, 80)
train_indicator = rand_unifs < division_thresh
eval_indicator = rand_unifs >= division_thresh

In [26]:
train_df = df[train_indicator].reset_index(drop=True)
train_features = train_df.loc[:, train_df.columns != 'outcome'].values
train_labels = train_df['outcome'].values
train_df.head()

Unnamed: 0,faculty,staff,people,professor,bio,index,id,profile,outcome
0,1,0,1,0,0,1,0,0,1
1,1,0,1,0,0,1,0,0,1
2,1,0,1,0,0,1,0,0,1
3,1,0,1,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,1


In [27]:
eval_df = df[eval_indicator].reset_index(drop=True)
eval_features = eval_df.loc[:, eval_df.columns != 'outcome'].values
eval_labels = eval_df['outcome'].values
eval_df.head()

Unnamed: 0,faculty,staff,people,professor,bio,index,id,profile,outcome
0,1,0,1,0,0,1,0,0,1
1,1,0,1,0,0,1,0,0,1
2,0,0,0,0,0,0,0,0,1
3,1,0,1,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,1


In [28]:
train_features.shape, train_labels.shape, eval_features.shape, eval_labels.shape

((14444, 8), (14444,), (3611, 8), (3611,))

## 4 Train Classifier

We first train the the classifier gnb with the built-in Gaussian Naive-Bayes model with our train_features and train labels. Then we did the prediction for both the train features and evaluation features. Finally we got the accuracy of over 70% for both training dataset and evaluation dataset.

In [22]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(train_features, train_labels)
train_pred_sk = gnb.predict(train_features)
eval_pred_sk = gnb.predict(eval_features)
print(f'The training data accuracy of your trained model is {(train_pred_sk == train_labels).mean()}')
print(f'The evaluation data accuracy of your trained model is {(eval_pred_sk == eval_labels).mean()}')

The training data accuracy of your trained model is 0.7343033256880734
The evaluation data accuracy of your trained model is 0.7294353683003726
