# <font color="blue">Lesson 3 - Basic Machine Learning Models</font>

### NBC on Iris Data

For this lesson, we'll use [sklearn's Iris flower dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html); this dataset contains measurements for three classes of Iris, with 50 oberservations per class. 

The following four measurements were taken for each of the three classes: 
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class = which of the three Iris classes this flower belongs to

We can use the features to train our dataset, and use the class atribute as targets to measure our ground truth. 

## Load Dataset and Packages
We can load this dataset directly from sklearn: 

In [1]:
from sklearn import datasets
import numpy as np

# load iris dataset
iris = datasets.load_iris()

Now we can use built-in sklearn functionality to pull apart the features and the targets. On a non-sklearn dataset, we would separate the features and targets manually by specifying column names. 

In [2]:
# data gives us access to the features we can use to classify each sample
features = iris.data
targets = iris.target

We can see that this gives us arrays of measurments and targets to use for training our model. 

In [3]:
features[0]

array([5.1, 3.5, 1.4, 0.2])

In [4]:
targets

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

## Splitting the dataset into training and test sets
In order to test how effective we were at training our model on our dataset, we'll need to set aside a portion of our features and targets to test on; the industry standard is usually somewhere around 80/20 or 70/30. 

We'll do this using sklearn's [train_test_split feature](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Let's use the following arguments: 
- test_size = hold aside 30% of the features and targets for testing
- random_state = set the seed used by the random number generator so that you can get the same results each time you run this same example
- shuffle = Whether or not to shuffle the data before splitting

In [None]:
# splitting the dataset into 80/20 with random shuffle
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, targets, 
                                                    test_size=0.30, 
                                                    random_state=42)

Let's check our feature training and test sets to see if the distribution of samples looks around 80/20:

In [None]:
X_train.shape

In [None]:
X_test.shape

## Training a Naive Bayes Model
Now that we've prepared our dataset by pulling out the targets and features, and separating it into training and testing sets, we can train our model with the training features and targets, X_train, X_test. 

We'll use the [GaussianNB model from sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) to first create our model. In our model, priors are learned from the data as we train. If we already know our priors, we could feed it in as a list using the "priors" option. 

### Create the Classifier

In [None]:
# train gaussian naive bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb

### Fit the Model
Now that we have setup our model, we can "fit" it to our data, a process called "training". We only train our model on the training portion of the data and targets, never use any of the data that you've held aside for testing. 

In [None]:
# fit the model on your training set
gnb_model = gnb.fit(X_train, y_train)

Now that you've trained the Gaussian model, you can access the priors:

In [None]:
gnb_model.class_prior_

## Make Predictions
We've instantiated a model and trained it on our iris dataset. Now we can use this model to make predictions. 

Our model comes with a predict() method that we use on our testing features, X_test. 

In [None]:
# make a prediction from the trained model on your test set
predictions = gnb_model.predict(X_test)

## Test Accuracy
We've used our model to make predictions, so now we finally test how accurate those predictions are. 

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions, normalize = True)

Not bad for our first try; our model was 97% accurate! 

## NBC on Adult Income Data
Now it's your turn to try a Naive Bayes classifier. For this portion of the lab, we'll be using the Adult Income Dataset that contains income data from about 32,000 adults. We're going to use these measurments to try and predict whether or not someone's income will be over, or under, $50k/year.  

We'll walk you through importing and pre-processing the data, and then let you try training and testing your model. 

### Importing data and pre-processing

In [None]:
import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
s = requests.get(url).content
adult_df = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None)
adult_df.columns = ["age", "workclass", "fnlwgt", "education", "education_num", "marital_status", \
                "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", \
               "hours_per_week", "native_country", "income"]
adult_df.head()

The data contains categorical variables, which we'll pull out and encode by changing to numeric values that can be used in our model. 

In [None]:
obj_cols = adult_df.select_dtypes(include=["object"]).columns
obj_cols

In the step above, we created a dataframe that contains just our categorical variables.  

Now we can use pandas get_dummies() method to one-hot-encode these categorical variables into numeric values.

In [None]:
# encode your dataframe
data = pd.get_dummies(adult_df, columns=obj_cols, drop_first=True)

In [None]:
data.head()

### Pull out Features and Targets
Our data is now ready to be split into features and targets (income).

In [None]:
features = data.drop('income_ >50K', axis=1)
targets = data['income_ >50K']

### Split data into training and test sets
Just as we did above, we'll split our dataset into training and test sets. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    targets, 
                                                    test_size=0.20,
                                                   random_state=42)

### Scale Dataset
Because the measurments in this dataset are not on the same scale, we need to adjust it so that it has a mean of 0 and a variance of 1. We'll use sklearn's StandardScaler. 

In [None]:
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)

## Implement NBC Classifier
Now that we've imported and processed the adult learning dataset, implement a Naive Bayes Classifer on this dataset, and asses the accuracy of your model, just like we did above. 

## Consider this
How do you think we could improve the accuracy of this model? 