# Implement Naive Bayes to Make Predictions Based on Weather

You want to go to a golf tournament, but would like to know whether the weather will hold up. In this exericse we are going to use Naive Bayes to predict whether or not the event will take place under certain weather conditions.


# Import Libraries

Before you get started, you need to import a few libraries. You can do this by executing the following code. Remember, run code in a cell by selecting the cell, holding the shift key, and pressing enter/return.


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We will also import the scikit-learn `CategoricalNB`, the `train_test_split` function, the metric `accuracy_score`, and a `LabelEncoder` that will help us tranform our raw data into features that can be used by our Naive Bayes model.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

### Step 1: Load and Inspect the Data Set

A small data set was created for this exercise. Load the data from our data set and store it in a dataframe called `dfgolf`.
The data set contains four features: the weather outlook, the temperature, the level of humidity, and the strength of the wind. It contains two features: 'yes' and 'no.' This will be a binary classification problem.

Execute the cell below to inspect 10 data points from the data set.

In [3]:
dfgolf = pd.read_csv('playGolf.csv')
dfgolf.sample(n=10, replace=False, random_state=1)

Unnamed: 0,outlook,temperature,humidity,wind,play golf
17,rainy,WARM,VERY HIGH,STRONG,no
21,sunny,HOT,LOW,WEAK,yes
10,sunny,WARM,AVERAGE,STRONG,no
19,rainy,WARM,AVERAGE,WEAK,yes
14,sunny,WARM,AVERAGE,WEAK,yes
20,sunny,WARM,HIGH,MILD,yes
26,overcast,HOT,HIGH,MILD,yes
3,rainy,WARM,AVERAGE,MILD,yes
24,overcast,HOT,LOW,MILD,yes
22,sunny,COLD,LOW,STRONG,no


### Step 2: Format the Data

Notice that all of the features in the data set are in plain text, which is not a format our machine learning algorithm can work with effectively, and so we have to represent these plain text features as numerical value.

There are different techniques you can use to tranform text data, and these different techniques have corresponding modules available for use in scikit-learn.

Let's analyze the data. Notice that each feature has a categorical value: the value belongs to a category. For example, the `outlook` can be rainy, sunny or overcast. There are also limited possible values for each feature:

`outlook`: rainy, sunny, overcast

`temperature`: HOT, COLD, WARM, MILD

`wind`: STRONG, MILD, WEAK

`humidity`: AVERAGE, HIGH, LOW, VERY HIGH


Considering the data, we can scikit-learn's `LabelEncoder` to transform the feature values into numerical representations. If we have n possible values for a feature, than LabelEncoder will choose a value from 0 to n-1.  It is worth noting that `LabelEncoder` should not always be used to transform features since the model may assume there is some order associated with the numerical values. For example, it may assume that some significance among the numerical values, since for example, 4 is greater than 1. The data and the model being used has to be taken into consideration when choosing an encoder. In our case, `LabelEncoder` works well.

The code cell below uses `LabelEncoder` to transform both our feature values and the label into numerical representatons. Execute the code cell below and inspect the results of the transformation.

Notice that all of the features now have numerical values. Also notice that the two labels are now 0 and 1: "no" (will not play golf) is now equal to 0, and "yes" (will play golf) is now equal to 1.

In [4]:
outlookE = LabelEncoder()
dfgolf['outlook'] = outlookE.fit_transform(dfgolf['outlook'])
tempE = LabelEncoder()
dfgolf['temperature'] = tempE.fit_transform(dfgolf['temperature'])
humE = LabelEncoder()
dfgolf['humidity'] = humE.fit_transform(dfgolf['humidity'])
windE = LabelEncoder()
dfgolf['wind'] = windE.fit_transform(dfgolf['wind'])
le = LabelEncoder()
dfgolf['play golf'] = le.fit_transform(dfgolf['play golf'])
dfgolf.sample(n=10, replace=False, random_state=1)

Unnamed: 0,outlook,temperature,humidity,wind,play golf
17,1,3,3,1,0
21,2,1,2,2,1
10,2,3,0,1,0
19,1,3,0,2,1
14,2,3,0,2,1
20,2,3,1,0,1
26,0,1,1,0,1
3,1,3,0,0,1
24,0,1,2,0,1
22,2,0,2,1,0


### Step 2: Create labeled examples from our data set for the training phase

Execute the code cell below and inspect the results. You will see that we have 30 labeled examples. Each example contains four features and one label.


In [5]:
x = dfgolf[['outlook', 'temperature', 'humidity','wind']]
y = dfgolf['play golf'].to_frame()
print(x.shape)
print(y.shape)

(30, 4)
(30, 1)



### Step 3: Create Training & Test Data Sets

Now that we have specified examples, we will need to split them into a training set and a test set.

We will refer to the training feature vectors as `x_train` with labels `y_train`. 

Our testing vectors are `x_test` with labels `y_test`. 


In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=4)

### Step 4: Fit a Categorical Naive Bayes Model with the Training Set

Scikit-learn has a few different options for Naive Bayes classifier depending on the input data. Since our features all have categorical values, we can use scikit-learn's ```CategoricalNB()``` class.

The code cell below:

1. Creates a ```CategoricalNB()``` object

2. Calls the ```.fit()``` method on ```model``` to fit the model to the training data. The first argument should be ```x_train``` and the second ```y_train```. Note CategoricalNB's `.fit()` method requires the second parameter to be a 1D array. We use method `.values()` to convert our data frame to a numpy array and then `.ravel()` to convert it to a 1D array. We use the `.flatten()` method to make a copy of `y_train` instead of changing the original value. 

3. Uses the ```.predict()``` method on ```model``` with the argument ```x_test``` to use the fitted model to predict values for the testing data. Store the outcome in the variable ```y_pred```. We will compare these values to ```y_test``` later.

In [7]:
# Initialize the model
nb_model = CategoricalNB()

# Train the model using the training sets
nb_model.fit(x_train, y_train.values.ravel().flatten())

# Make predictions using the test set
prediction= nb_model.predict(x_test) 


### Step 5: Check the accuracy of your model

Execute the code cell below to see the accuracy score of your model and the confusion matrix.

In [8]:
# Compute and print model's accuracy score
score = accuracy_score(y_test, prediction)
print('Accuracy score of model: ' + str(score))

# For the purpose of the producing a confustion matrix,
# convert numerical values back to strings using numpy's where method
prediction = np.where(prediction == 1, 'Yes', 'No')
y_test = np.where(y_test == 1, 'Yes','No')

# Display a confusion matrix
print('Confusion Matrix for the model: ')

pd.DataFrame(
confusion_matrix(y_test, prediction, labels=['No', 'Yes']),
columns=['Predicted: Play Golf', 'Predicted: Do Not Play Golf'],
index=['Actual: Play Golf', 'Actual: Do Not Play Golf']
)


Accuracy score of model: 0.7777777777777778
Confusion Matrix for the model: 


Unnamed: 0,Predicted: Play Golf,Predicted: Do Not Play Golf
Actual: Play Golf,4,1
Actual: Do Not Play Golf,1,3


The code cell below contains code that asks a user to enter values for all four weather features and outputs a prediction. Execute the code cell below to see whether you will be attending the game.


In [None]:
outlook_list = ['sunny', 'rainy', 'overcast']
temperature_list = ['hot', 'warm', 'cold', 'mild']
humidity_list = ['very high', 'high', 'average', 'low']
wind_list = ['strong', 'mild', 'weak']

while True:
    print('Please enter values for weather:')
    outlook = input('Enter outlook: ')
    while not outlook.lower() in outlook_list:
        outlook = input('Enter the correct value for overlook: ')
    temperature = input('Enter temperature: ')
    while not temperature.lower() in temperature_list:
        temperature = input('Enter a correct value for temperature: ')
    humidity = input('Enter humidty: ')
    while not humidity.lower() in humidity_list:
        humidity = input('Enter a correct value for humidity: ')
    wind = input('Enter wind: ')
    while not wind.lower() in wind_list:
        wind = input('Enter a correct value for wind: ')
    break
        
# create new data frame
df = pd.DataFrame(columns=['outlook', 'temperature', 'humidity', 'wind'], data=[[outlook,temperature,humidity,wind]])

# use encoders above to encode data
df['outlook'] = outlookE.fit_transform(df['outlook'])
df['temperature'] = tempE.fit_transform(df['temperature'])
df['humidity'] = humE.fit_transform(df['humidity'])
df['wind'] = windE.fit_transform(df['wind'])

# create test data out of your entries
x_test = df[['outlook', 'temperature', 'humidity','wind']]

# make a prediction
prediction= nb_model.predict(x_test) 

# output the prediction
if prediction[0] == 0:
    print('\n\nThey will NOT play golf!')
else:
    print('\n\nHAVE FUN!!')

Please enter values for weather:
Enter outlook: Summer
Enter the correct value for overlook: sunny
Enter temperature: temperature
Enter a correct value for temperature: wind
