# DTSC670: Foundations of Machine Learning Models
## Module 1
## Assignment 2: Johnny Likes Pie

#### Name: Betty Tai

This assignment is similiar to the golf example.  So, you should begin by watching the video called "Should You Play Golf Today?" and complete the Jupyter Notebook called PlayGolfRegression_template.ipynb.  

You will build a model that predicts whether or not Johnny likes a particular pie. You will do this by reading in the data, performing a one-hot encoding, and fitting the data to a linear regression model.  Then, you will make a prediction using the model along with a threshold value of 0.5, which is used to compare against in making a classification decision.

The following data describes features of different types of pie, along with a positive or negative classification of that pie based whether or not Johnny likes it.  A positive classification means Johnny likes that pie; a negative classification means Johnny does not like that pie.

<img src="JohnnyPies.png " width ="600" />

Begin by placing the data file called `JohnnyPiesData.csv` and this Jupyter notebook in the same directory.

## Import Data


1. Use the [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function to read in data from a comma-separated values (csv) file to a Pandas DataFrame called `pie_df`.

2. Display the DataFrame (by typing `pie_df`)

In [12]:
import pandas as pd

fileName = "JohnnyPiesData.csv"
pie_df = pd.read_csv(fileName)
pie_df

Unnamed: 0,Example,Crust Shape,Crust Size,Crust Shade,Filling Size,Filling Shade,Class
0,ex1,Circle,Thick,Gray,Thick,Dark,pos
1,ex2,Circle,Thick,White,Thick,Dark,pos
2,ex3,Triangle,Thick,Dark,Thick,Gray,pos
3,ex4,Circle,Thin,White,Thin,Dark,pos
4,ex5,Square,Thick,Dark,Thin,White,pos
5,ex6,Circle,Thick,White,Thin,Dark,pos
6,ex7,Circle,Thick,Gray,Thick,White,neg
7,ex8,Square,Thick,White,Thick,Gray,neg
8,ex9,Triangle,Thin,Gray,Thin,Dark,neg
9,ex10,Circle,Thick,Dark,Thick,White,neg


## Prepare Data for Linear Regression

1. Drop the `Example` column from the `pie_df` DataFrame, because it offers no information.

2. Encode all categorical data into numeric data via the "One Hot Encoding" technique provided by the Pandas [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function.  Display this DataFrame.

3. Since we are performing ordinary least squares linear regression, we will drop one of the newly created Boolean-valued features (output from the `get_dummies()` function) to prevent introducing unwanted correlation in the data.

4. Store the final features in a DataFrame called `features`, and store the positive class labels in a DataFrame called `response`.  Display both of these DataFrames.

In [13]:
pie_df_X = pie_df.drop(['Example'], axis = 1);
X = pd.get_dummies(pie_df_X, columns=['Crust Shape', 'Crust Size', 'Crust Shade','Filling Size', 'Filling Shade', 'Class'], drop_first=True)
X
response = X[['Class_pos']]
features = X.drop(['Class_pos'], axis = 1);
# You may create more cells throughout as needed, 
# but your final submission must be neat and concise

In [14]:
response

Unnamed: 0,Class_pos
0,1
1,1
2,1
3,1
4,1
5,1
6,0
7,0
8,0
9,0


In [15]:
features

Unnamed: 0,Crust Shape_Square,Crust Shape_Triangle,Crust Size_Thin,Crust Shade_Gray,Crust Shade_White,Filling Size_Thin,Filling Shade_Gray,Filling Shade_White
0,0,0,0,1,0,0,0,0
1,0,0,0,0,1,0,0,0
2,0,1,0,0,0,0,1,0
3,0,0,1,0,1,1,0,0
4,1,0,0,0,0,1,0,1
5,0,0,0,0,1,1,0,0
6,0,0,0,1,0,0,0,1
7,1,0,0,0,1,0,1,0
8,0,1,1,1,0,1,0,0
9,0,0,0,0,0,0,0,1


## Perfrom Linear Regression Model Fitting

1. Import the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class from the `sklearn.linear_model` library. 

2. Instantiate an object of the `LinearRegression` class called `reg_model`.

3. Train the model by invoking the `fit()` method of the `reg_model` object and passing it `features` and `response`.

In [16]:
import numpy as np
from sklearn.linear_model import LinearRegression
reg_model = LinearRegression()
reg_model.fit(features, response)
reg_model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Examine Linear Regression Model Parameters

1. View the trained model parameters by using the `coef_` and `intercept_` attributes of the trained model.

In [17]:
reg_model.coef_

array([[-0.52586207, -0.83189655, -0.56465517, -0.63793103, -0.92672414,
         0.70258621,  0.12068966, -1.07327586]])

In [18]:
reg_model.intercept_

array([1.56034483])

## Making Predictions Using the Linear Regression Model

1. Evaluate the model's performance on the training data set by invoking the `predict()` method and passing `features` to it. 


In [20]:
reg_model.predict(features)

array([[ 0.92241379],
       [ 0.63362069],
       [ 0.84913793],
       [ 0.77155172],
       [ 0.6637931 ],
       [ 1.3362069 ],
       [-0.15086207],
       [ 0.22844828],
       [ 0.22844828],
       [ 0.48706897],
       [ 0.10775862],
       [-0.07758621]])

Below are the results from the linear regression model:

The column "Class_pos" regards the "positive" or negative classification of the pie.  The column "Regression_Predictions" regards the predictions made by the linear regression model directly.  The column "Predicted_Responses" are the adjusted prdeictions made by the model after employing the cut-off values of 0 being 0 <= x <= 0.5 and 1 being 0.5 < x <= 1.0.

In [21]:
import numpy as np

# resp_comp = Response Comparison

resp_comp = response.copy() 
reg_outputs = [float(reg_model.predict(np.reshape(row, (1, -1)))) for row in features.itertuples(index=False)]
predicted_resp = np.array([1 if reg_output >= 0.5 else 0 for reg_output in reg_outputs])
resp_comp = resp_comp.assign(Regression_Predictions = reg_outputs)
resp_comp = resp_comp.assign(Predicted_Responses = predicted_resp)
resp_comp

Unnamed: 0,Class_pos,Regression_Predictions,Predicted_Responses
0,1,0.922414,1
1,1,0.633621,1
2,1,0.849138,1
3,1,0.771552,1
4,1,0.663793,1
5,1,1.336207,1
6,0,-0.150862,0
7,0,0.228448,0
8,0,0.228448,0
9,0,0.487069,0


## Calculate Model Accuracy

1. Use the [accuracy_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function to calculate the accuracy score of the model.

In [22]:
y_outputs = response.copy()
y_outputs = y_outputs.assign(predicted=predicted_resp)
y_outputs

Unnamed: 0,Class_pos,predicted
0,1,1
1,1,1
2,1,1
3,1,1
4,1,1
5,1,1
6,0,0
7,0,0
8,0,0
9,0,0


In [23]:
from sklearn.metrics import accuracy_score

print("accuracy score: ", accuracy_score(y_outputs['Class_pos'], y_outputs['predicted']))

accuracy score:  1.0
