# Lab 9 - Dummy variables and Scikit Learn

We will continued learning about linear regression by predicting health insurance prices, using the same dataset from Lab 7.

Data URL: [https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv](https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv)

In this data, each row represents an insurance policy and the 7 columns contain the following information about it:
- age: age of policy holder
- sex: sex of policy holder
- bmi: boday mass index (bmi) of policy holder.  bmi is a (sometimes unreliable) measurement of body fat in adults
- children: number of children (dependents) on the policy
- smoker: whether the policy holder is a smoker
- region: region of the country the policy holder lives in
- charges: price for insurance policy

### Section 1: Loading the data and exploratory data analysis

Run the import statements to load the libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import seaborn as sns
import numpy as np

%matplotlib inline

Load the CSV file into a dataframe and display it:

In Lab 7, we performed a lot of exploratory data analysis on this data set.  Let's recall how being a smoker affected the insurance policy charges.  Use Seaborn to make a scatter plot with age on the x axis and charges on the y axis, colored by whether the person is a smoker.

Based on this graph, how does being a smoker affect insurance policy charges?

The `smoker` variable is qualitative or categorical, so we could not inclue it in our linear model in Lab 7.  To see the effect of omitting it, use Seaborn to plot a scatter plot of age (x axis) vs. charges (y axis) with the regression line for this relationship.

Recall you can do this with the function `regplot()`.

How useful is this linear model?

### Section 2: Dummy variables

We have seen that whether someone smokes is important information for predicting their health insurance policy cost.  So we need a way to include qualitative or categorical data as independent variables in linear regression.  We do this by creating quantitative *dummy variables* from the categorical variables.

The code below creates a new DataFrame called `insurance_new` with the `smoker` column replaced by a dummy variable for it.

In [None]:
insurance_new = pd.get_dummies(insurance, columns = ["smoker"], drop_first = True)
insurance_new.head()

What is the name of the new dummy variable?  What values are stored in this column?  What does a 1 correspond to?  A 0?

In this case, the new dummy variable is called `smoker_yes`, which indicates that a `yes` in the original `smoker` column will be replaced by a 1 in the new `smoker_yes` column.  A `no` in the `smoker` column is replaced by a 0 in the `smoker_no` column.  Thus, we have converted categorical data (with categories `yes` and `no`) into quantitative data (with values 0 or 1).

Make a new linear regression model with independent variables `age` and `smoker_yes` and dependent variable `charges`. 

If you haven't already, display the summary of your linear model.

What is R-squared for this model?  How does it compare to the R-squared we computed in Lab 7 for independent variables `age`, `bmi`, `children`?  Which model do you think is better?

Plot the histogram of the residuals for your new model.

Are these residuals approximately normally distributed and centered at 0?  How do they compare to the residuals of the Lab 7 linear model (with independent variables `age`, `bmi`, `children`)?

Finally, make a scatterplot of the actual charges on the x axis and the residuals on the y axis.

What do you notice about this plot?  How does it compare to the one we made in Lab 7?

Based on both R-squared and the residual plots, which linear model do you think is better - the one with independent variables `age` and `smoker_yes` or the one with independent variables `age`, `bmi`, `children` from Lab 7?

### Section 3:  How dummy variables work 

What is the equation of the linear model found in Section 2?  Let $X_1$ be the variable `age` and let $X_2$ be the variable `smoker_yes`.

<details><summary>Answer:</summary>
$$Y = -2391.6264 + 274.8712X_1 + 23860X_2$$
</details>

Imagine we are trying to predict the insurance policy cost for someone who smokes.  Then the variable $X_2$ will be 1.

Plug this into the equation of the linear model:
$$Y = -2391.6264 + 274.8712X_1 + 23860(1)$$
$$Y = -2391.6264 + 274.8712X_1 + 23860$$
$$Y = (-2391.6264 + 23860) + 274.8712X_1$$

We now have a linear equation with only one variable ($X_1$).  Let's plot the corresponding line on our scatterplot.

In [None]:
# Create an array (set) of 100 evenly spaced ages between 18 and 64.
xx = np.linspace(18, 65, 100)
# Find the predicted insurance policy cost for each of the ages in our array, 
# assuming the person is a smoker.
yy_smoker = (-2391.6264 + 23860) + 274.8712*xx

# Plot the Seaborn scatter plot of age (x) vs. charges (y) colored by smoker variable.
sns.relplot(x = "age", y = "charges", hue = "smoker", data = insurance)

# Add the line showing how insurance charges change with age for a smoker
# in our linear model.  
plt.plot(xx, yy_smoker)

Where is this line situated, relative to the rest of the scatter plot?  Does this make sense?

Now imagine we want to predict the insurance policy cost for a non-smoker.  Then the variable $X_2$ will be 0.   Plug $X_2 = 0$ into our original linear equation:
$$Y = -2391.6264 + 274.8712X_1 + 23860(0)$$
$$Y = -2391.6264 + 274.8712X_1 + 0$$
$$Y = -2391.6264 + 274.8712X_1$$

We again have the equation for a line.  Let's plot it on our scatterplot as well.

In [None]:
# Create an array (set) of 100 evenly spaced ages between 18 and 64.
xx = np.linspace(18, 65, 100)
# Find the predicted insurance policy cost for each of the ages in our array, 
# assuming the person is a smoker.
yy_smoker = (-2391.6264 + 23860) + 274.8712*xx
# Find the predicted insurance policy cost for each of the ages in our array, 
# assuming the person is NOT a smoker.
yy_non_smoker = -2391.6264 + 274.8712*xx

# Plot the Seaborn scatter plot of age (x) vs. charges (y) colored by smoker variable.
sns.relplot(x = "age", y = "charges", hue = "smoker", data = insurance)

# Add the line showing how insurance charges change with age for a smoker
# in our linear model.  
plt.plot(xx, yy_smoker)

# Add the line (also in blue) showing how insurance charges change with age for a smoker
# in our linear model.  
plt.plot(xx, yy_non_smoker, color = "blue")

Where is the non-smoker line?  Does this make sense?  

In conclusion, the `smoker_yes` dummy variable means the linear regression uses the top line to predict the insurance cost from age when the person is a smoker, and it uses the bottom line to predict the insurance cost from age when the person is not a smoker.

### Section 4: Using all the variables

Let's add the remaining quantitative and qualitative variables as independent variables in our linear regression model.

First, change the other two qualitative variables into dummy variables.  Use the same code as we used to change `smoker`, but 

1) change the `columns` parameter to `columns = ["sex","region"]` and 

2) change the DataFrame to change to `insurance_new`.

What does a 1 represent in the new `sex_male` column?  What about a 0?

How many dummy variables were created from the `region` variable?  Why?

Make a linear regression model using all the columns except `charges` as independent variables.

Display the summary of your new linear model.

How much did R-squared improve?  Do you think this is significant?  Look at the p-values.  Should all variables be included in the linear model?

Now plot the residuals for this new model.

Does this look like a normal distribution?  How does it compare to the plot of the residuals in the previous model?

Let's also plot the residuals (y) against the actual charges (x):

What do you notice about this plot?

Which linear model would you use to predict insurance policy charges?

### Section 5:  Linear regression with the scikit-learn library

In this final section, we will learn to perform linear regression with the [scikit-learn library](https://scikit-learn.org/stable/).  The scikit-learn library is a widely used machine learning library.  *Machine learning* broadly refers to using computer models and algorithms to make predictions from data.  Linear regression can be considered machine learning, and we will be learning several other machine learning techniques in the following labs.  Being able to run linear regression with the scikit-learn library will make it easier to compare the performance of linear regression to other machine learning methods operating on the same data.

First import the functions we will use from the scikit-learn library (sklearn).

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

To use scikit-learn, we need to split our DataFrame into the independent variables (x) and the dependent variable (y).

First, create a new DataFrame `x` by dropping the `charges` column.  We can do this easily with the `.drop()` function, which has the pattern:
`new_df = df.drop("column_name_to_drop", axis =1)`

Can you figure out the code to drop the `charges` column?

<details><summary>Answer:</summary>
<code>
x = insurance_new2.drop("charges", axis = 1)
</code>
</details>

Display your new DataFrame `x` to check it is correct.

Create a new variable `y` equal to the `charges` column in your DataFrame.

Now that we have create two new variables `x` and `y` with just independent and dependent variable(s), respectively, we can run linear regression on them in scikit-learn. 

First, we create a linear regression model object:

In [None]:
linear_regressor = LinearRegression()

Next, we fit the linear regression model to the data:

In [None]:
linear_regressor.fit(x,y)

There is no summary method in scikit-learn, but we can find the predicted charges for each row in our data as follows.

In [None]:
y_pred = linear_regressor.predict(x)
y_pred

This function gives the same values as `lm3.fittedvalues` in the statsmodels library.

We will mainly compare different machine learning models using the *Mean Squared Error, (MSE)* which for linear regression is the sum of the squares of the residuals divided by the number of residuals.  Remember that the linear regression equation minimizes the sum of the squares of the residuals.

To compute the Mean Squared Error for all our data points:

In [None]:
mean_squared_error(y, y_pred)

By itself, the Mean Squared Error does not tell us a lot.  But in the next lab we will compare it with mean squared errors for other prediction methods.