# STUDENT EDIT THIS CELL
Student Name: **REPLACE WITH YOUR NAME**

# Predictive Data Mining with Linear Regression, Scikit Learn and Python

## BUS280 - Spring 2022
Professor John Michl

## Didn't we do regression?

* Yes, in the context of Time Series Analysis
* Analyzed historical data to project into the future
* For non-time series data, best to take train-test approach

## California Houses Prices Dataset
* Built-in to scikit-learn
* We're familiar with the data structure of the Bunch
* Applicable to similar analysis from other data sources

In [None]:
# import dependencies, modules and magics

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# create the Bunch object

california = fetch_california_housing()

In [None]:
# show keys are in the Bunch object

california.keys()

## Let's explore the data set
* The previous cell demonstrates how to view an *attribute* of an object. 
* `.keys()` is an attribute for this particular object.
* We can view and act on other attributes by accessing their key names.
* View the description of the data set using the DESCR attribute.

In [None]:
# Output the description using the DESCR attribute -- ugh!
# Remember, Python is case sensitive


In [None]:
# Use print() method to show description of dataset code to print out the 
# description of the data set. In this case, print() is used to apply 
# formatting so it is easier to read.



## The `Bunch` object contains various data arrays
* The `data` array consists of all of the independent variables or features. 
* The `target` array consists of one variable...the dependent variable or thing we hope to predict.
* Therefore `feature_names` are the variable names for the `data` attribute. 
* If we viewed the data in a spreadsheet, these names would be the column headers.
* The `target` attribute is the data for dependent variable. The `target_names` can takes on one of two roles depending on the type of the target value. 

In [None]:
# show the feature names for the california object
california.feature_names

In [None]:
# show the target names for the california object

california.target_names

# STUDENT EDIT THIS CELL
Why does that `target_name` result look different than what we saw in the Iris data set?

**Double-click on this cell and type in your answer. Run the cell and then wait.**

## Arrays are a special data structure
* Arrays are collections of values.
* They can have different shapes.
* One dimensional - like a single column in a spreadsheet (rows only)
* Two dimensional - like multiple columns (rows and columns)
* Even multi-dimensional (though that gets complicated!)

Given there is a `shape` attribute of an array, how might you find the shape of the `data` and `target` arrays?

In [None]:
# Show the shape of the data array
# Show the shape of the target array


In [None]:
# How many lines of output did you get in the previous cell? One or two? 
# Hold tight for a Python tip!


## `pandas` and the `DataFrame` object
* We learned earlier that a `DataFrame` is a special type of data structure.
* Acts like a spreadsheet but can handle almost unlimited data...and has many superpowers.
* We imported `pandas` at the start so we don't need import it again. 
* **ACTION:** comment out the import line with a # sign.
* The following lines will set some options such as default widths of output.

In [None]:
# Import pandas and adjust some defaults
import pandas as pd
pd.set_option('precision', 4)
pd.set_option('max_columns',9)
pd.set_option('display.width', None)

In [None]:
# import the california data object into a new dataframe named df
# use the feature names as the column names (a.k.a. Series names)

df = pd.DataFrame(REPLACE-WITH-DATA-ARRAY, columns=california.feature_names)

In [None]:
# create a new Series (column) called MedHouseValue
# assign to it the target array from the Bunch

df['MedHouseValue'] = pd.Series(REPLACE-WITH-TARGET-ARRAY)


In [None]:
# what was that attribute of a dataframe that shows the top of the 
# dataframe. (There's also one attached to your neck.) Show the first 7 rows.


In [None]:
# show some descriptive stats for the dataframe
df.describe()

### Congratulations, you've successfully...
* Imported a `Bunch` object from sklearn.
* Viewed the attributes of the object including names, data and target
* Converted that object to a DataFrame
* Explored the descriptive statistics

## Visualizing the Features

* Helpful to plot the target value against each feature
* Identify patterns and correlations
* However, with over 20,000 observations, scatterplots may become a giant ink blot blocking out the patterns.
* Let's take a small sample of the 20,000 for visualization purposes.


### **pandas** `sample`
* `sample` is a pandas `DataFrame` method
* **fract** is the fraction to select (.10 or 10%)
* **random_state** is the random number seed to be sure we can reproduce results
* If you and your neighbor both use the same seed, you'll end up with the same sample data set.
* We'll use the sample just for exploratory plotting. 
* We'll start with the full data set for training and testing.

In [None]:
# Create a new DataFrame for the sample, with a seed of 17

sample_df = df.sample(frac=.10, random_state=17)

### Prepare for plotting
* import `matplotlib` and `seaborn`
* `matplotlib` is the underlying plotting engine
* `seaborn` is the pretty face on top of that engine that has many desirable defaults
* We'll start by setting some defaults

In [None]:
# Note: we've already imported patplotlib and seaborn at the start
# Comment them out, if you like, but keep the sns lines

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
sns.set_style('whitegrid')

### Create the scatter plots
* Create several scatter plots showing the target variable (MedHouseValue) as y-axis against all other features.
* Use a `for` loop for each feature_name
* Substitute the feature name as the x-axis
* Create on plot and show it, then loop to the next

In [None]:
# Create a sample plot for just one feature
# Note -- data is coming from the sample_df
# Since we've designated a data frame as the source, we only need to 
# indicate the Series names for the x and y data

plt.figure(figsize=(16,9))
sns.scatterplot(data=sample_df, x='MedInc',
                  y='MedHouseValue', hue='MedHouseValue',
                  palette='cool', legend=False)




## Efficient chart creation
Since we want a chart that compares each feature to the target, we could loop through the feature names and substitute each name for the x data. 

When using Python (or most other languages) always think of ways to simplify the code when doing redundant tasks.

In [None]:
# loop through the feature_names, take each feature and plot it
for feature in california.feature_names:
  plt.figure(figsize=(16,9))
  sns.scatterplot(data=sample_df, x=feature,
                  y='MedHouseValue', hue='MedHouseValue',
                  palette='cool', legend=False)

# STUDENT EDIT THIS CELL
Double-click to open this cell. Enter your observations of the charts. Then run the cell.
* Use bullets

This could be useful for the previous cells. To see how to reference a internet image in a Markdown cell, double click to see the Markdown code.

![Map of california](https://images.mapsofworld.com/usa/states/california/california-lat-long-map.jpg "California Map")

## Splitting the Data for Training and Testing


In [None]:
# Create the train and test splits
# first, import the train_test_split module from sklearn
from sklearn.model_selection import train_test_split

# second, run the module and pass the data and target arrays from the `Bunch`
X_train, X_test, y_train, y_test = train_test_split(
    california.data, california.target,random_state = 11
)

In [None]:
# Show the shape of the X_train training set (upper case X)
X_train.shape

In [None]:
# Show the shape of the y_train training set (lower case y)


In [None]:
# Show the shape of the X_test training set (upper case X)


In [None]:
# Show the shape of the y_test training set (lower case y)


## Train the Model
* `LinearRegression` estimator uses *all* features by default
* Error thrown for categorical features
* Invoke `fit` method using the train and test sets

In [None]:
# import the LinearRegression module from scikit-learn
from sklearn.linear_model import LinearRegression

In [None]:
# create an empty linear_regression object
linear_regression = LinearRegression()

# pass the training features and target and attempt to fit the model
linear_regression.fit(X=X_train, y=y_train) 

### Looping through a list
* Use `for` to loop through items in an iterable object such as a list
* Show the index number or position using `enumerate`
* Access specific positions using a number starting with 0 for the first value

In [None]:
# View the resulting coefficients. They are now an attribute of the 
# linear_regression object we created.

linear_regression.coef_

### Results are stored in the `linear_regression` object
* That array isn't very helpful. Or, is it?
* The values align with the features. So, the first value goes with the first feature name.
* We can loop through those feature_names, pull the position number, pass that to the 

In [None]:
# Quick review - what are those features again?
# Add code to show the contents of the object

california.feature_names

In [None]:
# if we only wanted to see the coefficient for AveRooms, run this:

linear_regression.coef_[2]  # MedInc is item 0, AveRooms is item 2

In [None]:
# the enumerate() method returns an item in a list and
# the corresponding position number.

# loop through the feature_names, return pos, name, and coefficient
# special formatting applied for alignment and decimal places
for i, name in enumerate(california.feature_names):
  print(f'{name:>10}: {linear_regression.coef_[i]:.4}')

In [None]:
# our linear_regression object also has an intercept_ attribute
# show it. (Note it is spelled intercept_ )



# STUDENT EDIT THIS CELL
Review the results 
* Which features have little to no predictive power? 
* What is the *b* intercept?
* Does population significantly increase or decrease Median House Value?
* How about bedrooms?
* How does this align with the scatter charts? Any surprises or does it make sense?

## Testing the Model
* First create a predicted object using the **test** data.
* (We've been using the **training** set so far.

In [None]:
# pass the X_test data to the predict method of the regression object
# create a new object to hold the predicted output
predicted = linear_regression.predict(X_test)

# create an expected object which is the target value previously saved to the y_test object
expected = y_test

In [None]:
# View first five predictions with a slice
predicted[:5]

In [None]:
# View first five expected with a slice
expected[:5]

In [None]:
# View the together
print("predicted","expected")
for i in range(0,10):
    print(f'{predicted[i]:<10.3} {expected[i]:.3}')

## Visualizing the Expected vs. Predicted Prices
* Create a DataFrame of values from the arrays

In [None]:
# since we already of a DataFrame named df, let's use df1 here
df1 = pd.DataFrame()    # empty DataFrame
df1['Expected'] = pd.Series(expected)   # add a series from the array
df1['Predicted'] = pd.Series(predicted)

In [None]:
figure = plt.figure(figsize=(9,9))
# The following will throw some errors. Fix them.
axes = sns.scatterplot(data=df1, x='Expected',y='Predicted',
                       hue='Predicted', palette='cool', legend=False)

In [None]:
# Recreate plot with some other options
figure = plt.figure(figsize=(9,9))

axes = sns.scatterplot(data=df1, x='Expected',y='Predicted',
                       hue='Predicted', palette='cool', legend=False)

# Determine the min and max for the two axis
start = min(expected.min(), predicted.min())
end = max(expected.max(), predicted.max())

# Set the axis limits
axes.set_xlim(start, end)
axes.set_ylim(start, end)

# plot a prediction line
line = plt.plot([start, end], [start, end], 'k--')   # k = black, -- means dashed

## Regression Model Metrics
* Scikit-learn provides many metrics functions for evaluating how well the model predicts.
* For classification problems, use `confusion_matrix` and `classification_report`.
* For regression, use **coefficient of determination** and **mean squared error** as we did earlier in the semester.

### Scikit-learn's `r2_score` function
* Calculates the coefficient of determination.
* Recall the coefficient of determination has a range of 0 to 1.

In [None]:
# import the metrics module from Scikit-learn
from sklearn import metrics

In [None]:
# run the r2 functions using the expected and predicted values
metrics.r2_score(expected, predicted)

## STUDENT EDIT THIS CELL
* Evaluate the predictive power of this model given the output of the previous cell.

YOUR OBSERVATIONS HERE


### Scikit-Learn's `mean_squared_error` function
* 

In [None]:
# Calculate the estimators mean squared error
metrics.mean_squared_error(expected, predicted)

## Trying other models
* Frequently, we'll try several approaches to see if somethign will be more predictive than the linear regression model. 
* Scikit-learn includes `ElasticNet`, `Lasso`, and `Ridge` methods, in addition to `LinearRegression`.
* The code below will run each of those models several times using the `KFold` method of pulling samples. 
* The model that has the highest mean R-squared is most likely the most predictive.

In [None]:
# import the other models from Scikit-learn
from sklearn.linear_model import ElasticNet, Lasso, Ridge

# create a dictionary of the estimators so we can loop through
# the models rather than coding them separately

estimators = {
    'LinearRegression': linear_regression,
    'ElasticNet': ElasticNet(),
    'Lasso': Lasso(),
    'Ridge': Ridge()
}

In [None]:
# import the KFold and Cross_val_score from Scikit-learn for comparisons
from sklearn.model_selection import KFold, cross_val_score

# loop through all of the estimators and run all of the models
for estimator_name, estimator_object in estimators.items():
    kfold = KFold(n_splits=10, random_state=11, shuffle=True)
    scores = cross_val_score(estimator=estimator_object,
                            X=california.data, y=california.target,
                            cv=kfold, scoring='r2')
    print(f'{estimator_name:>16}: '+
          f'mean of r2 scores = {scores.mean():.3f}')

# STUDENT EDIT THIS CELL
Interpret the results in the previous cell. 

* Which, if any, of the regression models should be used to predict the home prices? 
* Is there anything you noticed from working through this, perhaps in the graphs, that might suggest a way to improve the predictive power?

Save the notebook, share the link, and submit it to the in-class assignment.