# DS106 Machine Learning : Lesson Six Companion Notebook

### Table of Contents <a class="anchor" id="DS106L6_toc"></a>

* [Table of Contents](#DS106L6_toc)
    * [Page 1 - Introduction](#DS106L6_page_1)
    * [Page 2 - What is Machine Learning?](#DS106L6_page_2)
    * [Page 3 - Supervised Machine Learning in Python](#DS106L6_page_3)
    * [Page 4 - Interpreting Supervised Machine Learning Model Accuracy](#DS106L6_page_4)
    * [Page 5 - Challenges in Machine Learning](#DS106L6_page_5)
    * [Page 6 - Cross Validation ](#DS106L6_page_6)
    * [Page 7 - k-Fold Cross Validation in Python](#DS106L6_page_7)
    * [Page 8 - Key Terms](#DS106L6_page_8)
    * [Page 9 - Lesson 1 Hands-On](#DS106L6_page_9)

    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS106L6_page_1"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to Machine Learning
VimeoVideo('244082549', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO106-ML-L01overview.zip)**.

# Introduction

Machine learning is the science of programming computers to learn from data. Many people benefit from machine learning every day, from the search engines results from Google to song recommendations on Spotify. Since these services are capable of seeing the stream history of their users, they're able to make correlations between customers to make better recommendations. In this course, you will learn the theory and the practice of machine learning in order to discover hidden insights.

By the end of this lesson, you will be able to: 

* Differentiate between the different types of machine learning
* Complete supervised machine learning in Python
* Understand model over- and under-fitting
* Conceptualize *k-* folds cross-validation
* Perform *k*-folds cross validation in Python

This lesson will culminate in a hands-on in which you use supervised machine learning to predict and cross-validate diamond price.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - What is Machine Learning?<a class="anchor" id="DS106L6_page_2"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is Machine Learning?

A computer scientist by the name of Arthur Samuel pioneered the term "machine learning." Here is how he defines machine learning:

> _"[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed."_ - Arthur Samuel, 1959

The key portion of that phrase is _"without being explicitly programmed"_. How is machine learning different from systems that have been explicitly programmed? Fundamentally, explicitly programmed systems involves writing code that solves a set of known problem in discrete steps. In fact, some problems can't be solved as efficiently by explicitly being programmed. How is this possible? Consider for one moment how a developer would go about filtering out spam email from the "ham" emails (good email).

The following _pseudocode_, or high-level code, demonstrates how a programmer would explicitly filter out spam emails:

```text
spam_phrases = ["FREE!!", "Black Friday Sale!", "Dear Sir, I hope this email finds you well. Wire me money."]
If (email contains phrases from spam_phrases)
    Flag email as spam
    Send email to "Spam" folder
Otherwise
    Keep email in the inbox
```

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun fact!</h3>
    </div>
    <div class="panel-body">
        <p>One of the first mainstream applications of machine learning was used to filter out spam emails</a>in the 1990s.</p>
    </div>
</div>

This might seem relatively straight-forward. If the email contains any of the spam phrases, mark it as spam and place it in the spam folder. However, this is not optimal for many reasons. For one, imagine what would happen to an email sent from a friend or family member warning them about malicious spam emails: _"Don't click those 'Black Friday Sale!' emails. It downloads a virus."_ According the rules setup, this email will be flagged as spam even though it's not. One reaction might be to add the sender's email address to a list of approved email addresses (_whitelist_). That solution would work in the short term. However, this is a reactive response rather than a proactive. Spam creators would catch on to this early and would use a different set of keywords, being careful not to use any words from the spam list. This would mean that the list of spam phrases will need to be manually updated whenever a new spam email enters the email space. This process becomes even more complex when you factor in what happens when a user receives an email from a known contact whose email account has been compromised (or "hacked"). Also, it can prove to be impractical to use the email service if the user doesn't whitelist every known sender.

This can be contrasted with the machine learning approach. Whenever an email comes in, the user has the ability to manually flag the email as spam. The machine learning algorithm, or _learner_, will then gather information about the spam email such as the the subject line, time stamp, sender's email address, and location of server to name a few examples. The _learner_ then begins to recognize patterns in spam. The benefit of this approach means all the users of the email service can both _train_ the learner to recognize spam and benefit from the intelligent spam filtering. If the learner noticed emails being sent from outside of the country during non-business hours tends to be spam, the learner can automatically make the decision to mark the email as spam. This system is effective, but not without its flaws. You may have experienced this when an email sent from a trusted source was marked as spam. While the learner is noticing the trend for spam emails, it can also learn to recognize "ham" emails (good emails) by noting the same information about the email (i.e. timestamp, etc.). Over time, the learner can build an intuition of what makes up both a ham and spam email. Once given enough data, learners can flag a spam email, even if it's the first of its kind. This is where machine learning shines: pattern recognition and predictions.

---

## Types of Machine Learning Systems

Machine learning is powerful, and there are multiple ways to go about solving a problem. Does one give the learner all the information all at once or in batches? Should the learner be given examples of labeled ham and spam or be left to discover on its own? These different approaches are important enough to be discussed below and contrasted with the counterpart.

---

### Supervised and Unsupervised Machine Learning

When training a learner with data (e.g. ham and spam examples), the data can either come with labels denoting what the data is (e.g. "ham"). When data has labeled data, this is referred to as _supervised learning_. This will allow the learner to predict which label a new email should be _classified_ to. Supervised learning is often done through regression and modeling by splitting a known data set in half.  The first half of the dataset is used to find a regression model and thus train the computer, and then the second half, which the computer has never seen before, can be predicted through machine learning.  

In contrast, data without labels is referred to as _unsupervised learning_. Since you have not split the data into a training and a learning data set, the computer doesn't have anything to go on and so has to discover the pattern for itself.  Typically, unsupervised learning programs cluster the data together based on likeness of certain values. 

---

### Reinforcement Learning

The Alpha Go artificial intelligence was developed to play the game Go. Go is a strategic board game even more complex than chess in which you must surround and capture your opponents' pieces. 

![The board game Go. White and black pieces are on the board.](Media/learning1.jpg)

It was originally thought that the current state of technology wasn't yet capable of artificial intelligence to win matches against human players at the game of Go. However, in October 2015, the artificial intelligence was capable of beating a human player at the game. This achievement is amazing because brute-forcing each possible move of Go is not computationally feasible. Alpha Go was capable of learning from multiple Go games and playing against itself using *reinforcement learning*.

The learner is referred to as an _agent_ in the context of reinforcement learning. The agent observes the environment, performs an action, and is either rewarded or punished depending on the outcome of the choice. The agent will then learn what happened and either perform or avoid the action. This approach is then repeated by taking more actions and learning from them. This is more complex compared to the other systems but provides a great gain in output.

---

### Batch and Online Learning

A machine learning algorithm can either be trained on the fly or once and then submitted for use. _Batch learning_ is a system that can learn incrementally using all of the data available. It is important to gather all available data since the batch learning process will need to be stopped and retrained in order to learn new information. This is often called _offline learning_. While the process of training the algorithm can be time consuming, it can fortunately be automated. This is contrasted with _online learning_, which is a system capable of learning incrementally in real time. While this style of incremental learning is possible to update very quickly, the training process is usually never done on a live system such as Google Search. Instead, the system is pulled offline, quickly retrained to extend the algorithms knowledge base and then redeployed to the live environment. The term _online learning_ can be confusing when it's contrasted with offline, so it's safe to think of this system simply as *incrementally learning*.

---

### Instance-based and Model-based Learning

Machine learning systems can generalize based on instances or models. Spam emails can be flagged and the learner can spot patterns in emails to flag future spam email never before seen. The learner can generalize patterns via _instance-based_ learning or _model-based_ learning.

_Instance-based_ learning is the human equivalent of learning by heart. This is effective for known data, such as the spam email, but it suffers when predicting new spam email with new terminology. Fortunately, measuring the similarity between two emails allows the learner to determine whether or not an email is spam provided some threshold.

_Model-based_ learning takes a set of examples and builds a model that represents the data. Predictions can be made using the model as the source of knowledge. Suppose you wish to test the hypothesis that higher rated movies make more money. Being an aspiring data scientist, you collect the top 20 movies, create a table, and plot the rating versus gross income on a graph.

**Table 1.1: Movie Ratings and Budget**

<table>
    <tr><th>Rank</th><th>Movie Title</th><th>Released Year</th><th>Rating (out of 10)</th><th>Gross (Millions of US Dollars)</th></tr>
    <tr>
        <td>1</td>
        <td>The Shawshank Redemption</td>
        <td>1994</td>
        <td>9.3</td>
        <td>$28.3</td>
    </tr>
    <tr>
        <td>2</td>
        <td>The Godfather</td>
        <td>1972</td>
        <td>9.2</td>
        <td>$135</td>
    </tr>
    <tr>
        <td>3</td>
        <td>The Godfather: Part II</td>
        <td>1974</td>
        <td>9.0</td>
        <td>$57.3</td>
    </tr>
    <tr>
        <td>4</td>
        <td>The Dark Knight</td>
        <td>2008</td>
        <td>9.0</td>
        <td>$535</td>
    </tr>
    <tr>
        <td>5</td>
        <td>Schindler's List</td>
        <td>1993</td>
        <td>8.9</td>
        <td>$96</td>
    </tr>
    <tr>
        <td>6</td>
        <td>Pulp Fiction</td>
        <td>1994</td>
        <td>8.9</td>
        <td>$108</td>
    </tr>
    <tr>
        <td>7</td>
        <td>The Lord of the Rings: The Return of the King</td>
        <td>2003</td>
        <td>8.9</td>
        <td>$377.8</td>
    </tr>
    <tr>
        <td>8</td>
        <td>The Good, the Bad and the Ugly</td>
        <td>1966</td>
        <td>8.9</td>
        <td>$6.1</td>
    </tr>
    <tr>
        <td>9</td>
        <td>Fight Club</td>
        <td>1999</td>
        <td>8.8</td>
        <td>$37</td>
    </tr>
    <tr>
        <td>10</td>
        <td>The Lord of the Rings: The Fellowship of the Ring</td>
        <td>2001</td>
        <td>8.8</td>
        <td>$315.5</td>
    </tr>
    <tr>
        <td>11</td>
        <td>Forrest Gump</td>
        <td>1994</td>
        <td>8.8</td>
        <td>$330</td>
    </tr>
    <tr>
        <td>12</td>
        <td>Star Wars: Episode V - The Empire Strikes Back</td>
        <td>1980</td>
        <td>8.8</td>
        <td>$290.5</td>
    </tr>
    <tr>
        <td>13</td>
        <td>Inception</td>
        <td>2010</td>
        <td>8.8</td>
        <td>$292.5</td>
    </tr>
    <tr>
        <td>14</td>
        <td>The Lord of the Rings: The Two Towers</td>
        <td>2002</td>
        <td>8.7</td>
        <td>$342.5</td>
    </tr>
    <tr>
        <td>15</td>
        <td>One Flew Over the Cuckoo's Nest</td>
        <td>1975</td>
        <td>8.7</td>
        <td>$112</td>
    </tr>
    <tr>
        <td>16</td>
        <td>Goodfellas</td>
        <td>1990</td>
        <td>8.7</td>
        <td>$46.8</td>
    </tr>
    <tr>
        <td>17</td>
        <td>The Matrix</td>
        <td>1999</td>
        <td>8.7</td>
        <td>$171.5</td>
    </tr>
    <tr>
        <td>18</td>
        <td>Star Wars: Episode IV - A New Hope</td>
        <td>1977</td>
        <td>8.7</td>
        <td>$322.7</td>
    </tr>
    <tr>
        <td>19</td>
        <td>Se7en</td>
        <td>1995</td>
        <td>8.6</td>
        <td>$100</td>
    </tr>
    <tr>
        <td>20</td>
        <td>The Silence of the Lambs</td>
        <td>1991</td>
        <td>8.6</td>
        <td>$130.7</td>
    </tr>
</table>

![A chart titled top twenty open parentheses ratings versus gross income close parentheses. The x axis is labeled rating open parentheses I M D B close parentheses and runs from eight point four to nine point four. The y axis is labeled gross income open parentheses million U S D close parentheses and runs from zero to three hundred fifty. Data for the top twenty movies is plotted on the chart.](Media/top-20.png)

Intuitively, this may seem to lend itself to a poor model when used to predict the gross income of a movie, so the pool of movies increases from the top 20 to the top 30. Then, a new graph is generated: 

![A chart titled top thirty open parentheses ratings versus gross income close parentheses. The x axis is labeled rating open parentheses I M D B close parentheses and runs from eight point four to nine point four. The y axis is labeled gross income open parentheses million U S D close parentheses and runs from zero to six hundred fifty. Data for the top thirty movies is plotted on the chart.](Media/top-30.png)

This example ignores potentially relevant data, thus introducing bias. For instance, this model didn't account for the movie production cost, production year, genre, running time, whether or not it's a sequel, and more. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Machine learning relies on a pool of data to become accurate, making the "top 30" example a mere puddle in comparison. In reality, data scientists will play with data sets with tens of thousands of data elements.</p>
    </div>
</div>

This section discussed the different systems a machine learning algorithm can implement. These different categories are not mutual exclusive, however. For instance, it's possible to have an unsupervised, batch learning algorithm that processes data unlabeled data.

---

## Comparing Machine Learning and Statistics

A lot of machine learning techniques have a statistical counterpart, and in reality, this is because statistical models form the foundation of machine learning.  Statistics came first, and machine learning was built-on top to automate and make better use of the statistics. However, the way they get used tends to be different.  Machine learning tends to be about the "what" - you are trying to get to a working algorithm that will be used with new data for making recommendations immediately. Statistics tend to be a little bit more about the "how" and "why." They'll give you a little more information about what's going on, so you can explore and understand your data better, and make changes to best practices based upon it. Neither is a better approach - they are just very different.  To get an even better feel for how to utilize all the tools in your arsenal, check out this **[Woz U Blog entitled "Stats Trained? Re-Orient Yourself Towards Machine Learning and Data Science](https://woz-u.com/stats-trained-re-orient-yourself-towards-machine-learning-and-data-science/)**

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Supervised Machine Learning in Python<a class="anchor" id="DS106L6_page_3"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Supervised Machine Learning in Python

Now that you understand the basics of machine learning and all the different types, you will learn how to complete supervised machine learning in Python, using linear regression as the supervised machine learning base. 

---

## Import Packages

First, you will need to import some packages.  You will need ```pandas``` for loading in data, ```numpy``` for square-rooting your model estimates, ```sklearn``` for the bulk of the linear regression and modeling work, and ```matplotlib``` to graph the model's residuals to get a visual representation of accuracy.

```python
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import metrics
import numpy as np
```

---

## Load in Data

Next, you will need to load in your data.  For this exercise, you will be using **[housing data](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/realestate.zip)**. 

---

## Goal

With the above data, your goal is to accurately predict housing prices.  This variable is conveniently labeled ```Y house price of unit area``` for you. You will use X variables numbered 2-6 to determine housing prices. When completed, if desired, you should be able to take completely new data, maybe from a new geographic location, and predict housing prices there.

---

## Data Wrangling

The first thing you need to do to kick off machine learning is to create your x and y variables as their own arrays. You cannot feed in the entire dataframe all at once, so you will need to subset your data. The ```x``` data will consist of X2-X6.  You are going to skip X1 since it is a date, and dates can sometimes be tricky to format correctly for machine learning. 

```python
x = realestate[['X2 house age', 'X3 distance to the nearest MRT station', 'X4 number of convenience stores', 'X5 latitude', 'X6 longitude']]
```

The y data will consist of the target variable, or what you are trying to predict.  In this case, that is housing price: 

```python
y = realestate['Y house price of unit area']
```

---

## Train Test Split

One of the key things that separates machine learning from statistics is that machine learning utilizes the concept of "train test split." In statistics, you typically run your analysis on all the data you have available.  In machine learning, you split your data in half, and reserve the first chunk for training the model, and the second half for testing the model. How big should a "chunk" be? Typically you want more data to be used for training than for testing. 80/20, 70/30, and 60/40 splits are all acceptable.  

You will utilize the ```train_test_split()``` function from ```sklearn``` to split your data.  You will end up with four data sets at the end: 

* x_train
* x_test
* y_train
* y_test

There will be one training dataset and one testing dataset each for x and y.

As arguments into the ```train_test_split()``` function, you will place your ```x``` and ```y``` data, and specify how much of your data you want to test with the argument ```test_size=```. In this case, the value of ```test_size=``` is .4, because you are going to use a 60/40 train/test split. This means that you are reserving 40% of your data for testing, and training with the remaining 60%. 

```python
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = .4, random_state=101)
```

You will note that another argument is included: ```random_state=101```. This argument is not required when you are doing machine learning on your own, but by everyone using the same number (101), this means that your randomly generated 60% training data will be the same as what is seen in the lesson.  So, including the ```random_state=``` argument into the function makes it a bit easier to follow along, because you will get the same results as what is presented here.

Once you have completed that line, if you want to see the shape of the data you'll be using for your machine learning algorithm, you can then print it out: 

```python
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
```

And here is the result:

```text
(248, 5) (248,)
(166, 5) (166,)
```

This is showing that in the ```x_train``` dataset, there are 248 rows and 5 columns, and in the ```x_test``` dataset, there are 166 rows and 5 columns.  So you can see how the training and testing data is broken up.

---

## Create the Linear Regression Model

Next, you will run the linear regression model on your training data.  You could call this linear model anything you’d like, but below you’ll see it has been named ```lm```.  You will then fit this model to the training data using the ```.fit()``` function, specifying the x and y training sets.  

```python
lm = LinearRegression()
lm.fit(x_train, y_train)
```

If you get this spit back to you, then you know it's worked ok: 

```text
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
```

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>You should be aware that in order to run this, you cannot have any non-number values in your dataset.   If you do, and you attempt this step, you will end up with this error message: "ValueError: could not convert string to float: 'See notes for:" The “convert string to float” should tip you off that you have a problem with data types, and the “see notes for” section will tell you which variables are causing the error. You can either drop them out, or dummy code them into your dataframe - the choice is yours based on what variables you want to retain in your model.</p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Interpreting Supervised Machine Learning Model Accuracy<a class="anchor" id="DS106L6_page_4"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Interpreting Supervised Machine Learning Model Accuracy

Now that you have created the model, it's time to take a peek at it and determine whether it is any good. 

---

## Examine Predictions

Now that you’ve created your model, you can see the predictions it has made about housing price:

```python
predictions = lm.predict(x_test)
predictions
```

Which will provide you with a large array: 

```text
array([14.77852916,  8.35848599, 23.1113017 , 47.67384657, 30.05251015,
       37.468435  , 38.01762284, 41.08294225, 46.50080685, 40.34536509,
       43.87818623, 33.77279613, 40.08116941, 37.31066596, 46.15211908,
       48.22093568, 39.48594154, 46.43844951, 49.94962395, 47.552992  ,
       41.60580876, 52.60152777, 47.16226231, 37.48194878, 32.40811002,
       50.67597957, 39.35917038, 47.99287312, 45.4694465 , 39.33112551,
       49.61736207, 42.53188577, 42.96261018, 46.15577268, 44.94124757,
        7.13730951, 39.15074038, 39.77497805,  7.07979164, 54.43242047,
       31.26660065, 46.90435905, 24.89017208, 48.80711134, 42.6710441 ,
       50.08982154, 41.0044385 , 37.39701978, 44.86394799, 36.76558821,
       46.8133099 , 35.89912014, 42.35933217, 14.7421879 , 38.74428879,
       47.50157796, 43.06612319, 45.44985241, 43.77496083, 39.48259244,
       34.31225036, 45.52392252, 42.44560897, 42.0625614 , 51.89857656,
       42.74806676, 24.28752167, 48.68058491, 31.25018334, 40.06346133,
       43.6178354 , 48.68240545, 14.21653961, 35.23519914, 14.76427345,
       43.25900943, 33.7425475 , 44.18683365, 42.22275082, 11.21376847,
       45.59819933, 36.51146884, 42.35933217, 29.6210743 , 52.1620338 ,
       14.75338445, 35.2064402 , 33.2566497 , 40.22496408, 14.09152523,
       47.50926438, 34.37096962, 45.11380117, 25.01302325, 33.54177669,
       30.06022011, 23.53156264, 46.64460151, 27.77120309, 37.6169996 ,
       47.67413156, 30.23443112, 38.67231057, 40.81568301, 48.46849393,
       27.3840657 , 28.40540026, 30.66691363, 32.9788148 , 42.56064471,
       46.55832471, 46.09825481, 49.45208001, 33.9117351 , 47.53802332,
       42.53188577, 42.46357723, 46.50080685, 43.96908151, 44.54806893,
       51.11916869, 42.91232964, 32.24698686, 14.78214338, 35.89873172,
       33.94049403, 14.38904544, 42.79542479, 49.39456214, 43.94751443,
       28.31477818, 39.86164578, 45.17199505, 48.46849393, 52.57276884,
       38.23919165, 36.23935025, 45.52392252, 42.3923762 , 39.99498272,
       34.53964756, 48.84883348, 32.83255357, 45.68510943, 33.27573801,
       39.69163345, 15.22305451, 33.74403719, 39.2721091 , 24.77513635,
       46.01197801, 45.95357381, 31.62887841, 31.20555338, 46.55832471,
       33.73827309, 46.82193446, 29.9451677 , 46.78249532, 11.76318638,
       52.77408137, 46.49381411, 47.21978018, 54.51869727, 40.81742848,
       52.86035817])
```

This information is not super useful by itself, but plotting it gives you a better idea of how accurate your predictions (and thus your model) is.  The straighter the line, the better the model fit. Go ahead and make a scatterplot with the ```plt.scatter()``` function, graphing the ```y_test``` data against the ```predictions``` from your training model:

```python
plt.scatter(y_test, predictions)
```

And here is the resulting graph:

![An unlabeled graph. The x axis runs from ten to eight. The y axis runs from ten to fifty. Data is scatted on the graph.](Media/learning2.png)

Looks like the accuracy is not wonderful just by eyeballing it, but you can also quantify it a number of different ways.

---

## Accuracy Score

The first way is to print an accuracy score for this model.  Place the ```.score()``` function inside the ```print()``` function with an argument of your testing data to get it: 

```python
print("Score:", lm.score(x_test, y_test))
```

And here is the result:

```text
Score: 0.644238084512178
```

This means your model is accurate approximately 64% of the time, which is not too shabby in the real world.  

---

### Examining Error

The next way to determine model fit is to look at the error terms.  This is just another way to quantify the residuals - how close is your predicted data from the real data? There are many different mathematical ways to examine error, but you will look at *mean absolute error (MAE)*, *mean squared error (MSE)* and *root mean squared error (RMSE)*. There are no cut-off values when interpreting error scores, because each model with different variables and different units for those variables will generate radically different error values.  The main thing to know about interpreting error is that the smaller the error value, the better, and they range from zero to infinity. You want as close to zero as you can get.

---

### Mean Absolute Error (MAE)

This is exactly what it sounds like - the average amount of error between the prediction and the real data. It's a nice one to use because it's pretty simple to understand. To get it, utilize the ```metrics``` package from ```sklearn``` and the ```mean_absolute_error()``` function:

```python
metrics.mean_absolute_error(y_test, predictions)
```

You'll place as arguments ```y_test``` and ```predictions```, because these are the things you are comparing to get error. Here is the resulting statistic:

```text
5.550201321415433
```

Since the lowest you can have is 0, a value of 5.55 is pretty good!

---

### Mean Squared Error (MSE)

This is the square of the absolute error from above.  It's a good one to use because it takes into account large amounts of error, which often happens in the real world.  You'll get it with the ```mean_squared_error()``` function from the ```sklearn metrics``` package: 

```python
metrics.mean_squared_error(y_test, predictions)
```

And here is the resulting value: 

```text
54.37572854492122
```

Note that because it is squared, it comes out much larger than the mean absolute error.

---

### Root Mean Squared Error (RMSE)

This one is the square root of the mean squared error you saw above. It is probably the most popular. You will need to utilize the ```numpy sqrt()``` function to get the square root of the ```mean_squared_error()``` function you used above:

```python
np.sqrt(metrics.mean_squared_error(y_test, predictions))
```

Here is the RMSE value:

```text
7.373990001683025
```

Again, this model fits decently well - 7 is pretty close to zero! That is not to say that there aren't better fitting models for this data out there - there very well might be! But at first blush this looks ok.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>Want to dive really deep into how to calculate error and which is better? Then check out <a href="https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d"> this Medium post.</a></p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Challenges in Machine Learning<a class="anchor" id="DS106L6_page_5"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Challenges in Machine Learning

The effectiveness of the machine learning algorithm, or the _learner_, is dependent on the quality and quantity of data. Additionally, the algorithm itself needs to be well designed and be capable of adapting to new information. Two problems plague data scientists in the world of machine learning: "bad" data and "bad" algorithms. In this lesson, students will learn the challenges confronting machine learning and the solutions to overcome those challenges.

In the previous lesson, you performed your first supervised machine learning task.  Now, you will learn how to cross validate the algorithm to ensure it truly is the best fit.  

---

## "Bad" Data

One of the problems concerned with "bad data" is when there is an _insufficient amount_ of data to make accurate decisions. Researchers at Microsoft investigated how algorithms can be used to solve the shortage of data and **[published a paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/acl2001.pdf)** in 2001. In the paper, they compared the performance of different machine learning algorithms. Their conclusion? The algorithms performed almost identically on complex tasks once given enough data. The authors of this research study summarized the findings as follows:

> _"These results suggest that we may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development."_ Michele Banko and Eric Brill (Microsoft Researchers)

Since the multiple learning algorithms were converging to the same accuracy, the researchers were lead to believe that acquiring more data is more economical (time and money) than attempting to perfect an algorithm. In short, the learner can only be as smart as it knows how to be.

This also makes sense intuitively. The more data there is to model, the less susceptible the model is to noise (i.e. outliers). A learner is capable of predicting unseen events based off of previously seen data by generalizing. However, the data that is used to train the model must be _representative_ of the data that will be seen during real world use cases. Not only does the data need to be representative of the real world, it must also be _quality_ data. This means the data needs to be free of errors, outliers, and noise. If the data contains these outliers, then _data cleaning_ needs to happen before the model can be properly trained.

---

## "Bad" Algorithms

---

### Overfitting and Underfitting

One trap learners can fall into is _overfitting_ data. Overfitting is when the model performs well (nearly perfect) on the training data but fails at generalizing to new data. It presumes that the training data includes every known case, and as such will force-fit the data. This ties into the problem where there is an insufficient amount of data to create a model accurate enough to generalize unseen cases: the model will overfit.

By contrast, _underfitting_ occurs when the model is too simple (e.g. linear) when the structure of the data is more complex. For instance, using a linear model to project what time of day is best to sell stocks in the stock market is very likely to be too simple a model. This can be remedied by using a more complex model, such as a polynomial model, which can account for more variables and attempt to fit more complex data structure. As a rule of thumb, the more features there are about a model, the "smarter" the learner can be when predicting events. However, there can be a time when there are features present that are irrelevant.

---

### Irrelevant Features

_Irrelevant features_, as the name implies, are features not necessary to have in order to make an educated prediction. The process of _feature engineering_ focuses on choosing which features are relevant and which are not. Feature engineering involves two processes: _feature selection_ and *feature extraction*. _Feature selection_ is when the most useful features about a dataset are used for training the model. _Feature extraction_ is when features are combined to produce new ones in a process known as *dimensionality reduction.* 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Cross Validation<a class="anchor" id="DS106L6_page_6"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Cross Validation

Train-test-split is a good first step, but what happens if, for whatever reason, the split ends up not being random?  It’s important to perform additional cross-validation steps to ensure that bias hasn’t been introduced to your model through the split method.  

There are typically two different ways to cross validate in machine learning: 

1. **K-folds Cross Validation**: You can basically reserve more data for cross validation before moving on to the testing phase, by creating iterations of train-test-split. 

2. **Leave One Out (LOO) Cross Validation:** This creates the same number of folds as observations in your dataset.   Then you’ll average every iteration together to build the model.  This method is computationally quite expensive (takes a long time to process), so it should really only be used for smaller datasets.  Therefore, this method is only mentioned  here, and you won't practice it.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to read more about leave one out cross validation, please visit <a href="https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6"> Towards Data Science - Cross Validation in Python</a> or visit <a href="https://machinelearningmastery.com/k-fold-cross-validation/"> Machine Learning Mastery - Cross Validation</a> </p>
    </div>
</div>

---

## K-Folds Cross Validation 

The idea behind k-folds cross validation is that you don’t want to rely on just one iteration of train-test-split, because it could be biased accidentally.  So if one is good, isn’t more better?  You can create as many iterations of training as you like, with the number of iterations indicated as *k*.  You will use all iterations but one to train the data, and leave the last for testing the final model.  You thus can feel incredibly confident (and thus convey that confidence to supervisors and customers) that the data and model is as accurate as it can possibly be.  

Here’s a visual of the k-folds cross validation method, using four iterations.  You can see that iterations 1-3 each have a different testing set, and uses the rest of the data for training.  When you arrive at the fourth iteration, you then use all the data available.

![Four rows, each of which has twenty circles that are either green or red. Row one, iteration one. Test data is first five circles. The remaining are training data. Row two, iteration two, second five circles are test data. Row three, iteration three, third five circles are test data. Row four, iteration k equals four, final five circles are test data.](Media/106.L1.21.jpg)

If you break down k-fold categorization into its most basic components, here is what this function does: 

* Randomizes the data
* Splits the data into groups (k #)
* For each group, creates a test set and a training set, then fits a model and retains the accuracy score
* Summarizes the model using each iteration’s accuracy score

Each separate group of data will the testing data once, and will be used as training data for the remainder of the iterations.  

---

### How do you know how many iterations (k) to use? 

The main goal in choosing the number of iterations is that each group should be representative of the dataset overall.   If you have a pretty small dataset, then you’ll need a smaller k to ensure representativeness.  Otherwise, folks often use k=5 or k=10 as standard. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - k-Fold Cross Validation in Python<a class="anchor" id="DS106L6_page_7"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# k-Fold Cross Validation in Python

Now that you understand the process and theory of cross validation, you will complete it in Python! To do so, you will be starting where the last lesson left off.  In order to perform k-fold cross validation, you already need to have your x and y variables defined, have performed the regular version of ```train_test_split()```, and created your model.  Then the k-fold fun can begin!

---

## Import Packages

First, you will need to import some additional packages in addition to everything you imported last lesson.  You will need from ```sklearn.model_selection``` the packages for ```KFold``` and for ```cross_val_score```.  

```python
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
```

---

## Load in Data

Next, you will need to load in your data.  As stated above, you will use the **[exact same housing data as last lesson](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/realestate.zip)**. 

---

## Goal

With the above data, your goal is to accurately predict housing prices.  To ensure this is the most accurate and rigorous model, you will be cross-validating it using the k-folds method. 

---

## Create the Folds

You will use the ```KFold()``` function to create your different training and test sets.  By inputting the ```3```, you have chosen to have 3 iterations, a good number given the small number of cases in your dataset.  Then the argument ```True``` means that you want your data to be shuffled.  Lastly, if you choose to shuffle your data, then you can specify the randomization version - just like with ```train_test_split()```. If you were doing this on your own, it wouldn't matter, but since you are following along, and you want your numbers to be the same, specify the ```1``` as done here so that everyone ends up with the same randomization.

```python
kfold = KFold(3, True, 1)
for train, test in kfold.split(x,y):
    print('train: %s, test: %s' % (train,test))
```

And it prints out all the sets for you, in case you need to keep track of it:

```text
train: [  0   1   2   3   7   8   9  10  14  15  16  19  20  21  22  24  25  26
  28  30  31  32  33  34  35  36  37  38  42  43  44  45  46  47  48  49
  50  51  52  53  54  55  56  57  60  63  64  66  68  70  71  72  74  75
  76  77  79  83  84  86  87  88  94  96  97  99 100 103 104 105 108 109
 110 111 112 113 114 115 116 118 121 123 124 126 128 129 130 131 133 134
 135 136 137 138 140 141 142 143 144 145 147 148 149 150 151 152 153 154
 155 156 157 158 160 163 166 167 168 169 170 174 175 176 177 178 181 182
 183 184 188 190 193 194 195 196 197 198 199 200 201 202 203 205 206 208
 209 210 212 215 216 217 219 220 221 222 224 225 226 227 229 231 234 235
 236 237 239 240 241 243 246 248 249 250 251 252 253 254 255 258 259 260
 262 263 264 265 266 267 269 275 276 277 278 279 280 281 282 283 284 285
 287 288 290 293 296 297 301 302 303 305 306 307 308 309 310 313 315 316
 317 318 319 321 324 326 327 328 331 333 334 335 336 339 340 342 343 344
 345 347 349 352 353 354 355 356 357 358 359 361 362 365 366 369 371 372
 375 376 377 381 382 383 384 386 387 390 391 392 393 394 396 399 400 402
 404 406 407 408 411 413], test: [  4   5   6  11  12  13  17  18  23  27  29  39  40  41  58  59  61  62
  65  67  69  73  78  80  81  82  85  89  90  91  92  93  95  98 101 102
 106 107 117 119 120 122 125 127 132 139 146 159 161 162 164 165 171 172
 173 179 180 185 186 187 189 191 192 204 207 211 213 214 218 223 228 230
 232 233 238 242 244 245 247 256 257 261 268 270 271 272 273 274 286 289
 291 292 294 295 298 299 300 304 311 312 314 320 322 323 325 329 330 332
 337 338 341 346 348 350 351 360 363 364 367 368 370 373 374 378 379 380
 385 388 389 395 397 398 401 403 405 409 410 412]
train: [  1   2   3   4   5   6   7  10  11  12  13  15  17  18  20  22  23  25
  26  27  29  30  36  37  39  40  41  43  49  50  52  54  57  58  59  60
  61  62  64  65  67  68  69  71  72  73  74  75  76  77  78  80  81  82
  83  85  86  87  89  90  91  92  93  94  95  96  97  98 101 102 103 104
 106 107 109 114 115 117 118 119 120 121 122 125 126 127 129 130 132 133
 136 139 140 141 143 144 146 148 149 151 152 153 155 156 159 161 162 164
 165 166 170 171 172 173 176 178 179 180 181 182 183 185 186 187 189 190
 191 192 193 194 195 196 198 200 202 203 204 207 209 210 211 213 214 215
 216 218 220 223 226 228 230 232 233 235 237 238 239 240 241 242 243 244
 245 247 252 253 254 255 256 257 259 261 262 263 264 265 266 268 269 270
 271 272 273 274 276 278 279 280 281 282 286 288 289 291 292 294 295 297
 298 299 300 301 302 303 304 308 309 311 312 313 314 316 317 318 319 320
 321 322 323 325 329 330 332 335 336 337 338 339 340 341 345 346 347 348
 350 351 352 357 359 360 363 364 365 366 367 368 369 370 372 373 374 376
 378 379 380 381 382 385 386 388 389 390 393 395 396 397 398 399 401 402
 403 405 407 409 410 412], test: [  0   8   9  14  16  19  21  24  28  31  32  33  34  35  38  42  44  45
  46  47  48  51  53  55  56  63  66  70  79  84  88  99 100 105 108 110
 111 112 113 116 123 124 128 131 134 135 137 138 142 145 147 150 154 157
 158 160 163 167 168 169 174 175 177 184 188 197 199 201 205 206 208 212
 217 219 221 222 224 225 227 229 231 234 236 246 248 249 250 251 258 260
 267 275 277 283 284 285 287 290 293 296 305 306 307 310 315 324 326 327
 328 331 333 334 342 343 344 349 353 354 355 356 358 361 362 371 375 377
 383 384 387 391 392 394 400 404 406 408 411 413]
train: [  0   4   5   6   8   9  11  12  13  14  16  17  18  19  21  23  24  27
  28  29  31  32  33  34  35  38  39  40  41  42  44  45  46  47  48  51
  53  55  56  58  59  61  62  63  65  66  67  69  70  73  78  79  80  81
  82  84  85  88  89  90  91  92  93  95  98  99 100 101 102 105 106 107
 108 110 111 112 113 116 117 119 120 122 123 124 125 127 128 131 132 134
 135 137 138 139 142 145 146 147 150 154 157 158 159 160 161 162 163 164
 165 167 168 169 171 172 173 174 175 177 179 180 184 185 186 187 188 189
 191 192 197 199 201 204 205 206 207 208 211 212 213 214 217 218 219 221
 222 223 224 225 227 228 229 230 231 232 233 234 236 238 242 244 245 246
 247 248 249 250 251 256 257 258 260 261 267 268 270 271 272 273 274 275
 277 283 284 285 286 287 289 290 291 292 293 294 295 296 298 299 300 304
 305 306 307 310 311 312 314 315 320 322 323 324 325 326 327 328 329 330
 331 332 333 334 337 338 341 342 343 344 346 348 349 350 351 353 354 355
 356 358 360 361 362 363 364 367 368 370 371 373 374 375 377 378 379 380
 383 384 385 387 388 389 391 392 394 395 397 398 400 401 403 404 405 406
 408 409 410 411 412 413], test: [  1   2   3   7  10  15  20  22  25  26  30  36  37  43  49  50  52  54
  57  60  64  68  71  72  74  75  76  77  83  86  87  94  96  97 103 104
 109 114 115 118 121 126 129 130 133 136 140 141 143 144 148 149 151 152
 153 155 156 166 170 176 178 181 182 183 190 193 194 195 196 198 200 202
 203 209 210 215 216 220 226 235 237 239 240 241 243 252 253 254 255 259
 262 263 264 265 266 269 276 278 279 280 281 282 288 297 301 302 303 308
 309 313 316 317 318 319 321 335 336 339 340 345 347 352 357 359 365 366
 369 372 376 381 382 386 390 393 396 399 402 407]
```

The ability to run something cross-validated is already built into ```sklearn```. If you were to do this the long way, you would use these printed index lists and drop rows from your column to make six separate data sets, 3 for training and three for testing, and then use the above model code to test each set of train and test data and then average together.  Sounds long and time consuming, right?  Well, luckily it’s as simple as one line of code with ```sklearn```: 

```python
print(cross_val_score(lm, x,y, cv=3))
```

Here is the result: 

```text
[0.62051774 0.50393467 0.55970703]
```

You’ll notice that these scores vary somewhat. That is to be expected.  Looks like the first trained model was accurate 62% of the time, while the second model was accurate 50% of the time and the third model was accurate 56% of the time. 

Using cross-validation, your model has now been thoroughly tested, and you should feel secure in your knowledge that you have created a rigorous model that has stood up to some serious testing! You also have a better idea of how the accuracy might vary.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Key Terms<a class="anchor" id="DS106L6_page_8"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Machine Learning</td>
        <td>The science of programming computers to learn from data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pseudocode</td>
        <td>High-level code that provides theory without using the actual functions and arguments.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Supervised Learning</td>
        <td>Requires a human to classify the data with labels to aid in the learner.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Unsupervised Learning</td>
        <td>The learner is capable of learning without information labeling.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Reinforcement Learning</td>
        <td>A learner that learns from choices that lead to reward (or punishment).</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Learner</td>
        <td>AKA agent. A machine learning algorithm capable of adapting, or _learning_, to new data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Batch Learning</td>
        <td>AKA offline learning. Requires the program to be stopped and retrained with additional information.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Online Learning</td>
        <td>AKA incremental learning. The system is capable of learning incrementally in real time.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Instance-based</td>
        <td>Learning from individual scenarios.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Model-based</td>
        <td>Learning by using data to build a model and make predictions.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Environment</td>
        <td>Everything the agent interacts with.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Actions</td>
        <td>What the agent can do.</td>
    </tr>
        <tr>
        <td style="font-weight: bold;" nowrap>Overfitting</td>
        <td>When the model performs well on training data but performs poorly when generalizing.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Underfitting</td>
        <td>When the model is too simple.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Feature Selection</td>
        <td>Choosing the most relevant features from a model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Feature Extraction</td>
        <td>Combining features to create new features.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Irrelevant Features</td>
        <td>Variables that are unnecessary for prediction.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>k-folds Cross Validation</td>
        <td>Reserve additional data for train-test-split testing by repeating the process.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Leave One Out Cross Validation</td>
        <td>Creates the same number of folds as observations in your dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>k</td>
        <td>Number of iterations.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.linear_model</td>
        <td>Used to create a regression model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sklearn.model_selection</td>
        <td>For train/test/splitting of data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Mean Absolute Error</td>
        <td>The mean of the difference between the predicted value and the real value.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Mean Squared Error</td>
        <td>The square of the mean absolute error.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Root Mean Squared Error</td>
        <td>The square root of mean squared error.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>train_test_split()</td>
        <td>For splitting your data into training and testing data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>test_size=</td>
        <td>An argument for train_test_split() that allows you to specify how much data to reserve for testing.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>random_state=</td>
        <td>An argument for train_test_split() that chooses a specific place to split the data so you can follow along.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>lm.predict()</td>
        <td>Provides predictions for your model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>lm.score()</td>
        <td>Provides an accuracy score for your model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>mean_absolute_error()</td>
        <td>Calculates the mean absolute error.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>mean_squared_error()</td>
        <td>Calculates the mean squared error.</td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Lesson 1 Hands-On<a class="anchor" id="DS106L6_page_9"></a>

[Back to Top](#DS106L6_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 1 Hands-On

This Hands-­On **will** be graded, so make sure you complete each part. When you are done, please submit one document with all of your findings for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

Now that you've learned your first machine learning algorithm, it's time to put that knowledge to work. In this Hands-On exercise you will create a project which will require you to take data, clean it so that it's usable, and finally create a linear model to predict unknown data. This Hands-On project should be completed using the browser for downloading data and Python for plotting and modeling the data.

You should leverage what you have learned about machine learning and data modeling. Import the diamonds dataset from seaborn using this code:

```text
import seaborn as sns
from sklearn.utils import shuffle
Diamonds = shuffle(sns.load_dataset('diamonds'))
```

If seaborn isn't working for you, **[click here](https://repo.exeterlms.com/documents/V2/DataScience/Machine-Learning/Diamonds.zip)** to download the data.

And use the following variables to predict the price of diamonds:
* carat
* cut
* color
* clarity

You will need to utilize the ```train_test_split()``` method as well as ```LinearRegression()``` to train and test your algorithm. Then, leverage your knowledge of cross-validation and Python programming to cross-validate the work you did. Note the variation in model accuracy once you have cross-validated the model using 5 iterations.

Your final product should be a slide presentation that explains the process you took to analyze the data and the conclusions you found. Make sure you can explain everything in the most basic of ways and that you have included visualizations. In addition, please attach your Python code for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>