# End-to-End Machine Learning Project 

1. *Look at the big picture*
2. Get the data
3. Discover and vizualize the data to gain insights
4. Prepare the data for machine learning algorithms
5. Select a model and train it
6. Fine-tune your solution
7. *Present your solution*
8. *Launch, monitor, and maintain your solution*

Today we'll walk through steps 2-6 using Scikit-Learn.

## Step 2: Get the data

In [None]:
# import statement
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import datasets

#### Jupyter pro-tip: Getting more info about an object
Use `?` after an object name to get more info (type, string form, path, docstring). Use `??` instead for even more info (also includes source).

Hitting `shift+tab` next to an object will get you a lot of the same info as `?`, except in a smaller, floating pop-up.

Hitting `shift+tab` next to a function will show the function signature and docstring.

Use `.` after an object name, then hit tab to get a list of all the methods associated with that object.

#### Load a data set

**PRACTICE**: load a data set from the `datasets` module and assign it to a variable. Hint: most of the methods that fetch data sets start with `load` or `fetch`.

## Step 3: Discover and visualize the data to gain insights

Goals: 
1. figure out what kind of an object our dataset is: what data and metadata does it contain, and how do we access it
2. get a sense of our data: what's its size and structure, what information does it contain, how might we use it to solve problems, what are the types and ranges of values in it.


Useful explorations:

- try using `print` on your dataset's `DESCR` attribute to get a prettier output
- use `.shape` next to an array to get the shape
- if you `import matplotlib.pyplot as plt`, `plt.hist(array)` will get you a histogram of the array

Tricks with the [Pandas](https://pandas.pydata.org/) Python data analysis library:

- `import pandas as pd`
- make a dataframe (basically a Table) with `pd.DataFrame(data=<your dataset>.data, columns=<your dataset>.feature_names)`
- try calling `.describe` on your dataframe
- try calling `.hist` on your dataframe

**PRACTICE**: explore the data set you loaded for step 2. Briefly describe your data set to a partner: what is it, what features does it have, and what kinds of data science questions could you use it for.

## Step 4: Prepare the data for machine learning algorithms

#### Set the random seed

Setting the random seed ensures that your results will be *reproducible*. It means that when pseudorandomly split the data into test and training sets, the split will happen the same way each time.

In [None]:
# set the random seed


#### Split into test and train sets

In [None]:
# split the data


In [None]:
# sanity check 


**PRACTICE**: set a random seed and split your data into training and test sets

In [None]:
# set the random seed


In [None]:
# split the data


## Step 5: Select a model and train it

[A list of Scikit-Learn algorithms.](http://scikit-learn.org/stable/user_guide.html)

In [None]:
# import the submodule

# make a model object


In [None]:
# fit the model


In [None]:
# make predictions


In [None]:
# score the predictions


In [None]:
# visualize 


**PRACTICE**: try a model on your data.

**CHALLENGE**: implement [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html)

In [None]:
# import the submodule

# make a model


In [None]:
# fit the model


In [None]:
# make predictions


In [None]:
# score predictions


In [None]:
# visualize


## Step 6: Fine-tune your model

Most models have **hyperparameters**: parameters that are set before training starts, as opposed to the model parameters that are learned during training. 

Fiddling with the hyperparameters when the model is created can drastically affect your model's accuracy.


**PRACTICE**: fine-tune your model or try another type of model to see if you can improve accuracy.

**CHALLENGE**: use [grid search](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) to fine-tune your hyperparameters.

In [None]:
# import a submodule, if necessary

# make a model and set any hyperparameters

# fit the model

# score the model


#### References
- "End to End Machine Learning Project" steps from Aurlien Gron. 2017. Hands-On Machine Learning with Scikit-Learn and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems (1st ed.). O'Reilly Media, Inc.