# End to End Machine Learning Project
this chapter has you work through an example project, end to end to get a taste of a ML workflow

## Working with Real Data

places to get real data:
* popular open data repos:
    * [UC irvine machine learning repository](https://archive.ics.uci.edu/ml/index.php)
    * [Kaggle datasets](https://www.kaggle.com/datasets)
    * [amazon AWS datasets](https://registry.opendata.aws/)
* Meta portals (they list open data repos):
    * Data Portals
    * [OpenDataMonitor](https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex)
    * [Quandl](https://www.quandl.com/)
* Other pages listing many popular open data repositories:
    * [Wikipedia's list of machine learning datasets](https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research)
    * Quora.com
    * [the datasets subreddit](https://www.reddit.com/r/datasets/)

this example problem uses housing data in california to model housing prices.

## Framing the Problem
Building a model is not the end goal, you need to ask what your goal is. How do you want to use an benefit from the model? This machine learning models output will be fed into another machine learning system with many other signals. This other system will determine whether it's worth it or not to invest in a given area or not.

NOTE: a piece of information fed into a machine learning model is often calleda  **signal**


**Pipelines**: 'A sequence of data processing components is caled a *data pipeline*. Each component in the pipeline can be preprocessing, an ML system, or some other data processing system. Components typically runs asynchronously; And each component typically pulls in a large amount of data, processes it, and spits out the results in another data store.'

when designing your system first you need to frame your problem:
* supervised, unsupervised, or reinforcement?
* classification, regression, etc?
* batch learning, online learning?

the information you have so far: you are given *labeled* training examples, it is a typical regression task, more speficially, it is a *multiple regression problem* since it has multiple features to make predictions on. it is also a *univariate* problem since we are making a single prediction (market value). there is also no continuous dataflow into the sysetm, so there is no need to adjust for rapidly changing data, I.E. batch learning should do just fine

NOTE: if we were trying to predict multiple values then it would be a *multi-variate* probem


## Select a Performance Measure
The next step is to select a performance measure; a typical measure for regression is the **Root Mean Square Error (RMSE)**, which gives an idea into how much error the system usually makes in it's predictions.

$$ RMSE(X, h): \sqrt{\frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^{2} }$$

### Notations
this equation introduces many common machine learning notations that will be used throughout this book.
* m is the number of instances in the dataset you are measuring RSME on
* $x^{i}$ is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and $y^{i}$ is it's label.
* $X$ (capital) is a matrix containing all the feature values. there is one instance per row, and the ith row is equal to the transpose of $x^{i}$
* h is the prediction function, also called a *hypothesis* . the out put of a hypothesis is a predicted label $\hat{y}=h(x^{(i)})$
* in general, a lowercase *italic* font is used for scalar values, such as $m$ or $y$, and function names, such as $h$. uppercase bold letters are used for matrices, such as $\boldsymbol{X}$
 
$RMSE(\boldsymbol{X}, h)$ is the cost function measured on sets of examples using hypothesis $h$. RSME is generally the preferred method for regression tasks. However, in some contexts you may prefer to use the **Mean Absolute Error (MAE)** function. For example: when there are many outliers in the data.

$$MAE(\boldsymbol{X}, h) = \frac{1}{m}\sum^{m}_{i=1}\mid h(x^{i})-y^{(i)} \mid$$

These two measures are two of many *norms* that are possible:
* RSME corresponds to the *Euclidean norm* (2nd norm). this is the notion of distance you are familiar with. also called the $l_{2}$ norm, noted $\|\cdot\|_{2}$ (or just $\|\cdot\|$)
* MAE corresponds to the *manhattan norm*, or the $l_{1}$ norm.
* In general, the $l_{k}$ norm of a vector $\boldsymbol{v}$, containing $n$ elements is defined as $\|v\|_{k}=(\mid v_{0}\mid ^{k} + \mid v_{1}\mid ^{k} + \ldots + \mid v_{n} \mid ^{k})^{1/k}$
* $l_{0}$ gives the number of nonzero elements in the vector, and $l_{\inf}$ gives teh maximum absolute value in the vector.
* the higher the norm index, the more it focuses on large values and neglects smaller ones.

## Check the assumptions
it is a good practice to verify all assumptions that have been made early on; this can help you catch series issues before they become a problem. what if the regressed values were to be converted into categories later in the pipeline? then the problem would have been better suited as a classificatio task.

## Get the Data
the next step is to start collecting the data. this can be a dirty process. However, you should make sure to not suffer any sampling bias when doing so. Missing data values are a fixable issue, non-representative data is less so.

in a typical environment, your data (if already collected) would be in a relational database (or some other common data store). This project uses a simple compressed file, containiing a CSV file called housing.csv

## create the workspace
(in python and similar languages atleast) it is a good practice to setup a workspace directory for each specific project. This is to avoid naming conflicts between dependant libraries across projects. I personally use Anaconda. An alternative option is to use virtualenv to create the different environments.

## Downloading the Data
you should try to automate your tasks when possible. it's possible to download this via a web-browser. but defining a function to download large quantities of data can be more efficient

In [1]:
#in a real project you should save this in it's own python file, but oh well
import os
import tarfile
import urllib

In [2]:
#location of the datasets folder
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
#combined url plus filename
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

In [7]:
def fetch_housing_data(housingUrl = HOUSING_URL, housingPath = HOUSING_PATH, fileName = "housing.tgz"):
    #make the folder for our (future) datasets
    os.makedirs(housingPath, exist_ok = True)
    tgzPath = os.path.join(housingPath, fileName)
    
    urllib.request.urlretrieve(housingUrl, tgzPath)
    
    housingTgz = tarfile.open(tgzPath)
    housingTgz.extractall(path = housingPath)
    housingTgz.close()

In [8]:
fetch_housing_data()

In [9]:
#similarly, you should automate the loading process
import pandas as pd

In [16]:
def load_housing_data(housingPath = HOUSING_PATH, fileName = "housing.csv"):
    csvPath = os.path.join(housingPath, fileName)
    #TODO read up on encodings
    return pd.read_csv(csvPath, encoding='latin1')

In [17]:
housing = load_housing_data()

### A Quick Look at the Data
this dataset has ten atributes to use. that's ten possible dimensions along which the model (that you choose) can predict data with.

In [18]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


info can be used to get a quick desciption of the data