# 1. Framing

# 2. Decending into ML


** Linear regression ** - a linear relationship i.e. linearly separable data points. y = mx + b, for ML we re-write this equation as y' = b + w1x1
**L2 Loss** = (observation - prediction)^2  
Square of difference between prediction and observation. During training we minimize loss (empirical loss minimization) on all the examples (summation)

Figure 3. High loss in the left model; low loss in the right model.

# 3. Reducing Loss

Repeated small steps in the direction that minimizes loss i.e. gradient steps (strategy is Gradient descent). Shows the derivative of the loss function (L2, MSE, etc) with respect to model parameters.

- Loss minimization demo - https://developers.google.com/machine-learning/crash-course/fitter/graph
- Learning rate convergence - https://developers.google.com/machine-learning/crash-course/reducing-loss/playground-exercise

Mini-batches - could compute gradient over entire data set on each step, but this turns out to be unnecessary. Instead computing gradient on small data samples works well i.e. on every step, get a new random sample.

- Stochastic gradient descent - one example at a time.
- Mini Batch gradient descent - batches of 10-1000.



## 4.First step with TF:


## 5. Generalization
Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.

## 6.Training and test sets:
Large training set -> better trained model.
Large test set -> better prediction confidence.

# 7. Validation 
**Learning Objectives:**
  * Use multiple features, instead of a single feature, to further improve the effectiveness of a model
  * Debug issues in model input data
  * Use a test data set to check if a model is overfitting the validation dataValidation.
  
  
First off, let's load up and prepare our data. This time, we're going to work with multiple features, so we'll modularize the logic for preprocessing the features a bit:

In [1]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
#from tensorflow.python.data import Dataset  # importError

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

# california_housing_dataframe = california_housing_dataframe.reindex(
#     np.random.permutation(california_housing_dataframe.index))

In [5]:
california_housing_dataframe.describe()  # basic statistics of the data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0


In [4]:
california_housing_dataframe.head(3)  # printing first 3 rows 

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0


In [21]:
california_housing_dataframe.shape

(17000, 9)

In [6]:
def preprocess_features(california_housing_dataframe):
    """Prepares input features from California housing data set.

    Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
    Returns:
    A DataFrame that contains the features to be used for the model, including
    synthetic features.
    """
    ## slicing dataframe i.e. the features we need to use
    selected_features = california_housing_dataframe[
    ["latitude",
     "longitude",
     "housing_median_age",
     "total_rooms",
     "total_bedrooms",
     "population",
     "households",
     "median_income"]]
    processed_features = selected_features.copy()
    # Create a synthetic feature.
    processed_features["rooms_per_person"] = (california_housing_dataframe["total_rooms"] /
                                              california_housing_dataframe["population"])
    return processed_features

def preprocess_targets(california_housing_dataframe):
    """Prepares target features (i.e., labels) from California housing data set.

    Args:
    california_housing_dataframe: A Pandas DataFrame expected to contain data
      from the California housing data set.
    Returns:
    A DataFrame that contains the target feature.
    """
    output_targets = pd.DataFrame()
    # Scale the target to be in units of thousands of dollars.
    output_targets["median_house_value"] = (california_housing_dataframe["median_house_value"] / 1000.0)
    return output_targets

In [15]:
# Select features to use - the first 12,000 examples - out of total 17,000
training_examples = preprocess_features(california_housing_dataframe.head(12000))
training_examples.shape


(12000, 9)

In [16]:
training_examples.describe()

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_person
count,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0
mean,34.6,-118.5,27.5,2655.7,547.1,1476.0,505.4,3.8,1.9
std,1.6,1.2,12.1,2258.1,434.3,1174.3,391.7,1.9,1.3
min,32.5,-121.4,1.0,2.0,2.0,3.0,2.0,0.5,0.0
25%,33.8,-118.9,17.0,1451.8,299.0,815.0,283.0,2.5,1.4
50%,34.0,-118.2,28.0,2113.5,438.0,1207.0,411.0,3.5,1.9
75%,34.4,-117.8,36.0,3146.0,653.0,1777.0,606.0,4.6,2.3
max,41.8,-114.3,52.0,37937.0,5471.0,35682.0,5189.0,15.0,55.2


In [17]:
# the target column - to prepare data for training.
training_targets = preprocess_targets(california_housing_dataframe.head(12000))
training_targets.shape

(12000, 1)

In [18]:
training_targets.describe()

Unnamed: 0,median_house_value
count,12000.0
mean,198.0
std,111.9
min,15.0
25%,117.1
50%,170.5
75%,244.4
max,500.0


**For** the **validation set**, we'll choose the last 5000 examples, out of the total of 17000. 

In [14]:
validation_examples = preprocess_features(california_housing_dataframe.tail(5000))
validation_examples.describe()

validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))
validation_targets.describe()

Unnamed: 0,median_house_value
count,5000.0
mean,229.5
std,122.5
min,15.0
25%,130.4
50%,213.0
75%,303.2
max,500.0


# 8.Representation:
Usually data come from heterogeneous sources, and we need to prepare the feature vector (the row we feed to the model/trainer), this process of extracting features (useful) is called feature engineering.

Using non-numeric features (Addresses, categorical features) - **One hot encoding**.
- A one hot encoding is a representation of categorical variables as binary vectors.
- This first requires that the categorical values be mapped to integer values.
- Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.



**Characteristics of a good feature**:
- a feature should occur with a non-zero value at least a handful of times or more in our data set.
- Should have a clear and obvious meaning.
- Should not have crazy outlier values. (as a pre-processing step remove outliers).
- For some values i.e. Lat-Long, there cannot be a clear co-relation to the target (e.g. house pricing) - here we can bucket them in bins, and have corelation i.e. prices are higher in a San Diego region (a single bin). 



**Feature scaling** -
Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:
- Helps gradient descent converge more quickly.
- Helps avoid the "NaN trap," in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and—due to math operations—every other number in the model also eventually becomes a NaN.
- Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.


You don't have to give every floating-point feature exactly the same scale. Nothing terrible will happen if Feature A is scaled from -1 to +1 while Feature B is scaled from -3 to +3. However, your model will react poorly if Feature B is scaled from 5000 to 100000. 
Most commonly used scaling technique is **Z-score** (normalization) - **"(value - mean)/std.dev"**



## 9. Feature cross:
A feature cross is a synthetic feature formed by multiplying (crossing) two or more features. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.