# Machine Learning - Notes
Here it begins the next step in data analytics. I hope you are excited too..! Let crack it on..

## Table of Contents
- [Selecting and filtering](# Selecting and Filtering)
- [Predicting](#predicting)
- [Bias Variance Trade Off](#Bias Variance Trade Off)

## Selecting and Filtering

Columns can be seen using **dot** notation.
`my_data.columns`

**Sample** data can be seen using dot and paran
`my_data.head()`

### Selecting
Single column, dot
`my_data.col1`

Multiple columns, **select SQL** example, dot and array
`my_data.[['col1', 'col2']]`

Details of data,  dot and paran
`my_data.describe()`


## Predicting 
*Prediction target is reffered to as* **y** and data as **x**

The steps to building and using a model are:

* Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
* Fit: Capture patterns from provided data. This is the heart of modeling.
* Predict: Just what it sounds like
* Evaluate: Determine how accurate the model's predictions are.

### Linear Regression
It is a line that passes as close to observations as possible. y predicted vs y can give us errors. 

y can be some column, say sale price

X can be columns which are characteristics, like area, floors, pool, parking, park facing etc.

We need to divide our data into set of training and testing data.

[Refer workbook](https://www.kaggle.com/iyadavvaibhav/machine-learning-linear-regression)

#### Define Model
We can define model
`my_model = DecisionTreeRegressor()`

#### Fit Model
Then we need to fit model
`my_model.fit(X, y)`

#### Predict Values
We can predict using following code
`my_model.predict(X.head())`

**Conclusion**
We have built a decision tree model that can predict the prices of houses based on their characteristics.


## Model Validation
How good is model?
We can do this by predicting values for training data but it is not good.
One common way is *Mean Absolute Error* aslo called **MAE**. 

*error = actual - predicted*

we take abolute values of it and the find average. In english we say:
*On average, our predicitons are off by about X*

### Computing Error
This can be calculated usung **metrics** class from sklearn.

`from sklearn import metrics`

**Predicting using train data and validation data**
We can use train_test_split to divide our data into two subsets.

`print('MAE',metrics.mean_absolute_error(y_test,predictions))`

### Summary
The entire process can be summarized as following:
1. Take the data
2. Split into training and testing set
3. Train a linear model from sklearn
4. Fit the model
5. look at `intercept_` and `coef_` to learn about the model
6. Predict the values from the model
7. Analyze the residuals and errors of the model

## Bias Variance Trade Off
Fundamental topic of understanding model performance.

It is  point where we just add noise by adding model complexity.  If we make model more and more complex to make our line touch all the X values. It reduces the training error but might increase the test error.

Model after bias trade-off point begins to **overfit**.

Sometime training data may not be very scattered and we predict well while otherwise it might be scattered with outliers where we will have to find a bias trade off point.

If we plot 'model complexity' vs 'perdiction error' for test sample then the point at which graph is lowest is can be treated as bias trade off.

In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = my_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

In [None]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
# Define model
my_model = DecisionTreeRegressor()
# Fit model
my_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = my_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

**Conclusion** now we can calculate the quality of our model by using MAE. Next is comparing models.

# Depth of Decesion Tree
The depth of decision and number of leaves play a mojor role in determining the accuracy of the mode. We can compare MAE with number of leaves to find best fit of our mode.

Models can suffer from either:
* **Overfitting:** capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
* **Underfitting:** failing to capture relevant patterns, again leading to less accurate predictions.
We use validation data, which isn't used in model training, to measure a candidate model's accuracy.

Let's compare MAE with number of leaves.




In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

**Conclusion:** Here we see that with 50 leaf nodes we get lowest MAE and it is sweetest spot for our data.
Next, we see advance ML model.

# Random Forest
The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

**Conclusion:** One of the best features of Random Forest models is that they generally work reasonably even without tuning.

# Competition

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Read the data
train = pd.read_csv('../input/train.csv')

# pull data into target (y) and predictors (X)
train_y = train.SalePrice
predictor_cols = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']

# Create training predictors data
train_X = train[predictor_cols]

my_model = RandomForestRegressor()
my_model.fit(train_X, train_y)

In [None]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted_prices = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

In [None]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)