## 2. Choosing the right estimator/algorithm for your problem

Some things to note:
    
* Sklearn refers to machine learning models, algorithms as estimators.
* Classification problem - predicting a category (heart disease or not)
    * Sometimes you'll see `clf` (short for classifier) used as a classification estimator
* Regression problem - predicting a number (selling price of a car)

If you're working on a machine learning problem and looking to use Sklearn and not sure what model you should use, refer to the sklearn machine learning map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## 2.1 Picking a ml modelfor a regression problem

Let's use the California Housing dataset  

https://scikit-learn.org/stable/datasets/toy_dataset.html  

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [10]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [11]:
# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [12]:

housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [13]:
#what we are going to do here is that we are going to use these feature values 
#we want to use the feature variable to pridict the target variable  target value is equal to medhouseval
housing_df["target"] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [None]:
housing_df = housing_df.drop("MedHouseVal", axis=1)
housing_df

In [22]:
# import algorithm / estimator
from sklearn.model_selection import train_test_split
#https://www.youtube.com/watch?v=Yj7sIK0VMg0
from sklearn.linear_model import Ridge

#Setup random seed
np.random.seed(42)

#create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"] #median house price in $100 , 000s

#split into train and test sets
X_train , X_test , y_train , y_test = train_test_split(X , y, test_size = 0.2)

#Instantiate and fit the model (on the training set)
model = Ridge()
model.fit(X_train , y_train)


# Check the score of the model (on the test set)
model.score(X_test, y_test)

0.5758549611440125

#what is coefficient of determination  R square?
> now we're going to dive more into differnet regression evaluation metrics, but what you should know is that the default regression evaluation metric is for regression models in sckit learn is this R  squared value 
it measueres the lineral relation between two variabels but 
in our case we've got a few more than two variables but essentailly it's taking the realtionship between this variables this feature variable and target variable 
and its saying how pridictive is  our  model given this features how much is it able to predict the target variable so we just said a lot there, but just think of this value the lowest value is 0 the highest value is 1 higher value is better
how preidcitive are these feataures of this target value so that's the coefficient of determination

https://www.youtube.com/watch?v=MDNuFbvc6Vo

# say we have this value 0.5758549611440125 how might we imporve that value let's think what data we have
how do we imporve machine learning models well one way would be to add more data
another way would be to choose a different model and try it out  https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
     so remember i said a lot this is experimental   we have just ry one of these green boxes

What if `Ridge` didn't work or the score didn't fit our needs?

Well, we could always try a different model...

How about we try an ensemble model (an ensemble is combination of smaller models to try and make better predictions than just a single model)?

Sklearn's ensemble models can be found here: https://scikit-learn.org/stable/modules/ensemble.html

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.        

#generalizability is the abiltiy  how well the machine pridict on the unseen data


## the main key take away from this model compare to other model is   :
  it combine the pridiction on several model rather than using same kind of model or one kind of model 
  so its like if you are making decision may be 10 doctors are smarter at making decison on diagonasies vs one doctor that the whole goal of ensemble methods

# if you'd like a more deep dive on random forests, they're based on decision tree
a random forest is a combination of lots of lots of lots of differnet decision tree


ensemble is a cobination of different models to make one prediction very commmon and very powerful algorithm machine learning algorithm that lebrages the ensemble technique is randomforestregressor and random forest is the combination of lots of different decision tree

In [18]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create random forest model
model = RandomForestRegressor()# we have 100 smaller model 100 decision tree predicting on our data versu the ridge  
model.fit(X_train, y_train)

# Check the score of the model (on the test set)
model.score(X_test, y_test

0.8065734772187598

In [None]:
#it has 20,000 sample to go through so it has to make lots of decision on all of those samples 