## California Housing Price Revisit

__The dataset:__
1. longitude: A measure of how far west a house is; a higher value is farther west
2. latitude: A measure of how far north a house is; a higher value is farther north
3. housingMedianAge: Median age of a house within a block; a lower number is a newer building
4. totalRooms: Total number of rooms within a block
5. totalBedrooms: Total number of bedrooms within a block
6. population: Total number of people residing within a block
7. households: Total number of households, a group of people residing within a home unit, for a block
8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. medianHouseValue: Median house value for households within a block (measured in US Dollars)
10. oceanProximity: Location of the house w.r.t ocean/sea

### EDA Review
- Histogram 
- Correlation Matrix

### Feature Engineering 
- Create a process pipeline for categorical features. 
- Combine two process pipelines with [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
- Create new features 

### Pipeline 
- Build a pipeline with a decision tree regressor, and Random Forest Regressor
- Experiment with cross validation
- Use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find the best parameters 


In [None]:
import pandas as pd

df = pd.read_csv("../data/housing.csv")
df.info()

Task 1. 

- Find which columns have missing values. 
- Plat a histogram to see the value distribution for `median_house_value`

Task 2.

- Print out correlation matrix for all features, including the `median_house_value`

- Find out which feature is the most correlated feature to `median_house_value`

Task 3.

- Create `model_features` and `model_features` column indexes
- Create `numerical_features_all` and `catagorical_features_all` to index to the numerical and categorical features. 

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline

Task 4. Create two processing pipelines for numerical and categorical features. 

- For categorical features, use `strategy='constant', fill_value='missing'` for the SimpleImputer

- Use `OneHotEncoder()` for one hot encoding. 

Task 5. 

- Combine the two pipelines in Task 4 with a `ColumnTransformer`
- Add a decision tree model as the estimator to create the full pipeline. 
- Train and evaluate the pipeline using the `mean_squared_error` metric

In [None]:
from sklearn.compose import ColumnTransformer

data_preprocessor = ColumnTransformer([
    ('numerical_pre', numerical_processor, numerical_features_all),
    # fill in here
]) 

In [None]:
# Pipeline desired all data transformers, along with an estimator at the end
# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example
pipeline = Pipeline([
    # fill in here
])

# Visualize the pipeline
from sklearn import set_config
set_config(display='diagram')
pipeline

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.2, shuffle=True, random_state=42)

In [None]:
from sklearn.metrics import mean_squared_error

Task 6. 

- Use 5-fold cross validation to validate the training of the pipeline. 

- Use `GridSearchCV` to optimize the hyperparameter in the decision tree model: `max_depth`, `min_samples_leaf` and `min_samples_split`

- Evaluate the final model obtained from the parameter tuning. 

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X_train, y_train,
                         scoring="neg_mean_squared_error", cv=5)
rmse_scores = np.sqrt(-scores)
rmse_scores

In [None]:
from sklearn.model_selection import GridSearchCV

# Parameter grid for GridSearch
param_grid={'dt__max_depth': [50, 100, 150, 200],
            'dt__min_samples_leaf': [5, 10, 15, 20],
            'dt__min_samples_split': [2, 5, 15, 20]
           }

grid_search = GridSearchCV(pipeline, # Base model
                           param_grid, # Parameters to try
                           cv = 5, # Apply 5-fold cross validation
                           verbose = 1, # Print summary
                           n_jobs = -1 # Use all available processors
                          )

# Fit the GridSearch to our training data
grid_search.fit(X_train, y_train)

In [None]:
print(grid_search.best_params_)
print(grid_search.best_score_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(mean_score, params)

In [None]:
final_model = grid_search.best_estimator_

Task 7. 

Let's add a new feature `rooms_per_household` which is the total number of rooms divided by the total number of households. See if this improves the result of our model. 
```
df["rooms_per_household"] = df["total_rooms"]/df["households"]
```
