# Task 1: Data Analysis
Please make sure you <span style="color: red">run every cell</span> in the task. You <span style="color: red">don't need to</span> read the detail of any code snippets we provided. 

You should **never** unfold the next part before the previous part is done.

Today, you would like to build a model to predict the median housing price in California.

Thankfully, your coworker has prepared a few options to help you build the model.

**Once you see "CHOOSE ONE", copy, paste, and execute <span style="color: red"> any one of the choices </span> in the cell below.**

You are free to run other cells.

## Part 0: Load Libraries and Reading data

In [None]:
import numpy as np
import pandas as pd
from joblib import dump, load
from kishu.user_study import install_submit_cell_execution
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.tree import DecisionTreeRegressor

install_submit_cell_execution()

In [None]:
housing = pd.read_csv("./housing_california.csv")
print(housing.head())

The `median_house_value` is the thing you need to predict using other features.

## Part 1: Data Preprocessing

### Impute the missing values (CHOOSE ONE)

Two choices to impute 207 missing values in the `total_bedrooms` column.

1. **Choice 1**: Median Imputation. Simplicity is good for this dataset size.
```python
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(housing.iloc[:,4:5])
housing.iloc[:,4:5] = imputer.transform(housing.iloc[:,4:5])
```
2. **Choice 2**: kNN Unsupervised Learning Imputation. Median imputation can introduce bias; kNN imputer can provide a better representation.
```python
scaler = MinMaxScaler()
housing_scaled = pd.DataFrame(scaler.fit_transform(housing), columns=housing.columns)
knn_imputer = KNNImputer(n_neighbors=2)
housing_imputed = pd.DataFrame(knn_imputer.fit_transform(housing_scaled), columns=housing.columns)
housing_imputed = pd.DataFrame(scaler.inverse_transform(housing_imputed), columns=housing.columns)
housing['total_bedrooms'] = housing_imputed['total_bedrooms']
```

**Paste and run** the code of your choice **in the cell below**.

In [None]:
# Making sure there is no missing value
housing.isnull().sum()

## Part 2: Feature Engineering

Several features (`total_rooms`, `total_bedrooms`, `longitude`, `latitude`) are highly correlated, let's simply drop some of them.

In [None]:
housing.drop(["total_bedrooms"], axis=1, inplace=True)
housing.drop(["longitude"], axis=1, inplace=True)

## Part 3: Model Training

### Train test set split (CHOOSE ONE)

How would you split the dataset into train and test sets?

1. **Choice 1**: Random split to be fair.
```python
tr_data, te_data = train_test_split(housing, test_size=0.2, random_state=43)
```
2. **Choice 2**: Stratified split, given the correlation between columns.
```python
housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True)
tr_data, te_data = train_test_split(housing, test_size=0.2, stratify=housing['income_cat'])
housing = housing.drop('income_cat',axis=1)
tr_data = tr_data.drop('income_cat',axis=1)
te_data = te_data.drop('income_cat',axis=1)
```

In [None]:
# Making sure splitting is successful
print(tr_data.shape, te_data.shape)

### Divide the data into feature (X) and target (Y)

In [None]:
# For the training data
X_train = tr_data.drop("median_house_value", axis=1)
Y_train = tr_data["median_house_value"]

# For the testing data
X_test = te_data.drop("median_house_value", axis=1)
Y_test = te_data["median_house_value"]

In [None]:
# Checking that the dataset is complete
print(X_train.head())
print(Y_train.head())

### Select the model to use (CHOOSE ONE)

Let's try either of the two models.

1. **Choice 1**: Linear regression
```python
model = LinearRegression(n_jobs=-1)
model.fit(X_train, Y_train)
```
2. **Choice 2**: Random Forest
```python
model = RandomForestRegressor(30)
model.fit(X_train, Y_train)
```

## Part 4: Model Evaluation

In [None]:
# Evaluate your model using RMSE
Y_pred = model.predict(X_test)
rmse = np.sqrt(metrics.mean_squared_error(Y_test, Y_pred))
rmse

<hr />

## Part 5: Alternative Feature Engineering

Suppose you would like to explore a new approach. 

**Previously:** We tried deleting correlated features in Part 2.

```python
housing.drop(["total_bedrooms"], axis=1, inplace=True)
housing.drop(["longitude"], axis=1, inplace=True)
```

**Now:** Instead of deleting, Let's try a new way to combine correlated features into new features first, then delete them.
 
```python
housing['diag_coord'] = housing['longitude'] + housing['latitude']
housing['bedperroom'] = housing['total_bedrooms'] / housing['total_rooms']
housing.drop(['longitude', 'latitude'], axis=1, inplace=True)
housing.drop(['total_bedrooms', 'total_rooms'], axis=1, inplace=True)
```
**Task**: Try the new feature engineering method instead of the previous one, retrain and re-evaluate the model of your previous choice based on the new method.

 <span style="color:red">**Note**</span>: `housing['longitude']` is already dropped by the old method, so you cannot run the new feature engineering method directly without recovering the data in kernel.


__________________________

Make sure you see the new RMSE before moving on.

__________________________

## Part 6

### Report result

**Task:** Please report the `rmse` for the old model and the new model.

Answer in the <span style="color:red">text file</span> opened for you.

## Part 7 

### Dump model

Let's export both models for later investigation.

**Task**: Please export the models you trained based on the first and second feature engineering methods:

The exported files' name should be `"old_way.joblib"` and `"new_way.joblib"` respectively. 

To dump a file, you can use the following code:
   ```python
   dump(model, "old_way.joblib") 
   ```
and
   ```python
   dump(model, "new_way.joblib") 
   ```

You should be able to see `old_way.joblib` and `new_way.joblib` files at the left panel, which means the model is exported successfully.