# Task 1: Data Analysis
Please make sure you <span style="color: red">run every cell</span> in the task. You <span style="color: red">don't need to</span> read the detail of any code snippets we provided. 

You should **never** unfold the next part before the previous part is done.

Today, you would like to build a model to predict the median housing price in California.

Thankfully, your coworker has prepared a few options to help you build the model.

**Once you see "CHOOSE ONE", copy, paste, and execute <span style="color: red"> any one of the choices </span> in the cell below.**

You are free to run other cells.

## Part 0: Load Libraries and Reading data

In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from joblib import dump, load
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from kishu.user_study import install_submit_cell_execution
install_submit_cell_execution()

In [None]:
housing = pd.read_csv("./housing_california.csv")
print(housing.head())

The median_house_value is the thing you need to predict using other features.

## Part 1: Data Preprocessing

### Impute the missing values (CHOOSE ONE)

Two choices to impute 207 missing values in the `total_bedrooms` column.

1. **Choice 1**: Median Imputation. Simplicity is good for this dataset size.
```python
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(housing.iloc[:,4:5])
housing.iloc[:,4:5] = imputer.transform(housing.iloc[:,4:5])
```
2. **Choice 2**: kNN Unsupervised Learning Imputation. Median imputation can introduce bias; kNN imputer can provide a better representation.
```python
scaler = MinMaxScaler()
housing_scaled = pd.DataFrame(scaler.fit_transform(housing), columns=housing.columns)
knn_imputer = KNNImputer(n_neighbors=2)
housing_imputed = pd.DataFrame(knn_imputer.fit_transform(housing_scaled), columns=housing.columns)
housing_imputed = pd.DataFrame(scaler.inverse_transform(housing_imputed), columns=housing.columns)
housing['total_bedrooms'] = housing_imputed['total_bedrooms']
```

**Paste and run** the code of your choice **in the cell below**.

In [None]:
# Making sure there is no missing value
housing.isnull().sum()

### Train test set split (CHOOSE ONE)

How would you split the dataset into train and test sets?

1. **Choice 1**: Random split to be fair.
```python
tr_data, te_data = train_test_split(housing, test_size=0.2, random_state=43)
```
2. **Choice 2**: Stratified split, given the correlation between columns.
```python
housing['income_cat'] = np.ceil(housing['median_income'] / 1.5)
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True)
tr_data, te_data = train_test_split(housing, test_size=0.2, stratify=housing['income_cat'])
housing = housing.drop('income_cat',axis=1)
tr_data = tr_data.drop('income_cat',axis=1)
te_data = te_data.drop('income_cat',axis=1)
```

In [None]:
# Making sure splitting is successful
print(tr_data.shape, te_data.shape)

### Divide the data into feature (X) and target (Y)

In [None]:
# For the training data
X_train = tr_data.drop('median_house_value', axis=1)
Y_train = tr_data['median_house_value']

# For the testing data
X_test = te_data.drop('median_house_value', axis=1)
Y_test = te_data['median_house_value']

In [None]:
# Checking that the dataset is complete
print(X_train.head())
print(Y_train.head())

## Part 2: Feature Engineering

Several features (`total_rooms`, `total_bedrooms`, `longitude`, `latitude`) are highly correlated, let's simply drop some of them.

In [None]:
del X_train['total_bedrooms']
del X_test['total_bedrooms']
del X_train['longitude']
del X_test['longitude']

## Part 3: Model Training

### Select the model to use (CHOOSE ONE)

Let's try either of the two models.

1. **Choice 1**: Linear regression
```python
model = LinearRegression(n_jobs=-1)
model.fit(X_train, Y_train)
```
2. **Choice 2**: Random Forest
```python
model = RandomForestRegressor(30)
model.fit(X_train, Y_train)
```

## Part 4: Model Evaluation

In [None]:
# Evaluate your model using RMSE
Y_pred = model.predict(X_test)
rmse = np.sqrt(metrics.mean_squared_error(Y_test, Y_pred))
rmse

<hr />

## Part 5: Alternative Feature Engineering: <span style="color: red">Please make sure to read every line of this part</span>

Your coworker realized that `Part 2 Feature Engineering` should instead combine the correlated features into new features.

**Use the following feature Engineering method instead of the one in Part 2, but  <span style="color: red">keep all your other choices</span>. Then, Re-run the model training and evaluation (Part 3 and Part 4).**



### Code for New Feature Engineering Method
**Note**:  <span style="color:red">Because `X_train['longitude']` is already dropped, the following code will lead to bug if you execute it directly</span>. 

```python
X_train['diag_coord'] = X_train['longitude'] + X_train['latitude']
X_train['bedperroom'] = X_train['total_bedrooms'] / X_train['total_rooms']
X_test['diag_coord'] = X_test['longitude'] + X_test['latitude']
X_test['bedperroom'] = X_test['total_bedrooms'] / X_test['total_rooms']
X_train = X_train.drop(['longitude', 'latitude', 'total_bedrooms', 'total_rooms'], axis=1)
X_test = X_test.drop(['longitude', 'latitude', 'total_bedrooms', 'total_rooms'], axis=1)
```

### Please answer the following question here.

Which feature engineering method has better accuracy? (please answer either `old way` or `new way` below)

## Part 6


Please export the models you trained by the first and second feature engineering methods respectively using the following code:
1. For the first model
   ```python
   dump(model, 'drop_feature.joblib') 
   ```
2. For the second model
   ```python
   dump(model, 'replace_feature.joblib')
   ```