## Homework


> Note: sometimes your answer doesn't match one of the options exactly. That's fine. 
Select the option that's closest to your solution.

### Dataset

In this homework, we will use the California Housing Prices data from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

Here's a wget-able [link](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv):

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```
We'll keep working with the `'median_house_value'` variable, and we'll transform it to a classification task. 


In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv


--2022-09-02 18:07:47--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘housing.csv.1’


2022-09-02 18:07:48 (1.18 MB/s) - ‘housing.csv.1’ saved [1423529/1423529]



### Features

For the rest of the homework, you'll need to use only these columns:

* `'latitude'`,
* `'longitude'`,
* `'housing_median_age'`,
* `'total_rooms'`,
* `'total_bedrooms'`,
* `'population'`,
* `'households'`,
* `'median_income'`,
* `'median_house_value'`,
* `'ocean_proximity'`,

### Data preparation

* Select only the features from above and fill in the missing values with 0.
* Create a new column `rooms_per_household` by dividing the column `total_rooms` by the column `households` from dataframe. 
* Create a new column `bedrooms_per_room` by dividing the column `total_bedrooms` by the column `total_rooms` from dataframe. 
* Create a new column `population_per_household` by dividing the column `population` by the column `households` from dataframe. 

In [2]:
import pandas as pd
import numpy as np

In [3]:
housing = pd.read_csv('housing.csv')
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


### Missing values

In [4]:
housing.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [5]:
housing = housing.assign(total_bedrooms=housing['total_bedrooms'].fillna(0),
                        rooms_per_household=housing['total_rooms'] / housing['households'],
                        bedrooms_per_room=housing['total_bedrooms'] / housing['total_rooms'],
                        population_per_household=housing['population'] / housing['households']
                       )

housing.head().T

Unnamed: 0,0,1,2,3,4
longitude,-122.23,-122.22,-122.24,-122.25,-122.25
latitude,37.88,37.86,37.85,37.85,37.85
housing_median_age,41.0,21.0,52.0,52.0,52.0
total_rooms,880.0,7099.0,1467.0,1274.0,1627.0
total_bedrooms,129.0,1106.0,190.0,235.0,280.0
population,322.0,2401.0,496.0,558.0,565.0
households,126.0,1138.0,177.0,219.0,259.0
median_income,8.3252,8.3014,7.2574,5.6431,3.8462
median_house_value,452600.0,358500.0,352100.0,341300.0,342200.0
ocean_proximity,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY


In [6]:
numerical = ['latitude', 'longitude', 'housing_median_age', 'total_rooms',
               'total_bedrooms', 'population', 'households', 'median_income',
               'rooms_per_household',
               'bedrooms_per_room', 'population_per_household']
             
categorical = ['ocean_proximity']

### Question 1

What is the most frequent observation (mode) for the column `ocean_proximity`?

Options:
* `NEAR BAY`
* `<1H OCEAN`
* `INLAND`
* `NEAR OCEAN`

In [7]:
housing['ocean_proximity'].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

### Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

Options:
* `total_bedrooms` and `households`
* `total_bedrooms` and `total_rooms`
* `population` and `households`
* `population_per_household` and `total_rooms`

In [8]:
correlation_matrix = housing[numerical].corr()
non_diag = correlation_matrix[correlation_matrix != 1]
non_diag.stack().idxmax()


('total_bedrooms', 'households')

### Make `median_house_value` binary

* We need to turn the `median_house_value` variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise.

In [9]:
housing['above_average'] = 1
housing['above_average'] = housing['above_average'].where(housing['median_house_value'] > housing['median_house_value'].mean(), 0)


### Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value (`median_house_value`) is not in your dataframe.

In [10]:
from sklearn.model_selection import train_test_split

full_train, test = train_test_split(housing, test_size=.2, random_state=42)
train, val = train_test_split(full_train, test_size=.25, random_state=42)

y_train = train['above_average']
y_val = val['above_average']
y_test = test['above_average']

del train['above_average']
del val['above_average']
del test['above_average']

print( test.shape, val.shape, train.shape )



(4128, 13) (4128, 13) (12384, 13)


### Question 3

* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
* What is the value of mutual information?
* Round it to 2 decimal digits using `round(score, 2)`

Options:
- 0.26
- 0
- 0.10
- 0.16

In [11]:
from sklearn.metrics import mutual_info_score

mutual_info_score(y_train, train[categorical[0]]).round(3)

0.101

### Question 4

* Now let's train a logistic regression
* Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

Options:
- 0.60
- 0.72
- 0.84
- 0.95

In [12]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

dv = DictVectorizer(sparse=False)

train_dict = train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)


model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

y_val_pred = model.predict(X_val)

accuracy_score(y_val, y_val_pred).round(2)

0.84

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `total_rooms`
   * `total_bedrooms` 
   * `population`
   * `households`

> **note**: the difference doesn't have to be positive

In [13]:
features = ['total_rooms', 'total_bedrooms', 'population', 'households']

train_full_feature_selection = train[features].copy()
val_full_feature_selection = val[features].copy()

model.fit(train_full_feature_selection, y_train)
model.score(train_full_feature_selection, y_train).round(2)

0.7

In [14]:
accuracy_scores = []
for feature in features:
    remaining_features = [f for f in features if f != feature]
    train_remaining_features = train[remaining_features]
    val_remaining_features = val[remaining_features]
    model.fit(train_remaining_features, y_train)
    accuracy_scores.append(model.score(val_remaining_features, y_val))
    
accuracy_scores

[0.6276647286821705, 0.6608527131782945, 0.656734496124031, 0.6719961240310077]

### Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'median_house_value'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model (`model = Ridge(alpha=a, solver="sag", random_state=42)`) on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

Options:
- 0
- 0.01
- 0.1
- 1
- 10

In [15]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
alphas = [0, 0.01, 0.1, 1, 10, 10000000]
errors = []
for a in alphas: 
    model = Ridge(alpha=a, solver="sag", random_state=42)
    
    y_train = np.log1p(train['median_house_value'])
    y_val = np.log1p(val['median_house_value'])
    
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_val)
    
    errors.append(mean_squared_error(y_val, y_val_pred))

list(zip(alphas, errors))

[(0, 0.2746426261364207),
 (0.01, 0.2746426261543594),
 (0.1, 0.274642626324762),
 (1, 0.27464262803776485),
 (10, 0.27464264514086784),
 (10000000, 0.28391754986338924)]

## Submit the results

* Submit your results here: https://forms.gle/vQXAnQDeqA81HSu86
* You can submit your solution multiple times. In this case, only the last submission will be used 
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 26 September (Monday), 23:00 CEST.

After that, the form will be closed.