In [76]:
import pandas as pd
import numpy as np

Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from
[06-environment.md](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md).

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

In [6]:
pd.__version__

'2.1.1'

### Getting the data 

For this homework, we'll use the California Housing Prices dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.


In [10]:
! curl https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv > housing.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1390k  100 1390k    0     0  13.7M      0 --:--:-- --:--:-- --:--:-- 14.5M


In [12]:
housing_df = pd.read_csv("housing.csv")

### Question 2

How many columns are in the dataset?

- 10
- 6560
- 10989
- 20640

In [14]:
len(housing_df.columns)

10

### Question 3

Which columns in the dataset have missing values?

- `total_rooms`
- `total_bedrooms`
- both of the above
- no empty columns in the dataset

In [16]:
housing_df.isna().head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


In [25]:
nan_columns = housing_df.isna().any()

In [27]:
nan_columns

longitude             False
latitude              False
housing_median_age    False
total_rooms           False
total_bedrooms         True
population            False
households            False
median_income         False
median_house_value    False
ocean_proximity       False
dtype: bool

In [33]:
list(nan_columns[nan_columns].index)

['total_bedrooms']

### Question 4

How many unique values does the `ocean_proximity` column have?

- 3
- 5
- 7
- 9

In [34]:
housing_df.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [35]:
housing_df.ocean_proximity.nunique()

5

### Question 5

What's the average value of the `median_house_value` for the houses located near the bay?

- 49433
- 124805
- 259212
- 380440

In [45]:
housing_df[housing_df["ocean_proximity"] == "NEAR BAY"].median_house_value.mean()

259212.31179039303

### Question 6

1. Calculate the average of `total_bedrooms` column in the dataset.
2. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calculate the average of `total_bedrooms` again.
4. Has it changed?

Has it changed?

> Hint: take into account only 3 digits after the decimal point.

- Yes
- No

In [51]:
housing_df.total_bedrooms[housing_df.total_bedrooms.notna()].mean()

537.8705525375618

In [54]:
housing_df.total_bedrooms.fillna(value=537.870).mean()

537.870546996124

Let there be $X$ valid values with mean $M$ and $Y$ NaN values. After replacing the NaNs with $M$, the new mean would be 

$$((X\times M) + (Y\times M))/(X+Y) = M$$

So the mean should not change!

In [None]:
# if we want to replace the NaNs in place in the data frame

In [56]:
housing_df.total_bedrooms.fillna(value=537.870, inplace = True)

In [61]:
housing_df.isna().any()

longitude             False
latitude              False
housing_median_age    False
total_rooms           False
total_bedrooms        False
population            False
households            False
median_income         False
median_house_value    False
ocean_proximity       False
dtype: bool

### Question 7

1. Select all the options located on islands.
2. Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[950, 1300, 800, 1000, 1300]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -1.4812
- 0.001
- 5.6992
- 23.1233






In [73]:
feature_df = housing_df[housing_df["ocean_proximity"] == "ISLAND"][["housing_median_age", "total_rooms", "total_bedrooms"]].reset_index(drop=True)

In [78]:
feature_df

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms
0,27.0,1675.0,521.0
1,52.0,2359.0,591.0
2,52.0,2127.0,512.0
3,52.0,996.0,264.0
4,29.0,716.0,214.0


In [81]:
X = np.array(feature_df)

In [83]:
X

array([[  27., 1675.,  521.],
       [  52., 2359.,  591.],
       [  52., 2127.,  512.],
       [  52.,  996.,  264.],
       [  29.,  716.,  214.]])

In [80]:
X.T

array([[  27.,   52.,   52.,   52.,   29.],
       [1675., 2359., 2127.,  996.,  716.],
       [ 521.,  591.,  512.,  264.,  214.]])

In [84]:
X.shape, X.T.shape

((5, 3), (3, 5))

In [85]:
y = np.array([950, 1300, 800, 1000, 1300])

In [103]:
y, y.shape

(array([ 950, 1300,  800, 1000, 1300]), (5,))

In [91]:
XTX = np.matmul(X.T,X)

In [92]:
XTX

array([[9.6820000e+03, 3.5105300e+05, 9.1357000e+04],
       [3.5105300e+05, 1.4399307e+07, 3.7720360e+06],
       [9.1357000e+04, 3.7720360e+06, 9.9835800e+05]])

In [93]:
XTX_inv = np.linalg.inv(XTX)

In [94]:
XTX_inv

array([[ 9.19403586e-04, -3.66412216e-05,  5.43072261e-05],
       [-3.66412216e-05,  8.23303633e-06, -2.77534485e-05],
       [ 5.43072261e-05, -2.77534485e-05,  1.00891325e-04]])

In [101]:
np.matmul(XTX_inv, X.T), np.matmul(XTX_inv, X.T).shape

(array([[-0.00825608, -0.00653208, -0.00232159,  0.02565144,  0.01204934],
        [-0.00165852,  0.0011141 ,  0.00139656, -0.00103215, -0.00110698],
        [ 0.00754365, -0.00301964, -0.00455125,  0.00181685,  0.00329418]]),
 (3, 5))

In [97]:
y.shape

(5,)

In [104]:
w = np.matmul(np.matmul(XTX_inv, X.T), y)

In [105]:
w

array([23.12330961, -1.48124183,  5.69922946])

In [106]:
w[-1]

5.699229455065586