In [1]:
import pandas as pd
import numpy as np
from numpy.linalg import inv

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

```python
pd.__version__
```

In [2]:
pd.__version__

'1.5.3'

### Getting the data 

For this homework, we'll use the California Housing Prices dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
df = pd.read_csv("./data/housing.csv")
print(df.head())
print(df.shape)

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  
(20640, 10)


### Question 2

How many columns are in the dataset?

- 10
- 6560
- 10989
- 20640

In [4]:
len(df.columns)

10

### Question 3

Which columns in the dataset have missing values?

- `total_rooms`
- `total_bedrooms`
- both of the above
- no empty columns in the dataset

In [5]:
print(df.isna().sum())

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64


### Question 4

How many unique values does the `ocean_proximity` column have?

- 3
- 5
- 7
- 9

In [6]:
len(df.ocean_proximity.unique())

5

### Question 5

What's the average value of the `median_house_value` for the houses located near the bay?

- 49433
- 124805
- 259212
- 380440

In [7]:
int(df.median_house_value.loc[df["ocean_proximity"] == "NEAR BAY"].mean())

259212

### Question 6

1. Calculate the average of `total_bedrooms` column in the dataset.
2. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calculate the average of `total_bedrooms` again.
4. Has it changed?

Has it changed?

> Hint: take into account only 3 digits after the decimal point.

- Yes
- No

In [8]:
print(df.total_bedrooms.mean())
df.total_bedrooms.fillna(value = df.total_bedrooms.mean(), inplace = True)
print(df.total_bedrooms.mean())

537.8705525375618
537.8705525375617


### Question 7

1. Select all the options located on islands.
2. Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[950, 1300, 800, 1000, 1300]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -1.4812
- 0.001
- 5.6992
- 23.1233

In [9]:
df_islands = df.loc[df["ocean_proximity"] == "ISLAND"]
df_islands_limited = df_islands[["housing_median_age", "total_rooms", "total_bedrooms"]]
X = df_islands_limited.to_numpy()
XTX = np.dot(X.T, X)
XTX_inverse = inv(XTX)
y = np.array([950, 1300, 800, 1000, 1300])
w = np.dot(np.dot(XTX_inverse, X.T), y)
print(w)

[23.12330961 -1.48124183  5.69922946]
