# Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

```python
pd.__version__
```

In [2]:
pd.__version__

'1.5.3'

### Getting the data

For this homework, we'll use the California Housing Prices dataset. Download it from
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv).

In [3]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

--2023-09-19 20:10:46--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘housing.csv’


2023-09-19 20:10:46 (134 MB/s) - ‘housing.csv’ saved [1423529/1423529]



### Question 2

How many columns are in the dataset?

- 10
- 6560
- 10989
- 20640

In [4]:
df_housing = pd.read_csv('housing.csv')
df_housing.shape

(20640, 10)

In [5]:
print("Columns:", df_housing.shape[1])

Columns: 10


### Question 3

Which columns in the dataset have missing values?

- `total_rooms`
- `total_bedrooms`
- both of the above
- no empty columns in the dataset

In [6]:
df_housing.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

### Question 4

How many unique values does the `ocean_proximity` column have?

- 3
- 5
- 7
- 9

In [7]:
print('Unique values:', df_housing.ocean_proximity.nunique() )

Unique values: 5


### Question 5

What's the average value of the `median_house_value` for the houses located near the bay?

- 49433
- 124805
- 259212
- 380440

In [8]:
df_housing[df_housing['ocean_proximity']=='NEAR BAY']['median_house_value'].mean()

259212.31179039303

### Question 6

1. Calculate the average of `total_bedrooms` column in the dataset.
2. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calculate the average of `total_bedrooms` again.
4. Has it changed?

Has it changed?

> Hint: take into account only 3 digits after the decimal point.

- Yes
- No

In [9]:
df_housing['total_bedrooms'].mean()

537.8705525375618

In [10]:
df_housing['total_bedrooms'].fillna(df_housing['total_bedrooms'].mean(), inplace=True)
df_housing['total_bedrooms'].isnull().sum()

0

In [12]:
df_housing['total_bedrooms'].mean()

537.8705525375617

In [13]:
# No change

### Question 7

1. Select all the options located on islands.
2. Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[950, 1300, 800, 1000, 1300]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -1.4812
- 0.001
- 5.6992
- 23.1233

In [15]:
#options located on Island
df_housing['ocean_proximity'].unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [16]:
df_houseIsland=df_housing[df_housing['ocean_proximity']=='ISLAND']
df_houseIsland.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

In [17]:
df_houseIsland=df_houseIsland[['housing_median_age','total_rooms','total_bedrooms']]

In [18]:
X=np.array(df_houseIsland)
X

array([[  27., 1675.,  521.],
       [  52., 2359.,  591.],
       [  52., 2127.,  512.],
       [  52.,  996.,  264.],
       [  29.,  716.,  214.]])

In [19]:
TX=X.T
TX

array([[  27.,   52.,   52.,   52.,   29.],
       [1675., 2359., 2127.,  996.,  716.],
       [ 521.,  591.,  512.,  264.,  214.]])

In [20]:
# Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T
XTX=np.dot(TX,X)
XTX

array([[9.6820000e+03, 3.5105300e+05, 9.1357000e+04],
       [3.5105300e+05, 1.4399307e+07, 3.7720360e+06],
       [9.1357000e+04, 3.7720360e+06, 9.9835800e+05]])

In [21]:
XTX=np.dot(X.T,X)
XTX

array([[9.6820000e+03, 3.5105300e+05, 9.1357000e+04],
       [3.5105300e+05, 1.4399307e+07, 3.7720360e+06],
       [9.1357000e+04, 3.7720360e+06, 9.9835800e+05]])

In [22]:
# Compute the inverse of XTX.
XTX_inv=np.linalg.inv(XTX)

In [23]:
# Create an array y
y=np.array([950, 1300, 800, 1000, 1300])

In [27]:
# Multiply the inverse of XTX with the transpose of X, and then multiply the result by y.
w=np.dot(np.dot(XTX_inv,TX),y)
w

array([23.12330961, -1.48124183,  5.69922946])

In [28]:
w[-1]

5.699229455065586