# Introduction to Machine Learning

## Set up the development environment

Install the following packages using pip:

```bash
pip install numpy pandas matplotlib seaborn scikit-learn
```
## Homework 1

### Question 1: What's the version of Pandas that you installed?

You can get the version information using the __version__ field:

In [3]:
import pandas as pd
pd.__version__

'1.5.2'

 - Download the sample data
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```

### Question 2: How many columns are in the dataset?

In [6]:
df = pd.read_csv('housing.csv', iterator=False)

# Get the number of columns
num_columns = df.shape[1]
print('Number of columns:', num_columns)

Number of columns: 10


### Question 3: Which columns in the dataset have missing values?

In [7]:
# Check for missing data in each column
missing_data = df.isnull().sum()

print('Missing data in each column:')
print(missing_data)

Missing data in each column:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64


### Question 4: How many unique values does the ocean_proximity column have?

In [8]:
# get the unique values of the column 'ocean_proximity'
unique_values = df['ocean_proximity'].unique()
print('Unique values of the column "ocean_proximity":')
print(unique_values)

Unique values of the column "ocean_proximity":
['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']


### Question 5: What's the average value of the median_house_value for the houses located near the bay?

In [11]:
# get the average value of the column median_house_value for houses located near the bay
bay_area = df[df['ocean_proximity'] == 'NEAR BAY']
average_value = bay_area['median_house_value'].mean()
print('Average value of the column median_house_value for house located near the bay:')
# format the output to 2 decimal places
print(f'{average_value:.2f}')


Average value of the column median_house_value for house located near the bay:
259212.31


### Question 6: 

In [17]:
# Calculate the average of total_bedrooms column in the dataset
average_bedrooms = df['total_bedrooms'].mean()
print(f'Average of total_bedrooms column in the dataset: {average_bedrooms:.3f}')

# Use the fillna method to fill the missing values in total_bedrooms with the mean value from the previous step
df['total_bedrooms'].fillna(average_bedrooms, inplace=True)

# Calculate the average of total_bedrooms column in the dataset again
updated_average_bedrooms = df['total_bedrooms'].mean()
print(f'Average of total_bedrooms column in the dataset after filling missing values: {updated_average_bedrooms:.3f}')

# has the average changed after filling the missing values?
print(f'Average has changed after filling the missing values: {average_bedrooms != updated_average_bedrooms}')

# round the two averages to 3 decimal places and compare if the value has changed
print(f'Average has changed after filling the missing values: {average_bedrooms != updated_average_bedrooms:.3f}')






Average of total_bedrooms column in the dataset: 537.871
Average of total_bedrooms column in the dataset after filling missing values: 537.871
Average has changed after filling the missing values: False
Average has changed after filling the missing values: 0.000


### Question 7

In [25]:
# Select all the options located on islands
island_options = df[df['ocean_proximity'] == 'ISLAND']

# Select only columns housing_median_age, total_rooms, total_bedrooms
island_options = island_options[['housing_median_age', 'total_rooms', 'total_bedrooms']]

# Get the underlying NumPy array. Let's call it X
X = island_options.values

# Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX
XTX = X.T.dot(X)

# Compute the inverse of XTX. Let's call the result inv_XTX
from numpy.linalg import inv
inv_XTX = inv(XTX)

# Create an array y with values [950, 1300, 800, 1000, 1300]
y = [950, 1300, 800, 1000, 1300]

# Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w
w = inv_XTX.dot(X.T).dot(y)
print('w:', w)

# What's the value of the last element of w?
print(f'Value of the last element of w: {w[-1]:.4f}')



w: [23.12330961 -1.48124183  5.69922946]
Value of the last element of w: 5.6992
