## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](../../../01-intro/06-environment.md).

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:


In [None]:
pd.__version__


### Getting the data 

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

**How I got the result**

Used attribute `__version__` pandas:


In [None]:
import pandas as pd
pd.__version__
# 2.3.2

Q1 — Pandas version: 2.3.2

### Q2. Records count

How many records are in the dataset?

- 4704
- 8704
- 9704
- 17704

Counted the number of rows in the DataFrame:


In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv")
len(df)        # 9704
# ou: df.shape[0]

Q2 — Records count: 9704

### Q3. Fuel types

How many fuel types are presented in the dataset?

- 1
- 2
- 3
- 4

Counted the number of distinct categories in `fuel_type` and checked the distribution:


In [None]:
df['fuel_type'].nunique()        # 2
df['fuel_type'].value_counts()   # optional: to see which and how many of each

Q3 — Fuel types: 2

### Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 1
- 2
- 3

Checked, per column, if there is at least one missing value and counted how many columns have them:


In [None]:
cols_com_na = df.columns[df.isna().any()]
qtd_cols_com_na = df.isna().any().sum()
list(cols_com_na)


> Since the options did not include 4, I selected the closest one (3), as instructed in the assignment.
Q4 — Columns with missing values: 4

### Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

- 13.75
- 23.75
- 33.75
- 43.75

Filtered the records where `origin == 'Asia'` and calculated the maximum of the efficiency column. Depending on the dataset version, the column may be named `fuel_efficiency`, `combined_mpg` or `fuel_efficiency_mpg`. Below I show a robust approach that automatically finds an efficiency/MPG column:


In [None]:
df_asia = df[df['origin'] == 'Asia']

candidatas = [c for c in df.columns 
              if 'efficien' in c.lower() or c.lower().endswith('mpg') or 'mpg' in c.lower()]

col_ef = candidatas[0]
df_asia[col_ef].max()

Q5 — Max fuel efficiency (Asia): 23.75

### Q6. Median value of horsepower



1. Find the median value of `horsepower` column in the dataset.
2. Next, calculate the most frequent value of the same `horsepower` column.
3. Use `fillna` method to fill the missing values in `horsepower` column with the most frequent value from the previous step.
4. Now, calculate the median value of `horsepower` once again.

Has it changed?


- Yes, it increased
- Yes, it decreased
- No

1) Original median of `horsepower`:


In [None]:
mediana_antes = df['horsepower'].median()


2) Mode (most frequent value) of `horsepower`:


In [None]:
moda_hp = df['horsepower'].mode()[0]


3) Filling the missing values with the mode:


In [None]:
df_hp = df.copy()
df_hp['horsepower'] = df_hp['horsepower'].fillna(moda_hp)


4) New median and comparison:


In [None]:
mediana_after = df_hp['horsepower'].median()
change = ("increase" if mediana_after > mediana_before
         else "reduce" if mediana_after < mediana_before
         else "don't change")

Q6 — Median horsepower change: Yes, it increased (149.0 → 152.0 after filling with mode=152)

### Q7. Sum of weights

1. Select all the cars from Asia
2. Select only columns `vehicle_weight` and `model_year`
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the sum of all the elements of the result?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- 0.051
- 0.51
- 5.1
- 51

1) Select cars from Asia  
2) Select columns `vehicle_weight` and `model_year`  
3) Take the first 7 rows  
4) Get `X` as `ndarray`  
5) Compute `XTX = X.T @ X`  
6) Invert `XTX`  
7) Create `y = [1100, 1300, 800, 900, 1000, 1100, 1200]`  
8) Compute `w = (XTX^-1) @ X.T @ y`  
9) Sum the elements of `w`


In [None]:
import numpy as np
from numpy.linalg import inv

asia = df[df['origin'] == 'Asia'][['vehicle_weight', 'model_year']].iloc[:7]
X = asia.to_numpy(dtype=float)

XTX = X.T @ X
XTX_inv = inv(XTX)

y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200], dtype=float)

w = XTX_inv @ X.T @ y
w.sum()

Q7 — Sum of weights: 0.51 (sum ≈ 0.5187709)

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw01
* If your answer doesn't match options exactly, select the closest one