### Homework #3: ML workflow and target encoding

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from itertools import product

In [5]:
df = pd.read_csv("winequality-red.csv")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


We will solve a regression problem: it is necessary to predict the **quality of wine** based on its characteristics.

### Step 1. (**0.2 points**)

Create the feature matrix **X** (object-feature) and the target vector **y** ("quality").

In [6]:
y = df.iloc[:, -1] 
X = df.iloc[:, :-1]

### Step 2. (**0.2 points**)

Split the data into **train** and **test** sets (test data share — **30%**).

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=16)

### Step 3. (**0.2 points**)

Train a **linear regression** model on the training data and make predictions on both the **train** and **test** sets.

In [8]:
reg = LinearRegression().fit(X_train, y_train)
y_pred_mse = reg.predict(X_test)
y_pred_train_mse = reg.predict(X_train)

### Step 4. (**0.4 points**)

Display the **MSE error** on the **train** and **test** sets, then display the **R² score** on the **train** and **test** sets.

In [9]:
y_pred_train = reg.predict(X_train)
y_pred_test  = reg.predict(X_test)

print(f"MSE on training data: {mean_squared_error(y_train, y_pred_train):}")
print(f"MSE on test data: {mean_squared_error(y_test, y_pred_test):}")
print(f"R² on training data: {r2_score(y_train, y_pred_train):}")
print(f"R² on test data: {r2_score(y_test, y_pred_test):}")

MSE on training data: 0.4029193184207987
MSE on test data: 0.4641292014818482
R² on training data: 0.38369769677763454
R² on test data: 0.28169107467930066


### Step 5. (**0.5 points**)

Calculate the **average quality (R²)** of the model using **cross-validation** with **k = 5 folds**.

In [10]:
r2_scores = cross_val_score(reg, X, y, cv=5, scoring='r2')

r2_scores.mean()

0.29004162884219536

### Step 6. (**0.5 points**)

Now apply **linear regression with L1 regularization (Lasso)** to this task.
Declare the model and tune the **regularization parameter α** using a grid search.
Search for **α** in the range **(0.1, 1.1)** with a **step of 0.1**.

Perform the tuning of **α** using the **training data (X_train, y_train)**.

In [11]:
from sklearn.linear_model import Lasso

best_alpha = ''
best_r2 = float('inf')  
results = []

for alpha in np.arange(0.1, 1.1, 0.1):
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train, y_train)

    y_pred_tr = lasso.predict(X_train)

    train_mse = mean_squared_error(y_train, y_pred_tr)
    train_r2 = r2_score(y_train, y_pred_tr)

    results.append((alpha, train_mse, train_r2)) 

    print(f'alpha={alpha:.1f}, Train MSE: {train_mse:.4f}, Train R²: {train_r2:.4f}')

alpha=0.1, Train MSE: 0.4774, Train R²: 0.2698
alpha=0.2, Train MSE: 0.5185, Train R²: 0.2070
alpha=0.3, Train MSE: 0.5648, Train R²: 0.1361
alpha=0.4, Train MSE: 0.6295, Train R²: 0.0371
alpha=0.5, Train MSE: 0.6316, Train R²: 0.0339
alpha=0.6, Train MSE: 0.6317, Train R²: 0.0338
alpha=0.7, Train MSE: 0.6318, Train R²: 0.0336
alpha=0.8, Train MSE: 0.6319, Train R²: 0.0334
alpha=0.9, Train MSE: 0.6321, Train R²: 0.0332
alpha=1.0, Train MSE: 0.6323, Train R²: 0.0329


### Step 7. (**0.5 points**)

Display the **best algorithm** and the **best quality** obtained from tuning **α** (`best_estimator_` and `best_score_`).

In [12]:
alphas = np.arange(0.1, 1.1, 0.1) 
params = {'alpha': alphas}

cv = GridSearchCV(lasso, params, scoring='r2', cv=5)

cv.fit(X_train, y_train)

print('Best estimator :', cv.best_estimator_)
print('Best score (R²):', cv.best_score_)

Best estimator : Lasso(alpha=0.1)
Best score (R²): 0.26162552533012484


### Step 8. (**0.5 points**)

Using the obtained **best_estimator_**, make predictions on the **test data** and display the **R² score** on the test set.

In [13]:
y_pred_test = cv.best_estimator_.predict(X_test)
test_r2_score = r2_score(y_test, y_pred_test)

test_r2_score

0.18860624128065862

### Step 9. (**0.5 points**)

Let’s try to improve model quality by adding **polynomial features**. Create a **pipeline** that first adds **degree-2 polynomial features**, then applies **linear regression**.

Then compute the **R² score** of this model using **5-fold cross-validation**.

In [15]:
from sklearn.preprocessing import StandardScaler

linear_pipe = Pipeline([('poly', PolynomialFeatures(degree=2)),
                       ('linear_model', LinearRegression())])
print('R²:', cross_val_score(linear_pipe, X, y, cv=5, scoring='r2').mean())

R²: 0.23009616946201242


### Step 10. (**0.5 points**)

Train the **pipeline model** on the **training data** and make predictions for both the **train** and **test** sets.
Then display the **R² score** and **MSE** for the **training** and **test** data.

In [17]:
linear_pipe.fit(X_train, y_train)

y_pred_train_pipe = linear_pipe.predict(X_train)
y_pred_test_pipe = linear_pipe.predict(X_test)

print(f"MSE on training data: {mean_squared_error(y_train, y_pred_train_pipe):}")
print(f"MSE on test data: {mean_squared_error(y_test, y_pred_test_pipe):}")
print(f"R² on training data: {r2_score(y_train, y_pred_train_pipe):}")
print(f"R² on test data: {r2_score(y_test, y_pred_test_pipe):}")

MSE on training data: 1.4544816422007008
MSE on test data: 1.7487605012189913
R² on training data: -1.2247639790424807
R² on test data: -1.7064668033455512


### Conclusions (1 point)

1. **Evaluation of the linear model without regularization**
   The linear model without regularization did not perform very well. An R² value below 0.5 indicates that the model does not explain much of the variance in wine quality. However, it’s difficult to claim definite overfitting — the test metric is only slightly worse (by about 0.1) than the training one. The difference between MSE and R² values across train and test is small, so there is no strong evidence of overfitting. Additionally, results vary slightly when the `random_state` parameter changes, which also suggests instability rather than overfitting.

2. **Effect of L1 regularization on model quality**
   L1 regularization (Lasso) did **not** improve model performance. The R² score dropped to around **0.26**, which is worse than the original linear regression. This indicates that regularization over-penalized the coefficients, reducing model flexibility without improving generalization. Hence, it didn’t fix potential overfitting and even decreased overall predictive power.

3. **Effect of adding polynomial features**
   Adding polynomial (degree-2) features **significantly worsened** model quality. On the training data, **R² became negative (-1.2248)**, meaning the model explains less variance than simply predicting the mean of the target variable. On the test data, the result was even worse (**R² = -1.7065**).

This clearly shows **overfitting** — the model became too complex relative to the data. The large number of polynomial features caused the model to memorize noise in the training set, leading to a dramatic loss of generalization on unseen data. Thus, adding nonlinear terms made the model unstable and ineffective.

### *Attempt to improve the model (achieve the best possible quality)*

When I was working on improving model performance, I came across an interesting method called the **Voting Regressor**.
(I tried many options — this one was the only one that showed a relatively substantial improvement.)

As I understand it, the **Voting Regressor** works as follows: it trains several models on the same dataset and then combines their predictions.
For example, in my implementation I used a simple **Linear Regression** and a **Random Forest** — two quite different approaches.
Instead of relying on just one of them, the **Voting Regressor** averages their predictions, which helps **reduce error and improve the model’s generalization ability**.

In [18]:
from sklearn.ensemble import VotingRegressor, RandomForestRegressor

voting_regressor = VotingRegressor(estimators=[
    ('lr', LinearRegression()),
    ('rf', RandomForestRegressor(n_estimators=100))
])
voting_regressor.fit(X_train, y_train)
y_pred_test_voting = voting_regressor.predict(X_test)

print('Voting Regressor R²:', r2_score(y_test, y_pred_test_voting))

Voting Regressor R²: 0.37916548267789696


### Data Preparation

In [20]:
sales = pd.read_csv('sales_train.csv.gz')
sales.columns = ['date', 'date_block_num', 'shop_id', 'item_id', 'item_price', 'target']
sales

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,target
0,02.01.2013,0,59,22154,999.00,1.0
1,03.01.2013,0,25,2552,899.00,1.0
2,05.01.2013,0,25,2552,899.00,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.00,1.0
...,...,...,...,...,...,...
2935844,10.10.2015,33,25,7409,299.00,1.0
2935845,09.10.2015,33,25,7460,299.00,1.0
2935846,14.10.2015,33,25,7459,349.00,1.0
2935847,22.10.2015,33,25,7440,299.00,1.0


In [21]:
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = []
for block_num in sales['date_block_num'].unique():
    cur_shops = sales[sales['date_block_num']==block_num]['shop_id'].unique()
    cur_items = sales[sales['date_block_num']==block_num]['item_id'].unique()
    grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))

#turn the grid into pandas dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

#get aggregated values for (shop_id, item_id, month)
gb = sales.groupby(index_cols,as_index=False).agg({'target':'sum'})

#join aggregated data to the grid
all_data = pd.merge(grid,gb,how='left',on=index_cols).fillna(0)
#sort the data
all_data.sort_values(['date_block_num','shop_id','item_id'],inplace=True)

### Mean encodings 

#### Method 1

In [22]:
# Calculate a mapping: {item_id: target_mean}
item_id_target_mean = all_data.groupby('item_id').target.mean()

# In our non-regularized case we just *map* the computed means to the `item_id`'s
all_data['item_target_enc'] = all_data['item_id'].map(item_id_target_mean)

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True)

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.4830386988621764


#### Method 2

In [23]:
'''
     Differently to `.target.mean()` function `transform`
   will return a dataframe with an index like in `all_data`.
   Basically this single line of code is equivalent to the first two lines from of Method 1.
'''
all_data['item_target_enc'] = all_data.groupby('item_id')['target'].transform('mean')

# Fill NaNs
all_data['item_target_enc'].fillna(0.3343, inplace=True)

# Print correlation
encoded_feature = all_data['item_target_enc'].values
print(np.corrcoef(all_data['target'].values, encoded_feature)[0][1])

0.4830386988621764


### KFold scheme (**1.25 points**)

You need to implement a KFold scheme with five folds. Use `KFold(5)` from `sklearn.model_selection`.

1. Split the data into **5 folds** using `sklearn.model_selection.KFold` with the parameter `shuffle=False`.
2. Iterate over the folds: use the **4 training folds** to compute the **mean target values by `item_id`**, and fill these values into the **validation fold** at each iteration.

Pay attention to **Method 1** from the example. In particular, study how the functions `map` and `pd.Series.map` work — they are quite useful in many situations.

In [25]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=False)
item_target_enc = np.zeros(all_data.shape[0])

for train_index, test_index in kf.split(all_data):
    train_data = all_data.iloc[train_index]
    test_data = all_data.iloc[test_index]
    
    item_id_target_mean = train_data.groupby('item_id')['target'].mean()
    item_target_enc[test_index] = test_data['item_id'].map(item_id_target_mean)

all_data['item_target_enc'] = item_target_enc
all_data['item_target_enc'].fillna(0.3343, inplace=True)

correlation = np.corrcoef(all_data['target'].values, all_data['item_target_enc'].values)[0][1]
correlation

0.41645907127988446

### Leave-one-out scheme (**1.25 points**)

You need to implement a **leave-one-out** scheme. Note: if you run the code from the first task with the number of folds equal to the dataset size, you might get the correct result, but you’ll be waiting **a very, very long time**.

For a faster way to compute the **mean target for all objects except one**, you can:

1. Compute the **total (sum) of the target** over all objects.
2. Subtract the target of the specific object and divide the result by `n_objects - 1`.

Note that step **1** should be done for **all** objects. Also, step **2** can be implemented **without** `for` loops.

The `.transform` function from **Method 2** in the example may be useful here.


In [27]:
total_target_sum = all_data.groupby('item_id')['target'].transform('sum')
n_objects = all_data.groupby('item_id')['target'].transform('size')

all_data['item_target_enc'] = (total_target_sum - all_data['target']) / (n_objects - 1)
all_data['item_target_enc'].fillna(0.3343, inplace=True)

zakodirovannaya = all_data['item_target_enc']
correlation = np.corrcoef(all_data['target'], zakodirovannaya)[0, 1]
correlation

0.4803848311293092

### Smoothing (**1.25 балла**)

In [28]:
localmean = all_data.groupby('item_id').target.mean()  
nrows = all_data.groupby('item_id').target.count()

all_data['item_target_enc'] = all_data['item_id'].map((localmean * nrows + 0.3343 * 100) / (nrows + 100))

all_data['item_target_enc'].fillna(0.3343, inplace=True)  
zakodirovannaya = all_data['item_target_enc'].values

correlation = np.corrcoef(all_data['target'].values, zakodirovannaya)[0][1] 
correlation

Корреляция: 0.4818


Ожидаемый ответ 0.4818

### Expanding mean схема (**1.25 балла**)

Необходимо реализовать *expanding mean* схему. Ее суть заключается в том, чтобы пройти по отсортированному в определенном порядке датасету (датасет сортируется в самом начале задания) и для подсчета счетчика для строки $m$ использовать строки от $0$ до $m-1$. Вам будет необходимо воспользоваться pandas функциями [`cumsum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.cumsum.html) и [`cumcount`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html).

In [10]:
all_data = all_data.sort_values(['date_block_num', 'shop_id', 'item_id'])

cumulative_sum = all_data.groupby('item_id')['target'].cumsum()
previous_target_sum = cumulative_sum - all_data['target']
item_counts = all_data.groupby('item_id')['target'].cumcount()

all_data['item_target_enc'] = previous_target_sum / item_counts
all_data['item_target_enc'].fillna(0.3343, inplace=True)

zakodirovannaya = all_data['item_target_enc'].values  

correlation = np.corrcoef(all_data['target'].values, zakodirovannaya)[0][1] 
print(f'Корреляция: {correlation:.4f}')

Корреляция: 0.5025


Ожидаемый ответ 0.5025