# 1. Теоретическая часть

## 1.a Аналитическое решение задачи регрессии: $\theta=(X^TX)^{-1}X^Ty$

## 1.b-с Добавление регулизации:
Регулизация возникает при добавлении в функцию потерь слагаемого, зависящее от параметров модели. 

Для L1(Lasso):
$min_{\theta} \left( \sum_{i=1}^{N}L(f(x_i,\theta),y_i)+\lambda \sum_{j=1}^{M}|\theta_j| \right)$


Основная идея: мы добавляем к стандартной функции потерь "штраф", пропорциональный сумме абсолютных значений весов модели (L1-норма). Это позволяет отобрать только важные признаки (занулив остальные) и таким образом бороться с переобучением модели. Чем больше параметр $\lambda$, тем сильнее сжимаются веса.

Для L2(Ridge): $min_{\theta} \left( \sum_{i=1}^{N}L(f(x_i,\theta),y_i)+\lambda \sum_{j=1}^{M}\theta_j^2 \right)$
Основная идея: Ridge-регуляризация — это методика, в которой к стандартной функции потерь добавляется "штраф", пропорциональный сумме квадратов весов модели (L2-норма). Это заставляет модель не только хорошо предсказывать целевую переменную, но и делать это с как можно меньшими по величине весами, предотвращая их чрезмерный рост. Как и Lasso, Ridge борется с переобучением путем штрафа за большие веса.  Ridge решает эту проблему, "сжимая" веса практически к нулю, но практически никогда не обнуляя их полностью. Ridge-регуляризация стабилизирует оценки, делая модель менее чувствительной к небольшим изменениям в обучающей выборке (борьба с мультиколлениарностью).



## 1.d Нелинейные зависимости:
Если мы хотим работать с той же моделью, но при этом работать и с нелинейными зависимостями, то этого можно добиться несколькими способами. Их суть: Мы не меняем саму модель. Вместо этого мы преобразуем наши исходные данные, создавая на их основе новые признаки. Модель по-прежнему линейна относительно своих параметров, но эти параметры теперь умножаются на нелинейные функции от исходных данных. Таким образом, она может аппроксимировать нелинейные зависимости.
- Polynomial Features:
Мы создаём новые признаки, возводя исходный признак в степени.
- Сплайны:
Мы разбиваем область определения нашей переменной x на несколько отрезков (бинов) и на каждом отрезке аппроксимируем зависимость своим собственным полиномом (в данном случае линейным). Эти полиномы соединяются в точках разбиения (узлах) так, чтобы получилась гладкая кривая. Таким образом, мы создаём набор функций-условий для каждого отрезка. Линейная модель учит веса для этого набора.

# 2. Введение

## 2.a

In [1]:
import pandas as pd
import numpy as np
import re
import sklearn
import lightgbm
import scipy
import statsmodels
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler  
from sklearn.tree import DecisionTreeRegressor

## 2.b

In [2]:
df_train = pd.read_json('train.json')

## 2.c

In [3]:
df_train['interest_level_enc'] = df_train['interest_level'].map({'low': 0, 'medium': 1, 'high': 2})

# 3. Анализ данных

## 3.a

In [4]:
df_train['features']

4         [Dining Room, Pre-War, Laundry in Building, Di...
6         [Doorman, Elevator, Laundry in Building, Dishw...
9         [Doorman, Elevator, Laundry in Building, Laund...
10                                                       []
15        [Doorman, Elevator, Fitness Center, Laundry in...
                                ...                        
124000              [Elevator, Dishwasher, Hardwood Floors]
124002    [Common Outdoor Space, Cats Allowed, Dogs Allo...
124004    [Dining Room, Elevator, Pre-War, Laundry in Bu...
124008    [Pre-War, Laundry in Unit, Dishwasher, No Fee,...
124009    [Dining Room, Elevator, Laundry in Building, D...
Name: features, Length: 49352, dtype: object

## 3.b Удалите неиспользуемые символы ([,], ', " и пробел) из столбца.


In [5]:
def clean_features(feature_list):
    if not feature_list:  
        return pd.NA  
    cleaned_list = []
    PAIRS = {
        '[': ']',
        '"': '"',
        "'": "'"
    }
    for item in feature_list:
        if item is pd.NA:
            continue  
        if not isinstance(item, str):
            item = str(item)
        for open_char, close_char in PAIRS.items():
            if open_char == close_char:
                quote_count = item.count(open_char)
                if quote_count % 2 != 0:  
                    item = item.replace(open_char, '')
            else:
                open_count = item.count(open_char)
                close_count = item.count(close_char)
                if open_count != close_count: 
                    item = item.replace(open_char, '')
                    item = item.replace(close_char, '')
        item = re.sub(r'\s+', ' ', item).strip()
        cleaned_list.append(item)
    return cleaned_list if cleaned_list else pd.NA

In [6]:
df_train['features'] = df_train['features'].apply(lambda x: clean_features(x) if isinstance(x, list) else x)
df_train['features']

4         [Dining Room, Pre-War, Laundry in Building, Di...
6         [Doorman, Elevator, Laundry in Building, Dishw...
9         [Doorman, Elevator, Laundry in Building, Laund...
10                                                     <NA>
15        [Doorman, Elevator, Fitness Center, Laundry in...
                                ...                        
124000              [Elevator, Dishwasher, Hardwood Floors]
124002    [Common Outdoor Space, Cats Allowed, Dogs Allo...
124004    [Dining Room, Elevator, Pre-War, Laundry in Bu...
124008    [Pre-War, Laundry in Unit, Dishwasher, No Fee,...
124009    [Dining Room, Elevator, Laundry in Building, D...
Name: features, Length: 49352, dtype: object

## 3.c Получить все значения из каждого списка и объединить результаты в один большой список для всего набора данных. 

In [7]:
all_features = []
for _, row in df_train.iterrows():
    features = row['features']
    if isinstance(features, list):
        all_features.extend(features)
all_features[:10]

['Dining Room',
 'Pre-War',
 'Laundry in Building',
 'Dishwasher',
 'Hardwood Floors',
 'Dogs Allowed',
 'Cats Allowed',
 'Doorman',
 'Elevator',
 'Laundry in Building']

In [8]:
print(len(all_features))

267906


## 3.d Сколько уникальных значений содержит список результатов?

In [9]:
unique_values = np.unique(all_features)
num_unique = len(unique_values)
print(num_unique)

1552


## 3.f

In [10]:
feature_counts = Counter(all_features)
top_20_features = [feature for feature, _ in feature_counts.most_common(20)]
top_20_features 

['Elevator',
 'Cats Allowed',
 'Hardwood Floors',
 'Dogs Allowed',
 'Doorman',
 'Dishwasher',
 'No Fee',
 'Laundry in Building',
 'Fitness Center',
 'Pre-War',
 'Laundry in Unit',
 'Roof Deck',
 'Outdoor Space',
 'Dining Room',
 'High Speed Internet',
 'Balcony',
 'Swimming Pool',
 'Laundry In Building',
 'New Construction',
 'Terrace']

## 3.h

In [11]:
for feature in top_20_features:
    df_train[f'{feature}'] = df_train['features'].apply(
        lambda x: 1 if isinstance(x, list) and feature in x else 0
    )
df_train

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,...,Laundry in Unit,Roof Deck,Outdoor Space,Dining Room,High Speed Internet,Balcony,Swimming Pool,Laundry In Building,New Construction,Terrace
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,...,0,0,0,1,0,0,0,0,0,0
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,...,0,0,0,0,0,0,0,0,0,0
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,...,1,0,0,0,0,0,0,0,0,0
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,,40.7145,7211212,-73.9425,...,0,0,0,0,0,0,0,0,0,0
15,1.0,0,bfb9405149bfff42a92980b594c28234,2016-06-28 03:50:23,Over-sized Studio w abundant closets. Availabl...,East 34th Street,"[Doorman, Elevator, Fitness Center, Laundry in...",40.7439,7225292,-73.9743,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124000,1.0,3,92bbbf38baadfde0576fc496bd41749c,2016-04-05 03:58:33,There is 700 square feet of recently renovated...,W 171 Street,"[Elevator, Dishwasher, Hardwood Floors]",40.8433,6824800,-73.9396,...,0,0,0,0,0,0,0,0,0,0
124002,1.0,2,5565db9b7cba3603834c4aa6f2950960,2016-04-02 02:25:31,"2 bedroom apartment with updated kitchen, rece...",Broadway,"[Common Outdoor Space, Cats Allowed, Dogs Allo...",40.8198,6813268,-73.9578,...,0,0,0,0,0,0,0,1,0,0
124004,1.0,1,67997a128056ee1ed7d046bbb856e3c7,2016-04-26 05:42:03,No Brokers Fee * Never Lived 1 Bedroom 1 Bathr...,210 Brighton 15th St,"[Dining Room, Elevator, Pre-War, Laundry in Bu...",40.5765,6927093,-73.9554,...,1,0,0,1,0,0,0,0,0,0
124008,1.0,2,3c0574a740154806c18bdf1fddd3d966,2016-04-19 02:47:33,Wonderful Bright Chelsea 2 Bedroom apartment o...,West 21st Street,"[Pre-War, Laundry in Unit, Dishwasher, No Fee,...",40.7448,6892816,-74.0017,...,1,0,1,0,0,0,0,0,0,0


In [12]:
df_filter_tr=df_train['price']
lower_tr = df_filter_tr.quantile(0.01) 

upper_tr = df_filter_tr.quantile(0.99) 

df_train = df_train[ (df_train['price'] <= upper_tr)] 


# 4. Линейная регрессия

## 4.a 

In [13]:
np.random.seed(21)

## 4.b Линейная регрессия

In [14]:
class Linreg:
    def __init__(self, learning_rate=0.001, n_iter=100, random_state=21):
        self.learning_rate = learning_rate
        self.n_iter = n_iter
        self.random_state = random_state
        self.weights = None
        self.bias = None
        
    def fit(self,X,y):
        np.random.seed(self.random_state)
        n_samples,n_features=X.shape
        self.weights= np.random.randn(n_features)
        self.bias=0
        X = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)
        y = y.to_numpy() if hasattr(y, 'to_numpy') else np.array(y)
        for _ in range(self.n_iter):
            idxs= np.random.permutation(n_samples)
            for idx in idxs:
                y_pred = X[idx] @ self.weights + self.bias
                dw = -2 * X[idx] *(y[idx] - y_pred)
                db = -2 * (y[idx] - y_pred)
                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db
                
    def predict(self,X):
        X=X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)
        y=X@self.weights+self.bias
        return y

    def deterministic_sgd(self,X,y):
        n_samples, n_features = X.shape
        w = np.zeros(n_features)     
        b = 0
        X = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)
        y = y.to_numpy() if hasattr(y, 'to_numpy') else np.array(y)
        for _ in range(self.n_iter):
            for i in range(n_samples): 
                y_pred = X[i] @ self.weights + self.bias
                dw = -2 * X[i] *(y[i] - y_pred)
                db = -2 * (y[i] - y_pred)
                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db

## 4.d Реализуйте функцию для расчета $R^2$

## Формула для $R^2: R^2=1-\frac{SS_{res}}{SS_{tot}}$, где $SS_{tot} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $ - ошибка модели $\\SS_{res} = \sum_{i=1}^{n} (y_i - \bar{y}_i)^2 $- дисперсия целевой переменной

In [15]:
def r2(y, y_pred):
    SS_res = np.sum((y - y_pred) ** 2)
    SS_tot = np.sum((y - np.mean(y)) ** 2)
    return 1 - SS_res / SS_tot

## 4.e Предсказание на основе созданного класса

In [16]:
model = Linreg()
feature_list = [
    'bathrooms','bedrooms','Elevator', 'Cats Allowed', 'Hardwood Floors', 'Dogs Allowed', 
    'Doorman', 'Dishwasher', 'No Fee', 'Laundry in Building', 
    'Fitness Center', 'Pre-War', 'Laundry in Unit', 'Roof Deck',
    'Outdoor Space', 'Dining Room', 'High Speed Internet', 'Balcony',
    'Swimming Pool', 'Laundry In Building', 'New Construction', 'Terrace'
]
X = df_train[feature_list]  
y = df_train['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
model.fit(X_train, y_train)
y_pred_sgd_tr=model.predict(X_train)
y_pred_sgd_tt = model.predict(X_test)

In [17]:
print(f'R2: {r2(y_test,y_pred_sgd_tt)}')
print(f'MAE: {mean_absolute_error(y_test,y_pred_sgd_tt)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_sgd_tt))}')

R2: 0.5685403963191817
MAE: 731.6765635396449
RMSE: 1067.2323774689633


In [18]:
model.deterministic_sgd(X_train,y_train)
y_pred_dsgd_tt = model.predict(X_test)

In [19]:
print(f'R2: {r2(y_test,y_pred_dsgd_tt)}')
print(f'MAE: {mean_absolute_error(y_test,y_pred_dsgd_tt)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_dsgd_tt))}')

R2: 0.5671367403683487
MAE: 718.1774366883362
RMSE: 1068.9669670687688


## 4.f

In [20]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_sklinreg_tt = lr.predict(X_test)
print(f'R2: {r2(y_test, y_pred_sklinreg_tt)}')
print(f'MAE: {mean_absolute_error(y_test, y_pred_sklinreg_tt)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_sklinreg_tt))}')

R2: 0.5695456320124295
MAE: 725.6702506334775
RMSE: 1065.9884073778624


In [21]:
result_MAE = pd.DataFrame(columns=['model', 'train', 'test'])
result_RMSE = pd.DataFrame(columns=['model', 'train', 'test'])
result_R2 = pd.DataFrame(columns=['model', 'train', 'test'])

In [22]:
lr_mae_train = mean_absolute_error(y_train, y_pred_sgd_tr)
lr_mae_test = mean_absolute_error(y_test, y_pred_sgd_tt)

lr_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_sgd_tr))
lr_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_sgd_tt))

lr_r2_train = r2(y_train, y_pred_sgd_tr)
lr_r2_test = r2(y_test, y_pred_sgd_tt)

result_MAE.loc[len(result_MAE)] = {
    'model': 'linear regression',
    'train': lr_mae_train,
    'test': lr_mae_test
}

result_RMSE.loc[len(result_RMSE)] = {
    'model': 'linear regression',
    'train': lr_rmse_train,
    'test': lr_rmse_test
}

result_R2.loc[len(result_R2)] = {
    'model': 'linear regression',
    'train': lr_r2_train,
    'test': lr_r2_test
}

In [23]:
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564


In [24]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377


In [25]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854


# 5. Реализация регуляризованных моделей — Ridge, Lasso, ElasticNet

In [26]:
import numpy as np
class Linreg:
    def __init__(self, learning_rate=0.001, n_iter=100, random_state=21,
                 reg_type=None, alpha=0.1, l1_ratio=0.5):
        self.learning_rate = learning_rate
        self.n_iter = n_iter
        self.random_state = random_state
        self.reg_type = reg_type
        self.alpha = alpha
        self.l1_ratio = l1_ratio
        self.weights = None
        self.bias = None
        
    def _add_regularization(self, dw):
        if self.reg_type == 'Ridge':
            dw += self.alpha * self.weights
        elif self.reg_type == 'Lasso':
            dw += self.alpha * np.sign(self.weights)
        elif self.reg_type == 'ElasticNet':
            dw += (self.alpha * self.l1_ratio * np.sign(self.weights) + 
                   self.alpha * (1 - self.l1_ratio) * self.weights)
        return dw
        
    def fit(self, X, y):
        np.random.seed(self.random_state)
        n_samples, n_features = X.shape
        self.weights = np.random.randn(n_features)
        self.bias = 0
        X = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)
        y = y.to_numpy() if hasattr(y, 'to_numpy') else np.array(y)
        for _ in range(self.n_iter):
            idxs = np.random.permutation(n_samples)
            for idx in idxs:
                y_pred = X[idx] @ self.weights + self.bias
                dw = -2 * X[idx] * (y[idx] - y_pred)
                db = -2 * (y[idx] - y_pred)
                dw = self._add_regularization(dw)
                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db
        
    def deterministic_sgd(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        X = X.to_numpy() 
        y = y.to_numpy()
        for _ in range(self.n_iter):
            for i in range(n_samples):
                y_pred = X[i] @ self.weights + self.bias
                dw = -2 * X[i] * (y[i] - y_pred)
                db = -2 * (y[i] - y_pred)
                dw = self._add_regularization(dw)
                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db
                
    def predict(self, X):
        X = X.to_numpy() if hasattr(X, 'to_numpy') else np.array(X)
        return X @ self.weights + self.bias
    
    def set_regularization(self, reg_type, alpha=0.1, l1_ratio=0.5):
        self.reg_type = reg_type
        self.alpha = alpha
        self.l1_ratio = l1_ratio

## 5.b Прогноз с помощью алгоритма и оценка модели по MAE, RMSE и R2.

## Lasso

In [27]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='Lasso', alpha=0.1,l1_ratio=1)
model.fit(X_train,y_train)
y_pred_tt=model.predict(X_test)
y_pred_tr=model.predict(X_train)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Lasso',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Lasso',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Lasso',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154


In [28]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377
1,Lasso,1059.277123,1087.95425


In [29]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623


In [30]:
model=Lasso(alpha=0.1,max_iter=100)
model.fit(X_train,y_train)
y_lasso_pred_tt=model.predict(X_test)
y_lasso_pred_tr=model.predict(X_train)
print(f'R2: {r2(y_test,y_lasso_pred_tt)}')
print(f'MAE: {mean_absolute_error(y_test,y_lasso_pred_tt)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_lasso_pred_tt))}')

R2: 0.569540156445242
MAE: 725.6146776583989
RMSE: 1065.9951872748943


In [31]:
print(f'R2: {r2(y_train,y_lasso_pred_tr)}')
print(f'MAE: {mean_absolute_error(y_train,y_lasso_pred_tr)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_train, y_lasso_pred_tr))}')

R2: 0.5838550308315844
MAE: 712.9159688073073
RMSE: 1032.0980385894438


## Ridge

In [32]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='Ridge', alpha=0.1,l1_ratio=0.001)
model.fit(X_train,y_train)
y_pred_tt=model.predict(X_test)
y_pred_tr=model.predict(X_train)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Ridge',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Ridge',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Ridge',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903


In [33]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377
1,Lasso,1059.277123,1087.95425
2,Ridge,1072.487348,1101.423083


In [34]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452


In [35]:
model=Ridge(alpha=0.1,max_iter=100)
model.fit(X_train,y_train)
y_ridge_pred_tt=model.predict(X_test)
y_ridge_pred_tr=model.predict(X_train)
print(f'R2: {r2(y_test,y_ridge_pred_tt)}')
print(f'MAE: {mean_absolute_error(y_test,y_ridge_pred_tt)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_ridge_pred_tt))}')

R2: 0.5695456742056093
MAE: 725.6698540255635
RMSE: 1065.9883551337164


In [36]:
print(f'R2: {r2(y_train,y_ridge_pred_tr)}')
print(f'MAE: {mean_absolute_error(y_train,y_ridge_pred_tr)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_train, y_ridge_pred_tr))}')

R2: 0.5838566374177008
MAE: 712.9791811017221
RMSE: 1032.0960463077838


## ElasticNet

In [37]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='ElasticNet', alpha=0.1,l1_ratio=0.5)
model.fit(X_train,y_train)
y_pred_tt=model.predict(X_test)
y_pred_tr=model.predict(X_train)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'ElasticNet',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'ElasticNet',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'ElasticNet',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164


In [38]:
model=ElasticNet(alpha=0.1,max_iter=100,l1_ratio=0.5)
model.fit(X_train,y_train)
y_elastic_net_pred_tt=model.predict(X_test)
y_elastic_net_pred_tr=model.predict(X_train)
print(f'R2: {r2(y_test,y_elastic_net_pred_tt)}')
print(f'MAE: {mean_absolute_error(y_test,y_elastic_net_pred_tt)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_elastic_net_pred_tt))}')

R2: 0.5587793865971715
MAE: 729.9192818012281
RMSE: 1079.2369830648886


In [39]:
print(f'R2: {r2(y_train,y_elastic_net_pred_tr)}')
print(f'MAE: {mean_absolute_error(y_train,y_elastic_net_pred_tr)}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_train, y_elastic_net_pred_tr))}')

R2: 0.5722513656603003
MAE: 717.2566998423132
RMSE: 1046.3884665871433


# 6. Нормализация признаков

## 6.a Нормализация - это процесс изменения диапазона числовых данных, чтобы все они находились в схожих пределах. Необходимость ее использования напрямую связана с используемой моделью и свойствами данных. Она нужна для того, чтобы:
- Признаки с большим исходным диапазоном не доминировали над признаками с малым диапазоном в алгоритмах, основанных на расстояниях или градиентах.
- Повысить скорость и устойчивость сходимости алгоритмов.
- Сделать данные более подходящими для методов, чувствительных к масштабу.
## Применяется при использовании алгоритмов, которые рассчитывают расстояния или используют градиентный спуск:
Пример: Метод K-ближайших соседей (KNN). Он классифицирует новый объект в тот класс, к которому принадлежит большинство из его k ближайших соседей в пространстве признаков. Когда появляется новый объект (который нужно классифицировать), алгоритм вычисляет расстояние от этого объекта до всех объектов в обучающей выборке и классифицирует по большинству k ближайших к объекту соседей.  Так как KNN рассчитывает расстояния, признаки должны быть в едином масштабе. Без этого признак с большими по абсолютной величине числами подавит все остальные.
## Не применяется при использовании алгоритмов на основе деревьев, так как они не чувствительны к масштабу данных:
Пример: Decision Tree, Random Forest. Эти алгоритмы принимают решения, основываясь на порядке значений, а не на их абсолютной величине. Нормализация не повлияет на качество модели .

## 6.b Математическая формула для нормализации MinMaxScaler: $X_{norm}=\frac{X-X_{min}}{X_{max}-X_{min}}$

In [40]:
class minmaxscaler:
    def __init__(self, feature_range=(0, 1)):
        self.feature_range = feature_range
        self.min_ = None
        self.max_ = None

    def fit(self, X):
        self.min_ = np.min(X, axis=0)
        self.max_ = np.max(X, axis=0)
        return self

    def transform(self, X):
        X_scaled = (X - self.min_) / (self.max_ - self.min_)
        a, b = self.feature_range
        return X_scaled * (b - a) + a

    def reverse(self, X):
        return X * (self.max_ - self.min_) + self.min_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

## 6.e 

In [41]:
minmax = minmaxscaler()
minmaxnorm= minmax.fit_transform(X_test.copy())
MinMax=MinMaxScaler().fit_transform(X_test.copy())
print(minmaxnorm.iloc[:3, :5])
print(pd.DataFrame(MinMax, columns=X_test.copy().columns).iloc[:3, :5])

       bathrooms  bedrooms  Elevator  Cats Allowed  Hardwood Floors
51074        0.1  0.333333       1.0           0.0              1.0
67862        0.1  0.166667       1.0           1.0              1.0
70185        0.2  0.166667       0.0           1.0              0.0
   bathrooms  bedrooms  Elevator  Cats Allowed  Hardwood Floors
0        0.1  0.333333       1.0           0.0              1.0
1        0.1  0.166667       1.0           1.0              1.0
2        0.2  0.166667       0.0           1.0              0.0


## 6.f Математическая формула для нормализации StadardScaler: $X_{norm}=\frac{X-u}{s}$, где u - среднее арифметическое, s - стандартное отклонение

In [42]:
class standartscaler:
    def __init__(self):
        self.mean_ = None
        self.std_ = None

    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)

    def transform(self, X):
        return (X - self.mean_) / self.std_

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)
    def reverse(self, X):
        return (X * self.std_) + self.mean_

In [43]:
stdscaler = standartscaler()
stdnorm= stdscaler.fit_transform(X_test.copy())
StdScaler=StandardScaler().fit_transform(X_test.copy())
print(stdnorm.iloc[:3, :5])
print(pd.DataFrame(StdScaler, columns=X_test.copy().columns).iloc[:3, :5])

       bathrooms  bedrooms  Elevator  Cats Allowed  Hardwood Floors
51074  -0.427916  0.428340  0.961951     -0.966494         1.029594
67862  -0.427916 -0.475594  0.961951      1.034667         1.029594
70185   1.731546 -0.475594 -1.039554      1.034667        -0.971257
   bathrooms  bedrooms  Elevator  Cats Allowed  Hardwood Floors
0  -0.427916  0.428340  0.961951     -0.966494         1.029594
1  -0.427916 -0.475594  0.961951      1.034667         1.029594
2   1.731546 -0.475594 -1.039554      1.034667        -0.971257


# 7. Подгонка пользовательских и sklearn-моделей под нормализованные данные

## 7.1 Обозначьте все модели — линейную регрессию, Ridge, Lasso и ElasticNet — с помощью MinMaxScaler.

In [44]:
model=Linreg()
X_train_minmaxsc=minmax.fit_transform(X_train)
X_test_minmaxsc=minmax.transform(X_test)
y_train_minmaxsc=minmax.fit_transform(y_train)
y_test_minmaxsc=minmax.transform(y_test)
model.fit(X_train_minmaxsc,y_train_minmaxsc)
y_pred_tt=model.predict(X_test_minmaxsc)
y_pred_tr=model.predict(X_train_minmaxsc)
y_pred_tt=minmax.reverse(y_pred_tt)
y_pred_tr=minmax.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'linear regression minmaxsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'linear regression minmaxsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'linear regression minmaxsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE


Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333


In [45]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815


In [46]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377
1,Lasso,1059.277123,1087.95425
2,Ridge,1072.487348,1101.423083
3,ElasticNet,1061.106312,1090.09857
4,linear regression minmaxsc,1033.677199,1066.89244


In [47]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='Ridge', alpha=0.01,l1_ratio=0)
X_train_minmaxsc=minmax.fit_transform(X_train)
X_test_minmaxsc=minmax.transform(X_test)
y_train_minmaxsc=minmax.fit_transform(y_train)
y_test_minmaxsc=minmax.transform(y_test)
model.fit(X_train_minmaxsc,y_train_minmaxsc)
y_pred_tt=model.predict(X_test_minmaxsc)
y_pred_tr=model.predict(X_train_minmaxsc)
y_pred_tt=minmax.reverse(y_pred_tt)
y_pred_tr=minmax.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Ridge minmaxsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Ridge minmaxsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Ridge minmaxsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076


In [48]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574


In [49]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377
1,Lasso,1059.277123,1087.95425
2,Ridge,1072.487348,1101.423083
3,ElasticNet,1061.106312,1090.09857
4,linear regression minmaxsc,1033.677199,1066.89244
5,Ridge minmaxsc,1088.954624,1117.930895


In [50]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='Lasso', alpha=0.01,l1_ratio=1)
X_train_minmaxsc=minmax.fit_transform(X_train)
X_test_minmaxsc=minmax.transform(X_test)
y_train_minmaxsc=minmax.fit_transform(y_train)
y_test_minmaxsc=minmax.transform(y_test)
model.fit(X_train_minmaxsc,y_train_minmaxsc)
y_pred_tt=model.predict(X_test_minmaxsc)
y_pred_tr=model.predict(X_train_minmaxsc)
y_pred_tt=minmax.reverse(y_pred_tt)
y_pred_tr=minmax.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Lasso minmaxsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Lasso minmaxsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Lasso minmaxsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806


In [51]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487


In [52]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377
1,Lasso,1059.277123,1087.95425
2,Ridge,1072.487348,1101.423083
3,ElasticNet,1061.106312,1090.09857
4,linear regression minmaxsc,1033.677199,1066.89244
5,Ridge minmaxsc,1088.954624,1117.930895
6,Lasso minmaxsc,1290.708981,1318.472911


In [53]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='ElasticNet', alpha=0.01,l1_ratio=0.5)
X_train_minmaxsc=minmax.fit_transform(X_train)
X_test_minmaxsc=minmax.transform(X_test)
y_train_minmaxsc=minmax.fit_transform(y_train)
y_test_minmaxsc=minmax.transform(y_test)
model.fit(X_train_minmaxsc,y_train_minmaxsc)
y_pred_tt=model.predict(X_test_minmaxsc)
y_pred_tr=model.predict(X_train_minmaxsc)
y_pred_tt=minmax.reverse(y_pred_tt)
y_pred_tr=minmax.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'ElasticNet minmaxsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'ElasticNet minmaxsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'ElasticNet minmaxsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827


In [54]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617


In [55]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377
1,Lasso,1059.277123,1087.95425
2,Ridge,1072.487348,1101.423083
3,ElasticNet,1061.106312,1090.09857
4,linear regression minmaxsc,1033.677199,1066.89244
5,Ridge minmaxsc,1088.954624,1117.930895
6,Lasso minmaxsc,1290.708981,1318.472911
7,ElasticNet minmaxsc,1147.861055,1176.558213


## 7.2 Обозначьте все модели — линейную регрессию, Ridge, Lasso и ElasticNet — с помощью StandardScaler.

In [56]:
model=Linreg(n_iter=100)
X_train_stdsc=stdscaler.fit_transform(X_train)
X_test_stdsc=stdscaler.transform(X_test)
y_train_stdsc=stdscaler.fit_transform(y_train)
y_test_stdsc=stdscaler.transform(y_test)
model.fit(X_train_stdsc,y_train_stdsc)
y_pred_tt=model.predict(X_test_stdsc)
y_pred_tr=model.predict(X_train_stdsc)
y_pred_tt=stdscaler.reverse(y_pred_tt)
y_pred_tr=stdscaler.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'linear regression stdsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'linear regression stdsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'linear regression stdsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546


In [57]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='Ridge', alpha=0.01,l1_ratio=0)
X_train_stdsc=stdscaler.fit_transform(X_train)
X_test_stdsc=stdscaler.transform(X_test)
y_train_stdsc=stdscaler.fit_transform(y_train)
y_test_stdsc=stdscaler.transform(y_test)
model.fit(X_train_stdsc,y_train_stdsc)
y_pred_tt=model.predict(X_test_stdsc)
y_pred_tr=model.predict(X_train_stdsc)
y_pred_tt=stdscaler.reverse(y_pred_tt)
y_pred_tr=stdscaler.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Ridge stdsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Ridge stdsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Ridge stdsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546
9,Ridge stdsc,816.643552,822.890281


In [58]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232377
1,Lasso,1059.277123,1087.95425
2,Ridge,1072.487348,1101.423083
3,ElasticNet,1061.106312,1090.09857
4,linear regression minmaxsc,1033.677199,1066.89244
5,Ridge minmaxsc,1088.954624,1117.930895
6,Lasso minmaxsc,1290.708981,1318.472911
7,ElasticNet minmaxsc,1147.861055,1176.558213
8,linear regression stdsc,1043.675211,1074.466085
9,Ridge stdsc,1143.936559,1171.797955


In [59]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='Lasso', alpha=0.01,l1_ratio=1)
X_train_stdsc=stdscaler.fit_transform(X_train)
X_test_stdsc=stdscaler.transform(X_test)
y_train_stdsc=stdscaler.fit_transform(y_train)
y_test_stdsc=stdscaler.transform(y_test)
model.fit(X_train_stdsc,y_train_stdsc)
y_pred_tt=model.predict(X_test_stdsc)
y_pred_tr=model.predict(X_train_stdsc)
y_pred_tt=stdscaler.reverse(y_pred_tt)
y_pred_tr=stdscaler.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Lasso stdsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Lasso stdsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Lasso stdsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546
9,Ridge stdsc,816.643552,822.890281


In [60]:
model=Linreg(learning_rate=0.01,n_iter=100,reg_type='ElasticNet', alpha=0.01,l1_ratio=0.5)
X_train_stdsc=stdscaler.fit_transform(X_train)
X_test_stdsc=stdscaler.transform(X_test)
y_train_stdsc=stdscaler.fit_transform(y_train)
y_test_stdsc=stdscaler.transform(y_test)
model.fit(X_train_stdsc,y_train_stdsc)
y_pred_tt=model.predict(X_test_stdsc)
y_pred_tr=model.predict(X_train_stdsc)
y_pred_tt=stdscaler.reverse(y_pred_tt)
y_pred_tr=stdscaler.reverse(y_pred_tr)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'ElasticNet stdsc',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'ElasticNet stdsc',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'ElasticNet stdsc',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546
9,Ridge stdsc,816.643552,822.890281


In [61]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617
8,linear regression stdsc,0.574467,0.562672
9,Ridge stdsc,0.488781,0.479851


# 8. Модели переобучения

## 8.b 

## 8.c

In [62]:
X_poly = df_train[['bathrooms', 'bedrooms', 'interest_level_enc']]
X_poly = PolynomialFeatures(degree=10).fit_transform(X_poly)
y_poly = df_train['price']
X_train, X_test, y_train, y_test = train_test_split(X_poly, y_poly, test_size=0.2, random_state=21)

scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_train = scaler_X.fit_transform(X_train)
y_train_sc = scaler_y.fit_transform(y_train.values.reshape(-1, 1)).flatten()
X_test = scaler_X.transform(X_test)

In [63]:
X_train

array([[ 0.        , -0.42442894, -1.38352827, ..., -0.19558162,
        -0.2338375 , -0.28757608],
       [ 0.        ,  1.78273705,  0.43213489, ..., -0.19558162,
        -0.2338375 , -0.29122427],
       [ 0.        , -0.42442894,  1.33996648, ..., -0.19558162,
        -0.2338375 , -0.29122427],
       ...,
       [ 0.        , -0.42442894, -1.38352827, ..., -0.19558162,
        -0.2338375 , -0.29122427],
       [ 0.        , -0.42442894, -0.47569669, ..., -0.19558162,
        -0.2338375 , -0.29122427],
       [ 0.        , -0.42442894, -1.38352827, ..., -0.19558162,
        -0.2338375 , -0.29122427]], shape=(39096, 286))

In [64]:
model = Linreg(n_iter=100, learning_rate=0.000001,alpha=0.01)
#model = LinearRegression()
model.fit(X_train, y_train_sc)

y_pred_tt = model.predict(X_test)  
y_pred_tr = model.predict(X_train) 

y_pred_tt = scaler_y.inverse_transform(y_pred_tt.reshape(-1, 1)).flatten()
y_pred_tr = scaler_y.inverse_transform(y_pred_tr.reshape(-1, 1)).flatten()

y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)

result_MAE.loc[len(result_MAE)] = {
    'model': 'linear regression poly',
    'train': y_mae_train,
    'test': y_mae_test
}
result_RMSE.loc[len(result_RMSE)] = {
    'model': 'linear regression poly',
    'train': y_rmse_train,
    'test': y_rmse_test
}
result_R2.loc[len(result_R2)] = {
    'model': 'linear regression poly',
    'train': y_r2_train,
    'test': y_r2_test
}
result_R2
    

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617
8,linear regression stdsc,0.574467,0.562672
9,Ridge stdsc,0.488781,0.479851


In [65]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232
1,Lasso,1059.277123,1087.954
2,Ridge,1072.487348,1101.423
3,ElasticNet,1061.106312,1090.099
4,linear regression minmaxsc,1033.677199,1066.892
5,Ridge minmaxsc,1088.954624,1117.931
6,Lasso minmaxsc,1290.708981,1318.473
7,ElasticNet minmaxsc,1147.861055,1176.558
8,linear regression stdsc,1043.675211,1074.466
9,Ridge stdsc,1143.936559,1171.798


In [66]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617
8,linear regression stdsc,0.574467,0.562672
9,Ridge stdsc,0.488781,0.479851


In [68]:
model=Linreg(learning_rate=0.000001,n_iter=100,reg_type='Lasso', alpha=0.01,l1_ratio=1)
model.fit(X_train,y_train)
y_pred_tt=model.predict(X_test)
y_pred_tr=model.predict(X_train)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Lasso poly',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Lasso poly',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Lasso poly',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546
9,Ridge stdsc,816.643552,822.890281


In [69]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232
1,Lasso,1059.277123,1087.954
2,Ridge,1072.487348,1101.423
3,ElasticNet,1061.106312,1090.099
4,linear regression minmaxsc,1033.677199,1066.892
5,Ridge minmaxsc,1088.954624,1117.931
6,Lasso minmaxsc,1290.708981,1318.473
7,ElasticNet minmaxsc,1147.861055,1176.558
8,linear regression stdsc,1043.675211,1074.466
9,Ridge stdsc,1143.936559,1171.798


In [70]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617
8,linear regression stdsc,0.574467,0.562672
9,Ridge stdsc,0.488781,0.479851


In [71]:
model=Linreg(learning_rate=0.000001,n_iter=100,reg_type='Ridge', alpha=0.01,l1_ratio=0)
model.fit(X_train,y_train)
y_pred_tt=model.predict(X_test)
y_pred_tr=model.predict(X_train)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'Ridge poly',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'Ridge poly',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'Ridge poly',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546
9,Ridge stdsc,816.643552,822.890281


In [72]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617
8,linear regression stdsc,0.574467,0.562672
9,Ridge stdsc,0.488781,0.479851


In [73]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232
1,Lasso,1059.277123,1087.954
2,Ridge,1072.487348,1101.423
3,ElasticNet,1061.106312,1090.099
4,linear regression minmaxsc,1033.677199,1066.892
5,Ridge minmaxsc,1088.954624,1117.931
6,Lasso minmaxsc,1290.708981,1318.473
7,ElasticNet minmaxsc,1147.861055,1176.558
8,linear regression stdsc,1043.675211,1074.466
9,Ridge stdsc,1143.936559,1171.798


In [74]:
model=Linreg(learning_rate=0.000001,n_iter=100,reg_type='ElasticNet', alpha=0.01,l1_ratio=0.5)
model.fit(X_train,y_train)
y_pred_tt=model.predict(X_test)
y_pred_tr=model.predict(X_train)
y_mae_train = mean_absolute_error(y_train, y_pred_tr)
y_mae_test = mean_absolute_error(y_test, y_pred_tt)
y_rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_tr))
y_rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_tt))
y_r2_train = r2(y_train, y_pred_tr)
y_r2_test = r2(y_test, y_pred_tt)
result_MAE.loc[len(result_MAE)] = {
            'model': 'ElasticNet poly',
            'train': y_mae_train,
            'test': y_mae_test
        }
result_RMSE.loc[len(result_RMSE)] = {
            'model': 'ElasticNet poly',
            'train': y_rmse_train,
            'test': y_rmse_test
        }
result_R2.loc[len(result_R2)] = {
            'model': 'ElasticNet poly',
            'train': y_r2_train,
            'test': y_r2_test
        }
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546
9,Ridge stdsc,816.643552,822.890281


In [75]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232
1,Lasso,1059.277123,1087.954
2,Ridge,1072.487348,1101.423
3,ElasticNet,1061.106312,1090.099
4,linear regression minmaxsc,1033.677199,1066.892
5,Ridge minmaxsc,1088.954624,1117.931
6,Lasso minmaxsc,1290.708981,1318.473
7,ElasticNet minmaxsc,1147.861055,1176.558
8,linear regression stdsc,1043.675211,1074.466
9,Ridge stdsc,1143.936559,1171.798


In [76]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617
8,linear regression stdsc,0.574467,0.562672
9,Ridge stdsc,0.488781,0.479851


## 9 Наивные модели

In [77]:
mean_price = y_train.mean()
median_price = y_train.median()

In [78]:
y_train_pred_mean = [mean_price] * len(y_train)
y_train_pred_median = [median_price] * len(y_train)
y_test_pred_mean = [mean_price] * len(y_test)
y_test_pred_median = [median_price] * len(y_test)

In [79]:
mae_train_mean = mean_absolute_error(y_train, y_train_pred_mean)
mae_test_mean = mean_absolute_error(y_test, y_test_pred_mean)
mae_train_median = mean_absolute_error(y_train, y_train_pred_median)
mae_test_median = mean_absolute_error(y_test, y_test_pred_median)

In [80]:
rmse_train_mean = np.sqrt(mean_squared_error(y_train, y_train_pred_mean))
rmse_test_mean = np.sqrt(mean_squared_error(y_test, y_test_pred_mean))
rmse_train_median = np.sqrt(mean_squared_error(y_train, y_train_pred_median))
rmse_test_median = np.sqrt(mean_squared_error(y_test, y_test_pred_median))

In [81]:
y_r2_mean_train = r2(y_train, y_train_pred_mean)
y_r2_mean_test = r2(y_test, y_test_pred_mean)
y_r2_median_train = r2(y_train, y_train_pred_median)
y_r2_median_test = r2(y_test, y_test_pred_median)

In [82]:
result_MAE.loc[len(result_MAE)] = {
    'model': 'naive mean',
    'train': mae_train_mean,
    'test': mae_test_mean
}
result_RMSE.loc[len(result_RMSE)] = {
    'model': 'naive mean',
    'train': rmse_train_mean,
    'test': rmse_test_mean
}
result_R2.loc[len(result_R2)] = {
    'model': 'naive mean',
    'train': y_r2_mean_train,
    'test': y_r2_mean_test
}

In [83]:
result_MAE.loc[len(result_MAE)] = {
    'model': 'naive median',
    'train': mae_train_median,
    'test': mae_test_mean
}
result_RMSE.loc[len(result_RMSE)] = {
    'model': 'naive median',
    'train': rmse_train_median,
    'test': rmse_test_mean
}
result_R2.loc[len(result_R2)] = {
    'model': 'naive median',
    'train': y_r2_median_train,
    'test': y_r2_median_test
}

## 10 Итоговая таблица

In [84]:
result_MAE

Unnamed: 0,model,train,test
0,linear regression,718.497361,731.676564
1,Lasso,731.613031,740.291154
2,Ridge,719.604702,728.795903
3,ElasticNet,721.377651,730.411164
4,linear regression minmaxsc,715.010744,727.987333
5,Ridge minmaxsc,751.396336,764.261076
6,Lasso minmaxsc,844.732604,858.106806
7,ElasticNet minmaxsc,764.632678,777.438827
8,linear regression stdsc,730.233319,740.482546
9,Ridge stdsc,816.643552,822.890281


In [85]:
result_RMSE

Unnamed: 0,model,train,test
0,linear regression,1034.2543,1067.232
1,Lasso,1059.277123,1087.954
2,Ridge,1072.487348,1101.423
3,ElasticNet,1061.106312,1090.099
4,linear regression minmaxsc,1033.677199,1066.892
5,Ridge minmaxsc,1088.954624,1117.931
6,Lasso minmaxsc,1290.708981,1318.473
7,ElasticNet minmaxsc,1147.861055,1176.558
8,linear regression stdsc,1043.675211,1074.466
9,Ridge stdsc,1143.936559,1171.798


In [86]:
result_R2

Unnamed: 0,model,train,test
0,linear regression,0.582114,0.56854
1,Lasso,0.561649,0.551623
2,Ridge,0.550648,0.540452
3,ElasticNet,0.560134,0.549854
4,linear regression minmaxsc,0.582581,0.568815
5,Ridge minmaxsc,0.536743,0.526574
6,Lasso minmaxsc,0.349182,0.341487
7,ElasticNet minmaxsc,0.485268,0.475617
8,linear regression stdsc,0.574467,0.562672
9,Ridge stdsc,0.488781,0.479851


## Самая лучшая модель и самая стабильная модель - линейная регрессия. Также стабильной моделью можно назвать Ridge без нормализации признаков.