### Множественный регрессионный анализ

В этом уроке применим линейную регрессию на практике – попробуем предсказать стоимость машин и понять, от каких факторов зависит ценообразование на автомобили. Помимо этого узнаем, какие переменные важны для прогнозирования и насколько хорошо полученная модель описывает данные.

Задание:
* Загрузите данные, проверьте правильность, наличие пропущенных значений, типы данных.
* Создайте новый признак – марку автомобиля (company). Машины каких производителей встречаются в датасете? Далее исправьте названия и проверьте изменения.
* Преобразуйте категориальные переменные с помощью pd.get_dummies().
* Постройте модель с одним предиктором цены – horsepower. Какой процент изменчивости объясняет полученная модель? ((R^2))
* Далее – две модели (со всеми предикторами и со всеми, кроме марок машин). Обратите внимание на изменения в (R^2), коэффициентах и их значимости. Какую модель лучше оставить?
* Заполните пропуски в результатах.

* car_ID: Уникальный идентификатор для каждой машины.
* symboling: Символический рейтинг безопасности автомобиля (обычно используется для оценки рисков).
* CarName: Название автомобиля.
* fueltype: Тип топлива, которое использует автомобиль (например, "gas" для бензина).
* aspiration: Тип системы впуска (например, "std" для стандартного).
* doornumber: Количество дверей в автомобиле (например, "two" или "four").
* carbody: Тип кузова автомобиля (например, "convertible" для кабриолета, "hatchback" для хэтчбека).
* drivewheel: Тип привода колес (например, "rwd" для заднего привода).
* enginelocation: Местоположение двигателя (например, "front" для переднего расположения).
* wheelbase: Колесная база (расстояние между осями автомобиля).
* carlength: Длина автомобиля.
* carwidth: Ширина автомобиля.
* curbweight: Масса автомобиля без нагрузки (включая массу машины без пассажиров и груза).
* enginetype: Тип двигателя (например, "dohc" для двигателя с двумя верхними распредвалами).
* cylindernumber: Число цилиндров в двигателе.
* enginesize: Размер двигателя (объем двигателя).
* boreratio: Диаметр цилиндра двигателя.
* stroke: Ход поршня в цилиндре двигателя.
* compressionratio: Степень сжатия двигателя (отношение объема цилиндра при его максимальном и минимальном положении).
* horsepower: Мощность двигателя (лошадиные силы).
* peakrpm: Максимальная скорость оборотов двигателя (обороты в минуту).
* citympg: Расход топлива в городе (miles per gallon).
* highwaympg: Расход топлива на шоссе (miles per gallon).
* price: Цена автомобиля.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
url = "https://raw.githubusercontent.com/sayanbiswasgithub/car-price-prediction-multiple-linear-regression/refs/heads/main/archive_dataset/CarPrice_Assignment.csv"

In [39]:
cars_data = pd.read_csv(url)
cars_data.head(3)

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0


In [72]:
cars_data.dtypes

car_ID                int64
symboling             int64
CarName              object
fueltype             object
aspiration           object
doornumber           object
carbody              object
drivewheel           object
enginelocation       object
wheelbase           float64
carlength           float64
carwidth            float64
carheight           float64
curbweight            int64
enginetype           object
cylindernumber       object
enginesize            int64
fuelsystem           object
boreratio           float64
stroke              float64
compressionratio    float64
horsepower            int64
peakrpm               int64
citympg               int64
highwaympg            int64
price               float64
company              object
dtype: object

In [73]:
cars_data.isnull().sum()

car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
company             0
dtype: int64

In [75]:
cars_data.isna().sum()


car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
company             0
dtype: int64

Создайте новый признак – марку автомобиля (company). Машины каких производителей встречаются в датасете? Далее исправьте названия и проверьте изменения.

In [6]:
cars_data['company'] = cars_data.CarName.apply(lambda x: x.split()[0])
cars_data.company.value_counts()

company
toyota         31
nissan         17
mazda          15
honda          13
mitsubishi     13
subaru         12
peugeot        11
volvo          11
volkswagen      9
dodge           9
buick           8
bmw             8
audi            7
plymouth        7
saab            6
isuzu           4
porsche         4
alfa-romero     3
chevrolet       3
jaguar          3
vw              2
maxda           2
renault         2
toyouta         1
vokswagen       1
Nissan          1
mercury         1
porcshce        1
Name: count, dtype: int64

In [7]:
cars_data.company.value_counts().count()

28

In [8]:
cars = cars_data.query("company not in ('Nissan', 'porcshce', 'vw', 'vokswagen', 'toyouta', 'maxda')")
cars.company.value_counts()

company
toyota         31
nissan         17
mazda          15
mitsubishi     13
honda          13
subaru         12
peugeot        11
volvo          11
dodge           9
volkswagen      9
buick           8
bmw             8
audi            7
plymouth        7
saab            6
isuzu           4
porsche         4
jaguar          3
chevrolet       3
alfa-romero     3
renault         2
mercury         1
Name: count, dtype: int64

In [9]:
cars.company.value_counts().count()

22

Будем использовать такие данные в дальнейшем: 'company', 'fueltype', 'aspiration','carbody', 'drivewheel', 'wheelbase', 'carlength','carwidth', 'curbweight', 'enginetype', 'cylindernumber', 'enginesize', 'boreratio','horsepower'

In [10]:
data_cars = cars[['company', 'fueltype', 'aspiration', 'carbody', 'drivewheel', 'wheelbase', 'carlength', 'carwidth',
                  'curbweight', 'enginetype', 'cylindernumber', 'enginesize', 'boreratio', 'horsepower', 'price']]

data_cars = data_cars.rename(columns={'company': 'CarName'})

data_cars.head(3)

Unnamed: 0,CarName,fueltype,aspiration,carbody,drivewheel,wheelbase,carlength,carwidth,curbweight,enginetype,cylindernumber,enginesize,boreratio,horsepower,price
0,alfa-romero,gas,std,convertible,rwd,88.6,168.8,64.1,2548,dohc,four,130,3.47,111,13495.0
1,alfa-romero,gas,std,convertible,rwd,88.6,168.8,64.1,2548,dohc,four,130,3.47,111,16500.0
2,alfa-romero,gas,std,hatchback,rwd,94.5,171.2,65.5,2823,ohcv,six,152,2.68,154,16500.0


Посмотрим на корреляцию

In [21]:
corr_matrix = data_cars.corr(numeric_only=True)
corr_matrix

Unnamed: 0,wheelbase,carlength,carwidth,curbweight,enginesize,boreratio,horsepower,price
wheelbase,1.0,0.874502,0.794294,0.779521,0.581549,0.500375,0.36896,0.603003
carlength,0.874502,1.0,0.840107,0.877083,0.68661,0.61015,0.559491,0.695764
carwidth,0.794294,0.840107,1.0,0.868727,0.74167,0.560484,0.656177,0.773887
curbweight,0.779521,0.877083,0.868727,1.0,0.851462,0.645432,0.753425,0.841843
enginesize,0.581549,0.68661,0.74167,0.851462,1.0,0.573355,0.804739,0.871825
boreratio,0.500375,0.61015,0.560484,0.645432,0.573355,1.0,0.56128,0.538311
horsepower,0.36896,0.559491,0.656177,0.753425,0.804739,0.56128,1.0,0.800908
price,0.603003,0.695764,0.773887,0.841843,0.871825,0.538311,0.800908,1.0


Преобразуйте категориальные переменные с помощью pd.get_dummies()

In [60]:
object_columns = data_cars.select_dtypes(include=['object']).columns.tolist()
cars_dummies = data_cars.copy()

for col in object_columns:
    col_dummies = pd.get_dummies(cars_dummies[col], prefix=col)
    
    cars_dummies.drop(columns=[col], inplace=True)
    
    cars_dummies = pd.concat([cars_dummies, col_dummies], axis=1)




In [61]:
cars_dummies.head(3)

Unnamed: 0,wheelbase,carlength,carwidth,curbweight,enginesize,boreratio,horsepower,price,CarName_alfa-romero,CarName_audi,...,enginetype_ohcf,enginetype_ohcv,enginetype_rotor,cylindernumber_eight,cylindernumber_five,cylindernumber_four,cylindernumber_six,cylindernumber_three,cylindernumber_twelve,cylindernumber_two
0,88.6,168.8,64.1,2548,130,3.47,111,13495.0,True,False,...,False,False,False,False,False,True,False,False,False,False
1,88.6,168.8,64.1,2548,130,3.47,111,16500.0,True,False,...,False,False,False,False,False,True,False,False,False,False
2,94.5,171.2,65.5,2823,152,2.68,154,16500.0,True,False,...,False,True,False,False,False,False,True,False,False,False


Постройте модель с одним предиктором цены – horsepower. Какой процент изменчивости объясняет полученная модель? ((R^2))


In [62]:
res = smf.ols(formula='price ~ 1 + horsepower', data=data_cars).fit()
res.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.641
Model:,OLS,Adj. R-squared:,0.64
Method:,Least Squares,F-statistic:,348.9
Date:,"Sun, 02 Feb 2025",Prob (F-statistic):,2.6299999999999997e-45
Time:,14:09:32,Log-Likelihood:,-1947.5
No. Observations:,197,AIC:,3899.0
Df Residuals:,195,BIC:,3906.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-3693.3512,973.270,-3.795,0.000,-5612.838,-1773.865
horsepower,163.3031,8.743,18.678,0.000,146.060,180.546

0,1,2,3
Omnibus:,44.988,Durbin-Watson:,0.788
Prob(Omnibus):,0.0,Jarque-Bera (JB):,83.188
Skew:,1.13,Prob(JB):,8.63e-19
Kurtosis:,5.243,Cond. No.,318.0


Теперь – две модели:
* модель со всеми предикторами
* модель со всеми предикторами, кроме марок машин

In [64]:
cars_dummies.columns

Index(['wheelbase', 'carlength', 'carwidth', 'curbweight', 'enginesize',
       'boreratio', 'horsepower', 'price', 'CarName_alfa-romero',
       'CarName_audi', 'CarName_bmw', 'CarName_buick', 'CarName_chevrolet',
       'CarName_dodge', 'CarName_honda', 'CarName_isuzu', 'CarName_jaguar',
       'CarName_mazda', 'CarName_mercury', 'CarName_mitsubishi',
       'CarName_nissan', 'CarName_peugeot', 'CarName_plymouth',
       'CarName_porsche', 'CarName_renault', 'CarName_saab', 'CarName_subaru',
       'CarName_toyota', 'CarName_volkswagen', 'CarName_volvo',
       'fueltype_diesel', 'fueltype_gas', 'aspiration_std', 'aspiration_turbo',
       'carbody_convertible', 'carbody_hardtop', 'carbody_hatchback',
       'carbody_sedan', 'carbody_wagon', 'drivewheel_4wd', 'drivewheel_fwd',
       'drivewheel_rwd', 'enginetype_dohc', 'enginetype_dohcv', 'enginetype_l',
       'enginetype_ohc', 'enginetype_ohcf', 'enginetype_ohcv',
       'enginetype_rotor', 'cylindernumber_eight', 'cylindernumber_

In [80]:
res = smf.ols(formula='price ~ wheelbase + carlength + carwidth + curbweight + enginesize + boreratio + horsepower + \
                       + CarName_audi + CarName_bmw + CarName_buick + CarName_chevrolet + \
                      CarName_dodge + CarName_honda + CarName_isuzu + CarName_jaguar + CarName_mazda + \
                      CarName_mercury + CarName_mitsubishi + CarName_nissan + CarName_peugeot + CarName_plymouth + \
                      CarName_porsche + CarName_renault + CarName_saab + CarName_subaru + CarName_toyota + \
                      CarName_volkswagen + CarName_volvo + fueltype_diesel + fueltype_gas + aspiration_std + \
                      aspiration_turbo + carbody_convertible + carbody_hardtop + carbody_hatchback + carbody_sedan + \
                      carbody_wagon + drivewheel_4wd + drivewheel_fwd + drivewheel_rwd + enginetype_dohc + \
                      enginetype_dohcv + enginetype_l + enginetype_ohc + enginetype_ohcf + enginetype_ohcv + \
                      enginetype_rotor + cylindernumber_eight + cylindernumber_five + cylindernumber_four + \
                      cylindernumber_six + cylindernumber_three + cylindernumber_twelve + cylindernumber_two',
              data=data_cars).fit()
res.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.959
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,75.48
Date:,"Sun, 02 Feb 2025",Prob (F-statistic):,2.66e-83
Time:,14:26:11,Log-Likelihood:,-1734.9
No. Observations:,197,AIC:,3564.0
Df Residuals:,150,BIC:,3718.0
Df Model:,46,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.431e+04,4249.380,-3.369,0.001,-2.27e+04,-5917.872
CarName_audi[T.True],315.7134,2212.251,0.143,0.887,-4055.486,4686.913
CarName_bmw[T.True],7940.3117,2180.275,3.642,0.000,3632.294,1.22e+04
CarName_buick[T.True],3313.7514,2580.460,1.284,0.201,-1784.992,8412.495
CarName_chevrolet[T.True],-2165.3772,2107.375,-1.028,0.306,-6329.350,1998.596
CarName_dodge[T.True],-3065.7371,1744.678,-1.757,0.081,-6513.056,381.582
CarName_honda[T.True],-2376.7338,1682.808,-1.412,0.160,-5701.804,948.336
CarName_isuzu[T.True],-932.8693,1888.128,-0.494,0.622,-4663.632,2797.893
CarName_jaguar[T.True],2710.1368,2624.205,1.033,0.303,-2475.043,7895.317

0,1,2,3
Omnibus:,79.061,Durbin-Watson:,1.39
Prob(Omnibus):,0.0,Jarque-Bera (JB):,487.677
Skew:,1.379,Prob(JB):,1.27e-106
Kurtosis:,10.198,Cond. No.,1.49e+16


In [82]:
res = smf.ols(formula='price ~ wheelbase + carlength + carwidth + curbweight + enginesize + boreratio + horsepower + \
                      + fueltype_diesel + fueltype_gas + aspiration_std + \
                      aspiration_turbo + carbody_convertible + carbody_hardtop + carbody_hatchback + carbody_sedan + \
                      carbody_wagon + drivewheel_4wd + drivewheel_fwd + drivewheel_rwd + enginetype_dohc + \
                      enginetype_dohcv + enginetype_l + enginetype_ohc + enginetype_ohcf + enginetype_ohcv + \
                      enginetype_rotor + cylindernumber_eight + cylindernumber_five + cylindernumber_four + \
                      cylindernumber_six + cylindernumber_three + cylindernumber_twelve + cylindernumber_two',
              data=data_cars).fit()
res.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.914
Model:,OLS,Adj. R-squared:,0.901
Method:,Least Squares,F-statistic:,69.64
Date:,"Sun, 02 Feb 2025",Prob (F-statistic):,5.65e-77
Time:,14:27:09,Log-Likelihood:,-1806.7
No. Observations:,197,AIC:,3667.0
Df Residuals:,170,BIC:,3756.0
Df Model:,26,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.266e+04,4551.828,-2.780,0.006,-2.16e+04,-3669.672
fueltype_diesel[T.True],-5055.1082,2341.872,-2.159,0.032,-9678.004,-432.213
fueltype_gas[T.True],-7599.9474,2320.906,-3.275,0.001,-1.22e+04,-3018.440
aspiration_std[T.True],-5760.1807,2305.265,-2.499,0.013,-1.03e+04,-1209.548
aspiration_turbo[T.True],-6894.8750,2316.469,-2.976,0.003,-1.15e+04,-2322.127
carbody_convertible[T.True],1056.6990,1360.621,0.777,0.438,-1629.189,3742.587
carbody_hardtop[T.True],-3307.3574,1280.586,-2.583,0.011,-5835.256,-779.459
carbody_hatchback[T.True],-4129.0838,1115.471,-3.702,0.000,-6331.042,-1927.126
carbody_sedan[T.True],-2805.1992,1114.442,-2.517,0.013,-5005.126,-605.272

0,1,2,3
Omnibus:,19.773,Durbin-Watson:,1.266
Prob(Omnibus):,0.0,Jarque-Bera (JB):,64.066
Skew:,0.261,Prob(JB):,1.23e-14
Kurtosis:,5.744,Cond. No.,1.02e+16


Поработаем над улучшением модели

In [84]:
res = smf.ols(formula='price ~ wheelbase + carlength + carwidth + curbweight + enginesize + boreratio + horsepower',
              data=data_cars).fit() 
res.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.823
Model:,OLS,Adj. R-squared:,0.816
Method:,Least Squares,F-statistic:,125.5
Date:,"Sun, 02 Feb 2025",Prob (F-statistic):,1.5299999999999998e-67
Time:,14:27:58,Log-Likelihood:,-1878.0
No. Observations:,197,AIC:,3772.0
Df Residuals:,189,BIC:,3798.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.396e+04,1.33e+04,-3.302,0.001,-7.02e+04,-1.77e+04
wheelbase,119.0305,98.522,1.208,0.228,-75.313,313.374
carlength,-47.4822,54.660,-0.869,0.386,-155.303,60.339
carwidth,546.0843,256.612,2.128,0.035,39.893,1052.276
curbweight,2.8109,1.570,1.790,0.075,-0.286,5.908
enginesize,80.2785,12.718,6.312,0.000,55.190,105.367
boreratio,-1573.1817,1211.519,-1.299,0.196,-3963.017,816.654
horsepower,53.6684,12.851,4.176,0.000,28.319,79.018

0,1,2,3
Omnibus:,40.796,Durbin-Watson:,0.828
Prob(Omnibus):,0.0,Jarque-Bera (JB):,110.083
Skew:,0.866,Prob(JB):,1.25e-24
Kurtosis:,6.227,Cond. No.,144000.0


Можно отбросить марки машин, потому что они не сильно влияют на значение детерминации
Если рассматривать только по характеристикам машин (изначальные количетсвенные данные), то R^2 уменьшается на 0.1 - тоже немного для модели