# Обработка признаков

В этом домашнем задании вы будете решать задачу предсказания стоимости автомобилей по их различным характеристикам.

In [1]:
import pandas as pd

RANDOM_STATE = 42

In [158]:
df = pd.read_csv("https://raw.githubusercontent.com/evgpat/edu_stepik_practical_ml/main/datasets/cars_prices.csv", decimal='.')

In [3]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


### Описание некоторых признаков

`symboling` - rating corresponds to the degree to which the auto is more risky than its price indicates (+3 more risk and -3 is pretty safe)
`make` - car types (i.e. car brand)
`fuel-type` - types of fuel (gas or diesel)
`aspiration` - engine aspiration (standard or turbo)
`num-of-doors` - numbers of doors (two or four)
`body-style` - car body style (sedan or hachback)
`drive-wheels` - which types of drive wheel (forward-fwd, reversed-rwd)
`engine-location` - engine mounted location (front or back)
`wheel-base` - wheel size
`length` - car lenght
`weight` - car weight
`width` - car width
`height` - car height

In [4]:
df.shape

(205, 26)

In [171]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          201 non-null    int64  
 1   normalized-losses  201 non-null    object 
 2   make               201 non-null    object 
 3   fuel-type          201 non-null    object 
 4   aspiration         201 non-null    object 
 5   num-of-doors       201 non-null    object 
 6   body-style         201 non-null    object 
 7   drive-wheels       201 non-null    object 
 8   engine-location    201 non-null    object 
 9   wheel-base         201 non-null    float64
 10  length             201 non-null    float64
 11  width              201 non-null    float64
 12  height             201 non-null    float64
 13  curb-weight        201 non-null    int64  
 14  engine-type        201 non-null    object 
 15  num-of-cylinders   201 non-null    object 
 16  engine-size        201 non

## Заполнение пропусков

Пропуски в этом датасете обозначены как `?`

In [177]:
for c in df.columns:
    print(c, len(df[df[c] == '?']))

symboling 0
normalized-losses 0
make 0
fuel-type 0
aspiration 0
num-of-doors 0
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 0


Удалите строки, для которых неизвестно значение price, так как это целевая переменная.

In [161]:
df = df.loc[df['price'] != '?', :]

## Вопрос для Quiz

Сколько строк осталось в данных?

In [162]:
# your code here
df.shape

(201, 26)

Заполните средним значением пропуски в столбцах для числовых признаков и самым популярным значением для категориальных признаков
* `num-of-doors`
* `bore`
* `stroke`
* `horsepower`
* `peak-rpm`

In [163]:
df['horsepower'].describe()

count     201
unique     59
top        68
freq       19
Name: horsepower, dtype: object

In [89]:
pd.to_numeric(df['horsepower'], errors='coerce').mean()

103.39698492462311

In [164]:
df['num-of-doors'].replace('?', 'four', inplace=True)

In [165]:
df['bore'] = pd.to_numeric(df['bore'], errors='coerce')
df['bore'].fillna(df['bore'].mean(), inplace=True)

In [166]:
df['stroke'] = pd.to_numeric(df['stroke'], errors='coerce')
df['stroke'].fillna(df['stroke'].mean(), inplace=True)

In [167]:
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)

In [168]:
df['peak-rpm'] = pd.to_numeric(df['peak-rpm'], errors='coerce')
df['peak-rpm'].fillna(df['peak-rpm'].mean(), inplace=True)

In [169]:
df['price'] = df['price'].astype('int64')

## Вопрос для Quiz

Чему равно среднее значение `peak-rpm` до заполнения пропусков? Ответ округлите до целого числа.

Пропуски в столбце `normalized-losses` предскажите при помощи линейной регрессии по признакам
`symboling`, `wheel-base`, `length`, `width`, `height`, `curb-weight`, `engine-size`, `compression-ratio`, `city-mpg`, `highway-mpg` и заполните их предсказаниями

In [103]:
from sklearn.linear_model import LinearRegression

model_fill = LinearRegression()

In [172]:
fill_features = ['symboling', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'compression-ratio', 'city-mpg', 'highway-mpg']
separator = df['normalized-losses'] != '?'
fill_test = df.loc[separator == False, fill_features]
fill_train = df.loc[separator, fill_features]
fill_target = df.loc[separator, 'normalized-losses']
fill_target_zero = df.loc[separator == False, 'normalized-losses']

In [173]:
model_fill.fit(fill_train, fill_target)
fill_pred = model_fill.predict(fill_test)

In [174]:
df.loc[separator == False, 'normalized-losses'] = fill_pred

## Вопрос для Quiz

Чему равно предсказание линейной регрессии на первом пропущенном значении? Ответ округлите до целого числа.

In [175]:
round(fill_pred[0])

168

In [176]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,168.072493,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,3,168.072493,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,1,134.001799,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450


## 2. Кодирование категориальных признаков

1. Закодируйте бинарные признаки `fuel-type`, `aspiration`, `num-of-doors`, `engine-location` каждый отдельной колонкой, состоящей из 0 и 1.
Единицей кодируйте самую частую категорию.

In [183]:
print(df['engine-location'].unique())
df['engine-location'].describe()

[1 0]


count    201.000000
mean       0.985075
std        0.121557
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: engine-location, dtype: float64

In [179]:
df['fuel-type'] = df['fuel-type'].map({'diesel': 0, 'gas': 1})

In [180]:
df['aspiration'] = df['aspiration'].map({'turbo': 0, 'std': 1})

In [181]:
df['num-of-doors'] = df['num-of-doors'].map({'two': 0, 'four': 1})

In [182]:
df['engine-location'] = df['engine-location'].map({'rear': 0, 'front': 1})

2. Вынесите в переменную `y` целевую переменную `price`, а все остальные колонки - в матрицу `X`.

Закодируйте признаки `make`, `body-style`, `engine-type`, `fuel-system` при помощи LeaveOneOutEncoder.

**Дальше все время работайте с объектами `X`, `y`.**

In [213]:
X = df.drop(['price'], axis=1)
y = df['price']

In [214]:
X.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
0,3,168.072493,alfa-romero,1,1,0,convertible,rwd,1,88.6,...,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27
1,3,168.072493,alfa-romero,1,1,0,convertible,rwd,1,88.6,...,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27
2,1,134.001799,alfa-romero,1,1,0,hatchback,rwd,1,94.5,...,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26
3,2,164.0,audi,1,1,1,sedan,fwd,1,99.8,...,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30
4,2,164.0,audi,1,1,1,sedan,4wd,1,99.4,...,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22


In [212]:
df2 = df.copy()

In [188]:
!pip install category_encoders -q

In [215]:
LOE_cols = ['make', 'body-style', 'engine-type', 'fuel-system']

In [216]:
from category_encoders.leave_one_out import LeaveOneOutEncoder
encoder = LeaveOneOutEncoder(cols=LOE_cols)
# your code here

In [218]:
X = encoder.fit_transform(X, y)

## Вопрос для Quiz

Чему равно среднее значение в столбце `body-style` после кодирования? Ответ округлите до целого числа.

In [219]:
round(X['body-style'].mean())

13207

3. Закодируйте признак `drive-wheels` при помощи OHE из библиотеки category_encoders.

In [221]:
X['drive-wheels'].describe()

count     201
unique      3
top       fwd
freq      118
Name: drive-wheels, dtype: object

In [223]:
from category_encoders.one_hot import OneHotEncoder

encoder = OneHotEncoder(cols=['drive-wheels'])
X = encoder.fit_transform(X)

In [225]:
X.iloc[:5, 14:]

Unnamed: 0,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
0,48.8,2548,18536.545455,four,130,17650.307692,3.47,2.68,9.0,111.0,5000.0,21,27
1,48.8,2548,18263.363636,four,130,17617.285714,3.47,2.68,9.0,111.0,5000.0,21,27
2,52.4,2823,25814.916667,six,152,17617.285714,2.68,3.47,9.0,154.0,5000.0,19,26
3,54.3,2337,11550.8125,four,109,17645.307692,3.19,3.4,10.0,102.0,5500.0,24,30
4,54.3,2824,11526.506944,five,136,17606.846154,3.19,3.4,8.0,115.0,5500.0,18,22


4. В столбце `num-of-cylinders` категории упорядочены по смыслу. Закодируйте их подряд идущими числами, начиная с 1, согласно смыслу.

Подряд идущими числами означает - 1, 2, 3 и так далее без пропусков.

In [230]:
print(sorted(X['num-of-cylinders'].unique()))
X['num-of-cylinders'].describe()

[1, 2, 3, 4, 5, 6, 7]


count    201.000000
mean       3.323383
std        0.877440
min        1.000000
25%        3.000000
50%        3.000000
75%        3.000000
max        7.000000
Name: num-of-cylinders, dtype: float64

In [229]:
X['num-of-cylinders'] = X['num-of-cylinders'].map({'two': 1, 'three': 2, 'four': 3, 'five': 4, 'six': 5, 'eight': 6, 'twelve': 7})

## Вопрос для Quiz

Сколько столбцов получилось в матрице `X`?

In [231]:
X.shape

(201, 27)

In [232]:
X['normalized-losses'] = X['normalized-losses'].astype(float)
X['bore'] = X['bore'].astype(float)
X['stroke'] = X['stroke'].astype(float)
X['horsepower'] = X['horsepower'].astype(float)
X['peak-rpm'] = X['peak-rpm'].astype(float)

y = y.astype(float)

In [233]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          201 non-null    int64  
 1   normalized-losses  201 non-null    float64
 2   make               201 non-null    float64
 3   fuel-type          201 non-null    int64  
 4   aspiration         201 non-null    int64  
 5   num-of-doors       201 non-null    int64  
 6   body-style         201 non-null    float64
 7   drive-wheels_1     201 non-null    int64  
 8   drive-wheels_2     201 non-null    int64  
 9   drive-wheels_3     201 non-null    int64  
 10  engine-location    201 non-null    int64  
 11  wheel-base         201 non-null    float64
 12  length             201 non-null    float64
 13  width              201 non-null    float64
 14  height             201 non-null    float64
 15  curb-weight        201 non-null    int64  
 16  engine-type        201 non

Разбейте данные на тренировочную и тестовую часть в пропорции 3 к 1, зафиксируйте random_state = 42.

In [234]:
from sklearn.model_selection import train_test_split

RANDOM_STATE = 42
TEST_SIZE = 0.25

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

Масштабируйте данные при помощи MinMaxScaler.

Обучайте масштабирование на тренировочных данных, а потом примените и к трейну, и к тесту.

In [239]:
from sklearn.preprocessing import MinMaxScaler

processor = MinMaxScaler()
processor.fit(X_train)
X_train = pd.DataFrame(processor.transform(X_train), columns=X.columns)
X_test = pd.DataFrame(processor.transform(X_test), columns=X.columns)

Обучите на тренировочных данных линейную регрессию, сделайте предсказание на тесте и вычислите значение $R^2$ на тестовых данных.

In [245]:
import numpy as np

In [242]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [247]:
from sklearn.metrics import r2_score

In [248]:
score = r2_score(y_test, y_pred)

## Вопрос для Quiz

Чему равно значение $R^2$ на тестовых данных? Ответ округлите до сотых.

In [249]:
round(score, 2)

0.91