# Обработка признаков

В этом домашнем задании вы будете решать задачу предсказания стоимости автомобилей по их различным характеристикам.

In [99]:
import pandas as pd
import numpy as np
RANDOM_STATE = 42

In [100]:
df = pd.read_csv("https://raw.githubusercontent.com/evgpat/edu_stepik_practical_ml/main/datasets/cars_prices.csv", decimal='.')

In [101]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [90]:
df['horsepower']

0      111
1      111
2      154
3      102
4      115
      ... 
200    114
201    160
202    134
203    106
204    114
Name: horsepower, Length: 205, dtype: object

### Описание некоторых признаков

`symboling` - rating corresponds to the degree to which the auto is more risky than its price indicates (+3 more risk and -3 is pretty safe)  
`make` - car types (i.e. car brand)  
`fuel-type` - types of fuel (gas or diesel)  
`aspiration` - engine aspiration (standard or turbo)  
`num-of-doors` - numbers of doors (two or four)  
`body-style` - car body style (sedan or hachback)  
`drive-wheels` - which types of drive wheel (forward-fwd, reversed-rwd)  
`engine-location` - engine mounted location (front or back)  
`wheel-base` - расстояние между осями передних и задних колес  
`length` - car lenght  
`weight` - car weight  
`width` - car width  
`height` - car height  

In [102]:
df.shape

(205, 26)

In [103]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

## Заполнение пропусков

Пропуски в этом датасете обозначены как `?`

In [104]:
for c in df.columns:
    print(c, len(df[df[c] == '?']))

symboling 0
normalized-losses 41
make 0
fuel-type 0
aspiration 0
num-of-doors 2
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4


Удалите строки, для которых неизвестно значение price, так как это целевая переменная.

## Вопрос для Quiz

Сколько строк осталось в данных?

In [105]:
# your code here
df = df[df.price != '?']
df.shape

(201, 26)

Заполните средним значением пропуски в столбцах для числовых признаков и самым популярным значением для категориальных признаков
* `num-of-doors` - кат
* `bore` - num
* `stroke` - num
* `horsepower` - num
* `peak-rpm` - num

In [106]:
# your code here
df[df['peak-rpm'] != '?']['peak-rpm'].astype(float).mean()

5117.587939698493

In [107]:
value = df['num-of-doors'].describe(include=object)['top']
value

'four'

In [108]:
for j in df.index:
    if df['num-of-doors'][j] == '?':
      df['num-of-doors'][j] = value
    else:
      continue

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['num-of-doors'][j] = value


In [109]:
for i in ['bore', 'stroke', 'horsepower', 'peak-rpm']:
  value = df[df[i] != '?'][i].astype(float).mean()
  for j in df.index:
    if df[i][j] == '?':
      df[i][j] = value
    else:
      continue
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[i][j] = value


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [110]:
for c in df.columns:
    print(c, len(df[df[c] == '?']))

symboling 0
normalized-losses 37
make 0
fuel-type 0
aspiration 0
num-of-doors 0
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 0


## Вопрос для Quiz

Чему равно среднее значение `peak-rpm` до заполнения пропусков? Ответ округлите до целого числа.

Пропуски в столбце `normalized-losses` предскажите при помощи линейной регрессии по признакам
`symboling`, `wheel-base`, `length`, `width`, `height`, `curb-weight`, `engine-size`, `compression-ratio`, `city-mpg`, `highway-mpg` и заполните их предсказаниями

In [127]:
from sklearn.linear_model import LinearRegression
X_train = df[df['normalized-losses'] != '?'][['symboling', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', \
                                        'compression-ratio', 'city-mpg', 'highway-mpg']]
X_test = df[df['normalized-losses'] == '?'][['symboling', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', \
                                        'compression-ratio', 'city-mpg', 'highway-mpg']]
y_train =  df[df['normalized-losses'] != '?']['normalized-losses']
y_test = df[df['normalized-losses'] == '?']['normalized-losses']
reg = LinearRegression().fit(X_train, y_train)
reg.predict(X_test)
# your code here

array([168.07249262, 168.07249262, 134.0017988 , 150.03347669,
       124.36459916, 136.54127678, 127.28771126, 138.09039231,
       130.51306948, 113.6478041 , 155.82813541, 175.60424066,
       166.04457052,  88.0845543 , 122.53305095, 127.5822689 ,
       161.54047049, 154.33123955, 132.97923436, 178.13829126,
       179.87212963, 179.97064318, 131.27495017, 119.95762868,
       144.30448909, 121.04127767, 177.84275063, 157.63816248,
       157.63816248, 158.50508167,  98.29098277, 154.71344201,
       121.38477454, 137.06488059, 108.12442301,  94.62530543,
       106.22799742])

In [None]:
pr = list(reg.predict(X_test))
pr

In [137]:
list_ = list(df[df['normalized-losses'] == '?']['normalized-losses'].index)
for i in range(0,len(list_)):
  if df['normalized-losses'][list_[i]] == '?':
    df['normalized-losses'][list_[i]] = pr[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['normalized-losses'][list_[i]] = pr[i]


In [138]:
df[df['normalized-losses'] == '?']['normalized-losses']

Series([], Name: normalized-losses, dtype: object)

## Вопрос для Quiz

Чему равно предсказание линейной регрессии на первом пропущенном значении? Ответ округлите до целого числа.

## 2. Кодирование категориальных признаков

1. Закодируйте бинарные признаки `fuel-type`, `aspiration`, `num-of-doors`, `engine-location` каждый отдельной колонкой, состоящей из 0 и 1.
Единицей кодируйте самую частую категорию.

In [112]:
# your code here
for i in ['fuel-type', 'aspiration', 'num-of-doors', 'engine-location']:
  value = df[i].describe(include=object)['top']
  for j in df.index:
    if df[i][j] == value:
      df[i][j] = 1
    else:
      df[i][j] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[i][j] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[i][j] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[i][j] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[i][j] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pan

2. Вынесите в переменную `y` целевую переменную `price`, а все остальные колонки - в матрицу `X`.

Закодируйте признаки `make`, `body-style`, `engine-type`, `fuel-system` при помощи LeaveOneOutEncoder.

**Дальше все время работайте с объектами `X`, `y`.**

In [96]:
!pip install category_encoders -q

In [113]:
from category_encoders.leave_one_out import LeaveOneOutEncoder

y = df['price']
X = df.drop(columns = ['price'])

# your code here

In [114]:
X

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
0,3,?,alfa-romero,1,1,0,convertible,rwd,1,88.6,...,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27
1,3,?,alfa-romero,1,1,0,convertible,rwd,1,88.6,...,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27
2,1,?,alfa-romero,1,1,0,hatchback,rwd,1,94.5,...,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26
3,2,164,audi,1,1,1,sedan,fwd,1,99.8,...,four,109,mpfi,3.19,3.40,10.0,102,5500,24,30
4,2,164,audi,1,1,1,sedan,4wd,1,99.4,...,five,136,mpfi,3.19,3.40,8.0,115,5500,18,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,1,1,1,sedan,rwd,1,109.1,...,four,141,mpfi,3.78,3.15,9.5,114,5400,23,28
201,-1,95,volvo,1,0,1,sedan,rwd,1,109.1,...,four,141,mpfi,3.78,3.15,8.7,160,5300,19,25
202,-1,95,volvo,1,1,1,sedan,rwd,1,109.1,...,six,173,mpfi,3.58,2.87,8.8,134,5500,18,23
203,-1,95,volvo,0,0,1,sedan,rwd,1,109.1,...,six,145,idi,3.01,3.40,23.0,106,4800,26,27


In [115]:
encoder = LeaveOneOutEncoder(return_df=True)

X['make'] = encoder.fit_transform(X['make'], y)
X['body-style'] = encoder.fit_transform(X['body-style'], y)
X['engine-type'] = encoder.fit_transform(X['engine-type'], y)
X['fuel-system'] = encoder.fit_transform(X['fuel-system'], y)

## Вопрос для Quiz

Чему равно среднее значение в столбце `body-style` после кодирования? Ответ округлите до целого числа.

In [48]:
X['body-style'].mean()

13207.129353233831

3. Закодируйте признак `drive-wheels` при помощи OHE из библиотеки category_encoders.

In [117]:
from category_encoders.one_hot import OneHotEncoder

# your code here
enc = OneHotEncoder()
enc_data = pd.DataFrame(enc.fit_transform(X[['drive-wheels']]))

# Merge with main
New_X = X.join(enc_data)
New_X

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,drive-wheels_1,drive-wheels_2,drive-wheels_3
0,3,?,16500.0,1,1,0,23569.600000,rwd,1,88.6,...,3.47,2.68,9.0,111,5000,21,27,1,0,0
1,3,?,14997.5,1,1,0,22968.600000,rwd,1,88.6,...,3.47,2.68,9.0,111,5000,21,27,1,0,0
2,1,?,14997.5,1,1,0,9859.791045,rwd,1,94.5,...,2.68,3.47,9.0,154,5000,19,26,1,0,0
3,2,164,18641.0,1,1,1,14465.236559,fwd,1,99.8,...,3.19,3.40,10.0,102,5500,24,30,0,1,0
4,2,164,17941.0,1,1,1,14427.602151,4wd,1,99.4,...,3.19,3.40,8.0,115,5500,18,22,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,18185.0,1,1,1,14434.107527,rwd,1,109.1,...,3.78,3.15,9.5,114,5400,23,28,1,0,0
201,-1,95,17965.0,1,0,1,14410.451613,rwd,1,109.1,...,3.78,3.15,8.7,160,5300,19,25,1,0,0
202,-1,95,17721.0,1,1,1,14384.215054,rwd,1,109.1,...,3.58,2.87,8.8,134,5500,18,23,1,0,0
203,-1,95,17622.5,0,0,1,14373.623656,rwd,1,109.1,...,3.01,3.40,23.0,106,4800,26,27,1,0,0


In [58]:
X['drive-wheels'].unique()

array(['rwd', 'fwd', '4wd'], dtype=object)

4. В столбце `num-of-cylinders` категории упорядочены по смыслу. Закодируйте их подряд идущими числами, начиная с 1, согласно смыслу.

Подряд идущими числами означает - 1, 2, 3 и так далее без пропусков.

In [118]:
X['num-of-cylinders'].unique()

array(['four', 'six', 'five', 'three', 'twelve', 'two', 'eight'],
      dtype=object)

In [121]:
# your code here
for i in New_X.index:
  if New_X['num-of-cylinders'][i] == 'four':
    New_X['num-of-cylinders'] = 1
  elif New_X['num-of-cylinders'][i] == 'six':
    New_X['num-of-cylinders'] = 2
  elif New_X['num-of-cylinders'][i] == 'five':
    New_X['num-of-cylinders'] = 3
  elif New_X['num-of-cylinders'][i] == 'three':
    New_X['num-of-cylinders'] = 4
  elif New_X['num-of-cylinders'][i] == 'twelve':
    New_X['num-of-cylinders'] = 5
  elif New_X['num-of-cylinders'][i] == 'two':
    New_X['num-of-cylinders'] = 6
  elif New_X['num-of-cylinders'][i] == 'eight':
    New_X['num-of-cylinders'] = 7

In [123]:
New_X = New_X.drop(columns=['drive-wheels'])

In [124]:
New_X

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,engine-location,wheel-base,length,...,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,drive-wheels_1,drive-wheels_2,drive-wheels_3
0,3,?,16500.0,1,1,0,23569.600000,1,88.6,168.8,...,3.47,2.68,9.0,111,5000,21,27,1,0,0
1,3,?,14997.5,1,1,0,22968.600000,1,88.6,168.8,...,3.47,2.68,9.0,111,5000,21,27,1,0,0
2,1,?,14997.5,1,1,0,9859.791045,1,94.5,171.2,...,2.68,3.47,9.0,154,5000,19,26,1,0,0
3,2,164,18641.0,1,1,1,14465.236559,1,99.8,176.6,...,3.19,3.40,10.0,102,5500,24,30,0,1,0
4,2,164,17941.0,1,1,1,14427.602151,1,99.4,176.6,...,3.19,3.40,8.0,115,5500,18,22,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,18185.0,1,1,1,14434.107527,1,109.1,188.8,...,3.78,3.15,9.5,114,5400,23,28,1,0,0
201,-1,95,17965.0,1,0,1,14410.451613,1,109.1,188.8,...,3.78,3.15,8.7,160,5300,19,25,1,0,0
202,-1,95,17721.0,1,1,1,14384.215054,1,109.1,188.8,...,3.58,2.87,8.8,134,5500,18,23,1,0,0
203,-1,95,17622.5,0,0,1,14373.623656,1,109.1,188.8,...,3.01,3.40,23.0,106,4800,26,27,1,0,0


## Вопрос для Quiz

Сколько столбцов получилось в матрице `X`?

In [141]:
X = New_X.copy()

In [None]:
list_ = list(New_X[New_X['normalized-losses'] == '?']['normalized-losses'].index)
for i in range(0,len(list_)):
  if New_X['normalized-losses'][list_[i]] == '?':
    New_X['normalized-losses'][list_[i]] = pr[i]

In [142]:
X['normalized-losses'] = X['normalized-losses'].astype(float)
X['bore'] = X['bore'].astype(float)
X['stroke'] = X['stroke'].astype(float)
X['horsepower'] = X['horsepower'].astype(float)
X['peak-rpm'] = X['peak-rpm'].astype(float)

y = y.astype(float)

Разбейте данные на тренировочную и тестовую часть в пропорции 3 к 1, зафиксируйте random_state = 42.

In [156]:
from sklearn.model_selection import train_test_split

# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Масштабируйте данные при помощи MinMaxScaler.

Обучайте масштабирование на тренировочных данных, а потом примените и к трейну, и к тесту.

In [157]:
from sklearn.preprocessing import MinMaxScaler

# your code here
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [158]:
X_train = pd.DataFrame(X_train)

In [159]:
X_test = pd.DataFrame(X_test)

Обучите на тренировочных данных линейную регрессию, сделайте предсказание на тесте и вычислите значение $R^2$ на тестовых данных.

In [160]:
# your code here
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
reg = LinearRegression().fit(X_train, y_train)
reg.predict(X_test)
r2_score(y_test, reg.predict(X_test))
# your code here

0.9139278213995001

## Вопрос для Quiz

Чему равно значение $R^2$ на тестовых данных? Ответ округлите до сотых.