Курсовой проект для курса "Python для Data Science"

Материалы к проекту (файлы):
train.csv
test.csv

Задание:
Используя данные из обучающего датасета (train.csv), построить модель для предсказания цен на недвижимость (квартиры).
С помощью полученной модели, предсказать цены для квартир из тестового датасета (test.csv).

Целевая переменная:
Price

Метрика качества:
R2 - коэффициент детерминации (sklearn.metrics.r2_score)

Требования к решению:
1. R2 > 0.6
2. Тетрадка Jupyter Notebook с кодом Вашего решения, названная по образцу {ФИО}_solution.ipynb, пример SShirkin_solution.ipynb
3. Файл CSV с прогнозами целевой переменной для тестового датасета, названный по образцу {ФИО}_predictions.csv, пример SShirkin_predictions.csv 
Файл должен содержать два поля: Id, Price и в файле должна быть 5001 строка (шапка + 5000 предсказаний).

Сроки сдачи:
Cдать проект нужно в течение 72 часов после окончания последнего вебинара. Оценки работ, сданных до дедлайна, будут представлены в виде рейтинга, ранжированного по заданной метрике качества. Проекты, сданные после дедлайна или сданные повторно, не попадают в рейтинг, но можно будет узнать результат.

Рекомендации для файла с кодом (ipynb):
1. Файл должен содержать заголовки и комментарии (markdown)
2. Повторяющиеся операции лучше оформлять в виде функций
3. Не делать вывод большого количества строк таблиц (5-10 достаточно)
4. По возможности добавлять графики, описывающие данные (около 3-5)
5. Добавлять только лучшую модель, то есть не включать в код все варианты решения проекта
6. Скрипт проекта должен отрабатывать от начала и до конца (от загрузки данных до выгрузки предсказаний)
7. Весь проект должен быть в одном скрипте (файл ipynb).
8. Допускается применение библиотек Python и моделей машинного обучения,
которые были в данном курсе.

Описание датасета:
Id - идентификационный номер квартиры
DistrictId - идентификационный номер района
Rooms - количество комнат
Square - площадь
LifeSquare - жилая площадь
KitchenSquare - площадь кухни
Floor - этаж
HouseFloor - количество этажей в доме
HouseYear - год постройки дома
Ecology_1, Ecology_2, Ecology_3 - экологические показатели местности
Social_1, Social_2, Social_3 - социальные показатели местности
Healthcare_1, Helthcare_2 - показатели местности, связанные с охраной здоровья
Shops_1, Shops_2 - показатели, связанные с наличием магазинов, торговых центров
Price - цена квартиры

Импорт библиотек:

In [236]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score as r2
import matplotlib
import matplotlib.pyplot as plt 

In [515]:
train_df = pd.read_csv("project_task/train.csv")
test_df = pd.read_csv("project_task/test.csv")

In [516]:
train_df.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
0,14038,35,2.0,47.981561,29.442751,6.0,7,9.0,1969,0.08904,B,B,33,7976,5,,0,11,B,184966.93073
1,15053,41,3.0,65.68364,40.049543,8.0,7,9.0,1978,7e-05,B,B,46,10309,1,240.0,1,16,B,300009.450063
2,4765,53,2.0,44.947953,29.197612,0.0,8,12.0,1968,0.049637,B,B,34,7759,0,229.0,1,3,B,220925.908524
3,5809,58,2.0,53.352981,52.731512,9.0,8,17.0,1977,0.437885,B,B,23,5735,3,1084.0,0,5,B,175616.227217
4,10783,99,1.0,39.649192,23.776169,7.0,11,12.0,1976,0.012339,B,B,35,5776,1,2078.0,2,4,B,150226.531644


In [517]:
test_df.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2
0,725,58,2.0,49.882643,33.432782,6.0,6,14.0,1972,0.310199,B,B,11,2748,1,,0,0,B
1,15856,74,2.0,69.263183,,1.0,6,1.0,1977,0.075779,B,B,6,1437,3,,0,2,B
2,5480,190,1.0,13.597819,15.948246,12.0,2,5.0,1909,0.0,B,B,30,7538,87,4702.0,5,5,B
3,15664,47,2.0,73.046609,51.940842,9.0,22,22.0,2007,0.101872,B,B,23,4583,3,,3,3,B
4,14275,27,1.0,47.527111,43.387569,1.0,17,17.0,2017,0.072158,B,B,2,629,1,,0,0,A


In [518]:
train_df.isna().sum()

Id                  0
DistrictId          0
Rooms               0
Square              0
LifeSquare       2113
KitchenSquare       0
Floor               0
HouseFloor          0
HouseYear           0
Ecology_1           0
Ecology_2           0
Ecology_3           0
Social_1            0
Social_2            0
Social_3            0
Healthcare_1     4798
Helthcare_2         0
Shops_1             0
Shops_2             0
Price               0
dtype: int64

In [519]:
test_df.isna().sum()

Id                  0
DistrictId          0
Rooms               0
Square              0
LifeSquare       1041
KitchenSquare       0
Floor               0
HouseFloor          0
HouseYear           0
Ecology_1           0
Ecology_2           0
Ecology_3           0
Social_1            0
Social_2            0
Social_3            0
Healthcare_1     2377
Helthcare_2         0
Shops_1             0
Shops_2             0
dtype: int64

In [520]:
train_df.nunique()

Id               10000
DistrictId         205
Rooms                9
Square           10000
LifeSquare        7887
KitchenSquare       58
Floor               33
HouseFloor          44
HouseYear           97
Ecology_1          129
Ecology_2            2
Ecology_3            2
Social_1            51
Social_2           142
Social_3            30
Healthcare_1        79
Helthcare_2          7
Shops_1             16
Shops_2              2
Price            10000
dtype: int64

In [521]:
test_df.nunique()

Id               5000
DistrictId        201
Rooms               8
Square           5000
LifeSquare       3959
KitchenSquare      38
Floor              35
HouseFloor         41
HouseYear          97
Ecology_1         130
Ecology_2           2
Ecology_3           2
Social_1           51
Social_2          143
Social_3           30
Healthcare_1       79
Helthcare_2         7
Shops_1            16
Shops_2             2
dtype: int64

Базовая модель:

In [522]:
train_df.replace({'Ecology_2': {'A':0,'B':1}, 'Ecology_3' : {'A':0,'B':1}, 'Shops_2': {'A':0,'B':1}}, inplace = True)

In [523]:
test_df.replace({'Ecology_2': {'A':0,'B':1}, 'Ecology_3' : {'A':0,'B':1}, 'Shops_2': {'A':0,'B':1}}, inplace = True)

In [524]:
X_train,X_test,y_train,y_test = train_test_split(train_df.drop(['Id','LifeSquare','Healthcare_1','Price'], axis = 'columns'),train_df['Price'],test_size = 0.2,random_state=50)

In [508]:
r=[]
for i in [15,20,25]:
    for j in [100,150,200]:
        model=RandomForestRegressor(max_depth=i,n_estimators=j, random_state=55)
        model.fit(X_train,y_train)
        first_pr=model.predict(X_test)
        r.append(r2(y_test,first_pr))
r

[0.7472705781206366,
 0.7492099804411674,
 0.7491797948636563,
 0.7466377600689109,
 0.748494149440379,
 0.7488023484836063,
 0.7458166952833022,
 0.7477838106824765,
 0.7474383782174129]

Выбираем max_depth=15,n_estimators=200

In [525]:
model=RandomForestRegressor(max_depth=15,n_estimators=200, random_state=55)
model.fit(X_train,y_train)
first_pr=model.predict(X_test)
r2(y_test,first_pr)

0.7528564506466789

Переобучаем модель на всех данных:

In [526]:
model=RandomForestRegressor(max_depth=15,n_estimators=200, random_state=55)
model.fit(train_df.drop(['Id','LifeSquare','Healthcare_1','Price'], axis = 'columns'),train_df['Price'])

RandomForestRegressor(max_depth=15, n_estimators=200, random_state=55)

In [527]:
f_importance = pd.DataFrame()
f_importance['name'] = train_df.set_index('Id').drop(['Healthcare_1','Price','LifeSquare'],axis='columns').columns.tolist()
f_importance['values'] = model.feature_importances_
f_importance.sort_values('values',ascending= False).reset_index(drop=True)

Unnamed: 0,name,values
0,Square,0.408505
1,Social_1,0.101846
2,Social_2,0.099371
3,Rooms,0.082868
4,Social_3,0.061396
5,DistrictId,0.051143
6,Ecology_1,0.04558
7,HouseYear,0.03895
8,Floor,0.029145
9,KitchenSquare,0.027619


In [528]:
first_pr=model.predict(test_df.drop(['Id','LifeSquare','Healthcare_1'], axis = 'columns'))
test_df['Price'] = first_pr
test_df.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
0,725,58,2.0,49.882643,33.432782,6.0,6,14.0,1972,0.310199,1,1,11,2748,1,,0,0,1,161688.51831
1,15856,74,2.0,69.263183,,1.0,6,1.0,1977,0.075779,1,1,6,1437,3,,0,2,1,223207.960883
2,5480,190,1.0,13.597819,15.948246,12.0,2,5.0,1909,0.0,1,1,30,7538,87,4702.0,5,5,1,201699.454156
3,15664,47,2.0,73.046609,51.940842,9.0,22,22.0,2007,0.101872,1,1,23,4583,3,,3,3,1,340680.83267
4,14275,27,1.0,47.527111,43.387569,1.0,17,17.0,2017,0.072158,1,1,2,629,1,,0,0,0,142791.050441


In [529]:
test_df[['Id','Price']].to_csv('first_predict.csv', index = False)

Вторая модель с коррекцией данных:

In [530]:
train_df = pd.read_csv("project_task/train.csv")
test_df = pd.read_csv("project_task/test.csv")

In [531]:
p1 = (train_df['LifeSquare']/train_df['Square']).dropna().mean()
p1

0.665884778674159

In [532]:
train_df['LifeSquare']=train_df['LifeSquare'].fillna(train_df['Square']*p1)

In [533]:
train_df.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
0,14038,35,2.0,47.981561,29.442751,6.0,7,9.0,1969,0.08904,B,B,33,7976,5,,0,11,B,184966.93073
1,15053,41,3.0,65.68364,40.049543,8.0,7,9.0,1978,7e-05,B,B,46,10309,1,240.0,1,16,B,300009.450063
2,4765,53,2.0,44.947953,29.197612,0.0,8,12.0,1968,0.049637,B,B,34,7759,0,229.0,1,3,B,220925.908524
3,5809,58,2.0,53.352981,52.731512,9.0,8,17.0,1977,0.437885,B,B,23,5735,3,1084.0,0,5,B,175616.227217
4,10783,99,1.0,39.649192,23.776169,7.0,11,12.0,1976,0.012339,B,B,35,5776,1,2078.0,2,4,B,150226.531644


In [534]:
Price_Square_District = train_df.groupby('DistrictId')[['Price','Square']].median().reset_index()
Price_Square_District['m_price']=Price_Square_District['Price']/Price_Square_District['Square']
Price_Square_District

Unnamed: 0,DistrictId,Price,Square,m_price
0,0,165963.054142,48.998792,3387.084585
1,1,183663.443595,60.139271,3053.968579
2,2,208539.501373,47.362676,4403.034587
3,3,169094.013281,47.202193,3582.333846
4,4,278639.482329,53.179791,5239.574588
...,...,...,...,...
200,202,394150.861857,68.245487,5775.486120
201,205,220501.566180,43.226985,5101.016598
202,207,426186.409334,76.780960,5550.678339
203,208,431137.654083,53.860839,8004.659114


In [535]:
#Ухудшило модель
#median_price=train_df.groupby('DistrictId')[['Price']].median().reset_index()
#median_price['median_price']=median_price['Price']
#train_df=pd.merge(train_df,median_price.drop(['Price'],axis = 'columns'),how='left',on='DistrictId')

In [536]:
train_df=pd.merge(train_df,Price_Square_District.drop(['Price','Square'],axis = 'columns'),how='left',on='DistrictId')

In [537]:
#Замена пропусков ухуждила результат, поэтому Healthcare_1 исключаем при построении модели
train_df['Healthcare_1']=train_df['Healthcare_1'].fillna(train_df['Healthcare_1'].median())

In [538]:
X_train,X_test,y_train,y_test = train_test_split(train_df.drop(['Id','Healthcare_1','Price','DistrictId','Shops_2','Ecology_3','Ecology_2','LifeSquare'], axis = 'columns'),train_df['Price'],test_size = 0.2,random_state=50)

In [539]:
model=RandomForestRegressor(max_depth=15,n_estimators=200, random_state=55)
model.fit(X_train,y_train)
second_pr=model.predict(X_test)
r2(y_test,second_pr)

0.7567449770798407

Переобучаем модель на всех данных

In [540]:
model=RandomForestRegressor(max_depth=15,n_estimators=200, random_state=55)
model.fit(train_df.drop(['Id','Healthcare_1','Price','DistrictId','Shops_2','Ecology_3','Ecology_2','LifeSquare'], axis = 'columns'),train_df['Price'])

RandomForestRegressor(max_depth=15, n_estimators=200, random_state=55)

In [541]:
p2 = (p1+(test_df['LifeSquare']/test_df['Square']).dropna().mean())/2
p2

0.6603070102544011

In [542]:
test_df['LifeSquare']=test_df['LifeSquare'].fillna(test_df['Square']*p2)

In [543]:
test_df=pd.merge(test_df,Price_Square_District.drop(['Price','Square'],axis = 'columns'),how='left',on='DistrictId')

In [544]:
test_df['m_price']=test_df['m_price'].fillna(test_df['m_price'].median())

In [545]:
test_df.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,m_price
0,725,58,2.0,49.882643,33.432782,6.0,6,14.0,1972,0.310199,B,B,11,2748,1,,0,0,B,2955.306014
1,15856,74,2.0,69.263183,45.734966,1.0,6,1.0,1977,0.075779,B,B,6,1437,3,,0,2,B,3483.810063
2,5480,190,1.0,13.597819,15.948246,12.0,2,5.0,1909,0.0,B,B,30,7538,87,4702.0,5,5,B,8361.39068
3,15664,47,2.0,73.046609,51.940842,9.0,22,22.0,2007,0.101872,B,B,23,4583,3,,3,3,B,3821.027752
4,14275,27,1.0,47.527111,43.387569,1.0,17,17.0,2017,0.072158,B,B,2,629,1,,0,0,A,2669.535159


In [546]:
second_pr=model.predict(test_df.drop(['Id','Healthcare_1','DistrictId','Shops_2','Ecology_3','Ecology_2','LifeSquare'], axis = 'columns'))
test_df['Price'] = second_pr
test_df.head()

Unnamed: 0,Id,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,...,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,m_price,Price
0,725,58,2.0,49.882643,33.432782,6.0,6,14.0,1972,0.310199,...,B,11,2748,1,,0,0,B,2955.306014,160472.2256
1,15856,74,2.0,69.263183,45.734966,1.0,6,1.0,1977,0.075779,...,B,6,1437,3,,0,2,B,3483.810063,213625.223845
2,5480,190,1.0,13.597819,15.948246,12.0,2,5.0,1909,0.0,...,B,30,7538,87,4702.0,5,5,B,8361.39068,323164.568709
3,15664,47,2.0,73.046609,51.940842,9.0,22,22.0,2007,0.101872,...,B,23,4583,3,,3,3,B,3821.027752,308806.038656
4,14275,27,1.0,47.527111,43.387569,1.0,17,17.0,2017,0.072158,...,B,2,629,1,,0,0,A,2669.535159,147196.389291


In [547]:
test_df[['Id','Price']].to_csv('MDubrovin_predictions.csv', index = False)

In [548]:
f_importance = pd.DataFrame()
f_importance['name'] = train_df.set_index('Id').drop(['Healthcare_1','Price','DistrictId','Shops_2','Ecology_3','Ecology_2','LifeSquare'],axis='columns').columns.tolist()
f_importance['values'] = model.feature_importances_
f_importance.sort_values('values',ascending= False).reset_index(drop=True)

Unnamed: 0,name,values
0,Square,0.442939
1,m_price,0.334913
2,HouseYear,0.034227
3,Rooms,0.027043
4,Floor,0.025818
5,KitchenSquare,0.022375
6,HouseFloor,0.022321
7,Social_2,0.018202
8,Ecology_1,0.018153
9,Social_3,0.017831
