**В этом задании нужно предсказать популярность объявления о продаже домов в Нью-Йорке, основываясь на текстовом описании, локации, количестве спален, цене etc.**

Решаем задачу классификации: нужно предсказать TARGET − low, medium или high. Метрка − точность.

train.csv − датасет для обучения (34,5 тысяч примеров, 15 фичейл)
test.csv − датасет для оценивания решений (15 тысяч примеров, 15 фичей)

In [89]:
import pandas as pd
import numpy as np

In [90]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

In [91]:
print(df_train.shape)
print(df_test.shape)

(34546, 16)
(14806, 15)


In [92]:
df_train.head()

Unnamed: 0,Id,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,TARGET
0,57094,1.0,3,0,2016-05-19 18:06:27,A FABULOUS 3BR IN MIDTOWN WEST! PERFECT APAR...,HOW AMAZING IS THIS MIDTOWN WEST STEAL!! NO FE...,"['Laundry In Unit', 'No Fee', 'Elevator']",40.7647,7039994,-73.9918,4bdc3d8c1aaa90d997ce2cb77680679b,['https://photos.renthop.com/2/7039994_07be01b...,4495,W 50 & AVE 10,medium
1,33389,1.0,1,9225efdfb57a50bf3ec17ebab082f94a,2016-06-16 02:01:49,Renovated Kitchen and Bathroom!,55 River Drive South,"['Dogs Allowed', 'Cats Allowed', 'No Fee']",40.7275,7166774,-74.0322,e5808a5e6cc13988fe596704428d38d5,['https://photos.renthop.com/2/7166774_03cf63a...,2570,55 River Drive South,medium
2,60458,1.0,0,320de7d3cc88e50a7fbbcfde1e825d21,2016-05-04 02:42:50,RARE AND BEST DEAL ON THE MARKET!!!! PERFECT S...,W 77 Street,"['Elevator', 'Hardwood Floors']",40.7798,6962716,-73.9751,d69d4e111612dd12ef864031c1148543,['https://photos.renthop.com/2/6962716_ec7f56f...,1795,22 W 77 Street,low
3,53048,1.0,2,ce6d18bf3238e668b2bf23f4110b7b67,2016-05-12 05:57:56,Newly renovated flex 2 apartment offers the ne...,John Street,"['Swimming Pool', 'Doorman', 'Elevator', 'Fitn...",40.7081,7002458,-74.0065,e6472c7237327dd3903b3d6f6a94515a,['https://photos.renthop.com/2/7002458_93f4010...,3400,100 John Street,low
4,592,1.0,3,fee4d465932160318364d9d48d272879,2016-06-16 06:06:15,LOW FEE apartments do not come around like thi...,West 16th Street,"['Laundry in Building', 'Laundry in Unit', 'Di...",40.7416,7170465,-74.0025,6fba9b3a8327c607b8b043716efee684,['https://photos.renthop.com/2/7170465_9c3f173...,5695,321 West 16th Street,low


In [93]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34546 entries, 0 to 34545
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               34546 non-null  int64  
 1   bathrooms        34546 non-null  float64
 2   bedrooms         34546 non-null  int64  
 3   building_id      34546 non-null  object 
 4   created          34546 non-null  object 
 5   description      33509 non-null  object 
 6   display_address  34458 non-null  object 
 7   features         34546 non-null  object 
 8   latitude         34546 non-null  float64
 9   listing_id       34546 non-null  int64  
 10  longitude        34546 non-null  float64
 11  manager_id       34546 non-null  object 
 12  photos           34546 non-null  object 
 13  price            34546 non-null  int64  
 14  street_address   34542 non-null  object 
 15  TARGET           34546 non-null  object 
dtypes: float64(3), int64(4), object(9)
memory usage: 4.2+ MB


Итак, у нас 34546 домов, 15 признаков, TARGET - таргетная переменная.

In [94]:
X_train = df_train.drop(columns=['TARGET'])

In [95]:
y_train = df_train['TARGET']

Посмотрим, сколько признаков содержат неизвестные значения.

In [96]:
X_train.isna().any().sum()

3

In [97]:
X_train.isna().sum()

Id                    0
bathrooms             0
bedrooms              0
building_id           0
created               0
description        1037
display_address      88
features              0
latitude              0
listing_id            0
longitude             0
manager_id            0
photos                0
price                 0
street_address        4
dtype: int64

Видно, что пропущеные значения встречаются в адресах, не будем использовать эти признаки.

In [98]:
X_train.fillna(value='', inplace=True)

In [99]:
X_train.isna().sum()

Id                 0
bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
dtype: int64

Поработаем с датой и временем:

In [100]:
X_train.dtypes 

Id                   int64
bathrooms          float64
bedrooms             int64
building_id         object
created             object
description         object
display_address     object
features            object
latitude           float64
listing_id           int64
longitude          float64
manager_id          object
photos              object
price                int64
street_address      object
dtype: object

In [101]:
X_train['crdata'] = pd.to_datetime(X_train['created'], infer_datetime_format=True)

In [102]:
# В каком месяце создали обьявление
X_train['crmonth'] = X_train['crdata'].dt.month

In [103]:
X_train['crmonth'].value_counts()

6    12061
4    11375
5    11110
Name: crmonth, dtype: int64

In [104]:
# Является ли этот день выходным
X_train['is_weekend'] = X_train['crdata'].apply(lambda x: 1 if x.date().weekday() in (0, 6) else 0)

In [105]:
X_train.head()

Unnamed: 0,Id,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,crdata,crmonth,is_weekend
0,57094,1.0,3,0,2016-05-19 18:06:27,A FABULOUS 3BR IN MIDTOWN WEST! PERFECT APAR...,HOW AMAZING IS THIS MIDTOWN WEST STEAL!! NO FE...,"['Laundry In Unit', 'No Fee', 'Elevator']",40.7647,7039994,-73.9918,4bdc3d8c1aaa90d997ce2cb77680679b,['https://photos.renthop.com/2/7039994_07be01b...,4495,W 50 & AVE 10,2016-05-19 18:06:27,5,0
1,33389,1.0,1,9225efdfb57a50bf3ec17ebab082f94a,2016-06-16 02:01:49,Renovated Kitchen and Bathroom!,55 River Drive South,"['Dogs Allowed', 'Cats Allowed', 'No Fee']",40.7275,7166774,-74.0322,e5808a5e6cc13988fe596704428d38d5,['https://photos.renthop.com/2/7166774_03cf63a...,2570,55 River Drive South,2016-06-16 02:01:49,6,0
2,60458,1.0,0,320de7d3cc88e50a7fbbcfde1e825d21,2016-05-04 02:42:50,RARE AND BEST DEAL ON THE MARKET!!!! PERFECT S...,W 77 Street,"['Elevator', 'Hardwood Floors']",40.7798,6962716,-73.9751,d69d4e111612dd12ef864031c1148543,['https://photos.renthop.com/2/6962716_ec7f56f...,1795,22 W 77 Street,2016-05-04 02:42:50,5,0
3,53048,1.0,2,ce6d18bf3238e668b2bf23f4110b7b67,2016-05-12 05:57:56,Newly renovated flex 2 apartment offers the ne...,John Street,"['Swimming Pool', 'Doorman', 'Elevator', 'Fitn...",40.7081,7002458,-74.0065,e6472c7237327dd3903b3d6f6a94515a,['https://photos.renthop.com/2/7002458_93f4010...,3400,100 John Street,2016-05-12 05:57:56,5,0
4,592,1.0,3,fee4d465932160318364d9d48d272879,2016-06-16 06:06:15,LOW FEE apartments do not come around like thi...,West 16th Street,"['Laundry in Building', 'Laundry in Unit', 'Di...",40.7416,7170465,-74.0025,6fba9b3a8327c607b8b043716efee684,['https://photos.renthop.com/2/7170465_9c3f173...,5695,321 West 16th Street,2016-06-16 06:06:15,6,0


In [106]:
X_train.dtypes

Id                          int64
bathrooms                 float64
bedrooms                    int64
building_id                object
created                    object
description                object
display_address            object
features                   object
latitude                  float64
listing_id                  int64
longitude                 float64
manager_id                 object
photos                     object
price                       int64
street_address             object
crdata             datetime64[ns]
crmonth                     int64
is_weekend                  int64
dtype: object

Посмотрим какие признаки категориальные:

In [107]:
def find_cat(data, num_uniq=6):
    for name in data.columns:
        s = ''
        s += name
        if (type(data[name][0]) == str):
            s += ' строка,'
        if (data[name].nunique()<=num_uniq):
            s += ' мало уникальных'
        if (s!=name):
            print (s)
            
find_cat(X_train)

building_id строка,
created строка,
description строка,
display_address строка,
features строка,
manager_id строка,
photos строка,
street_address строка,
crmonth мало уникальных
is_weekend мало уникальных


Посчитаем для каждого месяца среднюю цену квартиры:

In [108]:
def code_mean(data, cat_feature, real_feature):
    return (data[cat_feature].map(data.groupby(cat_feature)[real_feature].mean()))

X_train['month_mean_price'] = code_mean(X_train, 'crmonth', 'price')

In [109]:
X_train.head()

Unnamed: 0,Id,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,crdata,crmonth,is_weekend,month_mean_price
0,57094,1.0,3,0,2016-05-19 18:06:27,A FABULOUS 3BR IN MIDTOWN WEST! PERFECT APAR...,HOW AMAZING IS THIS MIDTOWN WEST STEAL!! NO FE...,"['Laundry In Unit', 'No Fee', 'Elevator']",40.7647,7039994,-73.9918,4bdc3d8c1aaa90d997ce2cb77680679b,['https://photos.renthop.com/2/7039994_07be01b...,4495,W 50 & AVE 10,2016-05-19 18:06:27,5,0,3877.583888
1,33389,1.0,1,9225efdfb57a50bf3ec17ebab082f94a,2016-06-16 02:01:49,Renovated Kitchen and Bathroom!,55 River Drive South,"['Dogs Allowed', 'Cats Allowed', 'No Fee']",40.7275,7166774,-74.0322,e5808a5e6cc13988fe596704428d38d5,['https://photos.renthop.com/2/7166774_03cf63a...,2570,55 River Drive South,2016-06-16 02:01:49,6,0,4137.745709
2,60458,1.0,0,320de7d3cc88e50a7fbbcfde1e825d21,2016-05-04 02:42:50,RARE AND BEST DEAL ON THE MARKET!!!! PERFECT S...,W 77 Street,"['Elevator', 'Hardwood Floors']",40.7798,6962716,-73.9751,d69d4e111612dd12ef864031c1148543,['https://photos.renthop.com/2/6962716_ec7f56f...,1795,22 W 77 Street,2016-05-04 02:42:50,5,0,3877.583888
3,53048,1.0,2,ce6d18bf3238e668b2bf23f4110b7b67,2016-05-12 05:57:56,Newly renovated flex 2 apartment offers the ne...,John Street,"['Swimming Pool', 'Doorman', 'Elevator', 'Fitn...",40.7081,7002458,-74.0065,e6472c7237327dd3903b3d6f6a94515a,['https://photos.renthop.com/2/7002458_93f4010...,3400,100 John Street,2016-05-12 05:57:56,5,0,3877.583888
4,592,1.0,3,fee4d465932160318364d9d48d272879,2016-06-16 06:06:15,LOW FEE apartments do not come around like thi...,West 16th Street,"['Laundry in Building', 'Laundry in Unit', 'Di...",40.7416,7170465,-74.0025,6fba9b3a8327c607b8b043716efee684,['https://photos.renthop.com/2/7170465_9c3f173...,5695,321 West 16th Street,2016-06-16 06:06:15,6,0,4137.745709


Используем подход **one-hot кодирование**.

In [110]:
X_train = pd.get_dummies(X_train, columns=['is_weekend', 'crmonth'])


In [111]:
X_train.head()

Unnamed: 0,Id,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,...,photos,price,street_address,crdata,month_mean_price,is_weekend_0,is_weekend_1,crmonth_4,crmonth_5,crmonth_6
0,57094,1.0,3,0,2016-05-19 18:06:27,A FABULOUS 3BR IN MIDTOWN WEST! PERFECT APAR...,HOW AMAZING IS THIS MIDTOWN WEST STEAL!! NO FE...,"['Laundry In Unit', 'No Fee', 'Elevator']",40.7647,7039994,...,['https://photos.renthop.com/2/7039994_07be01b...,4495,W 50 & AVE 10,2016-05-19 18:06:27,3877.583888,1,0,0,1,0
1,33389,1.0,1,9225efdfb57a50bf3ec17ebab082f94a,2016-06-16 02:01:49,Renovated Kitchen and Bathroom!,55 River Drive South,"['Dogs Allowed', 'Cats Allowed', 'No Fee']",40.7275,7166774,...,['https://photos.renthop.com/2/7166774_03cf63a...,2570,55 River Drive South,2016-06-16 02:01:49,4137.745709,1,0,0,0,1
2,60458,1.0,0,320de7d3cc88e50a7fbbcfde1e825d21,2016-05-04 02:42:50,RARE AND BEST DEAL ON THE MARKET!!!! PERFECT S...,W 77 Street,"['Elevator', 'Hardwood Floors']",40.7798,6962716,...,['https://photos.renthop.com/2/6962716_ec7f56f...,1795,22 W 77 Street,2016-05-04 02:42:50,3877.583888,1,0,0,1,0
3,53048,1.0,2,ce6d18bf3238e668b2bf23f4110b7b67,2016-05-12 05:57:56,Newly renovated flex 2 apartment offers the ne...,John Street,"['Swimming Pool', 'Doorman', 'Elevator', 'Fitn...",40.7081,7002458,...,['https://photos.renthop.com/2/7002458_93f4010...,3400,100 John Street,2016-05-12 05:57:56,3877.583888,1,0,0,1,0
4,592,1.0,3,fee4d465932160318364d9d48d272879,2016-06-16 06:06:15,LOW FEE apartments do not come around like thi...,West 16th Street,"['Laundry in Building', 'Laundry in Unit', 'Di...",40.7416,7170465,...,['https://photos.renthop.com/2/7170465_9c3f173...,5695,321 West 16th Street,2016-06-16 06:06:15,4137.745709,1,0,0,0,1


Выделим стоимость аренды одной комнаты:

In [141]:
c_rooms = X_train["bedrooms"] + X_train["bathrooms"]
c_rooms = c_rooms.apply(lambda x : x if x > 0 else 1)
X_train['price_per_room'] = X_train["price"] / c_rooms



Нас будут интересовать следующие признаки:

In [204]:
real_features = ['bathrooms',
                 'bedrooms',
                 'price',
                 'latitude',
                 'longitude',
                 'is_weekend_0',
                 'is_weekend_1',
                 'crmonth_4',
                 'crmonth_5',
                 'crmonth_6',
                 'month_mean_price',
                 'price_per_room']

Сделаем Z-score преобразование.

In [144]:
from sklearn.preprocessing import StandardScaler

In [145]:
scater = StandardScaler().fit(X_train.loc[:, real_features])

In [146]:
scater_X_train = scater.transform(X_train.loc[:, real_features])

In [147]:
scater_X_train

array([[-0.42559027,  1.30667192,  0.0230431 , ..., -0.73239473,
        -0.05436113, -0.05131692],
       [-0.42559027, -0.48428954, -0.05013345, ...,  1.36538393,
         1.20401843, -0.03393881],
       [-0.42559027, -1.37977027, -0.07959414, ..., -0.73239473,
        -0.05436113,  0.02102452],
       ...,
       [ 1.58968511,  0.41119119,  0.04642158, ..., -0.73239473,
        -1.22353531, -0.0347471 ],
       [-0.42559027,  0.41119119, -0.018582  , ..., -0.73239473,
        -0.05436113, -0.05028412],
       [-0.42559027,  0.41119119, -0.03568821, ..., -0.73239473,
        -1.22353531, -0.0664498 ]])

Сделаем MinMax Scaling.

In [148]:
from sklearn.preprocessing import MinMaxScaler
min_max_scater = MinMaxScaler().fit(X_train.loc[:, real_features])

In [149]:
min_max_scater_X_train = min_max_scater.transform(X_train.loc[:, real_features])

In [150]:
min_max_scater_X_train

array([[1.66666667e-01, 3.75000000e-01, 9.91101247e-04, ...,
        0.00000000e+00, 4.81626486e-01, 7.20790046e-04],
       [1.66666667e-01, 1.25000000e-01, 5.62366438e-04, ...,
        1.00000000e+00, 1.00000000e+00, 8.28532706e-04],
       [1.66666667e-01, 0.00000000e+00, 3.89758917e-04, ...,
        0.00000000e+00, 4.81626486e-01, 1.16930019e-03],
       ...,
       [3.33333333e-01, 2.50000000e-01, 1.12807367e-03, ...,
        0.00000000e+00, 0.00000000e+00, 8.23521420e-04],
       [1.66666667e-01, 2.50000000e-01, 7.47223525e-04, ...,
        0.00000000e+00, 4.81626486e-01, 7.27193357e-04],
       [1.66666667e-01, 2.50000000e-01, 6.46999803e-04, ...,
        0.00000000e+00, 0.00000000e+00, 6.26967626e-04]])

Отберём k лучших признаков с помощью ANOVA.

In [154]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import VarianceThreshold

In [156]:
# Выбираем 5 лучших фичей с помощью скоринговой функции для классификации f_classif в методе ANOVA
x_data_kbest = SelectKBest(f_classif, k=5).fit_transform(min_max_scater_X_train, y_train)

По вариотивности:

In [159]:
# Выбираем фичи по граничному значению дисперсии данных
x_data_varth = VarianceThreshold(.2).fit_transform(min_max_scater_X_train)
x_data_varth.shape

(34546, 3)

In [164]:
x_data_varth = VarianceThreshold(.1).fit_transform(min_max_scater_X_train)
x_data_varth.shape

(34546, 6)

In [165]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [205]:
# Результат X_train
cross_val_score(LogisticRegression(max_iter=10000), X_train.loc[:, real_features], y_train, scoring='accuracy').mean()

0.6952469178082908

In [197]:
# Результат для исходных данных
cross_val_score(LogisticRegression(max_iter=10000), scater_X_train, y_train, scoring='accuracy').mean()

0.6928732233109397

In [169]:
cross_val_score(LogisticRegression(max_iter=10000), min_max_scater_X_train, y_train, scoring='accuracy').mean()

0.6946390318297471

In [171]:
# Для отбора по вариативности
cross_val_score(LogisticRegression(max_iter=10000), x_data_varth, y_train, scoring='accuracy').mean()

0.694696927328372

In [172]:
# Для отбора по ANOVA
cross_val_score(LogisticRegression(max_iter=10000), x_data_kbest, y_train, scoring='accuracy').mean()

0.6946390318297471

Видно, что ANOVA справилась лучше всех остальных оценок, но на исходных данных показатели лучше всего.

### Перебор признаков

In [None]:
!pip install mlxtend

In [186]:
from mlxtend.feature_selection import SequentialFeatureSelector

selector = SequentialFeatureSelector(LogisticRegression(), scoring='accuracy',
                                     verbose=2, k_features=7, forward=True, n_jobs=-1)

selector.fit(scater_X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:   11.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:   11.2s finished

[2022-05-26 19:42:47] Features: 1/7 -- score: 0.694696927328372[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  11 | elapsed:    7.3s finished

[2022-05-26 19:42:54] Features: 2/7 -- score: 0.694696927328372[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    7.8s finished

[2022-05-26 19:43:02] Features: 3/7 -- score: 0.694696927328372[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   7 out of   9 | elapsed:    7.9s remaining:    2.2s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:   10.9s finished

[2022-05-26 19:43:13] Features: 4/7 -- score: 0.694725870

SequentialFeatureSelector(estimator=LogisticRegression(), k_features=7,
                          n_jobs=-1, scoring='accuracy', verbose=2)

Результат удалось улучшить на 0.02 процента. Может это погрешность а может и нет.
В итоге лучший результат был на X_train.loc[:, real_features] равный 0.6952

Обучим финальную модель.

In [215]:
X_test = df_test

In [216]:
X_test['crdata'] = pd.to_datetime(X_test['created'], infer_datetime_format=True)

In [217]:
X_test['crmonth'] = X_test['crdata'].dt.month

In [220]:
X_test['is_weekend'] = X_test['crdata'].apply(lambda x: 1 if x.date().weekday() in (0, 6) else 0)

In [221]:
X_test['month_mean_price'] = code_mean(X_test, 'crmonth', 'price')

In [223]:
X_test = pd.get_dummies(X_test, columns=['is_weekend', 'crmonth'])

In [225]:
c_rooms = X_test["bedrooms"] + X_test["bathrooms"]
c_rooms = c_rooms.apply(lambda x : x if x > 0 else 1)
X_test['price_per_room'] = X_test["price"] / c_rooms

In [200]:
lg = LogisticRegression(max_iter=10000).fit(X_train.loc[:, real_features], y_train)

In [226]:
y_pred = lg.predict(X_test.loc[:, real_features])

In [244]:
y_test = df_test['Id']

In [245]:
print(accuracy_score(y_test, y_pred))

0.6945157368634337


Точность почти 70 процентов. Хороший результат!

Сделаем сабмит:

In [251]:
submit = pd.DataFrame.from_dict({'Id':np.arange(0, 14806), 'TARGET': y_pred})
submit.to_csv("sumbit.csv", index=False)

In [252]:
submit.head()

Unnamed: 0,Id,TARGET
0,0,low
1,1,low
2,2,low
3,3,low
4,4,low
