###  Юнит 3. Введение в машинное обучение 

#### Preproject (предвариетльный анализ)

Первоначальная версия датасета состоит из десяти столбцов, содержащих следующую информацию:

 - **Restaurant_id** — идентификационный номер ресторана / сети ресторанов;
 - **City** — город, в котором находится ресторан;
 - **Cuisine Style** — кухня или кухни, к которым можно отнести блюда, предлагаемые в ресторане;
 - **Ranking** — место, которое занимает данный ресторан среди всех ресторанов своего города;
 - **Rating** — рейтинг ресторана по данным TripAdvisor (именно это значение должна будет предсказывать модель);
 - **Price Range** — диапазон цен в ресторане;
 - **Number of Reviews** — количество отзывов о ресторане;
 - **Reviews** — данные о двух отзывах, которые отображаются на сайте ресторана;
 - **URL_TA** — URL страницы ресторана на TripAdvosor;
 - **ID_TA** — идентификатор ресторана в базе данных TripAdvisor.

Задачу, которая стоит перед вами, можно свести к трём пунктам:

 - Удалить из датафрейма столбцы, данные в которых представлены не числами (это вы уже сделали, и нужно просто повторить знакомые действия, но в этот раз выполнить данный шаг в последнюю очередь).
 - Избавиться от пропущенных (None) значений (на предыдущем шаге мы делали это самым грубым из всех возможных способов; сейчас попробуем подойти к процессу более гибко).
 - Создать новые столбцы с данными, используя для этого информацию, содержащуюся в других столбцах датафрейма (например, можно добавить столбец, сообщающий, сколько дней прошло со дня публикации последнего отзыва, отображённого на сайте).
 
С другой стороны, в этом задании масса подводных камней.

In [20]:
# импорт библиотек
import pandas as pd
import numpy as np
import datetime
from sklearn.model_selection import train_test_split # Загружаем специальный инструмент для разбивки:
from sklearn.ensemble import RandomForestRegressor # инструмент для создания и обучения модели  
from sklearn import metrics # инструменты для оценки точности модели 
from collections import Counter
import os
# import re
# import math
# import copy

In [22]:
df = pd.read_csv('main_task.csv')
display(df.head())

Unnamed: 0,Restaurant_id,City,Cuisine Style,Ranking,Rating,Price Range,Number of Reviews,Reviews,URL_TA,ID_TA
0,id_5569,Paris,"['European', 'French', 'International']",5570.0,3.5,$$ - $$$,194.0,"[['Good food at your doorstep', 'A good hotel ...",/Restaurant_Review-g187147-d1912643-Reviews-R_...,d1912643
1,id_1535,Stockholm,,1537.0,4.0,,10.0,"[['Unique cuisine', 'Delicious Nepalese food']...",/Restaurant_Review-g189852-d7992032-Reviews-Bu...,d7992032
2,id_352,London,"['Japanese', 'Sushi', 'Asian', 'Grill', 'Veget...",353.0,4.5,$$$$,688.0,"[['Catch up with friends', 'Not exceptional'],...",/Restaurant_Review-g186338-d8632781-Reviews-RO...,d8632781
3,id_3456,Berlin,,3458.0,5.0,,3.0,"[[], []]",/Restaurant_Review-g187323-d1358776-Reviews-Es...,d1358776
4,id_615,Munich,"['German', 'Central European', 'Vegetarian Fri...",621.0,4.0,$$ - $$$,84.0,"[['Best place to try a Bavarian food', 'Nice b...",/Restaurant_Review-g187309-d6864963-Reviews-Au...,d6864963


In [23]:
df.columns = df.columns.str.replace(' ','_')
df

Unnamed: 0,Restaurant_id,City,Cuisine_Style,Ranking,Rating,Price_Range,Number_of_Reviews,Reviews,URL_TA,ID_TA
0,id_5569,Paris,"['European', 'French', 'International']",5570.0,3.5,$$ - $$$,194.0,"[['Good food at your doorstep', 'A good hotel ...",/Restaurant_Review-g187147-d1912643-Reviews-R_...,d1912643
1,id_1535,Stockholm,,1537.0,4.0,,10.0,"[['Unique cuisine', 'Delicious Nepalese food']...",/Restaurant_Review-g189852-d7992032-Reviews-Bu...,d7992032
2,id_352,London,"['Japanese', 'Sushi', 'Asian', 'Grill', 'Veget...",353.0,4.5,$$$$,688.0,"[['Catch up with friends', 'Not exceptional'],...",/Restaurant_Review-g186338-d8632781-Reviews-RO...,d8632781
3,id_3456,Berlin,,3458.0,5.0,,3.0,"[[], []]",/Restaurant_Review-g187323-d1358776-Reviews-Es...,d1358776
4,id_615,Munich,"['German', 'Central European', 'Vegetarian Fri...",621.0,4.0,$$ - $$$,84.0,"[['Best place to try a Bavarian food', 'Nice b...",/Restaurant_Review-g187309-d6864963-Reviews-Au...,d6864963
...,...,...,...,...,...,...,...,...,...,...
39995,id_499,Milan,"['Italian', 'Vegetarian Friendly', 'Vegan Opti...",500.0,4.5,$$ - $$$,79.0,"[['The real Italian experience!', 'Wonderful f...",/Restaurant_Review-g187849-d2104414-Reviews-Ro...,d2104414
39996,id_6340,Paris,"['French', 'American', 'Bar', 'European', 'Veg...",6341.0,3.5,$$ - $$$,542.0,"[['Parisian atmosphere', 'Bit pricey but inter...",/Restaurant_Review-g187147-d1800036-Reviews-La...,d1800036
39997,id_1649,Stockholm,"['Japanese', 'Sushi']",1652.0,4.5,,4.0,"[['Good by swedish standards', 'A hidden jewel...",/Restaurant_Review-g189852-d947615-Reviews-Sus...,d947615
39998,id_640,Warsaw,"['Polish', 'European', 'Eastern European', 'Ce...",641.0,4.0,$$ - $$$,70.0,"[['Underground restaurant', 'Oldest Restaurant...",/Restaurant_Review-g274856-d1100838-Reviews-Ho...,d1100838


### ТРЕНИРОВКА

### 2.1 Задание 1
Какие столбцы НЕ содержат пропущенных (None) значений?

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Restaurant_id      40000 non-null  object 
 1   City               40000 non-null  object 
 2   Cuisine_Style      30717 non-null  object 
 3   Ranking            40000 non-null  float64
 4   Rating             40000 non-null  float64
 5   Price_Range        26114 non-null  object 
 6   Number_of_Reviews  37457 non-null  float64
 7   Reviews            40000 non-null  object 
 8   URL_TA             40000 non-null  object 
 9   ID_TA              40000 non-null  object 
dtypes: float64(3), object(7)
memory usage: 3.1+ MB


In [25]:
df.columns[df.isna().sum() == 0]

Index(['Restaurant_id', 'City', 'Ranking', 'Rating', 'Reviews', 'URL_TA',
       'ID_TA'],
      dtype='object')

### 2.2 Задание 2
В каких столбцах данные хранятся в числовом формате?  
Ответ:  
- Ranking 
- Rating   
- Number of Reviews

In [26]:
df.columns[df.dtypes != np.object]

Index(['Ranking', 'Rating', 'Number_of_Reviews'], dtype='object')

### 2.3 Задание 3
В каких столбцах хранящиеся данные представляют собой список?

In [17]:
for vol in df.iloc[0]:
    if type(vol) == list: print(vol.index)


Ответ:  
   - Ни в одном

In [37]:
# сделаем из кода построения модели функцию, так как нам придется ее повторять 
def first_model(df):
    # Разбиваем датафрейм на части, необходимые для обучения и тестирования модели  
    # Х - данные с информацией о ресторанах, у - целевая переменная (рейтинги ресторанов)  
    X = df.drop(['Rating'], axis = 1)  
    y = df['Rating']  
    
    # Наборы данных с меткой "train" будут использоваться для обучения модели, 
    # "test" - для тестирования.  
    # Для тестирования мы будем использовать 25% от исходного датасета.  
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)  

    # Создаём модель  
    regr = RandomForestRegressor(n_estimators=100)  
    
    # Обучаем модель на тестовом наборе данных  
    regr.fit(X_train, y_train)  
    
    # Используем обученную модель для предсказания рейтинга ресторанов в тестовой выборке.  
    # Предсказанные значения записываем в переменную y_pred  
    y_pred = regr.predict(X_test) 

    # Сравниваем предсказанные значения (y_pred) с реальными (y_test), 
    # и смотрим насколько они в среднем отличаются  
    # Метрика называется Mean Absolute Error (MAE) и показывает 
    # среднее отклонение предсказанных значений от фактических.  
    print('MAE:', metrics.mean_absolute_error(y_test, y_pred)) 

In [27]:
# запускаем функцию с построением модели и оценкой МАЕ
first_model(df)

NameError: name 'first_model' is not defined

### 3.1 Вопрос для размышления  
По какой причине во время обучения модели возникла ошибка Value Error?  

***Ответ:*** - Ошибка возникла при обработке столбца со строковыми значениями. Следовательно, для обучения модели все данные должны быть переданы только в числовом формате. 


In [29]:
# заполняем пропуски 
# сначала заполним числовой ряд средними значениями 
df['Number_of_Reviews'] = df['Number_of_Reviews'].fillna(df['Number_of_Reviews'].mean())

In [30]:
# проверяем заполнение
df['Number_of_Reviews'].isnull().sum()

0

In [31]:
# теперь запольним Price Range
# сначала заменим все значения на числовые 
dic_value_Price = {'$':1,'$$ - $$$':2,'$$$$':3}
df['Price_Range']=df['Price_Range'].map(lambda x: dic_value_Price.get(x,x))
# потом заменим на самое популярное
df['Price_Range'] = df['Price_Range'].fillna(df["Price_Range"].value_counts().idxmax())

In [32]:
# проверяем заполнение
df['Price_Range'].isnull().sum()

0

In [33]:
# City переводим в числа с помощью метода get_dummies
df_city_dummies = pd.get_dummies(df['City'])
df = pd.concat([df,df_city_dummies], axis=1)

In [34]:
# на этом тренировочном этапе удалим все столбцы, 
# которые сложно заполнить или перевести в числа
df.drop(['Restaurant_id', 'City', 'Cuisine_Style', 'Reviews', 'URL_TA', 'ID_TA'], axis=1, inplace=True)

In [35]:
df

Unnamed: 0,Ranking,Rating,Price_Range,Number_of_Reviews,Amsterdam,Athens,Barcelona,Berlin,Bratislava,Brussels,...,Munich,Oporto,Oslo,Paris,Prague,Rome,Stockholm,Vienna,Warsaw,Zurich
0,5570.0,3.5,2.0,194.0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,1537.0,4.0,2.0,10.0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,353.0,4.5,3.0,688.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3458.0,5.0,2.0,3.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,621.0,4.0,2.0,84.0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,500.0,4.5,2.0,79.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39996,6341.0,3.5,2.0,542.0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
39997,1652.0,4.5,2.0,4.0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
39998,641.0,4.0,2.0,70.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [38]:
# повторяем построение модели
first_model(df)

MAE: 0.21757349999999998


### 4.1 Вопросы о ценах
1. Сколько вариантов непустых значений встречается в столбце Price Range?  
Ответ: 3

2. Как в датафрейме обозначается самый низкий уровень цен?  
Ответ: $

3. Как в датафрейме обозначается самый высокий уровень цен?  
Ответ: `$$$$`

4. Сколько ресторанов относятся к среднему ценовому сегменту?

In [40]:
# перед решением следующих задач скачиваем заново датасет, 
# потому что мы уже изменили первоначальный датасет
df = pd.read_csv('main_task.csv')

temp = len(df[df['Price Range'] == '$$ - $$$'])
print(round(temp,0))

18412


### 4.2 Вопрос о городах
Сколько городов представлено в наборе данных?

In [43]:
temp = df['City'].nunique()
temp

31

### 4.3 Вопросы о кухнях
1. Сколько типов кухонь представлено в наборе данных?  
2. Какая кухня представлена в наибольшем количестве ресторанов? Введите название кухни без кавычек или апострофов. 
3. Какое среднее количество кухонь предлагается в одном ресторане? Если в данных отсутствует информация о типах кухонь, то считайте, что в этом ресторане предлагается только один тип кухни. Ответ округлите до одного знака после запятой.

In [45]:
# заполним пропуски в Cuisine Style
# на этапе обучения просто поставим какую-нибудь "другую" кухню
df['Cuisine Style'] = df['Cuisine Style'].fillna("['Other']")
df['Cuisine Style'] = df['Cuisine Style'].str.findall(r"'(\b.*?\b)'") 

# создадим список списков кухонь
temp_list = df['Cuisine Style'].tolist()

# но у нас получился список списков, с которым работать неудобно
# напишем функцию распаковки списка списков 
# взяла данный подход из задания 1, потому что он достаточно быстрый)
def list_unrar(list_of_lists):
    result=[]
    for lst in list_of_lists:
      result.extend(lst)
    return result

#  записываем распакованный список в Counter
temp_counter=Counter(list_unrar(temp_list))

In [50]:
result1 = len(temp_counter) - 1  # минусуем нашу одну "другую" кухню
result1

125

In [48]:
result2 = temp_counter.most_common()[0][0]
result2

'Vegetarian Friendly'

In [51]:
# сделаем аналог get_dummies
for cuisine in temp_counter:
    df[cuisine] = df['Cuisine Style'].apply(lambda x: 1 if cuisine in x else 0 ).astype('float64')

In [52]:
# чтобы не удалять столбец и не создавать новый 
# запишем  количество кухонь в текущий
df['Cuisine Style'] = df['Cuisine Style'].apply(lambda x: len(x)).astype('float64')

In [54]:
result3 = df['Cuisine Style'].mean()
print(round(result3, 1))

2.6


### 4.4 Вопросы об отзывах
1. Когда был оставлен самый свежий отзыв? Введите ответ в формате yyyy-mm-dd.
2. Какое максимальное количество дней отделяет даты публикации отзывов, размещённых на сайте ресторана? Введите количество дней в виде целого числа.

In [55]:
# заполним пустые значения в Reviews
df['Reviews'] = df['Reviews'].fillna("['no_Reviews']")
# найдем даты и запишем в новый столбец в формате даты
df['date_of_Review'] = df['Reviews'].str.findall('\d+/\d+/\d+')

In [57]:
# проверим
with pd.option_context('display.max_columns', None):
    display(df.head())

Unnamed: 0,Restaurant_id,City,Cuisine Style,Ranking,Rating,Price Range,Number of Reviews,Reviews,URL_TA,ID_TA,European,French,International,Other,Japanese,Sushi,Asian,Grill,Vegetarian Friendly,Vegan Options,Gluten Free Options,German,Central European,Italian,Pizza,Fast Food,Mediterranean,Spanish,Healthy,Cafe,Thai,Vietnamese,Bar,Pub,Chinese,British,Polish,Fusion,Dutch,Mexican,Venezuelan,South American,Soups,Belgian,Steakhouse,Latin,Barbecue,Argentinean,Irish,Seafood,Swiss,Portuguese,Contemporary,Wine Bar,Greek,Central American,Indian,Middle Eastern,Turkish,Hungarian,Pakistani,Peruvian,Delicatessen,Eastern European,Swedish,Scandinavian,Tibetan,Nepali,Korean,Southwestern,Czech,American,Slovenian,Balti,Street Food,Diner,Brew Pub,Caribbean,Austrian,Moroccan,Halal,Lebanese,Russian,African,Ethiopian,Egyptian,Danish,Brazilian,Ecuadorean,Israeli,Kosher,Gastropub,Australian,Singaporean,Malaysian,Minority Chinese,Scottish,Arabic,Ukrainian,Chilean,Mongolian,Cuban,Persian,Indonesian,Colombian,Jamaican,Norwegian,Hawaiian,Armenian,Taiwanese,Bangladeshi,Sri Lankan,Cambodian,Albanian,New Zealand,Croatian,Central Asian,Filipino,Tunisian,Cajun & Creole,Romanian,Georgian,Polynesian,Azerbaijani,Caucasian,Afghani,Uzbek,Salvadoran,Yunnan,Native American,Canadian,Xinjiang,Burmese,Fujian,Welsh,Latvian,date_of_Review
0,id_5569,Paris,3.0,5570.0,3.5,$$ - $$$,194.0,"[['Good food at your doorstep', 'A good hotel ...",/Restaurant_Review-g187147-d1912643-Reviews-R_...,d1912643,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[12/31/2017, 11/20/2017]"
1,id_1535,Stockholm,1.0,1537.0,4.0,,10.0,"[['Unique cuisine', 'Delicious Nepalese food']...",/Restaurant_Review-g189852-d7992032-Reviews-Bu...,d7992032,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[07/06/2017, 06/19/2016]"
2,id_352,London,7.0,353.0,4.5,$$$$,688.0,"[['Catch up with friends', 'Not exceptional'],...",/Restaurant_Review-g186338-d8632781-Reviews-RO...,d8632781,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[01/08/2018, 01/06/2018]"
3,id_3456,Berlin,1.0,3458.0,5.0,,3.0,"[[], []]",/Restaurant_Review-g187323-d1358776-Reviews-Es...,d1358776,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[]
4,id_615,Munich,3.0,621.0,4.0,$$ - $$$,84.0,"[['Best place to try a Bavarian food', 'Nice b...",/Restaurant_Review-g187309-d6864963-Reviews-Au...,d6864963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"[11/18/2017, 02/19/2017]"


In [58]:
# функция расчета времени от текущего момента до самого раннего отзыва из двух
def time_to_now(row):
    if row['date_of_Review'] == []:
        return None
    return pd.datetime.now() - pd.to_datetime(row['date_of_Review']).max()

# функция расчета времени между отзывами
def time_between_Reviews(row):
    if row['date_of_Review'] == []:
        return None
    return pd.to_datetime(row['date_of_Review']).max() - pd.to_datetime(row['date_of_Review']).min()

# с помощью функций заполняем столбец с кол-вом дней то текущего мента 
# c момента самого раннего отзыва из двух
df['day_to_now'] = df.apply(time_to_now, axis = 1).dt.days

# с помощью функций заполняем столбец с кол-вом дней между отзывами
df['day_between_Reviews'] = df.apply(time_between_Reviews, axis = 1).dt.days

  return pd.datetime.now() - pd.to_datetime(row['date_of_Review']).max()


In [59]:
temp1 = datetime.date.today() - datetime.timedelta(days=df['day_to_now'].min())
temp1

datetime.date(2018, 2, 26)

In [60]:
temp2 = df['day_between_Reviews'].max()
temp2

3207.0

In [61]:
# раз необходимо снова посмотреть на результат МАЕ, то сделам необходимые заполнения и удаления столбцов

df['Number of Reviews'] = df['Number of Reviews'].fillna(df['Number of Reviews'].mean())

# теперь заполним Price Range
# сначала заменим все значения на числовые 
dic_value_Price = {'$':1,'$$ - $$$':2,'$$$$':3}
df['Price Range']=df['Price Range'].map(lambda x: dic_value_Price.get(x,x))
# потом заменим на самое популярное
df['Price Range'] = df['Price Range'].fillna(df["Price Range"].value_counts().idxmax())

# City переводим в числа с помощью метода get_dummies
df_city_dummies = pd.get_dummies(df['City']).astype('float64')
df = pd.concat([df,df_city_dummies], axis=1)

# заполняем пустые значения в столбцах с днями 
df['day_to_now'] = df['day_to_now'].fillna(df['day_to_now'].max())
df['day_between_Reviews'] = df['day_between_Reviews'].fillna(0)

# удаляем все не нужное
df.drop(['Restaurant_id', 'City', 'Reviews', 'URL_TA', 'ID_TA', 'date_of_Review' ], axis=1, inplace=True)

In [62]:
# проверим
with pd.option_context('display.max_columns', None):
    display(df.describe())

Unnamed: 0,Cuisine Style,Ranking,Rating,Price Range,Number of Reviews,European,French,International,Other,Japanese,Sushi,Asian,Grill,Vegetarian Friendly,Vegan Options,Gluten Free Options,German,Central European,Italian,Pizza,Fast Food,Mediterranean,Spanish,Healthy,Cafe,Thai,Vietnamese,Bar,Pub,Chinese,British,Polish,Fusion,Dutch,Mexican,Venezuelan,South American,Soups,Belgian,Steakhouse,Latin,Barbecue,Argentinean,Irish,Seafood,Swiss,Portuguese,Contemporary,Wine Bar,Greek,Central American,Indian,Middle Eastern,Turkish,Hungarian,Pakistani,Peruvian,Delicatessen,Eastern European,Swedish,Scandinavian,Tibetan,Nepali,Korean,Southwestern,Czech,American,Slovenian,Balti,Street Food,Diner,Brew Pub,Caribbean,Austrian,Moroccan,Halal,Lebanese,Russian,African,Ethiopian,Egyptian,Danish,Brazilian,Ecuadorean,Israeli,Kosher,Gastropub,Australian,Singaporean,Malaysian,Minority Chinese,Scottish,Arabic,Ukrainian,Chilean,Mongolian,Cuban,Persian,Indonesian,Colombian,Jamaican,Norwegian,Hawaiian,Armenian,Taiwanese,Bangladeshi,Sri Lankan,Cambodian,Albanian,New Zealand,Croatian,Central Asian,Filipino,Tunisian,Cajun & Creole,Romanian,Georgian,Polynesian,Azerbaijani,Caucasian,Afghani,Uzbek,Salvadoran,Yunnan,Native American,Canadian,Xinjiang,Burmese,Fujian,Welsh,Latvian,day_to_now,day_between_Reviews,Amsterdam,Athens,Barcelona,Berlin,Bratislava,Brussels,Budapest,Copenhagen,Dublin,Edinburgh,Geneva,Hamburg,Helsinki,Krakow,Lisbon,Ljubljana,London,Luxembourg,Lyon,Madrid,Milan,Munich,Oporto,Oslo,Paris,Prague,Rome,Stockholm,Vienna,Warsaw,Zurich
count,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0,40000.0
mean,2.6224,3676.028525,3.993037,1.8786,124.82548,0.2515,0.07975,0.0396,0.232075,0.0366,0.0289,0.075275,0.01225,0.279725,0.11215,0.102825,0.01655,0.034825,0.1491,0.071225,0.042625,0.156925,0.06995,0.0155,0.058125,0.018575,0.012825,0.082425,0.061225,0.028625,0.039875,0.009125,0.014425,0.00735,0.011125,0.001,0.0093,0.01235,0.00675,0.014325,0.00755,0.013875,0.004225,0.0071,0.037625,0.00445,0.027675,0.013075,0.017425,0.0151,0.003375,0.026025,0.01955,0.0111,0.0088,0.003075,0.00235,0.0098,0.0124,0.004275,0.00855,0.0006,0.002075,0.004175,0.000425,0.014875,0.032875,0.0017,0.002075,0.006575,0.007375,0.005625,0.002325,0.0095,0.003525,0.01495,0.008225,0.00145,0.00385,0.00105,0.0005,0.004425,0.003175,0.0001,0.00185,0.0012,0.011775,0.0006,0.000475,0.000775,0.000275,0.004075,0.001275,0.00045,0.0002,0.00035,0.000575,0.001325,0.001475,0.0005,0.000625,0.001875,0.000525,0.00035,0.00055,0.0019,0.000625,0.0005,0.0002,0.00015,0.000675,0.0003,0.0003,0.000525,0.0005,0.00025,0.000425,5e-05,5e-05,0.0001,0.00055,0.000125,2.5e-05,2.5e-05,0.0001,0.000125,2.5e-05,2.5e-05,5e-05,5e-05,2.5e-05,2063.973675,102.645675,0.02715,0.0157,0.06835,0.053875,0.007525,0.0265,0.0204,0.016475,0.016825,0.0149,0.012025,0.023725,0.0094,0.011075,0.0325,0.004575,0.143925,0.00525,0.0223,0.0777,0.053325,0.022325,0.012825,0.009625,0.122425,0.036075,0.05195,0.0205,0.02915,0.018175,0.01345
std,1.817292,3708.749567,0.668417,0.421683,286.113292,0.433881,0.270909,0.19502,0.422162,0.18778,0.167528,0.263838,0.110001,0.44887,0.315555,0.303734,0.127579,0.183339,0.356191,0.257204,0.202013,0.363735,0.255066,0.123532,0.233983,0.13502,0.11252,0.275015,0.239746,0.166752,0.195668,0.095089,0.119236,0.085418,0.104888,0.031607,0.095988,0.110444,0.081882,0.118828,0.086563,0.116974,0.064863,0.083963,0.19029,0.066561,0.164042,0.113597,0.13085,0.121952,0.057997,0.159212,0.13845,0.104771,0.093396,0.055368,0.04842,0.09851,0.110664,0.065244,0.092071,0.024488,0.045505,0.06448,0.020611,0.121054,0.178312,0.041197,0.045505,0.08082,0.085562,0.07479,0.048163,0.097005,0.059268,0.121354,0.090319,0.038052,0.06193,0.032387,0.022355,0.066374,0.056258,0.01,0.042972,0.034621,0.107873,0.024488,0.02179,0.027828,0.016581,0.063706,0.035685,0.021209,0.014141,0.018705,0.023973,0.036377,0.038378,0.022355,0.024992,0.043261,0.022907,0.018705,0.023446,0.043548,0.024992,0.022355,0.014141,0.012247,0.025972,0.017318,0.017318,0.022907,0.022355,0.01581,0.020611,0.007071,0.007071,0.01,0.023446,0.01118,0.005,0.005,0.01,0.01118,0.005,0.005,0.007071,0.007071,0.005,1787.774983,198.594208,0.162522,0.124314,0.252349,0.225774,0.086421,0.160619,0.141366,0.127295,0.128617,0.121154,0.108999,0.152193,0.096498,0.104655,0.177326,0.067485,0.351018,0.072267,0.147659,0.267702,0.224684,0.14774,0.11252,0.097635,0.32778,0.186479,0.221929,0.141705,0.168229,0.133586,0.115193
min,1.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1003.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,973.0,3.5,2.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1094.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,2285.0,4.0,2.0,38.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1206.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,4.0,5260.0,4.5,2.0,124.82548,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1674.0,120.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,21.0,16444.0,5.0,3.0,9660.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6062.0,3207.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [63]:
first_model(df)

MAE: 0.21033949999999998
