说明：训练数据集有3000部电影，包含"id, cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries"这些特征。其中会有同名电影的翻拍，这应该被看做是不同的电影。测试集有4398部电影。

这边我们加入5000部同样是来自TMDB的数据集，以增加预测的准确性（因为训练集实在是太少了！）。

先加载数据看看

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import re

In [3]:
def loaddata(file, train=True):
    if train:
        data_train = pd.read_csv(file)
        X_train = data_train[data_train.columns[1:-1]]
        y_train = data_train[data_train.columns[-1]]
        return X_train, y_train
    else:
        data_test = pd.read_csv(file)
        X_test = data_test[data_test.columns[1:]]
        return X_test

In [4]:
X_train_orignal, y_train_orignal = loaddata('./数据集/train.csv')
X_test = loaddata('./数据集/test.csv', train=False)

In [5]:
X_train_orignal.shape

(3000, 21)

In [6]:
X_test.shape

(4398, 21)

In [7]:
X_train_orignal.columns

Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'imdb_id',
       'original_language', 'original_title', 'overview', 'popularity',
       'poster_path', 'production_companies', 'production_countries',
       'release_date', 'runtime', 'spoken_languages', 'status', 'tagline',
       'title', 'keywords', 'cast', 'crew'],
      dtype='object')

In [8]:
len(X_train_orignal.columns)

21

In [9]:
X_train_orignal.head(1)

Unnamed: 0,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,...,production_countries,release_date,runtime,spoken_languages,status,tagline,title,keywords,cast,crew
0,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de..."


这边总共21个特征，我们分别对每个特征进行分析，看其是否会与票房预测有帮助，剔除掉无帮助的。

belongs_to_collection: 电影所属的集合，从其内容来看，应该是电影所属的题材，这边对其的处理，就是简单的是否有集合，有则为1，无则为0。

budget: 预算，显然这对预测票房是有影响，预算的高低，会影响制作影片的质量。这边需要做的就是标准化处理，我们这里使用log1p标准化，确保不为0。

genres: 电影的类型，比如科幻，动作，爱情等，这对于不同口味的人，也同样重要，也会影响票房。因为这个特征是对于不同类型电影是有id的，这个可以作为不同类别的区分，但有个问题，就是不同可能存在多个类别，我们需要将其拆开，分别处理。

homepage: 是否有电影主页，这边也处理成有则为1，没有则为0。

imdb_id: 这个是电影数据库对应的id，似乎对预测无影响，这里先舍弃掉！

original_language: 电影的原始语言，有英语，韩语，中文等，这也多少会对结果预测有影响，这就对它做Onehotencoding操作。

original_title: 这是电影原来的名字，这对于不同国家的电影，相对来说还是比较难断定是否与结果预测有影响的，本来有些人可能会有特殊喜好，一个好的电影名字来决定是否要去看一看这电影的。这里，我们试着去找有意义的特征，比如时间，年份，看还是算了，比较难找特征...

overview: 电影的概括，这个特征也比较难表示，可能有些人也会读电影的简介，了解后，才去决定是否要去看，这个比较因人而异，比如定义是否有趣，这个怎么说呢，也比较主观，无法既定，那这边就简单以有无概述为特征，1为有，0为无。

popularity: 电影的评分，这个比较客观，比之前的电影名和概述都直观，可度量性比较强，这里需要做的也是标准化处理，还是使用log1p标准化。

poster_path: 这个电影的宣传海报，如果要找特征的话，就以什么类型的电影，应该有什么样的海报，是否达到预期，这感觉也比较主观，这里就以是否有海报来定吧，1为有，0为无。

production_companies: 制片公司，这里包含的公司都有id，就以id来区分，也会有多个公司合作的情况，也需要分开处理。

production_countries: 制片国家，与制片公司，同样处理，存在多国合拍的情况。

release_date: 这是电影的发行日期，这可能对于有些人会认为过早的电影肯定也不咋的，就会选择不去看，那这边该如何定义这个特征呢？这边就以80年代之前，80到90年代，90到00，00到10，10到现在来进行区分。同时，我们还需要区分电影是在哪个季节上映的，这也会有一定的影响。

runtime: 电影的时长，有些人可能没有耐心看长电影，这边以1个小时内，一个半小时，2个小时，2个小时到3个小时，3个小时以上来区分。

spoken_languages: 影片中涉及到的语言，这也需要根据语言简写来区分，如kr, zh, en等。

status: 是否发行，发行为1，没发行为0。

tagline: 电影的宣传口号，这里就以是否有口号为特征，有则为1，无则为0。

title: 而这才是正式使用的名字，这里我们可以以是否改名为特征。

keywords: 有多个关键词的都可以分开区分。

cast: 卡司，也就是演员阵容，这里主要关注有哪些演员参演，这对是否要去看是很重要的，故而会影响票房，所以这里每个cast中的演员id来区分，同时再加入出演次序，这可能会有影响！

crew: 工作组人员信息，同样以cast中的工作人员id来区分，但感觉应该影响不大！

我们再来看下需要附加数据

In [10]:
tmdb_5000_movies = pd.read_csv('./数据集/tmdb_5000_movies.csv')
tmdb_5000_credits = pd.read_csv('./数据集/tmdb_5000_credits.csv')

In [11]:
tmdb_5000_movies.shape

(4803, 19)

In [12]:
tmdb_5000_credits.shape

(4803, 4)

In [13]:
tmdb_5000 = pd.concat([tmdb_5000_movies, tmdb_5000_credits], axis=1)

In [14]:
tmdb_5000.shape

(4803, 23)

In [15]:
tmdb_5000.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'vote_average', 'vote_count',
       'movie_id', 'title', 'cast', 'crew'],
      dtype='object')

In [16]:
tmdb_5000.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,vote_average,vote_count,movie_id,title,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [17]:
len(tmdb_5000.columns)

23

In [18]:
matched_features = []
without_matched_features = []
for i in X_train_orignal.columns:
    i = i.lower()
    if i in tmdb_5000.columns:
        matched_features.append(i)
    else:
        without_matched_features.append(i)

In [19]:
matched_features

['budget',
 'genres',
 'homepage',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'production_companies',
 'production_countries',
 'release_date',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'keywords',
 'cast',
 'crew']

In [None]:
len(matched_features)

18

In [None]:
X_train_5000_part1 = tmdb_5000[matched_features]
X_train_5000_part2 = pd.DataFrame(columns=without_matched_features)
X_train_5000 = pd.concat([X_train_5000_part1, X_train_5000_part2], axis=1)
X_train_5000 = X_train_5000[X_train_orignal.columns]
y_train_5000 = tmdb_5000['revenue']

In [None]:
X_train_5000.shape

(4803, 21)

In [None]:
X_train_5000.columns

Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'imdb_id',
       'original_language', 'original_title', 'overview', 'popularity',
       'poster_path', 'production_companies', 'production_countries',
       'release_date', 'runtime', 'spoken_languages', 'status', 'tagline',
       'title', 'keywords', 'cast', 'crew'],
      dtype='object')

In [None]:
X_train = pd.concat([X_train_orignal, X_train_5000], axis=0, ignore_index=True)
X = pd.concat([X_train, X_test], axis=0, ignore_index=True)
y_train = pd.concat([y_train_orignal, y_train_5000], axis=0, ignore_index=True)
y_train = np.log1p(y_train)

In [None]:
X.loc[X['belongs_to_collection'].notnull(), 'belongs_to_collection'] = 1
X['belongs_to_collection'].fillna(0, inplace=True)
X['budget'] = np.log1p(X['budget'])

In [None]:
genres_find_id = X['genres'].str.findall('[0-9]{1,}')
genres_find_id.fillna(0, inplace=True)
genres_id_lst = []
for i in range(len(genres_find_id)):
    if genres_find_id.loc[i] != 0:
        for j in genres_find_id.loc[i]: 
            j = int(j)
            if j not in genres_id_lst:
                genres_id_lst.append(j)
genres_features = pd.DataFrame(np.zeros((len(genres_find_id), len(genres_id_lst)), dtype='int'), columns=['genres_' + str(i) for i in sorted(genres_id_lst)])
for i in range(len(genres_find_id)):
    if genres_find_id.loc[i] != 0:
        for j in genres_find_id.loc[i]:
            j = int(j)
            genres_features.loc[i]['genres_' + str(j)] = 1
X.drop('genres', axis=1, inplace=True)
X = pd.concat([X, genres_features], axis=1) 

In [None]:
X.loc[X['homepage'].notnull(), 'homepage'] = 1
X['homepage'].fillna(0, inplace=True)
X.loc[X['imdb_id'].notnull(), 'imdb_id'] = 1
X['imdb_id'].fillna(0, inplace=True)

original_language = pd.get_dummies(X['original_language'], prefix='original_language')
X.drop('original_language', axis=1, inplace=True)
X = pd.concat([X, original_language], axis=1)

X.loc[X['overview'].notnull(), 'overview'] = 1
X['overview'].fillna(0, inplace=True)
X['popularity'] = np.log1p(X['popularity'])
X.loc[X['poster_path'].notnull(), 'poster_path'] = 1
X['poster_path'].fillna(0, inplace=True)

In [None]:
production_companies_find_id = X['production_companies'].str.findall('[0-9]{1,}')
production_companies_find_id.fillna(0, inplace=True)
production_companies_id_lst = []
for i in range(len(production_companies_find_id)):
    if production_companies_find_id.loc[i] != 0:
        for j in production_companies_find_id.loc[i]: 
            j = int(j)
            if j not in production_companies_id_lst:
                production_companies_id_lst.append(j)
production_companies_features = pd.DataFrame(np.zeros((len(production_companies_find_id), len(production_companies_id_lst)), dtype='int'), columns=['production_companies_' + str(i) for i in sorted(production_companies_id_lst)])
for i in range(len(production_companies_find_id)):
    if production_companies_find_id.loc[i] != 0:
        for j in production_companies_find_id.loc[i]:
            j = int(j)
            production_companies_features.loc[i]['production_companies_' + str(j)] = 1
X.drop('production_companies', axis=1, inplace=True)
X = pd.concat([X, production_companies_features], axis=1)

In [None]:
production_countries_find_id = X['production_countries'].str.findall('[0-9]{1,}')
production_countries_find_id.fillna(0, inplace=True)
production_countries_id_lst = []
for i in range(len(production_countries_find_id)):
    if production_countries_find_id.loc[i] != 0:
        for j in production_countries_find_id.loc[i]: 
            j = int(j)
            if j not in production_countries_id_lst:
                production_countries_id_lst.append(j)
production_countries_features = pd.DataFrame(np.zeros((len(production_countries_find_id), len(production_countries_id_lst)), dtype='int'), columns=['production_countries_' + str(i) for i in sorted(production_countries_id_lst)])
for i in range(len(production_countries_find_id)):
    if production_countries_find_id.loc[i] != 0:
        for j in production_countries_find_id.loc[i]:
            j = int(j)
            production_countries_features.loc[i]['production_countries_' + str(j)] = 1
X.drop('production_countries', axis=1, inplace=True)
X = pd.concat([X, production_countries_features], axis=1)

In [None]:
X['release_date'].fillna('2014-06-01', inplace=True)
X['release_year'] = pd.to_datetime(X['release_date']).dt.year
X['release_month'] = pd.to_datetime(X['release_date']).dt.month
X.drop('release_date', axis=1, inplace=True)

X.loc[X['release_year'] < 1980, 'release_time'] = 'before 80'
X.loc[(X['release_year'] >= 1980) & (X['release_year'] < 1990), 'release_time'] = '80 to 90'
X.loc[(X['release_year'] >= 1990) & (X['release_year'] < 2000), 'release_time'] = '90 to 00'
X.loc[(X['release_year'] >= 2000) & (X['release_year'] < 2010), 'release_time'] = '00 to 10'
X.loc[(X['release_year'] >= 2010) & (X['release_year'] <= int(time.strftime('%Y', time.localtime(time.time())))), 'release_time'] = '10 to now'
X.drop('release_year', axis=1, inplace=True)

X.loc[(X['release_month'] == 12) | (X['release_month'] == 1) | (X['release_month'] == 2), 'release_season'] = 'winter'
X.loc[(X['release_month'] == 3) | (X['release_month'] == 4) | (X['release_month'] == 5), 'release_season'] = 'spring'
X.loc[(X['release_month'] == 6) | (X['release_month'] == 7) | (X['release_month'] == 8), 'release_season'] = 'summer'
X.loc[(X['release_month'] == 9) | (X['release_month'] == 10) | (X['release_month'] == 11), 'release_season'] = 'autumn'
X.drop('release_month', axis=1, inplace=True)

release_time = pd.get_dummies(X['release_time'], prefix='release_time')
X.drop('release_time', axis=1, inplace=True)
X = pd.concat([X, release_time], axis=1)

release_season = pd.get_dummies(X['release_season'], prefix='release_season')
X.drop('release_season', axis=1, inplace=True)
X = pd.concat([X, release_season], axis=1)

In [None]:
X.loc[X['runtime'] <= 60, 'runtime_interval'] = 'within one hour'
X.loc[(X['runtime'] > 60) & (X['runtime'] <= 90), 'runtime_interval'] = 'within one and a half hour'
X.loc[(X['runtime'] > 90) & (X['runtime'] <= 120), 'runtime_interval'] = 'within two hours'
X.loc[(X['runtime'] > 120) & (X['runtime'] <= 180), 'runtime_interval'] = 'within three hours'
X.loc[X['runtime'] > 180, 'runtime_interval'] = 'more than three hours'

X.drop('runtime', axis=1, inplace=True)
runtime_interval = pd.get_dummies(X['runtime_interval'], prefix='runtime_interval')
X.drop('runtime_interval', axis=1, inplace=True)
X = pd.concat([X, runtime_interval], axis=1)

In [None]:
spoken_languages_find_lang = X['spoken_languages'].str.findall('\'[a-z]{2}\'')
spoken_languages_find_lang.fillna(0, inplace=True)
spoken_languages_lang_lst = []
for i in range(len(spoken_languages_find_lang)):
    if spoken_languages_find_lang.loc[i] != 0:
        for j in spoken_languages_find_lang.loc[i]: 
            j = j[1:3]
            if j not in spoken_languages_lang_lst:
                spoken_languages_lang_lst.append(j)
spoken_languages_features = pd.DataFrame(np.zeros((len(spoken_languages_find_lang), len(spoken_languages_lang_lst)), dtype='int'), columns=['spoken_languages_' + i for i in sorted(spoken_languages_lang_lst)])
for i in range(len(spoken_languages_find_lang)):
    if spoken_languages_find_lang.loc[i] != 0:
        for j in spoken_languages_find_lang.loc[i]:
            j = j[1:3]
            spoken_languages_features.loc[i]['spoken_languages_' + j] = 1
X.drop('spoken_languages', axis=1, inplace=True)
X = pd.concat([X, spoken_languages_features], axis=1)

In [None]:
X.loc[X['status'].notnull(), 'status'] = 1
X['status'].fillna(0, inplace=True)

X.loc[X['tagline'].notnull(), 'tagline'] = 1
X['tagline'].fillna(0, inplace=True)

for i in range(len(X)):
    if X.loc[i, 'original_title'] == X.loc[i, 'title']:
        X.loc[i, 'title_changed'] = 1
    else:
        X.loc[i, 'title_changed'] = 0
X['title_changed'] = X['title_changed'].astype('int')
X.drop(['original_title', 'title'], axis=1, inplace=True)

In [None]:
keywords_find_id = X['keywords'].str.findall('[0-9]{1,}')
keywords_find_id.fillna(0, inplace=True)
keywords_id_lst = []
for i in range(len(keywords_find_id)):
    if keywords_find_id.loc[i] != 0:
        for j in keywords_find_id.loc[i]: 
            j = int(j)
            if j not in keywords_id_lst:
                keywords_id_lst.append(j)
keywords_features = pd.DataFrame(np.zeros((len(keywords_find_id), len(keywords_id_lst)), dtype='int'), columns=['keywords_' + str(i) for i in sorted(keywords_id_lst)])
for i in range(len(keywords_find_id)):
    if keywords_find_id.loc[i] != 0:
        for j in keywords_find_id.loc[i]:
            j = int(j)
            keywords_features.loc[i]['keywords_' + str(j)] = 1
X.drop('keywords', axis=1, inplace=True)
X = pd.concat([X, keywords_features], axis=1)

In [None]:
cast_find_actor_id = X['cast'].str.findall("\'id\'\:\s\d{1,}")
cast_actor_id = []
cast_find_actor_id.fillna(0, inplace=True)
for i in range(len(cast_find_actor_id)):
    if cast_find_actor_id.loc[i] != 0:
        for j in cast_find_actor_id.loc[i]:
            j = int(re.findall('[0-9]{1,}', j)[0])
            if j not in cast_actor_id:
                cast_actor_id.append(j)
cast_actor_features = pd.DataFrame(np.zeros((len(cast_find_actor_id), len(cast_actor_id)), dtype='int'), columns=['cast_actor_' + str(i) for i in sorted(cast_actor_id)])
for i in range(len(cast_find_actor_id)):
    if cast_find_actor_id.loc[i] != 0:
        for j in cast_find_actor_id.loc[i]:
            j = int(re.findall('[0-9]{1,}', j)[0])
            cast_actor_features.loc[i]['cast_actor_' + str(j)] = 1
X = pd.concat([X, cast_actor_features], axis=1)

cast_find_order_id = X['cast'].str.findall("\'order\'\:\s\d{1,}")
cast_order_id = []
cast_find_order_id.fillna(0, inplace=True)
for i in range(len(cast_find_order_id)):
    if cast_find_order_id.loc[i] != 0:
        for j in cast_find_order_id.loc[i]:
            j = int(re.findall('[0-9]{1,}', j)[0])
            if j not in cast_order_id:
                cast_order_id.append(j)
cast_order_features = pd.DataFrame(np.zeros((len(cast_find_order_id), len(cast_order_id)), dtype='int'), columns=['cast_order_' + str(i) for i in sorted(cast_order_id)])
for i in range(len(cast_find_order_id)):
    if cast_find_order_id.loc[i] != 0:
        for j in cast_find_order_id.loc[i]:
            j = int(re.findall('[0-9]{1,}', j)[0])
            cast_order_features.loc[i]['cast_order_' + str(j)] = 1
X = pd.concat([X, cast_order_features], axis=1)
X.drop('cast', axis=1, inplace=True)

In [None]:
crew_find_id = X['crew'].str.findall('[0-9]{1,}')
crew_find_id.fillna(0, inplace=True)
crew_id_lst = []
for i in range(len(crew_find_id)):
    if crew_find_id.loc[i] != 0:
        for j in crew_find_id.loc[i]: 
            j = int(j)
            if j not in crew_id_lst:
                crew_id_lst.append(j)
crew_features = pd.DataFrame(np.zeros((len(crew_find_id), len(crew_id_lst)), dtype='int'), columns=['crew_' + str(i) for i in sorted(crew_id_lst)])
for i in range(len(crew_find_id)):
    if crew_find_id.loc[i] != 0:
        for j in crew_find_id.loc[i]:
            j = int(j)
            crew_features.loc[i]['crew_' + str(j)] = 1
X.drop('crew', axis=1, inplace=True)
X = pd.concat([X, crew_features], axis=1)

In [None]:
min_outlier_boundary = y_train.mean() - 3 * y_train.std()
max_outlier_boundary = y_train.mean() + 3 * y_train.std()
len(y_train[(y_train < min_outlier_boundary) | (y_train > max_outlier_boundary)])

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [None]:
len(X_train), len(X_test)

In [None]:
X.head(1)

In [None]:
X_train = X[:len(X_train)]
X_test = X[len(X_train):]

In [None]:
Xtrain, Xval, ytrain, yval = train_test_split(X_train, y_train, test_size=0.25)

In [None]:
lr = LinearRegression(n_jobs=-1)
lr.fit(Xtrain, ytrain)
ypred = lr.predict(Xval)
np.sqrt(mean_squared_error(yval, ypred))

In [None]:
enet = ElasticNet()
gscv_enet = GridSearchCV(enet, {'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
                                'l1_ratio': [.1, .5, .7, .9, .95, .99, 1],
                                'max_iter': [100, 200, 500, 1000, 2000, 5000],
                                'tol': [1e-4, 1e-5, 1e-6],
                                'random_state': range(501),
                                'selection': ['random', 'cyclic']}, cv=10, n_jobs=-1, iid=False)
gscv_enet.fit(Xtrain, ytrain)
enet = ElasticNet(alpha=gscv_enet.best_params_['alpha'], l1_ratio=gscv_enet.best_params_['l1_ratio'],
                  max_iter=gscv_enet.best_params_['max_iter'], tol=gscv_enet.best_params_['tol'],
                  random_state=gscv_enet.best_params_['random_state'], selection=gscv_enet.best_params_['selection'])
enet.fit(Xtrain, ytrain)
ypred = enet.predict(Xval)
np.sqrt(mean_squared_error(yval, ypred))

In [None]:
svm_r = SVR()
gscv_svr = GridSearchCV(svm_r, {'C': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
                                'kernel': ['linear', 'rbf'],
                                'tol': [1e-3, 1e-4, 1e-5, 1e-6],
                                'random_state': range(501)}, cv=10, n_jobs=-1, iid=False)
gscv_svr.fit(Xtrain, ytrain)
svm_r = SVR(C=gscv_svr.best_params_['C'], kernel=gscv_svr.best_params_['kernel'],
            tol=gscv_svr.best_params_['tol'], random_state=gscv_svr.best_params_['random_state'])
svm_r.fit(Xtrain, ytrain)
ypred = svm_r.predict(Xval)
np.sqrt(mean_squared_error(yval, ypred))

In [None]:
rfr = RandomForestRegressor()
gscv_rfr = GridSearchCV(rfr, {'n_estimators': [50, 80, 100, 200],
                              'max_depth': [3, 10, 30, 50],
                              'min_samples_split': [2, 5, 10],
                              'min_samples_leaf': [1, 2, 5],
                              'random_state': range(101)}, cv=10, n_jobs=-1, iid=False)
gscv_rfr.fit(Xtrain, ytrain)
rfr = RandomForestRegressor(n_estimators=gscv_rfr.best_params_['n_estimators'],
                            max_depth=gscv_rfr.best_params_['max_depth'],
                            min_samples_split=gscv_rfr.best_params_['min_samples_split'],
                            min_samples_leaf=gscv_rfr.best_params_['min_samples_leaf'],
                            random_state=gscv_rfr.best_params_['random_state'])
rfr.fit(Xtrain, ytrain)
ypred = rfr.predict(Xval)
np.sqrt(mean_squared_error(yval, ypred))

In [None]:
gbr = GradientBoostingRegressor()
gscv_gbr = GridSearchCV(gbr, {'n_estimators': [50, 80, 100, 200],
                              'max_depth': [3, 5, 10, 30, 50],
                              'learning_rate': [0.05, 0.1, 0.2],
                              'min_samples_split': [2, 5, 10],
                              'min_samples_leaf': [1, 2, 5],
                              'subsample': [0.6, 0.8, 1.0],
                              'random_state': range(101)}, cv=10, n_jobs=-1, iid=False)
gscv_gbr.fit(Xtrain, ytrain)
gbr = GradientBoostingRegressor(loss='exponential',
                                n_estimators=gscv_gbr.best_params_['n_estimators'],
                                max_depth=gscv_gbr.best_params_['max_depth'],
                                learning_rate=gscv_gbr.best_params_['learning_rate'],
                                min_samples_split=gscv_gbr.best_params_['min_samples_split'],
                                min_samples_leaf=gscv_gbr.best_params_['min_samples_leaf'],
                                subsample=gscv_gbr.best_params_['subsample'],
                                random_state=gscv_gbr.best_params_['random_state'])
gbr.fit(Xtrain, ytrain)
ypred = gbr.predict(Xval)
np.sqrt(mean_squared_error(yval, ypred))

In [None]:
xgbr = xgb.XGBRegressor()
gscv_xgbr = GridSearchCV(xgbr, {'n_estimators': [50, 80, 100, 200],
                                'max_depth': [3, 5, 10, 30, 50],
                                'learning_rate': [0.05, 0.1, 0.2],
                                'min_child_weight': [3, 5, 7, 9],
                                'subsample': [0.6, 0.8, 1.0],
                                'colsample_bytree': [0.6, 0.8, 0.1],
                                'reg_lambda': [0.01, 0.05, 0.1, 0.5, 1.0],
                                'reg_alpha': [0, 0.1, 0.5, 1.0],
                                'random_state': range(101)}, cv=10, n_jobs=-1, iid=False)
gscv_xgbr.fit(Xtrain, ytrain)
xgbr = xgb.XGBRegressor(n_estimators=gscv_xgbr.best_params_['n_estimators'],
                        max_depth=gscv_xgbr.best_params_['max_depth'],
                        learning_rate=gscv_xgbr.best_params_['learning_rate'],
                        min_child_weight=gscv_xgbr.best_params_['min_child_weight'],
                        subsample=gscv_xgbr.best_params_['subsample'],
                        colsample_bytree=gscv_xgbr.best_params_['colsample_bytree'],
                        reg_lambda=gscv_xgbr.best_params_['reg_lambda'],
                        reg_alpha=gscv_xgbr.best_params_['reg_alpha'],
                        random_state=gscv_xgbr.best_params_['random_state'])
xgbr.fit(Xtrain, ytrain)
ypred = xgbr.predict(Xval)
np.sqrt(mean_squared_error(yval, ypred))

In [None]:
from keras import layers, models, callbacksbacks

In [None]:
Xtrain = np.array(Xtrain)
Xval = np.array(Xval)
X_test = np.array(X_test)

In [None]:
min_val_mae = []
params = []

early_stopping = callbacks.EarlyStopping(patience=50) 

for hidden_size1 in [64, 128, 256, 512]:
    for hidden_size2 in [64, 128, 256, 512]:
        for activation in ['elu', 'selu', 'relu']:
            for epochs in [10, 50, 100, 200]:
                for batch_size in [100, 200, 500, 1000]:
                    model = models.Sequential()
                    model.add(layers.Dense(hidden_size1, activation=activation))
                    model.add(layers.Dense(hidden_size2, activation=activation))
                    model.add(layers.Dense(1))

                    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
                    history = model.fit(Xtrain, ytrain, epochs=epochs, batch_size=batch_size, validation_data=(Xval, yval), verbose=0, callbacks=[early_stopping])
                    
                    min_val_mae.append(min(history.history['val_mae']))
                    params.append([hidden_size1, hidden_size2, activation, epochs, batch_size])

In [None]:
best_params = params[np.argmin(min_val_mae)]

model = models.Sequential()
model.add(layers.Dense(best_params[0], activation=best_params[2]))
model.add(layers.Dense(best_params[1], activation=best_params[2]))
model.add(layers.Dense(1))

model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.fit(Xtrain, ytrain, epochs=best_params[3], batch_size=best_params[4], validation_data=(Xval, yval), callbacks=[early_stopping])

ypred = model.predict(Xval)
np.sqrt(mean_squared_error(yval, ypred))