<a href="https://colab.research.google.com/github/yandexdataschool/MLatImperial2021/blob/master/02_lab/kaggle_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up variables and download data

Register on [kaggle](https://www.kaggle.com) and accept the [competition](https://www.kaggle.com/t/e69c3ea6e14d4b34b0cc608f80691676) rules.

Go to My Account and under API section click **create new API Token**.
Download created kaggle.json

Upload this file to your google drive root folder.

Now execute the following magic. - It installs kaggle, mounts google drive and downloads data from competition to you drive.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!mkdir /root/.kaggle
!cp /content/gdrive/My\ Drive/kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!ls -l /root/.kaggle

In [None]:
DATA_PATH = "/content/gdrive/My Drive/mlimperial2021-predict-the-house-price"

In [None]:
ls /content/gdrive/My\ Drive/mlimperial2021-predict-the-house-price

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
#!kaggle config set -n path -v /content
!kaggle competitions download -c mlimperial2021-predict-the-house-price -p '/content/gdrive/My Drive/mlimperial2021-predict-the-house-price'

In [None]:
!unzip -q /content/gdrive/My\ Drive/mlimperial2021-predict-the-house-price/mlimperial2021-predict-the-house-price.zip -d /content/gdrive/My\ Drive/mlimperial2021-predict-the-house-price/

In [None]:
ls /content/gdrive/My\ Drive/mlimperial2021-predict-the-house-price

# https://www.kaggle.com/t/e69c3ea6e14d4b34b0cc608f80691676

### Metric

For regression task we can use the most common Mean Squared Error(MSE). However, sometimes its better to use logarithmic error. In this challenge, we will use RMSLE - root mean square logarithmic error:

$$
RMSLE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} [\log(y_i + 1) - \log(p_i + 1)]^2},
$$

where $y_i$ is true value and $p_i$ is a predicted value.

# Grading

Your task is to try as many techniques that you have learned this week as possible.


The outcome of your work should be a small table with results, i.e Method - parameters tuned with CV - score + features created on top of exiting ones. The table should be accompanied by a small report of your workflow and reasoning. Also, you need to send the code.


The archive with the files should be sent to mlicl-2021-seminars@yandex.ru with the topic: Surname_name_kaggle_1

### The total amount of points is 10. You will get additional points based on your final ranking

- 1 Point. Find correlated features in the train.csv and macro.csv. Try to run linear regression when you remove this features. What do you observe?
- 1 Point. Try to run various linear methods such Ridge, ElasticNet and more. Grid search parameters.
- 1 Point. Work with missed values. Try to impute them, remove them, or someshow other use clustering technic to fill in missed values with the best value.
- 1 Point. Work with categorial features. Find them. Try to one-hot encode them. Does this improves your score? Try to use standard techniques to work with them, such as counting them, calculating frequency, inverted frequency.
- 1 Point. Work with the datastamps. What information can you extract from them? Can you come up with some date based features?
- 1 Point. Try to find badly defined features and outliers in the dataset. Remove them. Did it help?
- 1 Point. Try using PCA/SVD. Is it usefull? Why?
- 1 Point. Create your own features and explain, why did you decide to create those particular ones. Did they make score better?
- 1 Point. Apply decision tree, random forest, boosting based algos. Grid search parameters.
- 1 Point. Estimate feature importances. Try to remove bad features. Which difference did you notice in comparison when you remove correlated features? 
- 1 Point. Use stacking and blending of the models trained above? Does it improve your score?

## Bonus

Beat medium baseline and we will give you +3 points :)

# Baseline

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error,mean_squared_error
import os

In [None]:
data = pd.read_csv(os.path.join(DATA_PATH, 'X_train.csv'), parse_dates=['timestamp'])
test = pd.read_csv(os.path.join(DATA_PATH, 'X_test.csv'), parse_dates=['timestamp'])
macro = pd.read_csv(os.path.join(DATA_PATH, "macro.csv"), parse_dates=['timestamp'])

In [None]:
data.shape, test.shape, macro.shape

In [None]:
data.head()

In [None]:
macro.head()

As you can see, the timestamp is important here, because it will define the various variables, that change with time, for example, gdp or mortgage rate. Lets, for example, merge train, test,the data on the timestamp.

In [None]:
y_train = data["price_doc"]

data.drop(['id', 'price_doc'], axis=1, inplace=True)

# num_train = len(X_train)
# X_all = pd.concat([X_train, X_test])


X_all = pd.merge_ordered(data, macro, on='timestamp', how='left')


In [None]:
X_all.head()

A small hint - do we really need all 389 columns? What is the distribution of the predicted data?

For now, lets split the training set to train/test and fit simpliest linear model on top of it. But before that we must get rid of NaNs!, beacause not algorithms can deal with them.

In [None]:
X_all.fillna(0, inplace=True)

In [None]:
training_ind, validation_ind = train_test_split(range(len(X_all)), random_state=11, train_size=0.10)

In [None]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_all.iloc[training_ind])

# Error! Why?

Because we have there are exists categorial values in the table. For now, we will just drop them, but you should not! They might be important for the prediction result.

In [None]:
df_numeric = X_all.select_dtypes(exclude=['object'])
df_numeric.drop(["timestamp"], inplace=True, axis=1)

In [None]:
df_numeric.head()

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(df_numeric.iloc[training_ind])

In [None]:
predictor = Ridge()
predictor.fit(X_train, np.log1p(y_train[training_ind]))

X_test = scaler.transform(df_numeric.iloc[validation_ind])
mean_squared_error(predictor.predict(X_test), np.log1p(y_train[validation_ind]), squared=False)

In [None]:
%%time

param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10]}

gscv = GridSearchCV(predictor, param_grid, scoring='neg_root_mean_squared_error', cv=3, n_jobs=-1, verbose=1)
gscv.fit(X_train, np.log1p(y_train[training_ind]))

In [None]:
gscv.cv_results_

## Now we can fit on all the data and make a prediction

Refit model with best parameters

In [None]:
predictor = Ridge(alpha=1)
predictor.fit(X_train, np.log1p(y_train[training_ind]))

X_test = scaler.transform(df_numeric.iloc[validation_ind])
mean_squared_error(predictor.predict(X_test), np.log1p(y_train[validation_ind]), squared=False)

# Make predictions on the test set



In [None]:
test.head()

In [None]:
pred_ids = test['id']
test.drop(['id'], axis=1, inplace=True)
X_predict = pd.merge_ordered(test, macro, on='timestamp', how='left')
X_predict.fillna(0, inplace=True)
X_predict = X_predict[df_numeric.columns]

In [None]:
predictions = np.expm1(predictor.predict(scaler.transform(X_predict)))
predictions = pd.DataFrame(predictions, columns=["price_doc"])
predictions = pd.concat([pred_ids, predictions], axis=1)

In [None]:
predictions.to_csv(os.path.join(DATA_PATH, "predictions.csv"), index=False)

In [None]:
!head -n 5 '/content/gdrive/My Drive/mlimperial2021-predict-the-house-price/predictions.csv'

# Lets use kaggle API again to submit results


In [None]:
!kaggle competitions submit -c mlimperial2021-predict-the-house-price -f "{DATA_PATH}/predictions.csv" -m "Message"