# Project Overview
- 강의명 : 2022년 K-디지털 직업훈련(Training) 사업 - AI데이터플랫폼을 활용한 빅데이터 분석전문가 과정
- 교과목명 : 빅데이터 분석 및 시각화, AI개발 기초, 인공지능 프로그래밍
- 프로젝트 주제 : 캐글 대회 Bike Sharing Demand 데이터를 활용한 수요 예측 대회
- 프로젝트 마감일 : 2022년 7월 19일 화요일
- 수강생명 : 홍승기

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Outline
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

# Reference
- https://www.kaggle.com/code/viveksrinivasan/eda-ensemble-model-top-10-percentile

# Data Fields
- datetime - hourly date + timestamp  
- season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
- holiday - whether the day is considered a holiday
- workingday - whether the day is neither a weekend nor holiday
- weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp - temperature in Celsius
- atemp - "feels like" temperature in Celsius
- humidity - relative humidity
- windspeed - wind speed
- casual - number of non-registered user rentals initiated
- registered - number of registered user rentals initiated
- count - number of total rentals

# Chapter 01. Load Required Libraries 

In [None]:
import pandas as pd   # Data processing
import numpy as np    # Numerical operation
import matplotlib as mpl           # Data visualization
import matplotlib.pyplot as plt    # Data visualization
import seaborn as sns              # Data visualization
import sklearn   # Machine Learning
import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline


# Version check
print("pandas version : ", pd.__version__)
print("numpy version : ", np.__version__)
print("matplotlib version : ", mpl.__version__)
print("seaborn version : ", sns.__version__)
print("sklearn version : ", sklearn.__version__)

# Chapter 02. Read The Datasets

In [None]:
DATA_PATH = '/kaggle/input/bike-sharing-demand/'
train_data = pd.read_csv(DATA_PATH + 'train.csv')   # Train dataset
test_data = pd.read_csv(DATA_PATH + 'test.csv')     # Test dataset
submission_data = pd.read_csv(DATA_PATH + 'sampleSubmission.csv')   # submission dataset

train_data.shape, test_data.shape, submission_data.shape

## Samlpe of First Five Rows

In [None]:
print(train_data.head())
print(test_data.head())

In the test dataset, there are columns"casual", "registered" that do not exist in addition to the count column. Our goal is to predict the frequency of the count column. Therefore, we will drop the "casual" and "registered" columns from the training dataset.

## Information of the Dataset


In [None]:
print(train_data.info())
test_data.info()

# Chapter 03. Exploratory Data Analysis

As we know from the above results,  the columns "season","holiday","workingday" and "weather" should be of "categorical" data type. But the current data type is "int" for those columns. We will transform the dataset to get started up with EDA.

- Create new columns "date, "year", "month", "day", "hour", "weekday" from "datetime" column.
- Coerce the datatype of "season", "holiday", "workingday" and "weather" to category.
- Drop the datetime column as we already extracted useful features from it.

## Create a copy


In [None]:
temp_train = train_data.copy()
print(temp_train.head())
temp_train.info()   # DataFrame information

### The 'datatime' data processing for Data visualization

- Create new columns "date, "year", "month", "day", "hour" from "datetime" column.

In [None]:
import time
import datetime

# runnung time test
start_time = time.time()

temp_train['date'] = pd.to_datetime(temp_train['datetime'])
temp_train['year'] = temp_train['date'].dt.year
temp_train['month'] = temp_train['date'].dt.month
temp_train['day'] = temp_train['date'].dt.day
temp_train['hour'] = temp_train['date'].dt.hour

end_time = time.time()
lambda_ctime = end_time - start_time

print("실행 시간 : ", np.round(lambda_ctime, 4))
print(temp_train[['datetime', 'year', 'month', 'day', 'hour']])
temp_train.info()

- Create the column 'weekday'
    + we can create the column that represents the name of day.

In [None]:
temp_train['weekday'] = temp_train['date'].dt.day_name()
temp_train['weekday']

- Convert the number values of the column 'weekday' into the character values


In [None]:
temp_train['season'] = temp_train['season'].map({
    1 : 'Spring',
    2 : 'Summer',
    3 : 'Fall',
    4 : 'Winter'
})
temp_train['season']

- Convert the number values of the column 'weather' into the character values

In [None]:
temp_train['weather'] = temp_train['weather'].map({
    1 : 'Clear',
    2 : 'Few Clouds',
    3 : 'Light snow, Rain',
    4 : 'Heavy snow, Rain'
})
temp_train['weather']

print(temp_train.info())
temp_train.head()

## Missing Data

Once we get hang of the data and columns, we geneally have a step to find out whether we have any missing values in our data. Luckily we dont have any missing value in the dataset.

# Chapter 04. Exploratory Data Analysis
It is a competition to predict figures. To predict the figures well, We need to visualize the data to figure out which columns to delete. We need to visualize the train data corresponding to the dependent variable, and we need to check the distribution to consider whether to give the log transformation.

### Check the difference between 'Normal Graph' and 'Log Transformed Graph'

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 10))

sns.histplot(train_data['count'], ax = ax[0])
sns.histplot(np.log1p(train_data['count']), ax = ax[1])  # Log conversion

# title option
ax[0].set_title('Normal Graph')
ax[1].set_title('Log Transformed Graph')

plt.show()

### The graph of Rental amounts by 'year', 'month', 'day', 'hour'

In [None]:
fig, ax = plt.subplots(nrows = 2, ncols = 2)

# step 1. Basic setting of the entire graph
## Spacing Between Graphs
fig.tight_layout()

## Manage overall graph size
fig.set_size_inches(15,12)

# step 2. Enter each individual graph
sns.boxplot(x = 'year', y = 'count', hue = 'season', 
            data = temp_train,
            palette = "Set2",
           ax = ax[0, 0])
sns.boxplot(x = 'month', y = 'count', data = temp_train, ax = ax[0, 1])
sns.barplot(x = 'day', y = 'count', data = temp_train,
            palette = "flare", ax = ax[1, 0], capsize = .005)
sns.boxplot(x = 'hour', y = 'count', data = temp_train, ax = ax[1,1])

# step 3. Insert detailed options
ax[0, 0].set(xlabel = 'The Years : 2011 / 2012', ylabel = 'Count', title = 'Box Plot on Count By Year')
ax[0, 1].set(xlabel = 'Month', ylabel = 'Count', title = 'Box Plot on Count By Month')
ax[1, 0].set(xlabel = 'Day', ylabel = 'Count', title = 'Bar Plot on Count By Day')
ax[1, 1].set(xlabel = 'Hour Of The Day', ylabel = 'Count', title = 'Box Plot on Count By Hour Of The Day')


plt.show()

### The graph of Rental amounts by 'season', 'weather', 'holiday', 'workingday'

In [None]:
temp_train['season'].value_counts()

In [None]:
fig, ax = plt.subplots(nrows = 2, ncols = 2)

fig.tight_layout()
fig.set_size_inches(15,12)

sns.boxplot(x = 'season', y = 'count', data = temp_train, ax = ax[0, 0], palette = "husl")
sns.boxplot(x = 'weather', y = 'count', data = temp_train, ax = ax[0, 1], palette = "pastel")
sns.boxplot(x = 'holiday', y = 'count', data = temp_train, ax = ax[1, 0])
sns.boxplot(x = 'workingday', y = 'count', data = temp_train, ax = ax[1,1], palette = "Set2")

ax[0, 0].set(xlabel = 'Season', ylabel = 'Count', title = 'Box Plot on Count By Season')
ax[0, 1].set(xlabel = 'Weather', ylabel = 'Count', title = 'Box Plot on Count By Weather')
ax[1, 0].set(xlabel = 'Holiday', ylabel = 'Count', title = 'Box Plot on Count By Holiday')
ax[1, 1].set(xlabel = 'Workingday', ylabel = 'Count', title = 'Box Plot on Count By Workingday')

plt.show()

- Since the 'season' column and the 'month' column show similar shapes on the graph, we decided to remove the 'month' column.
- Also, The 'day' columns are harder to obtain analytical insights than other columns, so delete day columns as well.

### 

### The graph of Rental amounts by 'hour' based on various variables


In [None]:
fig, ax = plt.subplots(nrows = 5)

fig.tight_layout()
fig.set_size_inches(15,30)

sns.pointplot(x = 'hour', y = 'count', hue = 'workingday', data = temp_train, ax = ax[0], palette = 'hls')
sns.pointplot(x = 'hour', y = 'count', hue = 'holiday', data = temp_train, ax = ax[1], palette = "icefire")
sns.pointplot(x = 'hour', y = 'count', hue = 'weekday',  data = temp_train, ax = ax[2], palette = "colorblind")
sns.pointplot(x = 'hour', y = 'count', hue = 'season', data = temp_train, ax = ax[3], palette = 'Paired')
sns.pointplot(x = 'hour', y = 'count', hue = 'weather', data = temp_train, ax = ax[4], palette = 'rocket' )

ax[0].set(xlabel = 'Hour', ylabel = 'Count', title = 'Point Polt on Count By Workingday Hour')
ax[1].set(xlabel = 'Hour', ylabel = 'Count', title = 'Point Polt on Count By Holiday Hour')
ax[2].set(xlabel = 'Hour', ylabel = 'Count', title = 'Point Polt on Count By Weekday Hour')
ax[3].set(xlabel = 'Hour', ylabel = 'Count', title = 'Point Polt on Count By Season Hour')
ax[4].set(xlabel = 'Hour', ylabel = 'Count', title = 'Point Polt on Count By Weather Hour')

plt.show()

- Check values of the column 'weather'
    + When you look at the weather graph, something seems strange. Let's check the frequency of the weather column.

In [None]:
print(temp_train['weather'].value_counts())

- The 'Heavy snow, Rain' value has one frequency 
    + Let's extract and examine the row corresponding to The 'Heavy snow, Rain' value. 

In [None]:
temp_train.loc[temp_train['weather'] == 'Heavy snow, Rain']

- Since there is only one frequency of Heavy snow and Rain, it seems meaningless to predict Heavy snow and Rain in the test dataset. Therefore, it seems better to delete the 'Heavy snow, Rain' value.

### Scatterplot with regression lines

In [None]:
fig, ax = plt.subplots(nrows = 2, ncols = 2)

fig.tight_layout()

fig.set_size_inches(12, 18)

sns.regplot(x = 'temp', y = 'count', data = temp_train, 
            scatter_kws = {'alpha' : 0.3, 'color' : '#3EFA66'},
            line_kws = {'color' : '#FC28D7'}, ax = ax[0, 0])
sns.regplot(x = 'atemp', y = 'count', data = temp_train, 
            scatter_kws = {'alpha' : 0.3, 'color' : '#E69130'},
            line_kws = {'color' : '#717AFF'}, ax = ax[0, 1])
sns.regplot(x = 'humidity', y = 'count', data = temp_train, 
            scatter_kws = {'alpha' : 0.3, 'color' : '#D16A4D'},
            line_kws = {'color' : '#6BE6D9'}, ax = ax[1, 0])
sns.regplot(x = 'windspeed', y = 'count', data = temp_train, 
            scatter_kws = {'alpha' : 0.3, 'color' : '#7A6CE6'},
            line_kws = {'color' : '#FC28D7'}, ax = ax[1, 1])

ax[0, 0].set(xlabel = 'Temp', ylabel = 'Count', title = 'Rental amounts by Temp')
ax[0, 1].set(xlabel = 'aTemp', ylabel = 'Count', title = 'Rental amounts by aTemp')
ax[1, 0].set(xlabel = 'Humidity', ylabel = 'Count', title = 'Rental amounts by Humidity')
ax[1, 1].set(xlabel = 'Windspeed', ylabel = 'Count', title = 'Rental amounts by Windspeed')

plt.show()

- there's something weird in the 'windspeed' column
    + there are many '0' values in the 'windspeed' column  

In [None]:
temp_train['windspeed'].value_counts()

- We cannot know exactly what 0 means among the values in the 'windspeed' column.
    + Therefore, we drop the 'windspeed' column to avoid confusion.

### Create Heatmap Graph
- Correlation coefficient analysis
    + The figure is positive : a positive relationship
    + The figure is negative : a negative relationship
    + 0 ~ ±0.2 : There is no correlation between the two variables.
    + ±0.2 ~ ±1 : The larger the value, the greater the correlation between the two variables.

In [None]:
CorrMat = temp_train[['temp','atemp', 'humidity', 'windspeed','count']].corr()
CorrMat

In [None]:
sns.heatmap(CorrMat, annot = True)
# Check the correlation ratio between count column and other Columns

# Chapter 05. Data Preprocessing
- step 01. Drop the 'casual', 'registered' columns from the train_data
- step 02. Drop data with a value of 4 in the weather column & Data combine
- step 03. 'Datetime' data processing(including dropping 'month', 'day' columns)
- step 04. The 'season', 'weather' columns processing
    + convert 'numeric' into 'character'  
- step 05. Drop the 'windspeed' column
- step 06. Drop the 'datetime', 'date' columns
- step 07. Encoding all characters to numbers

## step 01. Drop the 'casual', 'registered' columns from the train_data

In [None]:
train_data.shape

In [None]:
train_data = train_data.drop(['casual', 'registered'], axis = 1)
train_data

## step 02. Drop data with a value of 4 in the weather column & Data combine

- Drop data with a value of 4 in the weather column

In [None]:
train_data = train_data[train_data['weather'] != 4].reset_index(drop = True)
train_data.shape

In [None]:
train_data.info()

- Combine Train And Test Dataset


In [None]:
all_data = pd.concat([train_data, test_data]).reset_index(drop = True)
all_data.info()

## step 03. 'Datetime' data processing(including dropping 'month', 'day' columns)

In [None]:
all_data['date'] = pd.to_datetime(all_data['datetime'])
all_data['year'] = all_data['date'].dt.year
all_data['hour'] = all_data['date'].dt.hour
all_data['weekday'] = all_data['date'].dt.day_name()

all_data.shape
all_data.info()

## step 04. The 'season',  'weather' columns processing (convert 'numeric' into 'character')

- 'season' column processing

In [None]:
all_data['season'] = all_data['season'].replace(to_replace = [1, 2, 3, 4],
                                               value = ['Spring', 'Summer', 'Fall', 'Winter'])
print(all_data['season'])

all_data.shape
all_data.info()

- 'weather' column processing

In [None]:
all_data['weather'] = all_data['weather'].map({
    1 : 'Clear',
    2 : 'Few Cloud',
    3 : 'Light rain, snow',
    4 : 'Heavy rain, snow'
})

all_data['weather']

## step 05. Drop the 'windspeed' column

In [None]:
all_data = all_data.drop(['windspeed'], axis = 1)
all_data.shape

### step 06. Drop the 'datetime', 'date' columns

In [None]:
all_data = all_data.drop(['datetime', 'date'], axis = 1)
all_data.shape

### step 07. Encoding all characters to numbers

In [None]:
all_data = pd.get_dummies(all_data).reset_index(drop = True)
all_data.shape

In [None]:
all_data.info()

# Chapter 06. Dividing dataset
- Re-separation into train_data and test_data
- 'count' : target data(Dependent variable)
    + train_data if target data is present
    + test_data if target data isn't present

In [None]:
train_data = all_data[~pd.isnull(all_data['count'])]
test_data = all_data[pd.isnull(all_data['count'])]
# train_data if Null isn't in 'count' column
# test_data if Null is in 'count' column

In [None]:
train_data.shape, test_data.shape

- Target data extraction

In [None]:
y = train_data['count']

- Drop the 'count' column from train and test data

In [None]:
X = train_data.drop(['count'], axis = 1)
test_data = test_data.drop(['count'], axis = 1)

X.shape, test_data.shape

- Check the data 'X', 'y', 'test_data'

In [None]:
X.shape, y.shape, test_data.shape

# Chapter 07. Machine Learning

### RMSLE Scorer

In [None]:
def rmsle(y_act, y_pred, convertExp = True):
    if convertExp:
# convertExp is a parameter that determines whether to exponentially convert input data.
        y_act = np.exp(y_act),
        y_pred = np.exp(y_pred)
    log1 = np.nan_to_num(np.array([np.log(z + 1) for z in y_act]))
    log2 = np.nan_to_num(np.array([np.log(z + 1) for z in y_pred]))
    calc = (log1 - log2) ** 2
    return np.sqrt(np.mean(calc))

### Spliting the data 'X', 'y'

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.3, random_state = 42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

### Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

lr_model = LinearRegression()

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
lr_model.fit(X_train, log_y_train)

# Model Prediction
lr_preds = lr_model.predict(X)
print("RMSLE Value For Linear Regression: ", rmsle(np.exp(log_y), np.exp(lr_preds), False))

- Linear Regression RMSLE score
    + y_train : 1.0091
    + y_test : 1.0128
    + y : 1.0102

### Ensemble Models - Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators = 100)

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
rf_model.fit(X_train, log_y_train)

# Model Prediction
rf_preds = rf_model.predict(X)
print("RMSLE Value For Random Forest: ", rmsle(np.exp(log_y), np.exp(rf_preds), False))

- Random Forest RMSLE score
    + y_train : 0.1182
    + y_test : 0.3191
    + y : 0.2001
    
    
- Overfitted to (X_train, y_train) data

### Ensemble Model - Gradient Boost

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gb_model = GradientBoostingRegressor(n_estimators = 5000, alpha = 0.01);

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
gb_model.fit(X_train, log_y_train)

# Model Prediction
gb_preds = gb_model.predict(X)
print("RMSLE Value For Gradient Boost: ", rmsle(np.exp(log_y), np.exp(gb_preds), False))

- Gradient Boost RMSLE score
    + y_train : 0.2008
    + y_test : 0.3103
    + y : 0.2356
    


### XGBoost

In [None]:
from xgboost import XGBRegressor 

xgb_model = XGBRegressor(n_estimators = 100,
                          max_depth = 5,
                          learning_rate = 0.1,
                          random_state = 42,
                          eval_metric = 'rmsle')

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
w_list = [(X_train, log_y_train), (X_test, log_y_test)]

xgb_model.fit(X_train, log_y_train, eval_set = w_list)

# Model Prediction
xgb_preds = xgb_model.predict(X)
print("RMSLE Value For XGBoost: ", rmsle(np.exp(log_y), np.exp(xgb_preds), False))

- XGBoost RMLSE score
    + y_train : 0.2607
    + y_test : 0.3077
    + y : 0.2757

### Create a cross-validation function (Based on RMSE scorer)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor 
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor

def cv_rmse(model, n_folds=5):
    cv = KFold(n_splits=n_folds, random_state=42, shuffle=True)
    rmse_list = np.sqrt(-cross_val_score(model, X, log_y, scoring='neg_mean_squared_error', cv=cv))
    print('CV RMSE value list:', np.round(rmse_list, 4))
    print('CV RMSE mean value:', np.round(np.mean(rmse_list), 4))
    return (rmse_list)

n_folds = 5
rmse_scores = {}
# lr_model = LinearRegression()
xgb_model =  XGBRegressor()

score = cv_rmse(xgb_model, n_folds)
print("XGBoost - mean: {:.4f} (std: {:.4f})".format(score.mean(), score.std()))
rmse_scores['XGBoost'] = (score.mean(), score.std())

- Cross-validation rmse_scores
    + Linear Regression : 1.0647
    + RandomForest : 0.3422
    + GradientBoosting : 0.4369
    + XGBoost : 0.3290
    + LightGBM : Error -> [LightGBM] [Fatal] Do not support special JSON characters in feature name.


- As a result of model evaluation and cross-validation, xgboost with the best performance is selected as the final prediction model.


### Hyperparameter Tuning for best prediction score


In [None]:
from xgboost import XGBRegressor 

xgb_model = XGBRegressor(n_estimators = 200,
                          max_depth = 8,
                          learning_rate = 0.1,
                          random_state = 42,
                          eval_metric = 'rmsle')

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
w_list = [(X_train, log_y_train), (X_test, log_y_test)]

xgb_model.fit(X_train, log_y_train, eval_set = w_list)

# Model Prediction
xgb_preds = xgb_model.predict(X)
print("RMSLE Value For XGBoost: ", rmsle(np.exp(log_y), np.exp(xgb_preds), False))

In [None]:
from xgboost import XGBRegressor 

xgb_model = XGBRegressor(n_estimators = 60,
                          max_depth = 10,
                          learning_rate = 0.1,
                          random_state = 42,
                          eval_metric = 'rmsle')

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
w_list = [(X_train, log_y_train), (X_test, log_y_test)]

xgb_model.fit(X_train, log_y_train, eval_set = w_list)

# Model Prediction
xgb_preds = xgb_model.predict(X)
print("RMSLE Value For XGBoost: ", rmsle(np.exp(log_y), np.exp(xgb_preds), False))

In [None]:
from xgboost import XGBRegressor 

xgb_model = XGBRegressor(n_estimators = 80,
                          max_depth = 10,
                          learning_rate = 0.1,
                          random_state = 42,
                          eval_metric = 'rmsle')

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
w_list = [(X_train, log_y_train), (X_test, log_y_test)]

xgb_model.fit(X_train, log_y_train, eval_set = w_list)

# Model Prediction
xgb_preds = xgb_model.predict(X)
print("RMSLE Value For XGBoost: ", rmsle(np.exp(log_y), np.exp(xgb_preds), False))

In [None]:
from xgboost import XGBRegressor 

xgb_model = XGBRegressor(n_estimators = 80,
                          max_depth = 7,
                          learning_rate = 0.1,
                          random_state = 42,
                          eval_metric = 'rmsle')

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
w_list = [(X_train, log_y_train), (X_test, log_y_test)]

xgb_model.fit(X_train, log_y_train, eval_set = w_list)

# Model Prediction
xgb_preds = xgb_model.predict(X)
print("RMSLE Value For XGBoost: ", rmsle(np.exp(log_y), np.exp(xgb_preds), False))

In [None]:
from xgboost import XGBRegressor 

xgb_model = XGBRegressor(n_estimators = 80,
                          max_depth = 7,
                          learning_rate = 0.1,
                          random_state = 42,
                          eval_metric = 'rmsle')

# Log conversion
log_y = np.log(y)
log_y_train = np.log(y_train)
log_y_test = np.log(y_test)
w_list = [(X_train, log_y_train), (X_test, log_y_test)]

xgb_model.fit(X_train, log_y_train, eval_set = w_list)

# Model Prediction
xgb_preds = xgb_model.predict(test_data)
xgb_preds[:10]

# Chapter 08. Model Prediction

In [None]:
# Exponential conversion
final_preds = np.exp(xgb_preds)
final_preds

# Chapter 09. Submission

In [None]:
submission_data['count'] = final_preds
submission_data.to_csv('submission.csv', index = False)