## The easiest way to get an RMSE < 180,000

Sometimes, less is more. Simplicity is often looked over when diving into a data science problem, and reasonably so! Here's a model that scored an RMSE of 172926.36 on the test data. 

In [1]:
# Libraries
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
import sklearn.metrics as metrics

# A nice function for saving predictions in the proper submission format
def submit(preds, submission_num):
    submission_names = ['id', 'price']
    num = str(submission_num)
    sub = 'submission' + num + '.csv'
    submission_df = pd.DataFrame([test_id, preds]).transpose()
    submission_df.columns = submission_names
    submission_df.to_csv(sub, index=False)


In [2]:
# Read in train data
df_train = pd.read_csv('/kaggle/input/playground-series-s3e6/train.csv')
df_train.head()

Unnamed: 0,id,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityCode,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom,price
0,0,34291,24,1,0,47,35693,2,1,2000,0,1,8,5196,369,0,3,3436795.2
1,1,95145,60,0,1,60,34773,1,4,2000,0,1,729,4496,277,0,6,9519958.0
2,2,92661,45,1,1,62,45457,4,8,2020,1,1,7473,8953,245,1,9,9276448.1
3,3,97184,99,0,0,59,15113,1,1,2000,0,1,6424,8522,256,1,9,9725732.2
4,4,61752,100,0,0,57,64245,8,4,2018,1,0,7151,2786,863,0,7,6181908.8


In [3]:
# Read in test data
df_test = pd.read_csv('/kaggle/input/playground-series-s3e6/test.csv')
df_test.head()

Unnamed: 0,id,squareMeters,numberOfRooms,hasYard,hasPool,floors,cityCode,cityPartRange,numPrevOwners,made,isNewBuilt,hasStormProtector,basement,attic,garage,hasStorageRoom,hasGuestRoom
0,22730,47580,89,0,1,8,54830,5,3,1995,0,0,6885,8181,241,0,8
1,22731,62083,38,0,0,87,8576,10,3,1994,1,1,4601,9237,393,1,4
2,22732,90499,75,1,1,37,62454,9,6,1997,0,1,7454,2680,305,0,2
3,22733,16354,47,1,1,9,9262,6,5,2019,1,1,705,5097,122,1,5
4,22734,67510,8,0,0,55,24112,3,7,2014,1,1,3715,7979,401,1,9


In [4]:
# Basic info
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22730 entries, 0 to 22729
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 22730 non-null  int64  
 1   squareMeters       22730 non-null  int64  
 2   numberOfRooms      22730 non-null  int64  
 3   hasYard            22730 non-null  int64  
 4   hasPool            22730 non-null  int64  
 5   floors             22730 non-null  int64  
 6   cityCode           22730 non-null  int64  
 7   cityPartRange      22730 non-null  int64  
 8   numPrevOwners      22730 non-null  int64  
 9   made               22730 non-null  int64  
 10  isNewBuilt         22730 non-null  int64  
 11  hasStormProtector  22730 non-null  int64  
 12  basement           22730 non-null  int64  
 13  attic              22730 non-null  int64  
 14  garage             22730 non-null  int64  
 15  hasStorageRoom     22730 non-null  int64  
 16  hasGuestRoom       227

**Clearly, the id column should go. Besides that, everything can stay for now.**

In [5]:
# Drop id from train data
df_train.drop('id', axis=1, inplace=True)

# Store the test id column before dropping for submission purposes
test_id = df_test['id']
df_test.drop('id', axis=1, inplace=True)

**We will train an XGBRegressor on the entire training data using default parameters**

In [6]:
# Model: XGBRegressor
X = df_train.drop('price', axis=1)
y = df_train['price']

# Call upon thy regressor
xgbr = XGBRegressor()

# Fit thy model
xgbr.fit(X, y)


XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

In [7]:
# Predictions
preds = xgbr.predict(df_test).astype(int)
submit(preds=preds, submission_num='1')

preds[0:20]

array([4755510, 6213942, 9077159, 1642430, 6756827,  122364, 9924012,
       5280922, 5586454, 9837013, 6984843, 6100250, 5205847, 1096391,
       5103403, 3408830, 5802486, 4246467, 3515929, 7051813])

In [8]:
# Predictions on training set
train_preds = xgbr.predict(X)

def regression_results(y_test, y_pred):

    # Regression metrics
    explained_variance=metrics.explained_variance_score(y_test, y_pred)
    mean_absolute_error=metrics.mean_absolute_error(y_test, y_pred)
    mse=metrics.mean_squared_error(y_test, y_pred)
    #mean_squared_log_error=metrics.mean_squared_log_error(y_test, y_pred)
    median_absolute_error=metrics.median_absolute_error(y_test, y_pred)
    r2=metrics.r2_score(y_test, y_pred)
    errors = abs(y_pred - y_test)
    mape = 100 * (errors / y_test)
    accuracy = 100 - np.mean(mape)

    print('Accuracy:', round(accuracy, 2), '%.')
    print('explained_variance: ', round(explained_variance,4))    
    #print('mean_squared_log_error: ', round(mean_squared_log_error,4))
    print('r2: ', round(r2,4))
    print('MAE: ', round(mean_absolute_error,4))
    print('MSE: ', round(mse,4))
    print('RMSE: ', round(np.sqrt(mse),4))
    
regression_results(y, train_preds)

Accuracy: 99.09 %.
explained_variance:  1.0
r2:  1.0
MAE:  9445.3713
MSE:  352129861.4569
RMSE:  18765.1235


**The RMSE on test data is about 9 - 10x higher than on the training data. We're overfitting! (Shocker, I know).
This is a good baseline to improve upon. **