# House Prices Prediction Model
In the last notebook, we did some exploratory data analysis. Now, we're going to modeling directly.

![Abstract houses](https://storage.googleapis.com/kaggle-media/competitions/House%20Prices/kaggle_5407_media_housesbanner.png)

<br>

#### >> [Copy this notebook](https://www.kaggle.com/code/mohamedyosef101/house-prices-prediction)

<hr>

# Set up
- Import the libraries
- Load the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# load the data
trainData = pd.read_csv('data/train.csv')
testData = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/sample_submission.csv')

# Step 1: Handling missing values
Just to be the same page, this is my missing value table that I've created in the EDA.

In [2]:
def get_missing_value_counts(data_frame):
    missing_counts = data_frame.isnull().sum()
    missing_counts = (missing_counts[missing_counts > 0]).sort_values(ascending=False)
    
    percent = data_frame.isnull().sum()/data_frame.isnull().count()
    percent = (percent[percent > 0]).sort_values(ascending=False)
    
    missing_data = pd.concat([missing_counts, percent], axis=1, keys=['Missing_counts', 'Percent'])
    return missing_data

train_missing_values = get_missing_value_counts(trainData)
print(train_missing_values)

              Missing_counts   Percent
PoolQC                  1453  0.995205
MiscFeature             1406  0.963014
Alley                   1369  0.937671
Fence                   1179  0.807534
MasVnrType               872  0.597260
FireplaceQu              690  0.472603
LotFrontage              259  0.177397
GarageType                81  0.055479
GarageYrBlt               81  0.055479
GarageFinish              81  0.055479
GarageQual                81  0.055479
GarageCond                81  0.055479
BsmtFinType2              38  0.026027
BsmtExposure              38  0.026027
BsmtFinType1              37  0.025342
BsmtCond                  37  0.025342
BsmtQual                  37  0.025342
MasVnrArea                 8  0.005479
Electrical                 1  0.000685


> Since it is the action time, I'll remove all features with more than 81 missing value.

In [3]:
trainData = trainData.drop((train_missing_values[train_missing_values['Missing_counts'] > 81]).index,axis=1)
trainData = trainData.apply(lambda x:x.fillna(x.value_counts().index[0]))
trainData.isnull().sum().max()

0

**Let's address the missing values in the test data - we cannot drop rows.**

In [4]:
test_missing_values = get_missing_value_counts(testData)
testData = testData.drop((test_missing_values[test_missing_values['Missing_counts'] > 81]).index,axis=1)
testData = testData.apply(lambda x:x.fillna(x.value_counts().index[0]))

# Step 2: Data Preprocessing
Get the data ready for the model

### 2.1 Remove the identifiers

In [5]:
trainData.drop("Id", axis = 1, inplace = True)
testData.drop("Id", axis = 1, inplace = True)

### 2.2 Encoding Categorical Variables

In [6]:
from sklearn.preprocessing import LabelEncoder
cat_cols = trainData.select_dtypes(include='object').columns

for c in cat_cols:
    lbl = LabelEncoder()
    lbl.fit(list(trainData[c].values)) 
    trainData[c] = lbl.transform(list(trainData[c].values))
    testData[c] = lbl.transform(list(testData[c].values))

print("done")

done


### 2.3 Remove outliers

In [7]:
outliers = trainData[(trainData['GrLivArea']>4000) & (trainData['SalePrice']<300000)].index 
train = trainData.drop(outliers)

### 2.4 Split the data

In [8]:
x_train = trainData.drop('SalePrice', axis=1)
y_train = trainData['SalePrice']

# Step 3: Model building
Maybe you've already explored some notebooks and learned that achieving results quickly can be done using XGBoost. Since you want reproducibility, let's begin with XGBoost for our modeling process.

But, If you have any specific questions or need guidance on setting up the XGBoost model, please feel free to ask.

In [9]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score

model_xgb = xgb.XGBRegressor(n_estimators=2200)
n_folds = 5

def rmsle(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse = np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

model_xgb.fit(x_train, y_train)
xgb_train_pred = model_xgb.predict(x_train)
xgb_pred = model_xgb.predict(testData)
print(rmsle(y_train, xgb_train_pred))

0.04013432145719167


### The submission

In [11]:
original_test = pd.read_csv('data/test.csv')
sub = pd.DataFrame()
sub['Id'] = original_test['Id'].values
sub['SalePrice'] = xgb_pred
sub.to_csv('submission.csv',index=False)

### The Score is 0.14875
There is a posibility of a higher score 0.044 but it lack of sense and nobody understand the real reason behind it. So I'll keep this score.