# House Price Prediction - Advanced Regression Techniques

## Table of Contents
- [Data collection & initial exploration](#Data-collection-&-initial-exploration)
  - [Setup and imports](#setup-and-imports)
  - [Initial exploration](#initial-exploration)
- [Feature engineering](#feature-engineering)
  - [Preprocessing](#preprocessing)
    - [Data Cleaning](#Data-cleaning)
      - [Identify columns with missing values](#identify-columns-with-missing-values)
      - [Drop columns](#drop-columns)
      - [Impute missing values](#impute-missing-values)
  - [Concatenation](#Concatenation)
  - [Encoding](#Encoding)
  - [Split](#split)
- [Model Building](#Model-Building)
  - [Model 1 - Linear regression](#Model-1---Linear-regression)
    - [First try linear regression & performance](#first-try-linear-regression-&-performance)
    - [Enhancement of linear regression](#enhancement-of-linear-regression)
  - [Model 2 - Random forest](#Model-2---Random-forest)
    - [First try random forest & performance](#first-try-random-forest-&-performance)
    - [Enhancement of random forest](#enhancement-of-random-forest)
  - [Model 3 - XGBoost](#Model-3---XGBoost)
    - [First try XGBoost & performance](#first-try-xgboost-&-performance)
    - [Enhancement of XGBoost](#enhancement-of-xgboost)
- [Conclusion & submission](#conclusion-&-submission)



# Data collection & initial exploration

## Setup and imports

In [None]:
#Import relevant packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import random
from sklearn.preprocessing import OneHotEncoder

In [None]:
#Read in training data 
train_df = pd.read_csv('train.csv')
#Read in test data
test_df = pd.read_csv('test.csv')
# set seed for reproducibility
np.random.seed(0)

#Delete ID columns
train_df.drop(['Id'], axis=1, inplace=True)
test_df.drop(['Id'], axis=1, inplace=True)

We drop the ID column to make sure this feature does not affect the prediction. 

## Exploratory Data Analysis (EDA)

In [None]:
#Show first five rows of dataframe. This gives us an initial overview over what we are dealing with. 
train_df.head()

In [None]:
train_df.shape

In [None]:
train_df.dtypes

Initial exploration shows us that the training dataset consists of 1460 rows, and 80 columns. The final column includes the target feature: sale price. The dataset contains various datatypes like int64, float64 and object. We will handle this later under "encoding". 

In [None]:
# find categorical variables
categorical = [var for var in train_df.columns if train_df[var].dtype=='O']
print('There are {} categorical variables'.format(len(categorical)))

In [None]:
# find numerical variables
numerical = [var for var in train_df.columns if train_df[var].dtype!='O']
print('There are {} numerical variables'.format(len(numerical)))

We  will assume that variables with a definite and low number of unique values are discrete.

In [None]:
# Visualizing the values of the discrete variables
discrete = []
for var in numerical:
    if len(train_df[var].unique())<20:
        print(var, ' values: ', train_df[var].unique())
        discrete.append(var)

In [None]:
# first we make a list of continuous variables (from the numerical ones)
continuous = [var for var in numerical if var not in discrete and var not in ['SalePrice']]
continuous

### HeatMap

We use a correlation matrix to see what is most important for the house price

In [None]:
# Correlation Matrix Heatmap
corrmat = train_df.corr()
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.show()

# Top 10 Heatmap
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train_df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

most_corr = pd.DataFrame(cols)
most_corr.columns = ['Most Correlated Features']
most_corr

### BoxPlot

In [None]:
# Make boxplots to visualise outliers in the continuous variables 
# and histograms to get an idea of the distribution

for var in continuous:
    plt.figure(figsize=(12,6))
    plt.subplot(1, 2, 1)
    fig = sns.boxplot(y=train_df[var])
    fig.set_title('')
    fig.set_ylabel(var)
    
    plt.subplot(1, 2, 2)
    fig = sns.distplot(train_df[var].dropna())
    fig.set_ylabel('Number of houses')
    fig.set_xlabel(var)

    plt.show()

We see that multiple of the continuous features have outliers. We will handle this under "Feature engineering". 

# Feature engineering

## Preprocessing

### Handling outliers

#### Outliers in continous variables

In [None]:
#Qualitative assessment of which features have outliers.
outlier_columns = ['1stFlrSF','GrLivArea', '3SsnPorch', 'ScreenPorch', 'MiscVal', 'YearBuilt', 'BsmtFinSF1', "2ndFlrSF"]


for column in outlier_columns:
    # Calculate quartiles
    Q1 = train_df[column].quantile(0.25)
    Q3 = train_df[column].quantile(0.75)

    # Calculate IQR
    IQR = Q3 - Q1

    # Define lower and upper bounds
    lower_bound = Q1 - (1.5 * IQR)
    upper_bound = Q3 + (1.5 * IQR)

    # Filter the DataFrame to exclude outliers
    train_df = train_df[(train_df[column] >= lower_bound) & (train_df[column] <= upper_bound)]


#### Outliers in discrete variables

In [None]:
for var in discrete:
    value_counts = train_df[var].value_counts() / len(train_df)
    categories_to_remove = value_counts[value_counts < 0.01].index
    train_df = train_df[~train_df[var].isin(categories_to_remove)] 

In [None]:
train_df.shape

### Concatenation

In [None]:
train_test_df = pd.concat([train_df,test_df],axis=0)

### New features

In [None]:
train_test_df['TotalIndoorArea']  = train_test_df['1stFlrSF'] + train_test_df['2ndFlrSF'] + train_test_df['TotalBsmtSF'] + train_test_df['GrLivArea']

import matplotlib.pyplot as plt

# # Plotting 'SalePrice' vs 'BsmtFinSF1'
# plt.figure(figsize=(8, 6))
# plt.scatter(train_test_df['BsmtFinSF1'], train_test_df['SalePrice'], color='blue', label='SalePrice vs BsmtFinSF1')
# plt.xlabel('BsmtFinSF1')
# plt.ylabel('SalePrice')
# plt.title('SalePrice vs BsmtFinSF1')
# plt.legend()
# plt.show()

# Plotting 'SalePrice' vs 'TotalBsmtSF'
plt.figure(figsize=(8, 6))
plt.scatter(train_test_df['TotalBsmtSF'], train_test_df['SalePrice'], color='green', label='SalePrice vs TotalBsmtSF')
plt.xlabel('TotalBsmtSF')
plt.ylabel('SalePrice')
plt.title('SalePrice vs TotalBsmtSF')
plt.legend()
plt.show()

train_test_df["Square_BsmtFinSF1"] = train_test_df["BsmtFinSF1"] ** 2
train_test_df["Square_TotalBsmtSF"] = train_test_df["TotalBsmtSF"] ** 2


###  Replace NA with None for features where NA means that the house does not have it

In [None]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 
            'BsmtFinType1', 'BsmtFinType2',
            'GarageType', 'GarageFinish', 'GarageQual', 
            'GarageCond'):
    train_test_df[col] = train_test_df[col].fillna('None')

### Data cleaning

#### Identify columns with missing values
The first cleaning of data we are going to do is handle columns with missing data points.

In [None]:
#Get the number of missing data points per column
missing_values_count = train_test_df.isnull().sum()
#Get columns with at least one missing data point
columns_with_missing_data = missing_values_count[missing_values_count > 0]

print(columns_with_missing_data)


all_data_na = (train_test_df.isnull().sum() / len(train_test_df)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

#### Feature Selection
We will drop columns which are missing more than 80% of their values.
<font color='red'>Should find a source which recommends percentage of missingness to drop a column.</font> 

In [None]:
#Create list of columns which miss more values than 80%.
columns_to_drop = missing_values_count[missing_values_count > 0.8*len(train_test_df)].index
print(columns_to_drop)
train_test_df.drop(columns=columns_to_drop, inplace=True)
train_test_df.shape

Can now see that the four columns "Alley", "PoolQC", "Fence" and "MiscFeature" have been removed. 

#### Impute missing values
For the rest of the columns with missing values we will impute values. For the numeric columns we will impute the mean of the values in the column, and for the categorical columns we will impute the mode. 

In [None]:
train_test_df['LotFrontage'] = train_test_df['LotFrontage'].fillna(train_test_df['LotFrontage'].mean())
train_test_df['MasVnrArea'] = train_test_df['MasVnrArea'].fillna(train_test_df['MasVnrArea'].mean())
train_test_df['GarageYrBlt'] = train_test_df['GarageYrBlt'].fillna(train_test_df['GarageYrBlt'].mean())
train_test_df['MasVnrType'] = train_test_df['MasVnrType'].fillna(train_test_df['MasVnrType'].mode()[0])
train_test_df['BsmtQual'] = train_test_df['BsmtQual'].fillna(train_test_df['BsmtQual'].mode()[0])
train_test_df['BsmtCond'] = train_test_df['BsmtCond'].fillna(train_test_df['BsmtCond'].mode()[0])
train_test_df['BsmtExposure'] = train_test_df['BsmtExposure'].fillna(train_test_df['BsmtExposure'].mode()[0])
train_test_df['BsmtFinType1'] = train_test_df['BsmtFinType1'].fillna(train_test_df['BsmtFinType1'].mode()[0])
train_test_df['BsmtFinType2'] = train_test_df['BsmtFinType2'].fillna(train_test_df['BsmtFinType2'].mode()[0])
train_test_df['Electrical'] = train_test_df['Electrical'].fillna(train_test_df['Electrical'].mode()[0])
train_test_df['FireplaceQu'] = train_test_df['FireplaceQu'].fillna(train_test_df['FireplaceQu'].mode()[0])
train_test_df['GarageType'] = train_test_df['GarageType'].fillna(train_test_df['GarageType'].mode()[0])
train_test_df['GarageFinish'] = train_test_df['GarageFinish'].fillna(train_test_df['GarageFinish'].mode()[0])
train_test_df['GarageQual'] = train_test_df['GarageQual'].fillna(train_test_df['GarageQual'].mode()[0])
train_test_df['GarageCond'] = train_test_df['GarageCond'].fillna(train_test_df['GarageCond'].mode()[0])
train_test_df['MSZoning']=train_test_df['MSZoning'].fillna(train_test_df['MSZoning'].mode()[0])
train_test_df['Utilities']=train_test_df['Utilities'].fillna(train_test_df['Utilities'].mode()[0])
train_test_df['Exterior1st']=train_test_df['Exterior1st'].fillna(train_test_df['Exterior1st'].mode()[0])
train_test_df['Exterior2nd']=train_test_df['Exterior2nd'].fillna(train_test_df['Exterior2nd'].mode()[0])
train_test_df['MasVnrArea']=train_test_df['MasVnrArea'].fillna(train_test_df['MasVnrArea'].mode()[0])
train_test_df['BsmtFinSF1']=train_test_df['BsmtFinSF1'].fillna(train_test_df['BsmtFinSF1'].mean())
train_test_df['BsmtFinSF2']=train_test_df['BsmtFinSF2'].fillna(train_test_df['BsmtFinSF2'].mean())
train_test_df['BsmtUnfSF']=train_test_df['BsmtUnfSF'].fillna(train_test_df['BsmtUnfSF'].mean())
train_test_df['TotalBsmtSF']=train_test_df['TotalBsmtSF'].fillna(train_test_df['TotalBsmtSF'].mean())
train_test_df['BsmtFullBath']=train_test_df['BsmtFullBath'].fillna(train_test_df['BsmtFullBath'].mode()[0])
train_test_df['BsmtHalfBath']=train_test_df['BsmtHalfBath'].fillna(train_test_df['BsmtHalfBath'].mode()[0])
train_test_df['KitchenQual']=train_test_df['KitchenQual'].fillna(train_test_df['KitchenQual'].mode()[0])
train_test_df['Functional']=train_test_df['Functional'].fillna(train_test_df['Functional'].mode()[0])
train_test_df['GarageYrBlt']=train_test_df['GarageYrBlt'].fillna(train_test_df['GarageYrBlt'].mean())
train_test_df['GarageCars']=train_test_df['GarageCars'].fillna(train_test_df['GarageCars'].mean())
train_test_df['GarageArea']=train_test_df['GarageArea'].fillna(train_test_df['GarageArea'].mean())
train_test_df['SaleType']=train_test_df['SaleType'].fillna(train_test_df['SaleType'].mode()[0])
train_test_df['TotalIndoorArea']=train_test_df['TotalIndoorArea'].fillna(train_test_df['TotalIndoorArea'].mean())
train_test_df['Square_BsmtFinSF1']=train_test_df['Square_BsmtFinSF1'].fillna(train_test_df['Square_BsmtFinSF1'].mean())
train_test_df['Square_TotalBsmtSF']=train_test_df['Square_TotalBsmtSF'].fillna(train_test_df['Square_TotalBsmtSF'].mean())



# Check that we have no missing datapoints.
missing_values_count = train_test_df.isnull().sum()
columns_with_missing_data = missing_values_count[missing_values_count > 0]
print(columns_with_missing_data)

Sale price will still miss 1459 values, which are the sale-price of our test set. We ignore that there are missing values here, as it is correct because this is the value we are going to predict.

### Fixing Skewed Features

In [None]:
from operator import itemgetter
# TODO Haakon look over
def find_skewness(train, numeric_cols):
    """
    Calculate the skewness of the columns and segregate the positive
    and negative skewed data.
    """
    skew_dict = {}
    for col in numeric_cols:
        skew_dict[col] = train[col].skew()

    skew_dict = dict(sorted(skew_dict.items(),key=itemgetter(1)))
    positive_skew_dict = {k:v for (k,v) in skew_dict.items() if v>0}
    negative_skew_dict = {k:v for (k,v) in skew_dict.items() if v<0}
    return skew_dict, positive_skew_dict, negative_skew_dict

def add_constant(data, highly_pos_skewed):
    """
    Look for zeros in the columns. If zeros are present then the log(0) would result in -infinity.
    So before transforming it we need to add it with some constant.
    """
    C = 1
    for col in highly_pos_skewed.keys():
        if(col != 'SalePrice'):
            if(len(data[data[col] == 0]) > 0):
                data[col] = data[col] + C
    return data

def log_transform(data, highly_pos_skewed):
    """
    Log transformation of highly positively skewed columns.
    """
    for col in highly_pos_skewed.keys():
        if(col != 'SalePrice'):
            data[col] = np.log10(data[col])
    return data

def sqrt_transform(data, moderately_pos_skewed):
    """
    Square root transformation of moderately skewed columns.
    """
    for col in moderately_pos_skewed.keys():
        if(col != 'SalePrice'):
            data[col] = np.sqrt(data[col])
    return data

def reflect_sqrt_transform(data, moderately_neg_skewed):
    """
    Reflection and log transformation of highly negatively skewed 
    columns.
    """
    for col in moderately_neg_skewed.keys():
        if(col != 'SalePrice'):
            K = max(data[col]) + 1
            data[col] = np.sqrt(K - data[col])
    return data


"""
If skewness is less than -1 or greater than 1, the distribution is highly skewed.
If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
"""
skew_dict, positive_skew_dict, negative_skew_dict = find_skewness(train_test_df, numerical)
moderately_pos_skewed = {k:v for (k,v) in positive_skew_dict.items() if v>0.5 and v<=1}
highly_pos_skewed = {k:v for (k,v) in positive_skew_dict.items() if v>1}
moderately_neg_skewed = {k:v for (k,v) in negative_skew_dict.items() if v>-1 and v<=0.5}
highly_neg_skewed = {k:v for (k,v) in negative_skew_dict.items() if v<-1}

'''Transform train data.'''
train_test_df  = add_constant(train_test_df , highly_pos_skewed)
train_test_df  = log_transform(train_test_df , highly_pos_skewed)
train_test_df  = sqrt_transform(train_test_df , moderately_pos_skewed)
train_test_df  = reflect_sqrt_transform(train_test_df , moderately_neg_skewed )


## Encoding
To use our models we have to encode the categorical columns in our dataset. To do this we will use the sklearns One hot encoder.

In [None]:
# Identify categorical columns
categorical_columns = train_test_df.select_dtypes(include=['object']).columns

# Preprocess the data to ensure categorical columns contain only strings
train_test_df[categorical_columns] = train_test_df[categorical_columns].astype(str)

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore', drop='first')

# Fit and transform the categorical data using one-hot encoding
X_encoded = encoder.fit_transform(train_test_df[categorical_columns])

# Get the feature names
feature_names = encoder.get_feature_names_out(input_features=categorical_columns)

# Create a DataFrame with the one-hot encoded features
X_encoded_train_test_df = pd.DataFrame(X_encoded, columns=feature_names)

# Reset the index of both DataFrames
X_encoded_train_test_df.reset_index(drop=True, inplace=True)
train_test_df.reset_index(drop=True, inplace=True)

# Combine the one-hot encoded features with the original numerical features
train_test_df = pd.concat([X_encoded_train_test_df, train_test_df.drop(categorical_columns, axis=1)], axis=1)

## Split
Now we will split the train and test set so we can train a regression model, and then use this model to predict the test-data.

In [None]:
# nan_mask = train_df['SalePrice'].isna()
nan_mask = train_test_df['SalePrice'].isna()

# Use diff() to find the transition from non-NaN to NaN
transition_mask = nan_mask.diff() == True

# Find the index of the first occurrence of the transition
border_index = transition_mask.idxmax()

print(border_index)

HPP_data_Train = train_test_df.iloc[:border_index,:]
HPP_data_Test = train_test_df.iloc[border_index:,:]

#Dropping SalePrice in test-data. This is the value we are going to predict.
HPP_data_Test.drop(['SalePrice'],axis=1,inplace=True)

# Model building

## Model 1 - Linear regression

### Linear regression & performance

In [None]:
# Import the necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score, train_test_split

# X_train contains features, and y_train contains the target variable (SalePrice).
X_train= HPP_data_Train.drop(['SalePrice'],axis=1)
y_train= HPP_data_Train['SalePrice']

#Initialize the linear regression model
linear_regression_model = LinearRegression()

#Cross Validation
k_folds = 5
kf = KFold(k_folds, shuffle=True, random_state=42).get_n_splits(X_train.values)
rmse= np.sqrt(-cross_val_score(linear_regression_model, X_train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
print(rmse.mean())

#Below we split the training data into train and test, so that we have labels for the actual Sale price.
#This is used to analyze our model to further enhance it. 
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Fit the liner_regression_model to the training data
linear_regression_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = linear_regression_model.predict(X_test)

# Create a scatter plot to visualize the correlation
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, c='blue', alpha=0.5, label='Data Points')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--', label='y = x')
plt.xlabel("Actual Sale Price (y_test)")
plt.ylabel("Predicted Sale Price (y_pred)")
plt.title("Correlation between Actual and Predicted Sale Prices")
plt.legend()
plt.show()

### Submission of LR predictions

In [None]:
# X_train contains features, and y_train contains the target variable (SalePrice).
X_train= HPP_data_Train.drop(['SalePrice'],axis=1)
y_train= HPP_data_Train['SalePrice']

pred_LR=pd.DataFrame(y_pred)
# Create a DataFrame with 'Id' values (1461 to N+1460) and the 'SalePrice' values from 'pred'
pred_LR['Id'] = range(1461, 1461 + len(pred_LR))
pred_LR = pred_LR.rename(columns={0: 'SalePrice'})
# Create a new DataFrame with columns named "Id" and "SalePrice"
result_df = pred_LR[['Id', 'SalePrice']]
# Save the DataFrame to a CSV file
result_df.to_csv('predicted_saleprice_LR.csv', index=False)

## Model 2 - Random forest

### Random forest & performance

In [None]:
# Import the necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# X_train should contain your features, and y_train should contain the target variable (SalePrice).
X_train = HPP_data_Train.drop(['SalePrice'], axis=1)
y_train = HPP_data_Train['SalePrice']

# Initialize the RandomForestRegressor model
random_forest_regressor = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=11)

#Cross Validation
k_folds = 5
kf = KFold(k_folds, shuffle=True, random_state=42).get_n_splits(X_train.values)
rmse= np.sqrt(-cross_val_score(random_forest_regressor, X_train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
print(rmse.mean())

# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Fit the model to the training data
random_forest_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = random_forest_regressor.predict(X_test)

# Create a scatter plot to visualize the correlation
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, c='blue', alpha=0.5, label='Data Points')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--', label='y = x')
plt.xlabel("Actual SalePrice (y_test)")
plt.ylabel("Predicted Sale Price (y_pred)")
plt.title("Correlation between Actual and Predicted Sale Prices")
plt.legend()
plt.show()

### Submission of Random Forest prediction

In [None]:
# X_train contains features, and y_train contains the target variable (SalePrice).
X_train= HPP_data_Train.drop(['SalePrice'],axis=1)
y_train= HPP_data_Train['SalePrice']
pred_RF=pd.DataFrame(y_pred)
# Create a DataFrame with 'Id' values (1461 to N+1460) and the 'SalePrice' values from 'pred'
pred_RF['Id'] = range(1461, 1461 + len(pred_RF))
pred_RF = pred_RF.rename(columns={0: 'SalePrice'})
# Create a new DataFrame with columns named "Id" and "SalePrice"
result_df = pred_RF[['Id', 'SalePrice']]
# Save the DataFrame to a CSV file
result_df.to_csv('predicted_saleprice_RF.csv', index=False)

## Model 3 -  XG boost

### Performance

In [None]:
# Import the necessary libraries
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt

# X_train should contain your features, and y_train should contain the target variable (SalePrice).
X_train = HPP_data_Train.drop(['SalePrice'], axis=1)
y_train = HPP_data_Train['SalePrice']

# Initialize the XGBoost model
xgboost_regressor = xgb.XGBRegressor(n_estimators=100, random_state=42, max_depth=2)  # You can adjust the number of estimators as needed

# Cross Validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
rmse = np.sqrt(-cross_val_score(xgboost_regressor, X_train.values, y_train, scoring="neg_mean_squared_error", cv=kf))
print(rmse.mean())

# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Fit the model to the training data
xgboost_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = xgboost_regressor.predict(X_test)

# Create a scatter plot to visualize the correlation
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, c='blue', alpha=0.5, label='Data Points')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--', label='y = x')
plt.xlabel("Actual SalePrice (y_test)")
plt.ylabel("Predicted Sale Price (y_pred)")
plt.title("Correlation between Actual and Predicted Sale Prices")
plt.legend()
plt.show()


In [None]:
# X_train contains features, and y_train contains the target variable (SalePrice).
X_train= HPP_data_Train.drop(['SalePrice'],axis=1)
y_train= HPP_data_Train['SalePrice']
pred_XGB=pd.DataFrame(y_pred)
# Create a DataFrame with 'Id' values (1461 to N+1460) and the 'SalePrice' values from 'pred'
pred_XGB['Id'] = range(1461, 1461 + len(pred_XGB))
pred_XGB = pred_XGB.rename(columns={0: 'SalePrice'})
# Create a new DataFrame with columns named "Id" and "SalePrice"
result_df = pred_XGB[['Id', 'SalePrice']]
# Save the DataFrame to a CSV file
result_df.to_csv('predicted_saleprice_XGB.csv', index=False)

# Conclusion & submission

In [None]:
# Our best model is PRED_RF
pred = pred_RF
# Create a DataFrame with 'Id' values (1461 to N+1460) and the 'SalePrice' values from 'pred'
pred['Id'] = range(1461, 1461 + len(pred))
pred = pred.rename(columns={0: 'SalePrice'})
# Create a new DataFrame with columns named "Id" and "SalePrice"
result_df = pred[['Id', 'SalePrice']]
# Save the DataFrame to a CSV file
result_df.to_csv('predicted_saleprice.csv', index=False)