# House Price Prediction
**Workspace for the [Machine Learning course](https://www.kaggle.com/learn/machine-learning).**


Reading the csv files and printing basic details about the data

In [None]:
import pandas as pd

pd.set_option('display.max_rows', 5)
main_file_path = '../input/house-prices-advanced-regression-techniques/train.csv' # this is the path to the Iowa data that you will use
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
# test = pd.read_csv('test.csv')
# main_file_path = "train.csv"
data = pd.read_csv(main_file_path)

print(data.describe())
print(data.head())
print(data.columns)

**Correlation Plot**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

corrmat = data.corr()
f, ax = plt.subplots(figsize=(20, 9))
sns.heatmap(corrmat, vmax=.8, annot=True);

In [None]:
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(data[cols], size = 2.5)
plt.show();

**Analysis on SalePrice**

In [None]:
from scipy import stats
from scipy.stats import norm, skew #for some statistics

sns.distplot(data['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(data['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

fig = plt.figure()
res = stats.probplot(data['SalePrice'], plot=plt)
plt.show()

**Checking Sale Price Column**

In [None]:
print(data["SalePrice"].describe())
print(data["SalePrice"].head())

# Feature Selection

In [None]:
feature_list = ["LotArea","YearBuilt","1stFlrSF","2ndFlrSF","FullBath","BedroomAbvGr","TotRmsAbvGrd"]
X = data[feature_list]
y = data["SalePrice"]
X.head()
y.head()

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor as dt

iowa_model = dt()
iowa_model.fit(X,y)

In [None]:
iowa_model.predict(X.head())

**Checking MAE using built-in method**

In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = iowa_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

**Splitting into test-train sets**

In [None]:
#split training and validation data using scikit-learns inbuilt function

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

iowa_model.fit(train_X,train_y)
predicted_home_prices = iowa_model.predict(val_X)
mean_absolute_error(val_y,predicted_home_prices)

**Searching for optimal leaf nodes**

In [None]:
def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = dt(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

# Ramdom Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

In [None]:
test_features = test[feature_list]
predicted_prices = forest_model.predict(test_features)
print(predicted_prices)

In [None]:
#creat submission file called submission.csv
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
my_submission.to_csv('submission.csv', index=False)

# Handling Missing Values

**Drop Columns with Missing Values**

In [None]:
data_without_missing_values = data.dropna(axis=1)

cols_with_missing = [col for col in data.columns if data[col].isnull().any()]
reduced_X_train = train_X.drop(cols_with_missing, axis=1)
reduced_X_test  = val_X.drop(cols_with_missing, axis=1)

print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

**Imputation**

Imputation fills in the missing value with some number. The default behavior fills in the mean value for imputation. Statisticians have researched more complex strategies, but those complex strategies typically give no benefit once you plug the results into sophisticated machine learning models.

One (of many) nice things about Imputation is that it can be included in a scikit-learn Pipeline. Pipelines simplify model building, model validation and model deployment.

In [None]:
from sklearn.preprocessing import Imputer

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(train_X)
imputed_X_test = my_imputer.transform(val_X)
print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, train_y, val_y))

In [None]:
data

**One Hot Encoding for Categorical Data**

Use pd.get_dummies() to get encodings

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = X.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, y)

one_hot_encoded_training_predictors = pd.get_dummies(X)
mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, y)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

In [None]:
print(one_hot_encoded_training_predictors)

# XGBoost - Xtreme Gradient Boosting

In [None]:
from xgboost import XGBRegressor

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(val_X, val_y)], verbose=False)
predictions = my_model.predict(val_X)
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, val_y)))

In [None]:
#creat submission file called submission.csv
test_features = test[feature_list]
predicted_prices = forest_model.predict(test_features)
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
my_submission.to_csv('submission.csv', index=False)

**XGBoost with Hot Encoding**

In [None]:
my_model.fit(one_hot_encoded_training_predictors, train_y, early_stopping_rounds=5, eval_set=[(val_X, val_y)], verbose=False)
predictions = my_model.predict(val_X)
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, val_y)))

# Partial Dependence Plot

**The partial dependence plot is calculated only after the model has been fit.**

Note: scikit-learn originally implemented partial dependence plots only for Gradient Boosting models. This was due to an implementation detail, and a future release will support all model types.

In [None]:
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.ensemble import GradientBoostingRegressor
my_model = GradientBoostingRegressor()
# fit the model as usual
my_model.fit(X, y)
my_plots = plot_partial_dependence(my_model,       
                                   features=[0, 2], # column numbers of plots we want to show
                                   X=X,            # raw predictors data.
                                   feature_names=['Distance', 'Landsize', 'BuildingArea'], # labels on graphs
                                   grid_resolution=10) # number of values to plot on x axis

# Pipelines

Pipelines are a simple way to keep your data processing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

In [None]:
from sklearn.pipeline import make_pipeline

my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())

In [None]:
my_pipeline.fit(train_X,train_y)
predictions = my_pipeline.predict(val_X)
predictions

**Understanding Pipelines**

Most scikit-learn objects are either transformers or models.

Transformers are for pre-processing before modeling. The Imputer class (for filling in missing values) is an example of a transformer. Over time, you will learn many more transformers, and you will frequently use multiple transformers sequentially.

Models are used to make predictions. You will usually preprocess your data (with transformers) before putting it in a model.

You can tell if an object is a transformer or a model by how you apply it. After fitting a transformer, you apply it with the transform command. After fitting a model, you apply it with the predict command. Your pipeline must start with transformer steps and end with a model. This is what you'd want anyway.

Eventually you will want to apply more transformers and combine them more flexibly. We will cover this later in an Advanced Pipelines tutorial.

# Cross Validation

In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality. For example, we could have 5 folds or experiments. We divide the data into 5 pieces, each being 20% of the full dataset.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
print(scores)

In [None]:
print('Mean Absolute Error %2f' %(-1 * scores.mean()))

# Data Leakage

**Leaky Predictors**
This occurs when your predictors include data that will not be available at the time you make predictions.
Check which predictors might depend on the target variable and drop them from training.

**Leaky Validation Strategy**
A much different type of leak occurs when you aren't careful distinguishing training data from validation data.
Be careful to not add testing data in fitting.