#  Predict sales prices and practice feature engineering, RFs, and gradient boosting

## 1. Problem defition


> Predict the value of the SalePrice variable. 

## 2. Data


https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data



    **Data Details:**
    
* Here's a brief version of what you'll find in the data description file.
* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale





##  Evaluation

 The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices
 
 https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn


In [None]:
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.isna().sum()

In [None]:
df_train.columns

In [None]:
fig, ax = plt.subplots()
ax.scatter(df_train['Id'], df_train['SalePrice'])

In [None]:
df_train.head()

In [None]:
df_train.head().transpose()

# Manipulating data

In [None]:
df_train.dtypes.T

### Preprocessing the data 

In [None]:
def preprocess_data(df):
    """
    Performs transformations on df and returns transformed df.
    """
    # Fill the numeric rows with median
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                # Add a binary column which tells us if the data was missing or not
                df[label+"_is_missing"] = pd.isnull(content)
                # Fill missing numeric values with median
                df[label] = content.fillna(content.median())
    
        # Filled categorical missing data and turn categories into numbers
        if not pd.api.types.is_numeric_dtype(content):
            df[label+"_is_missing"] = pd.isnull(content)
            # We add +1 to the category code because pandas encodes missing categories as -1
            df[label] = pd.Categorical(content).codes+1
    
    return df

In [None]:
# Process the test data 
df_train = preprocess_data(df_train)
df_train.head()

In [None]:
# Make a copy
df_train_cop = df_train.copy()


In [None]:
df_train_cop.info()


### Save Preprocceded data

In [None]:
# df_train_cop.to_csv('data/df_train_cop.csv', index=False)

In [None]:
# df_train_cop = pd.read_csv('data/df_train_cop.csv')

In [None]:
df_train_cop.head()

In [None]:
df_train_cop.info()

### Modelling

In [None]:
%%time
# Instantiate model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

# Fit the model
model.fit(df_train_cop.drop("SalePrice", axis=1), df_train_cop["SalePrice"])

In [None]:
# Score the model
model.score(df_train_cop.drop("SalePrice", axis=1), df_train_cop["SalePrice"])


### Building an evaluation function

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error, mean_absolute_error, r2_score

X = df_train_cop.drop('SalePrice', axis=1)
y = df_train_cop['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2)

# Put models in a dictionary
models = {'Logistic Regression': LogisticRegression(),
          'linear_regression' : LinearRegression(),
          'Random Forest': RandomForestRegressor(),
          'linear_model' : linear_model.Lasso(alpha=0.1)
          }

# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of differetn Scikit-Learn machine learning models
    X_train : training data (no labels)
    X_test : testing data (no labels)
    y_train : training labels
    y_test : test labels
    """
    # Set random seed
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(X_test, y_test)
    return model_scores
    

In [None]:
model_scores = fit_and_score(models=models,
                             X_train=X_train,
                             X_test=X_test,
                             y_train=y_train,
                             y_test=y_test)

model_scores



In [None]:
y_preds = model.predict(X_test)


print('MAE', mean_absolute_error(y_test, y_preds))
print('MSLE', mean_squared_log_error(y_test, y_preds))
print('r-squared', r2_score (y_test, y_preds))

## Testing our model on a subset (to tune the hyperparameters)

In [None]:
%%time

# Instantiate model
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

# Fit the model
model.fit(X_train, y_train)
y_preds = model.predict(X_test)



### Hyerparameter tuning with RandomizedSearchCV


In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Different RandomForestRegressor hyperparameters
rf_grid = {'n_estimators': np.arange(10, 100, 10),
           'max_depth': [None, 3, 5, 10],
           'min_samples_split': np.arange(2, 20, 2),
           'min_samples_leaf': np.arange(1, 20, 2),
           'max_features': [0.5, 1, "sqrt", "auto"],
           'max_samples': [500]}

# Instantiate RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                              param_distributions=rf_grid,
                              n_iter=2,
                              cv=5,
                              verbose=True)

# Fit the RandomizedSearchCV model
rs_model.fit(X_train, y_train)

In [None]:
# Find the best model hyperparameters
rs_model.best_params_

In [None]:
%%time

# Evaluate the RandomizedSearch model
# Instantiate model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs=-1,
                                                    random_state=42),
                              param_distributions=rf_grid,
                              n_iter=2,
                              cv=5,
                              verbose=True)


# Fit the model
rs_model.fit(X_train, y_train)
y_preds = rs_model.predict(X_test)

print('MAE', mean_absolute_error(y_test, y_preds))
print('MSLE', mean_squared_log_error(y_test, y_preds))
print('r-squared', r2_score (y_test, y_preds))


In [None]:
rs_model.best_params_

In [None]:
%%time

# Most ideal hyperparamters
ideal_model = RandomForestRegressor(n_estimators=40,
                                    min_samples_leaf=1,
                                    min_samples_split=14,
                                    max_features=0.5,
                                    n_jobs=-1,
                                    max_samples=None,
                                    random_state=42) # random state so our results are reproducible

# Fit the ideal model
ideal_model.fit(X_train, y_train)

In [None]:
ideal_model.fit(X_train, y_train)
y_preds = ideal_model.predict(X_test)

print('MAE', mean_absolute_error(y_test, y_preds))
print('MSLE', mean_squared_log_error(y_test, y_preds))
print('r-squared', r2_score (y_test, y_preds))


### Make predictions on test data

In [None]:
# Import the test data
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df_test.head()

In [None]:
# Process the test data 
df_test = preprocess_data(df_test)
df_test.head()

In [None]:
df_train.head()

In [None]:
set(df_test.columns) - set(X_train.columns)

In [None]:
# Manually adjust df_train to missing columns
df_test = df_test.drop('BsmtFinSF1_is_missing', axis=1)                            
df_test = df_test.drop('BsmtFinSF2_is_missing', axis=1)
df_test = df_test.drop('BsmtFullBath_is_missing', axis=1) 
df_test = df_test.drop('BsmtHalfBath_is_missing', axis=1)
df_test = df_test.drop('BsmtUnfSF_is_missing', axis=1)
df_test = df_test.drop('GarageArea_is_missing', axis=1)
df_test = df_test.drop('GarageCars_is_missing', axis=1)
df_test = df_test.drop('TotalBsmtSF_is_missing', axis=1)
df_test.head()

In [None]:
df_test.head()

In [None]:
# Make predictions on updated test data
test_preds = ideal_model.predict(df_test)

In [None]:
test_preds

## Format predictions asked by Kaggle

In [None]:
df_preds = pd.DataFrame()
df_preds['Id'] = df_test['Id']
df_preds["SalePrice"] = test_preds
df_preds

In [None]:
# Export prediction data
df_preds.to_csv("data/df_preds.csv", index=False)

In [None]:
df_preds = pd.read_csv('data/df_preds.csv')
df_preds