**[Introduction to Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**

---


# Introduction
Machine learning competitions are a great way to improve your data science skills and measure your progress. 

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.

The steps in this notebook are:
1. Build a Random Forest model with all of your data (**X** and **y**)
2. Read in the "test" data, which doesn't include values for the target.  Predict home values in the test data with your Random Forest model.
3. Submit those predictions to the competition and see your score.
4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

## Recap
Here's the code you've written so far. Start by running it again.

In [None]:
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

In [None]:
# Code you have previously used to load data
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from xgboost import XGBRegressor
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from scipy import stats
from scipy.stats import norm, skew

In [None]:
#treating outliers

def upper_cap (data, col):
    return np.mean (data [col]) + np.std (data [col])*2.1
def lower_cap (data, col):
    return np.mean (data [col]) - np.std (data [col])*2.5

def capping (data, list):
    for col in list:
        data.loc[(data [col] > upper_cap (data, col)),col] = upper_cap (data, col)
        data.loc[(data[col] < lower_cap (data, col)),col] = lower_cap (data, col)
    return data [list]

#removing elemenents from a list
def element_remove (list, list1):
    for element in list1:
        if element in list:
            list.remove(element)
    return list

In [None]:
# Path of the file to read. We changed the directory structure to simplify submitting to a competition
iowa_file_path = '../input/train.csv'
# path to file you will use for predictions
test_data_path = '../input/test.csv'

# read training data file using pandas
home_data = pd.read_csv(iowa_file_path)
# read test data file using pandas
test_data = pd.read_csv (test_data_path)

sns.set_style ('whitegrid')

In [None]:
plt.scatter(home_data ['GrLivArea'], home_data ['SalePrice'])

In [None]:
home_data = home_data.drop(home_data[(home_data['GrLivArea']>4000) & (home_data['SalePrice']<300000)].index)

#Check the graphic
fig, ax = plt.subplots()
ax.scatter(home_data ['GrLivArea'], home_data ['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

In [None]:
home_data ['SalePrice'].describe ()

In [None]:
Y = home_data ['SalePrice']
train_ID = home_data ['Id']
test_ID = test_data ['Id']

home_data.drop ('SalePrice', axis = 1, inplace = True)
home_data.drop ('Id', axis = 1, inplace = True)
test_data.drop ('Id', axis = 1, inplace = True)

In [None]:
y = np.log1p (Y)
sns.distplot (y)

In [None]:
data = pd.concat ([home_data, test_data])

print ('data shape', data.shape)
print ('Dtypes in the home_data are : {}'.format (data.dtypes.unique ()))

categorical_cols = list (data.dtypes [(data.dtypes == object)|(data.dtypes == 'O')].index)
numerical_cols = list (data.dtypes [(data.dtypes == 'int64')|(data.dtypes == 'float64')].index)

print ('categorical_cols', len (categorical_cols))
print ('numerical_cols', len (numerical_cols))

Correcting Dtypes

In [None]:
data ['MSSubClass'] = data ['MSSubClass'].astype ('object')
data ['OverallQual'] = data ['OverallQual'].astype ('object')
data ['Fireplaces'] = data ['Fireplaces'].astype ('object')
data ['OverallCond'] = data ['OverallCond'].astype ('object')
data ['GarageCars'] = data ['GarageCars'].astype ('object')
data [['BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']] = data [['BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']].astype ('object')
data [['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath']] = data [['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath']].astype ('object')


categorical_cols = list (data.dtypes [(data.dtypes == object)|(data.dtypes == 'O')].index)

numerical_cols = list (data.dtypes [(data.dtypes == 'int64')|(data.dtypes == 'float64')].index)

print ('categorical_cols', len (categorical_cols))
print ('\n')
print ('numerical_cols', len (numerical_cols))

Categorical_cols : Number of levels per column / levels

In [None]:
dict = {}
for cols in categorical_cols:
    
    k = data [cols].nunique ()
    n = cols
    dict.update ({n:k})
cat_cols_nlevels = pd.DataFrame (dict, index = range (0, len(categorical_cols))).transpose ()[0].sort_values ()
cat_cols_nlevels.plot.barh (figsize = (7,10))
plt.title ('Categorical_cols_nlevels')
plt.yticks (fontsize = 7)
plt.show ()

dict1 = {}
for cols in categorical_cols:
    k1 = str (data [cols].unique ())
    n1 = cols
    dict1.update ({n1:k1})
cat_cols_levels = pd.DataFrame (dict1, index = range (0, len (categorical_cols))).transpose ()[0]
print (cat_cols_levels)

Categorical_cols : Missing values

In [None]:
plt.figure (figsize = (8,10))
cat_nulls = data [categorical_cols].isnull ().sum ()
cat_nulls_gr0 = cat_nulls [cat_nulls > 0].sort_values ()
cat_nulls_gr0.plot.barh (title = 'Categorical_cols_MissingValues')
plt.show ()

In [None]:
print (cat_nulls_gr0.index, ' ')

In [None]:
data['MSZoning'] = data['MSZoning'].fillna(data['MSZoning'].mode()[0])
data['GarageCars'] = data['GarageCars'].fillna(0)
data['Exterior1st'] = data['Exterior1st'].fillna(data['Exterior1st'].mode()[0])
data['Exterior2nd'] = data['Exterior2nd'].fillna(data['Exterior2nd'].mode()[0])
data['Electrical'] = data['Electrical'].fillna(data['Electrical'].mode()[0])
data['Functional'] = data['Functional'].fillna(data['Functional'].mode()[0])
data['BsmtHalfBath'] = data['BsmtHalfBath'].fillna(0)
data['BsmtFullBath'] = data['BsmtFullBath'].fillna(0)
data['SaleType'] = data['SaleType'].fillna(data['SaleType'].mode()[0])
data [['FireplaceQu', 'Fence', 'Alley', 'MiscFeature', 'PoolQC', 'GarageType', 'GarageQual', 'GarageFinish', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'GarageCond', 'MasVnrType']] = data [['FireplaceQu', 'Fence', 'Alley', 'MiscFeature', 'PoolQC', 'GarageType', 'GarageQual', 'GarageFinish', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'GarageCond', 'MasVnrType']].fillna ('None')
data['KitchenQual'] = data['KitchenQual'].fillna(data['KitchenQual'].mode()[0])
data['Utilities'] = data['Utilities'].fillna(data['Utilities'].mode()[0])

In [None]:
print (data [categorical_cols].isnull ().sum ().sum ())

Numerical_cols : Missing values

In [None]:
num_nulls = data [numerical_cols].isnull ().sum ()
num_nulls [num_nulls > 0].sort_values ().plot.barh (title = 'Numerical_cols_MissingValues')
plt.show ()

print (num_nulls [num_nulls>0])

In [None]:
data ['MasVnrArea'] = data ['MasVnrArea'].fillna (0)
data ['GarageYrBlt'] = data ['GarageYrBlt'].fillna (0)
data ['LotFrontage'] = data.groupby ('Neighborhood') ['LotFrontage'].transform (lambda x : x.fillna (np.mean (x)))
data ['BsmtFinSF1'] = data ['BsmtFinSF1'].fillna (0)
data ['BsmtFinSF2'] = data ['BsmtFinSF2'].fillna (0)
data ['TotalBsmtSF'] = data ['TotalBsmtSF'].fillna (0)
data ['BsmtUnfSF'] = data ['BsmtUnfSF'].fillna (0)
data ['GarageArea'] = data ['GarageArea'].fillna (0)

In [None]:
print (data [numerical_cols].isnull ().sum ().sum ())

# label encoding

In [None]:
categorical_cols_lb =['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC']

In [None]:
for c in categorical_cols_lb:
    lbl = LabelEncoder() 
    lbl.fit(data[c])
    data[c] = lbl.transform(data[c])

In [None]:
data [categorical_cols_lb].head (2)

In [None]:
print (data.shape)

# feature engineering

In [None]:
# numerical features
data ['TotalSF'] = data  ['TotalBsmtSF'] + data['1stFlrSF'] + data ['2ndFlrSF']
data ['YearBuilt_YrSold'] = data ['YrSold'] - data ['YearBuilt']
data ['YearRemodAdd_YrSold'] = data ['YrSold'] - data ['YearRemodAdd']
data ['GarageYrBlt_YrSold'] = data ['YrSold'] - data ['GarageYrBlt']
#data ['Other_rooms'] = data ['TotRmsAbvGrd'] - data ['BedroomAbvGr']
numerical_cols = element_remove (numerical_cols, ['YrSold', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt']) + ['TotalSF','YearBuilt_YrSold', 'YearRemodAdd_YrSold', 'GarageYrBlt_YrSold' ]
print (len (numerical_cols))

In [None]:
# categorical features
data ['BsmtBath'] = data ['BsmtFullBath'] + data ['BsmtHalfBath']
data ['Bath'] = data ['HalfBath'] + data ['FullBath']

categorical_cols = element_remove (categorical_cols, ['BsmtFullBath','FullBath', 'BsmtHalfBath', 'HalfBath']) + ['Bath', 'BsmtBath']
print (len (categorical_cols))

In [None]:
data = data [categorical_cols + numerical_cols]

In [None]:
data.shape

# checking skewness of numerical cols

In [None]:
skewed_cols = data [numerical_cols].apply (lambda x : skew (x)).sort_values (ascending = False)
skewness = pd.DataFrame ({'skew': skewed_cols})

In [None]:
skewness.tail (10)

In [None]:
skewness = skewness [abs (skewness) > 0.75].dropna ()

In [None]:
print ('Skewed numerical features {}'.format (skewness.shape [0]))

In [None]:
from scipy.special import boxcox1p
skewed_features = skewness.index

for feat in skewed_features:
    
    data[feat] = np.log1p(data[feat]+10)

fig = plt.figure (figsize = (20,35))

for i, col in enumerate (numerical_cols): 
    fig.add_subplot (10,4,i+1) 
    data [col].hist ()
    plt.xticks (rotation = 90) 
    plt.xlabel (col) 
    plt.tight_layout ()

# Split test - train data now 

In [None]:
training_data = data [categorical_cols + numerical_cols].iloc [0:1458]
testing_data = data [categorical_cols + numerical_cols].iloc [1458:]
testing_data ['Id'] = test_ID
print (training_data.shape, testing_data.shape)

In [None]:
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(training_data, y, train_size=0.7, test_size=0.3,random_state=0)

In [None]:
print (X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

In [None]:
X_train [categorical_cols].shape

# removing low variance categorical features

In [None]:
D = data [categorical_cols].describe (exclude = 'number').transpose ()
D ['portion'] = D['freq']*100/D ['count']
#print (D.sort_values (by = 'portion', ascending = False))
low_variance = D [D.portion > 90].index

In [None]:
categorical_cols = element_remove (categorical_cols, low_variance)
print (len (categorical_cols))

Categorical_cols - Univariate EDA

fig = plt.figure (figsize = (12,40))

for i, col in enumerate (categorical_cols):
    fig.add_subplot (20,4,i+1)
    sns.countplot (X_train.iloc [0:1460][col])
    plt.xticks (rotation = 90, fontsize = 5)
plt.tight_layout ()

Categorical_cols - Bivariate EDA

fig = plt.figure (figsize = (12,40))

for i, col in enumerate (categorical_cols): 
    fig.add_subplot (20,4,i+1) 
    sns.boxplot (X_train [col],y_train) 
    plt.xticks (rotation = 90, fontsize = 6) 
    plt.tight_layout ()

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

f=[];p=[]

for col in categorical_cols:
    df = pd.DataFrame ({'Y': y_train, 'categorical_col': X_train [col]})
    model = ols ('Y ~ C(categorical_col)', data=df).fit()
    anova_table = sm.stats.anova_lm (model, typ=2)
    f.append(anova_table.iloc[0, 2])
    p.append(anova_table.iloc[0, 3])

In [None]:
anova = pd.DataFrame({'column': categorical_cols, 'F': f, 'p':p})
anova = anova.sort_values(by=['F'], ascending=False).reset_index(drop=True)
plt.figure(figsize=(15,8))

plt.subplot (1,2,1)
ax = sns.barplot(x=anova.F, y=anova.column)
plt.title('ANOVA F-values on Y', fontsize=14)

plt.subplot (1,2,2)
ax = sns.barplot(x=anova.p, y=anova.column)
plt.title('ANOVA p-values on Y', fontsize=14)
plt.tight_layout ()

print (anova ['F'].describe ())

In [None]:
categorical_cols = list (anova [anova.F > 10]['column'])

Numerical_cols : Bivariate EDA

In [None]:
num_data = X_train [numerical_cols]
num_data ['SalePrice'] = y_train

In [None]:
corr = abs (round (num_data.corr (),3))
plt.figure (figsize = (5, 5))
corr ['SalePrice'].sort_values (ascending = True).plot.barh ()
plt.title ('Numerical_cols Vs SalePrice Correlation')
plt.show ()

print (corr ['SalePrice'].sort_values (ascending = False))

In [None]:
to_drop = list (corr [abs (corr.SalePrice) <= 0.30]['SalePrice'].sort_values (ascending = False).index)
numerical_cols = element_remove (numerical_cols, to_drop)

In [None]:
plt.figure (figsize = (9,6))
sns.heatmap (corr [corr > 0.7], annot = True, cmap = 'Blues')
plt.show ()

Numerical_cols : After deleting correlated features.

In [None]:
numerical_cols = element_remove (numerical_cols, ['1stFlrSF', 'GrLivArea', '2ndFlrSF', 'TotalBsmtSF'])

In [None]:
fig = plt.figure (figsize = (20,35))

for i, col in enumerate (numerical_cols):
    fig.add_subplot (10,4,i+1)
    sns.regplot (num_data [col], num_data ['SalePrice'])
    plt.xticks (rotation = 90)
    
    plt.title (col + '  ' + str (round (np.corrcoef (num_data.SalePrice, num_data [col])[0,1],3)))
plt.tight_layout ()

In [None]:
X_train [numerical_cols] = capping (X_train, numerical_cols)

In [None]:
total_features = categorical_cols + numerical_cols

print (len (total_features))

In [None]:
# Preprocessing for numerical data
numerical_transformer = Pipeline (steps = [('scaler', StandardScaler ())])
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
     ])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
    ])
# Define model
#model = Lasso (alpha = 10)
#model = GradientBoostingRegressor ()
#model = RandomForestRegressor(n_estimators=100, max_depth = 5, random_state=0)
model = XGBRegressor (colsample_bytree=0.4603, gamma=0.05, 
                             learning_rate=0.06, max_depth=5, 
                             min_child_weight=1.7817, n_estimators=5000,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),('model', model)])
#param = {"model__n_estimators":[100,150,200], "model__learning_rate":[0.01,0.1,0.05], "model__max_depth":[2,3,4,5]}
#grid = GridSearchCV (clf,param, n_jobs = -1)

In [None]:
# Preprocessing of training data, fit model 
clf.fit(X_train [total_features], y_train)

In [None]:
#Validation function
def rmsle_cv(model):
    n_folds = 5
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [None]:
# grid.best_params_

In [None]:
X_valid [numerical_cols] = capping (X_valid, numerical_cols)

In [None]:
# Make validation predictions and calculate mean absolute error
pred_valid = np.exp (clf.predict(X_valid [total_features]))

In [None]:
val_mae = mean_absolute_error(pred_valid, y_valid)

print("Validation MAE : {:,.0f}".format(val_mae))

In [None]:
X_train [total_features].shape

In [None]:
pred_train = np.exp (clf.predict(X_train [total_features]))

In [None]:
val_mae_t = mean_absolute_error(pred_train, y_train)

In [None]:
print("Validation MAE_t : {:,.0f}".format(val_mae_t))

# Creating a Model For the Competition

Build a Random Forest model and train it on all of **X** and **y**.

# Make Predictions
Read the file of "test" data. And apply your model to make predictions

In [None]:
testing_data [numerical_cols] = capping (testing_data, numerical_cols)

In [None]:
testing_data [total_features].columns

In [None]:
X = testing_data [total_features]

In [None]:
test_preds = np.exp (clf.predict (X))

In [None]:
output = pd.DataFrame({'Id': testing_data.Id,
                      'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

In [None]:
output.head ()

Before submitting, run a check to make sure your `test_preds` have the right format.

In [None]:
# Check your answer
step_1.check()
step_1.solution()

# Test Your Work

To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on [this link](https://www.kaggle.com/c/home-data-for-ml-course).  Then click on the **Join Competition** button.

![join competition image](https://i.imgur.com/wLmFtH3.png)

Next, follow the instructions below:
1. Begin by clicking on the blue **Save Version** button in the top right corner of this window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the **Submit to Competition** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

5. If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your model and repeat the process. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.

# Continuing Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  Look at the list of columns and think about what might affect home prices.  Some features will cause errors because of issues like missing values or non-numeric data types. 

The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** micro-course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.


# Other Micro-Courses
The **[Pandas](https://kaggle.com/Learn/Pandas)** micro-course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** micro-course, where you will build models with better-than-human level performance at computer vision tasks.

---
**[Introduction to Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*