# Introduction
<hr style="border:2px solid black"> </hr>

- **What?** Kaggle competition: House Prices - Advanced Regression Techniques. This particular notebook serves as a common repository to code snippets of my own or taken from other kagglers.

- **Dataset description** Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

# General info
<hr style="border:2px solid black"> </hr>

- To promote code reusability and tidiness, I will try to bring inside a method most of the actions performed in this notebook.
- If you do not like it, it would extremely easy to get rid of the method and use the content as a code snippet.
- Please, consider this notebook as a collection of ideas taken (and made mine with some modifications) from several notebooks published by other kagglers who generously shared their idea. Here I am returning the favour for the benefit of the others.

- This notebook is part 1 of a 4-series analysis:
    - Step_#1_Train_test_comparison.ipynb
    - **Step_#2_EDA.ipynb**
    - Step_#3_Data_preparation.ipynb
    - Step_#4_Modelling.ipynb

# Import modules
<hr style="border:2px solid black"> </hr>

In [None]:
# Dara wrangling
import pandas as pd
import numpy as np
from collections import Counter
import copy
from functools import reduce
import pandas_profiling as pp
from sklearn.decomposition import PCA

# Features engineering
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import RobustScaler

# Plotting
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
%matplotlib inline

# Statistics
from scipy.stats import norm
from scipy import stats
from scipy.stats import ttest_ind
import statsmodels.api as sm
from scipy.stats import spearmanr, kendalltau
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

In [None]:
# Other notebook settings
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
import warnings
warnings.filterwarnings("ignore")
from IPython.display import display as dsp

# Load datasets
<hr style="border:2px solid black"> </hr>

- To know what this step does read the comments inside the `load_data` method.
- This is nothing new than what we have seen in `Step_#1_Train_test_comparison.ipynb`. I've just reported the whole section to have a semi self-contained notebook.

In [None]:
def load_data():
    """Load data

    Load the train and test data as provided by Kaggle.
    Keep in mind that the way Kaggle provides the data is
    different than the usual idea we have of the trian-test
    split. In particular, the target column is not present
    in the test set.

    Parameters
    ----------
    None

    Returns
    -------
    train : pandas dataframe
    test : pandas dataframe
    """

    print("\nLoading data")
    # Read the train data
    print("Read train set")
    train = pd.read_csv('./DATASETS/train.csv')

    # Read the test data
    print("Read test set")
    test = pd.read_csv('./DATASETS/test.csv')

    print("Train size", train.shape)
    print("Test size:", test.shape)

    train_features = train.columns
    test_features = test.columns
    print("Not share columns: ", set(train_features).difference(test_features))
    print("Not share columns: ", set(test_features).difference(train_features))

    return train, test

In [None]:
# Read data for the first time
train, test = load_data()

- Get the number of categorical and numerical features.
- This information can be used to debug the final dataset.
- The difference of (-1) between numerical features between train and set is due to the fact that the test set provided by Kaggle does not have the target column. The idea is that the target in the test set is then used to score your submission in the public board.

In [None]:
def get_features_type(SET):

    print("\nGet feature types")

    df_numerical_features = SET.select_dtypes(exclude=['object'])
    df_non_numerical_features = SET.select_dtypes(include=['object'])

    print("No of numerical features: ", df_numerical_features.shape)
    print("No of NON numerical features: ", df_non_numerical_features.shape)

    return df_numerical_features, df_non_numerical_features

In [None]:
_,_ = get_features_type(train)

In [None]:
_,_ = get_features_type(test)

- Check for duplicates in both train and test set,
- The check is done against the ID but a more thourough one would involve checking if different ID have exactly the same entry for each column.

In [None]:
Ids_dupli_train = train.shape[0] - len(train["Id"].unique())
Ids_dupli_test = test.shape[0] - len(test["Id"].unique())

print("There are " + str(Ids_dupli_train) + " duplicate IDs for " +
      str(train.shape[0]) + " total TRAIN entries")
print("There are " + str(Ids_dupli_test) + " duplicate IDs for " +
      str(test.shape[0]) + " total TEST entries")

# Minor changes
<hr style="border:2px solid black"> </hr>

- Just a collection of actions generally forgotten. It is better to do here so we'll keep the notebook clean for later steps.
- More info is provided under the `minor_changes` doc string.

In [None]:
def minor_changes(train, test, target_name):
    """Minor changes.

    This method performs minor changes to the dataset.
    These are generally forgotten actions hence the
    name of the method.

    At the moment we have:
    - Get train and test IDs, the former used for Kaggle
    competition submission file
    - Remove the column Id as it is not needed. This is an 
    artifact used by Kaggle to keep track of the submitted
    predictions.

    Parameters
    ----------
    train : pandas datafrme
    test : pandas dataframe
    target_name : string

    Returns
    -------
    train : pandas datafrme
        Dataframe NOT containing only the ID column
    test : pandas dataframe
        Dataframe NOT containing only the ID column
    df_target : pandas dataframe
        Dataframe containing only the target    
    """

    print("Train size BEFORE:", train.shape)
    print("Test size BEFORE:", test.shape)

    Id_train = train.Id.values
    Id_test = test.Id.values

    if test['Id'].count() != 0.0:
        print("Removing column Id from test set")
        test = test.drop(['Id'], axis=1)
    else:
        print("No column ID present")

    if train['Id'].count() != 0.0:
        print("Removing column Id from train set")
        train = train.drop(['Id'], axis=1)
    else:
        print("No column ID present")

    print("Train size AFTER:", train.shape)
    print("Test size AFTER:", test.shape)
    df_target = train[target_name]

    return train, test, Id_train, Id_test, df_target

In [None]:
train, test, Id_train, Id_test, TARGET = minor_changes(train, test, "SalePrice")

# Basics details

In [None]:
def basic_details(df, sort_by="Feature"):
    """Get basic details of the dataset.

    Missing values and their percentage
    Unique values and their percentage (Cardinality)
    Cardinality is important if you are trying to understand
    feature importance.
    Type if numerical or not    

    Parameters
    ----------
    df - pabdas dataframe
    sort_by - string, defaul = "Feature"

    Returns
    -------
    b - pandas dataframe
    """

    b = pd.DataFrame()
    b['No missing value'] = df.isnull().sum()
    b["Missing[%]"] = df.isna().mean()*100
    b['No unique value'] = df.nunique()
    b['Cardinality[%]'] = (df.nunique()/len(df.values))*100
    b["No Values"] = [len(df.values) for _ in range(len(df.columns))]
    b['dtype'] = df.dtypes

    # Some cosmetic on the table
    # Turn index into a columns
    b['Feature'] = b.index
    # Getting rid of the index
    b.reset_index(drop=True, inplace=True)
    # Order by feature name
    b.sort_values(by=[sort_by], inplace=True)
    # Move feature as a first column
    b = b[['Feature'] + [col for col in b.columns if col != 'Feature']]

    return b

In [None]:
basic_details(train)

In [None]:
basic_details(test)

# Utilities to compare two sets
<hr style="border:2px solid black"> </hr>

- These are a series of utilies we are going to routinely used while analysis each features.
- These will help us keep the code structure tidy.

In [None]:
def get_IQR_frequency(df):
    """Get the IQR or the frequency.

    This method returns the interquartalies or the frquenecy
    depending on the feature being numerical or categorical.

    Parameters
    ----------
    df : pandas dataframe

    Returns
    -------
    dummy : pandas dataframe
        dummy storing IQR if numerical
        dummy storing each instance frequency if categorical
    """

    if all(df.dtypes != object):
        dummy = pd.DataFrame(df.describe()).T
    else:
        frequency = []
        frequency_percentage = []
        unique = list(set([i[0] for i in pd.DataFrame(df).values]))
        for i in unique:
            frequency.append(df[df == i].count()[0])
            frequency_percentage.append((df[df == i].count()[0]/len(df))*100)

        dummy = pd.DataFrame()
        dummy["Entries"] = unique
        dummy["Frequency"] = frequency
        dummy["Frequency[%]"] = frequency_percentage
        dummy.sort_values(by=['Frequency[%]'], inplace=True, ascending=False)

    return dummy

In [None]:
def compare_distribution_sets_on_numerical_columns(train, test):
    """Cmpare distribution sets on numerical columns

    Test if two feature has the same distribution in both train 
    and test set. This is achieved via a T-test. To be able to 
    perform this test we ONLY select the numerical variables, 
    hence the name of the method.

    Parameters
    ----------
    train : pandas dataframe
    test : pandas dataframe

    Returns
    -------
    dummy : pandas dataframe
    """

    # Test.columns because it does not have the target
    numeric_features = test.dtypes[test.dtypes != object].index

    similar = []
    p_value = []
    mu, sigma = [], []
    min_test_within_train = []
    max_test_within_train = []

    for feature in numeric_features:

        # Filling the null values with zeros
        train_clean = train[feature].fillna(0.0)
        test_clean = test[feature].fillna(0.0)

        stat, p = ttest_ind(train_clean, test_clean)
        p_value.append(p)
        
        alpha = 0.05
        if p > alpha:
            similar.append("similar")
            #print('Same distributions (fail to reject H0)')
        else:
            similar.append("different")
            #print('Different distributions (reject H0)')

        min_train = get_IQR_frequency(pd.DataFrame(train_clean))[
            "min"].values[0]
        min_test = get_IQR_frequency(pd.DataFrame(test_clean))["min"].values[0]

        max_train = get_IQR_frequency(pd.DataFrame(train_clean))[
            "max"].values[0]
        max_test = get_IQR_frequency(pd.DataFrame(test_clean))["max"].values[0]        

        min_test_within_train.append(min_test >= min_train)
        max_test_within_train.append(max_test <= max_train)

    # Create a pandas dataframe
    dummy = pd.DataFrame()
    dummy["numerical_feature"] = numeric_features
    dummy["type"] = ["numerical" for _ in range(len(numeric_features))]
    dummy["train_test_similar?"] = similar
    dummy["ttest_p_value"] = p_value
    dummy["min_test>=min_train"] = min_test_within_train
    dummy["max_test<=max_train"] = max_test_within_train

    # Decorate the dataframe for quick visualisation
    def highlight(x):
        return ['background: yellow' if v == "different" or v == False else '' for v in x]
    def bold(x):
        return ['font-weight: bold' if v == "different" or v == False else '' for v in x]
    
    return dummy.style.apply(highlight).apply(bold)

In [None]:
def get_unique_values(df):
    """Get unique values.
    
    Parameters
    ----------
    df : pandas dataframe
    
    Returns
    -------
    unique : set
    """    
    
    unique = set([i[0] for i in df.dropna().values])
    return unique

In [None]:
def compare_hist_kde(train, test):
    """Compare hist and kde
    
    Parameteres
    -----------
    train : pandas dataframe
    test : pandas dataframe
    
    Returns
    -------
    None    
    """
    rcParams['figure.figsize'] = 17, 5
    rcParams['font.size'] = 15

    dummy = get_unique_values(train)
    No_bins = 50 #int(len(dummy))
    print("No of bins used for histograme: ", No_bins)

    for i in set(list(train.columns.values) + list(test.columns.values)):
        
        print("***************")
        print("Feature's name:", i)
        print("***************")
        
        # Plot histogram
        try:
            test[i].hist(legend=True, bins=No_bins)
        except:
            print("Feature", i, " NOT present in TEST set!")
        try:
            train[i].hist(legend=True, bins=No_bins)
        except:
            print("Feature", i, " NOT present in TRAIN set!")
        plt.xticks(rotation=90)
        plt.show()

        # Plot density function
        try:
            sns.distplot(test[i], hist=False, kde=True,
                         kde_kws={'shade': True, 'linewidth': 3})            
        except:
            print("Feature", i, " is categorical in TEST set!")
        try:
            sns.distplot(train[i], hist=False, kde=True,
                         kde_kws={'shade': True, 'linewidth': 3})            
        except:
            print("Feature", i, " is categorical in TRAIN set!")
        plt.xticks(rotation=90)
        plt.show()

In [None]:
def compare_QQ_kde(distr1, distr2):
    """Compare QQ and KDE plot.

    Parameters
    ----------
    distr1 : pandas.core.series.Series
    distr2 : pandas.core.series.Series

    Returns
    -------
    None
    """

    fig, ax = plt.subplots(2, 2, figsize=(15, 5))
    fig.suptitle(
        " qq-plot & distribution: original vs. log-transform ", fontsize=15)

    # SalePrice BEFORE transformation
    #sm.qqplot(distr1, stats.t, distargs=(4,), fit=True, line="45", ax=ax[0, 0])
    stats.probplot(distr1, plot=ax[0, 0])
    sns.distplot(distr1, kde=True, hist=True, fit=norm, ax=ax[0, 1])

    # SalePrice AFTER transformation
    #sm.qqplot(distr2, stats.t, distargs=(4,), fit=True, line="45", ax=ax[1, 0])
    stats.probplot(distr2, plot=ax[1, 0])
    sns.distplot(distr2, kde=True, hist=True, fit=norm, ax=ax[1, 1])
    plt.show()

In [None]:
def get_cat_num_df(df):
    """Get categorical and numerical features name.

    Parameters
    ----------
    df : pandas dataframe

    Returns
    -------
    cat_cols : list
    num_cols : list    
    """

    cat_cols = df.select_dtypes(include=['object'])
    num_cols = df.select_dtypes(exclude=['object'])
    print(
        f'The dataset contains {len(cat_cols.columns.tolist())} categorical columns')
    print(
        f'The dataset contains {len(num_cols.columns.tolist())} numeric columns')

    return cat_cols, num_cols

# Target analysis

- It seems (although the description talks only about RMSE) that the error metric is the RMSE **on the log of the sale prices.**
- The target `SalePrice` variable show the following traits:
    - Deviate from the normal distribution.
    - Have appreciable positive skewness.
    - Show peakedness

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')

f, ax = plt.subplots(figsize=(16, 5))
sns.distplot(train['SalePrice'], color="b")

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])

# Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
           loc='best', fontsize = 15)

ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")
sns.despine(trim=True, left=True)
plt.show()

- If the target is **NOT** normally distributed, we can corroborate our intuition by calculating two other measures used in this case:
    - Skewness 
    - Kurtosis

In [None]:
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())

- The SalePrice is **skewed to the right**, (aka positive skewness).
- This is a problem because most ML models don't do well with non-normally distributed data. But bare in mind that other methods, such as trees-based are not affected by this transformation. Since we are spot checking a wide varity of methods, we'll enforce this knowing it is going to benefit only some methods.

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(16, 5))

# Check the new distribution
sns.distplot(np.log1p(train["SalePrice"]), fit=norm, color="b")

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(np.log1p(train['SalePrice']))

# Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
           loc='best', fontsize = 15)
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")
sns.despine(trim=True, left=True)

plt.show()

- In this case, `log(1+x)` tranform works fairly well and fixes the skewness. **Why is it so important?** The real advantage is that taking the log **means that errors in predicting expensive houses and cheap houses will affect the result equally**.
- We'll log the target after we analyse all the features.
- We can compare the two distribution with and withouth the log transform.
- The method `compare_QQ_kde` will do just this.

In [None]:
compare_QQ_kde(train["SalePrice"], np.log1p(train["SalePrice"]))

- **How about data standardisation?**
- Data standardization means converting data values to have mean of 0 and a standard deviation of 1.
- This would not make the distribution normal, further some of the value would be negative and that does not make sense.

In [None]:
from sklearn.preprocessing import StandardScaler
saleprice_scaled = StandardScaler().fit_transform(
    train['SalePrice'][:, np.newaxis])

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(16, 5))
#Check the new distribution 
sns.distplot(saleprice_scaled , fit=norm, color="b");

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
ax.xaxis.grid(False)
ax.set(ylabel="Frequency")
ax.set(xlabel="SalePrice")
ax.set(title="SalePrice distribution")
sns.despine(trim=True, left=True)

plt.show()

- This [reference](https://www.kaggle.com/dgawlik/house-prices-eda) suggested an interesting insight. It is possible that correlations shift with change of SalePrice.
- Here houses are divided in two price groups: cheap (under 200000) and expensive. Then means of quantitative variables are compared. 
- Expensive houses have pools, better overall qual and condition, open porch and increased importance of MasVnrArea.

In [None]:
quantitative = [f for f in train.columns if train.dtypes[f] != 'object']
quantitative.remove('SalePrice')

features = quantitative

standard = train[train['SalePrice'] < 200000]
pricey = train[train['SalePrice'] >= 200000]

diff = pd.DataFrame()
diff['feature'] = features
diff['difference'] = [(pricey[f].fillna(0.).mean() - standard[f].fillna(0.).mean())/(standard[f].fillna(0.).mean())
                      for f in features]

rcParams['figure.figsize'] = 17, 8
rcParams['font.size'] = 20

# Sorting by difference: from low to high
diff.sort_values(by=['difference'], inplace=True, ascending=True)

sns.barplot(data=diff, x='feature', y='difference')
x = plt.xticks(rotation=90)

- Another interesting way of visualising the data was suggested by this [reference](https://www.kaggle.com/janiobachmann/house-prices-useful-regression-techniques)

In [None]:
# To understand better our data I will create a category column for SalePrice.
train['Price_Range'] = np.nan
lst = [train]

# Create a categorical variable for SalePrice
# I am doing this for further visualizations.
for column in lst:
    column.loc[column['SalePrice'] < 150000, 'Price_Range'] = 'Low'
    column.loc[(column['SalePrice'] >= 150000) & (
        column['SalePrice'] <= 300000), 'Price_Range'] = 'Medium'
    column.loc[column['SalePrice'] > 300000, 'Price_Range'] = 'High'

train.head()

In [None]:
# Most outliers are in the high price category nevertheless, in the year
# of 2007 saleprice of two houses look extremely high!

fig = plt.figure(figsize=(12, 8))
ax = sns.boxplot(x="YrSold", y="SalePrice", hue='Price_Range', data=train)
plt.title('Detecting outliers', fontsize=16)
plt.xlabel('Year the House was Sold', fontsize=14)
plt.ylabel('Price of the house', fontsize=14)
plt.show()

In [None]:
# We'll drop price as we used this just for visualisation prurpouses
train.drop(columns = ['Price_Range'], inplace = True)

# PCA

- Pronciple component analysis is a technique that will help us understand what is the minimum number of dimensions we can use to describe the differences in this dataset.
- This needs to be taken with a pinch of salt as we have not touched the dataset as yet.

In [None]:
plt.figure(figsize=(16, 5))
rbst_scaler = RobustScaler()
train_rbst = rbst_scaler.fit_transform(train[quantitative].dropna())

pca = PCA(36).fit(train_rbst)
plt.plot(pca.explained_variance_ratio_.cumsum(), "k*-", lw=3)
plt.xticks(np.arange(0, 36, 1))
plt.xlabel('Number of components', fontweight='bold', size=14)
plt.ylabel('Explanined variance ratio', fontweight='bold', size=14)

train_pca = PCA(3).fit_transform(train_rbst)
plt.show()

# Normality test

- **Skewedness**:

    - A skewness of zero or near zero indicates a symmetric distribution.
    - A negative value for the skewness indicate a left skewness (tail to the left)
    - A positive value for te skewness indicate a right skewness (tail to the right)

- **Kurtosis**:
    - Kourtosis is a measure of how extreme observations are in a dataset.
    - The greater the kurtosis coefficient , the more peaked the distribution around the mean is.
    - Greater coefficient also means fatter tails, which means there is an increase in tail risk (extreme results)

- Reference: Investopedia: https://www.investopedia.com/terms/m/mesokurtic.asp


In [None]:
def test_set_for_normality(df):
    """Test det for normality.

    Generally the Shapire test is used to check if a distribution
    is normal. In practice the shapiro-p value is reported but not used
    because it is badly affected if the samples are too large!
    We'll use the kurtosis and skew factors to establish if it is
    normal or not, but that can be easily be modified.

    Parameters
    ----------
    df : pandas dataframe    

    Returns
    -------
    df ; pandas dataframe

    References
    ----------
    https://www.researchgate.net/post/P-value-equal-to-0000-in-every-test-how-is-it-possible
    """

    numerical_features = df.dtypes[df.dtypes != object].index

    mu, sigma = [], []
    normal_pvalue = []
    normal = []
    skew_all = []
    kurtosis_all = []
    for feature in numerical_features:
        # Use 0.0 for all null values, an alternative could be to drop the row
        df_clean = df[feature].fillna(0.0)

        shap = stats.shapiro(df_clean)
        normal_pvalue.append(shap.pvalue)
        skew = df_clean.skew()
        skew_all.append(skew)
        kurtosis = df_clean.kurt()
        kurtosis_all.append(kurtosis)
        condition_No1 = skew < 0.5 and skew > -0.5
        condition_No2 = kurtosis < 2.0 and kurtosis > -2.0
        if condition_No1 and condition_No2:
            normal.append("normal")
        else:
            normal.append("not_normal")

    # Create a pandas dataframe
    dummy = pd.DataFrame()
    dummy["numerical_feature"] = numerical_features
    dummy["type"] = ["numerical" for _ in range(len(numerical_features))]
    dummy["shapiro_pvalue"] = normal_pvalue
    dummy["skew"] = skew_all
    dummy["kurtosis"] = kurtosis_all
    dummy["normal?"] = normal

    # Decorate the dataframe for quick visualisation
    def highlight(x):
        return ['background: yellow' if v == "different" or v == "not_normal" else '' for v in x]

    def bold(x):
        return ['font-weight: bold' if v == "different" or v == "not_normal" else '' for v in x]

    dummy.sort_values(by=['numerical_feature'], inplace=True, ascending=True)
    
    # Visualise the highlighted df
    return dummy.style.apply(highlight).apply(bold)

In [None]:
test_set_for_normality(train)

In [None]:
test_set_for_normality(test)

# Low variability

- We have seen how some features mostly consist of just a single value or 0s, which is not useful to us. 
- Therefore, we set an user-defined threshold of **96%**. If a column has one entry which represents 96% of the entries, then it is reasonable to think the features to be useless since there isnt much information we can extract from it.
- This feature are often called `overfitting feature` because the fact that they have low cardinality will make too easy for the algorithm to **overfit them** as suggested in this [notebook](https://www.kaggle.com/angqx95/data-science-workflow-top-2-with-tuning)

In [None]:
# Get numerical and categorical features
cat_cols_train, num_cols_train = get_cat_num_df(train)
cat_cols_test, num_cols_test = get_cat_num_df(test)

In [None]:
def get_low_cardinality_features(df, threshold, feature_names):
    """Get low cardinality features

    Parameters
    ---------
    df : pandas dataframe
    threshold : float 
        float representing a threshold
    feature_names : list of strings
        list of strings containing the name of the
        features to be checked
    Returns
    -------
    None
    """
    selected_features = []
    for i in feature_names:
        #print(i, counts, counts.iloc[0], counts.iloc[1])
        # stop
        counts = df[i].value_counts()
        # print(counts)

        # How many zeros?
        instance_count = [counts.iloc[i] for i in range(len(counts))]
        # print(instance_count)
        instance_percentage = [i / len(train) * 100 for i in instance_count]
        # print(instance_percentage)
        if any([i > float(threshold) for i in instance_percentage]):
            #print("found, ", i, instance_percentage)
            selected_features.append(i)

    print("Features for which at least one of its entry has cardinality as low as\n", str(
        100-threshold), "%", selected_features)

In [None]:
# Check cardinality on numerical features
get_low_cardinality_features(train, 96, num_cols_train.columns)
get_low_cardinality_features(test, 96, num_cols_test.columns)

In [None]:
# Check cardinality on categorical features
get_low_cardinality_features(train, 96, cat_cols_train.columns)
get_low_cardinality_features(test, 96, cat_cols_test.columns)

- Two comments:
    - There seems to be a consistency between both train and test set.
    - Although we have made the argument that this feature should be dropped,, we still have not check for null values and we made no imputation on the dataset. Thus, we'll keep this analysis into consideration and come back to double check if this statement still holds after the cleaning and imputations steps. 

# Features analysis

- This section concentrates on some basic analysis for each feature. 
- It offers an occasion to study each single variable in details. 
- This is a fundamental step for the subsequent feature engineer step.
- The comments/actions made/taken in this analysis are then implemented either under the imputation or the feature engineering sections.
- The main goal of the `analyse_single_feature` method is to collect everything a Data Scientist needs to know on each single feature. **Essentially, can we produce a systematic (read it -> semi-automatic) series of plots that will drive the DS decisions?**

- *Are train and test sets representative of the same data?*
    - **Scenario No#1** - An entry is present in the TRAIN but not in the TEST set. The issue was on how the data was splitted. This is a red flag that should tell you that probably the private leader board could suffer from the same issue. 
    - **Scenario No#2** - An entries is present in the TEST but not in the TRAIN set. This is even a worse case scenario, as the model we'll see no correlation as there were not data to train on. The issue was on how the data was splitted. 
- *What to do if one feature is NOT distributed uniformerly between the train and test sets?* As the data is provided directly from kaggle, you cannot really do anything about it!

In [None]:
def compare_sets_over_non_usable_entries(train, test, delta_threshold=2.0):
    """Compare sets over non usable entries

    As the name suggests two sets are compared over their
    of non-usable entries. If the percentage difference is greater
    than the user-defined threshold, the feature is then highlighted.

    Parameters
    ----------
    train : pandas dataframe
    test : pandas dataframe
    delta_threshold : float, default = 2.0
         Value in percentage above which the row get highlithed

    Returns
    -------
    b : pandas dataframe
    """

    # Pandas dataframe showing the number of null values for the train set
    nan_train = pd.DataFrame(train.isna().sum(), columns=['Nan_sum_train'])
    nan_train['feature_name'] = nan_train.index
    nan_train = nan_train[nan_train['Nan_sum_train'] > 0]
    nan_train['Percentage_train'] = (nan_train['Nan_sum_train']/len(train))*100
    nan_train = nan_train.sort_values(by=['feature_name'])    

    # Pandas dataframe showing the number of null values for the test set
    nan_test = pd.DataFrame(test.isna().sum(), columns=['Nan_sum_test'])

    nan_test['feature_name'] = nan_test.index
    nan_test = nan_test[nan_test['Nan_sum_test'] > 0]
    nan_test['Percentage_test'] = (nan_test['Nan_sum_test']/len(test))*100
    nan_test = nan_test.sort_values(by=['feature_name'])    

    # Merge the two dataset by "feature_name"    
    pd_merge = pd.merge(nan_test, nan_train, how='outer', on='feature_name')
    pd_merge = pd_merge.fillna(0)
    pd_merge["NaN_tot"] = pd_merge["Nan_sum_train"] + pd_merge["Nan_sum_test"]
    pd_merge["delta_percentage"] = abs(
        pd_merge["Percentage_test"] - pd_merge["Percentage_train"])
    pd_merge = pd_merge.sort_values(by=['feature_name'])

    # We'd like to highlight those entries where the differences > delta_threshold
    def highlight(x):
        return ['background: yellow' if v > delta_threshold else '' for v in x]

    def bold(x):
        return ['font-weight: bold' if v > delta_threshold else '' for v in x]

    # Highlith the entries
    a = pd_merge.style.apply(highlight, subset="delta_percentage").apply(
        bold, subset="delta_percentage")
    return a

In [None]:
def get_unique_values(df):
    """Get unique values.

    Parameters
    ----------
    df : pandas dataframe

    Returns
    -------
    unique : set
    """

    unique = set([i[0] for i in df.dropna().values])
    return unique

In [None]:
def get_features_correlation(feature1, feature2, df):
    """Get feature correlation.

    Parameters
    ----------
    feature1 : string
        Feature No1's name
    feature2 : string
        Feature No2's name
    df : pandas dataframe
        Generally either the train or the test set

    Returns
    -------
    None
    """

    # Dropping NaN
    #print("Shape BEFORE removing NaN: ", df.shape)    
    #df = df.dropna(how = 'any', axis = 0)
    #print("Shape AFTER removing NaN: ", df.shape)
    
    # Getting rid of nan otherwise it will not work!
    data1 = df[feature1]
    data2 = df[feature2]

    data1 = data1.values
    data2 = data2.values

    print("Are ", feature1, " and ", feature2, " correlated?")
    # Calculate spearman's correlation
    coef, p = spearmanr(data1, data2)
    print('Spearmans correlation coefficient: %.3f' % coef)
    # Interpret the significance
    alpha = 0.05
    if p > alpha:
        print('Samples are NOT correlated (fail to reject H0) p=%.3f' % p)
    else:
        print('Samples are correlated (reject H0) p=%.3f' % p)

    # Calculate kendall's correlation
    coef, p = kendalltau(data1, data2)
    print("-------------------------")
    print('Kendall correlation coefficient: %.3f' % coef)
    # Interpret the significance given a value of alpha
    alpha = 0.05
    if p > alpha:
        print('Samples are NOT correlated (fail to reject H0) p=%.3f' % p)
    else:
        print('Samples are correlated (reject H0) p=%.3f' % p)

In [None]:
def get_non_common_values(df1, df2):
    """Get non common values.

    Given two dataframes, df1 and df2, find the values
    that are NOT common. This is the union of values that
    are in df1 but not in df2 and viceversa.
    
    Paramaters
    ----------
    df1 : pandas dataframe
    df2 : pandas dataframe
    
    Returns
    -------
    c : set
        Set containing entries that are NOT common
    """
    
    # Get values that are in df1 but not in df2
    a = set(get_unique_values(df1).difference(get_unique_values(df2)))
    # Get values that are in df2 but not in df1
    b = set(get_unique_values(df2).difference(get_unique_values(df1)))
    c = a.difference(b)
    return c

In [None]:
def plot_box_stripplot_violin_plots(df, features, y):
    """Plot how the target varies wrt feature changes.

    This is achieved via three plots:
    [1] box plot :
    [2] stro plot :
    [3] violin plot : 
    [4] bar plot :  A bar plot represents an estimate of central 
    tendency for a numeric variable with the height of each rectangle
    and provides some indication of the uncertainty around that 
    estimate using error bars.

    Parameters
    ----------
    features : list of string
        List of features name

    y : string
        The target variable name

    Returns
    -------
    None
    """

    for i in features:
        rcParams['font.size'] = 20
        if i != "SalePrice":
            fig, ax = plt.subplots(4, 1, figsize=(20, 15))
            sns.boxplot(data=df, x=i, y=y, ax=ax[0])
            sns.stripplot(data=df, x=i, y=y, ax=ax[1])
            sns.violinplot(data=df, x=i, y=y, ax=ax[2])
            sns.barplot(x=i, y=y, data=df, ax=ax[3])
            plt.tight_layout()
            plt.show()

In [None]:
def create_boxen_count_2x1(df, first_feature, figsize, target):
    plt.figure(figsize=figsize)

    # Create boxen plot of first_feature and Log_SalePrice
    ax2 = plt.subplot(212)
    sns.boxenplot(x=first_feature, y=target, data=df, color='tomato')
    plt.xticks(rotation='horizontal')

    # Create countplot of first_feature
    ax1 = plt.subplot(211, sharex=ax2)
    sns.countplot(x=first_feature, data=df, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(hspace=0)

    plt.show()

In [None]:
def get_jointplot(df, feature, target):
    """
    """
    sns.jointplot(x=feature,
                  y=target,
                  data=train,
                  kind='reg',
                  height=9,
                  color='darkseagreen')

In [None]:
def analyse_single_feature(feature_name, train, test, correlated_feature="None", target="None"):
    """Analyse single feature.

    Parameters
    ----------
    feature_name : string
        Name of the feature
    train : pandas dataframe
    test : pandas dataframe
    correlated_featire : string, default "None"
        The name of the feature we'd like to know is correlated
        with feature_name
    target : string, default "None"        

    Returns
    -------
    None
    """

    # Turn pandas series into pandas dataframe
    current_feature = feature_name
    current_train_feature_df = pd.DataFrame(train[current_feature])
    current_test_feature_df = pd.DataFrame(test[current_feature])

    print("\nUnique TRAIN values")
    print(get_unique_values(current_train_feature_df))
    print("\nUnique TEST values")
    print(get_unique_values(current_test_feature_df))
    print("\nNon common values")
    print("Present in TRAIN but not in TEST:", get_non_common_values(
        current_train_feature_df, current_test_feature_df))
    if len(get_non_common_values(current_train_feature_df, current_test_feature_df)) != 0.0:
        """
        Print the warning ONLY for categorical features
        In theory you can also do for numerical values, but this needs to be done of the value are outside
        the min and max in my opinion
        """
        if all(current_train_feature_df.dtypes == object) and all(current_test_feature_df.dtypes == object):
            print("WARNING!")
            print("TEST set is not representive of the TRAIN set. Deal with it.")
            print("WARNING!")

    print("Present in TEST  but not in TRAIN", get_non_common_values(
        current_test_feature_df, current_train_feature_df))
    if len(get_non_common_values(current_test_feature_df, current_train_feature_df)) != 0.0:
        """
        Print the warning ONLY for categorical features
        In theory you can also do for numerical values, but this needs to be done of the value are outside
        the min and max in my opinion
        """
        if all(current_train_feature_df.dtypes == object) and all(current_test_feature_df.dtypes == object):
            print("WARNING!")
            print("This is serious!")
            print("WARNING!")

    print("\nNon-usable entries")
    dsp(compare_sets_over_non_usable_entries(
        current_train_feature_df, current_test_feature_df))

    print("\nBasics details TRAIN")
    dsp(basic_details(current_train_feature_df))
    dsp(get_IQR_frequency(current_train_feature_df))

    print("\nBasics details TEST")
    dsp(basic_details(current_test_feature_df))
    dsp(get_IQR_frequency(current_test_feature_df))

    print("\nCompare TRAIN and TEST histogram and kde")
    compare_hist_kde(current_train_feature_df, current_test_feature_df)

    if target != "None":
        # Only for the train set as the test set does not have a target column
        create_boxen_count_2x1(train, feature_name, (20, 8), target)
        plot_box_stripplot_violin_plots(train, [feature_name], target)

    # Only for numerical features
    if all(current_train_feature_df.dtypes != object) and all(current_test_feature_df.dtypes != object):

        print("\nNo of value with equal to zero.")
        """
        This important when using a log transform
        """
        print("No of entries equal to 0.0 in TRAIN set", list(
            current_train_feature_df.values).count(0.0))
        print("No of entries equal to 0.0 in TEST set", list(
            current_test_feature_df.values).count(0.0))

        print("\nCompare TRAIN and TEST distributions")
        dsp(compare_distribution_sets_on_numerical_columns(
            current_train_feature_df, current_test_feature_df))

        print("\nNormality on the TRAIN set")
        dsp(test_set_for_normality(current_train_feature_df))

        print("\nNormality on the TEST set")
        dsp(test_set_for_normality(current_test_feature_df))

        # Scatter plot for target vs. feature
        if target != "None":
            print("Scatter plot for target vs. feature")
            #data = pd.concat([train[target], current_train_feature_df], axis=1)
            #data.plot.scatter(x=current_feature, y=target)

            fig, (ax1, ax2) = plt.subplots(
                figsize=(16, 8), ncols=2, sharey=False)
            # This is done if the realtionship is linear or not!
            ax1.title.set_text('Scatter + regression plor + variance')
            sns.regplot(x=train[feature_name], y=train[target], ax=ax1)
            # This is the residual plot which shows the error variance
            ax2.title.set_text('Regression plot errors')
            sns.residplot(train[feature_name], train[target], ax=ax2)

            get_jointplot(train, feature_name, target)

        # dropna is necessary otherwise it'd not run
        # the try-except is there in case it is a categorical variable
        print("QQ plot on TRAIN set for original and normalised values")
        compare_QQ_kde(train[current_feature].dropna(),
                       np.log1p(train[current_feature].dropna()))

        print("QQ plot on TEST set for original and normalised values")
        compare_QQ_kde(test[current_feature].dropna(),
                       np.log1p(test[current_feature].dropna()))

        if correlated_feature != "None":
            print("Correlation against ", correlated_feature,
                  " on the train set ONLY!")
            get_features_correlation(feature_name, correlated_feature, train)

    else:
        print(current_feature, " is NOT numerical!")

In [None]:
train.columns

In [None]:
test.columns

- **Feature** - MSSubClass Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES
- **Missing value**: NONE
- **Type**: Numeric-> treated as categorical -> use one-hot encoding.
- **Encoding needed**: Yes
- **Comments:** Which one between ordinal or label encoder should we use? Now let us separate the specifics of a package implementation and let us focus on the high level understand of the 3 options:
    - **Ordinal encoding** should be used for ordinal variables (where order matters, like cold, warm, hot);
    - **Label encoding** should be used for non-ordinal (aka nominal) variables (where order doesn't matter, like blonde, brunette). No extra column is added but the algorithm may used the value as if they were ordinal.
    - **One-hot encoding** every entry get a 1, but the number of column is equal to the number of entries minus one
- A note on the implementation: in the sklearn implementation, if you want ordinal encoding (order is preserved); you must do the ordinal encoding yourself (neither OrdinalEncoder nor LabelEncoder can infer the order! See the OrdinalEncoder constructor parameter called `categories`).
- **EDA**: Each of these classes represents a very different style of building, as shown in the data description. Hence, we can see large variance between classes with SalePrice. After turning the entris from numerical to string I will use one-hot encoding because I am unable to establish an order.

In [None]:
analyse_single_feature("MSSubClass", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: MSZoning Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density
	
    
- **Missing value**: 4
- **Cardinality**: LOW
- **Type**: Ordinal 
- **Encoding needed**: Yes -> Use one-hot encoding
- **Comments:**: We could make some assumptions on what type of areas are worth more and try to come up with a classification, but I feel this will mean enforcing a personal bias on the encoding. Therefore, I will use one-hot encoding for this feature. Also note that, not all the entries mentioned in the description are present. For instance there are no Industrial zones in the dataset.
- **EDA:** Most of the houses are located in RL (residential low population density area) and Houses located at lower population density area generally have higher SalePrice than houses at higher population density area. Nevertheless, this is not enough to impose an order in the data.

In [None]:
analyse_single_feature("MSZoning", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: LotFrontage: Linear feet of street connected to property
- **Missing value**: high above 15% for each train and test sets
- **Cardinality**: Low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments** Maybe we could group the data into different categories (binning)?
- **EDA** There seems to be some outliers having very high LotFrontages values but relatively low SalePrice.

In [None]:
analyse_single_feature("LotFrontage", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: LotArea: Lot size in square feet
- **Missing value**: No
- **Cardinality**: High
- **Type**: Numerical
- **Encoding needed**: No
- **Comments** Maybe grouped into different bins? We'll not use this in this notebook, but I'll share how to do it under the `feature_engineering` section.
- **EDA** This feature shows a high correlation but it is very positively skewed. 

In [None]:
analyse_single_feature("LotArea", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Street Type of road access to property
- **Missing value**: NONE
- **Cardinality**: Low
- **Type**: Ordinal categorical, as having a paved street is better than having a gravel one!
- **Encoding needed**: Yes
- **Comments:** Similarly, one can also argue that there is not enough data for the "Grvl"  making this feature useless. Another, option would be to create a isPaved feature which essentially will create two bin, has it and hasn't it! What I'd like to use here is ordinal encoding as having a paved street type seems to lead to higher sale price.

In [None]:
analyse_single_feature("Street", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Alley: Type of alley access to property
- **Missing value**: over 90%
- **Cardinality**: Low
- **Type**: Categorical -> use ordinal endoder
- **Encoding needed**: Yes
- **Comments**: Use NA for missing value imputation
- **EDA**: Here we see a fairly even split between to two classes in terms of frequency, but 
a much higher average SalePrice for Paved alleys as opposed to Gravel ones. Having a paved Alley
seems to be correlated with higher prices.

In [None]:
analyse_single_feature("Alley", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: LotShape: General shape of property
       Reg	Regular	
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular        
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> Use one-hot encoding
- **Encoding needed**: Yes
- **Comments:** I was not able to clearly see a consistent trend from regular to irregular, hence I am not going to use ordinal encoding. Another option could be to add another feature flagging regular or irregular shapes.

In [None]:
analyse_single_feature("LotShape", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: LandContour: Flatness of the property
       Lvl	Near Flat/Level	
       Bnk	Banked - Quick and significant rise from street grade to building
       HLS	Hillside - Significant slope from side to side
       Low	Depression        
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: Since this a categorical feature without order, I will create dummy features. Note that most properties are levelled; thus we could add another category feature as in "not_level".
- **EDA**: Most houses are indeed on a flat contour, however the houses with the highest SalePrice seem to come from properties on a hill! Better view!?

In [None]:
analyse_single_feature("LandContour", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Utilities: Type of utilities available
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only	
- **Missing value**: 2
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: The vast majority of entries have AllPub and it is reasonable to assume that those 2 value missing are essentially likely to be AllPub. Further if this is the case there is essentially no variability in the test data. It is a constant. We have seen how `NoSeWa` is only present in the train data, but not in the test set we have access to. To complicate the situation we have only one entry. This means that:
    - If we drop this feature and the private set has one value with `NoSeWa` we then have a model that was not trained for this.
    - Even if we'd like to train a model we litteraly have only one entry. How can we train and test this? We could try to collect more data but this is not an option available to us.
    - Droping this feature seems to be an almost forced option. However, consider that selecting only the top feature would generally do this for us, probably we do not even need to drop it.
    - A more elaborated feature would to consider a flag hasAllPub with Yes and No and the hot-encode that feature. I feel this is the most robust option we have. Again we can still rely on feature importance to exclude this option is deemed not useuful.

In [None]:
analyse_single_feature("Utilities", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: LotConfig: Lot configuration

       Inside	Inside lot
       Corner	Corner lot
       CulDSac	Cul-de-sac
       FR2	Frontage on 2 sides of property
       FR3	Frontage on 3 sides of property
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Ordinal -> use one-hot encoding
- **Encoding needed**: Yes
- **EDA**: Cul de sac's seems to boast the highest average prices within Ames, however most houses are positioned inside or on the corner of the lot. To simplify this feature we could cluster "FR2" and "FR3", then create dummy features afterwards. This will have the benefit to limit the number of extra column created.

In [None]:
analyse_single_feature("LotConfig", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: LandSlope: Slope of property
		
       Gtl	Gentle slope
       Mod	Moderate Slope	
       Sev	Severe Slope        
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Ordinal categorical
- **Encoding needed**: Yes
- **Comments**: We are going to cluster "Mod" and "Sev" to create one class, and create a new flag to indicate a gentle slope or not. This will also help us reduce the number of column while one-hot encoding.
- **EDA**:  Most houses have a gentle slope of land and overall, the severity of the slope doesn't appear to have much of an impact on SalePrice.


In [None]:
analyse_single_feature("LandSlope", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Neighborhood: Physical locations within Ames city limits
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical, but we do not have enouh info at the moment to rank them.
- **Encoding needed**: Yes
- **Comments**: Since this is a categorical feature without order, I will create dummy features.
- **EDA**: Neighborhood clearly has an important contribution towards SalePrice, since we see such high values for certain areas and low values for others. 

In [None]:
analyse_single_feature("Neighborhood", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Condition1: Proximity to various conditions
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad        
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Oridnal -> Use one-hot encoding
- **Encoding needed**: Yes
- **Com ments**: Most of the houses have normal condition and there seems to be no clear order in the label. There is another feature called `Condition1` which seems very similar. Please take a look to how we dealt with this under the `feature_engineering` section.
- **EDA**: Since this feature is based around local features, it is understandable that having more desirable things, like a parks... nearby are a factor that would contribute towards a higher SalePrice. 

In [None]:
analyse_single_feature("Condition1", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Condition2: Proximity to various conditions (if more than one is present)
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: Most of the houses have normal condition and there seems to be no correlation with the target. Most of the houses have normal condition and there seems to be no clear order in the label. There is another feature called `Condition1` which seems very similar. Please take a look to how we dealt with this under the `feature_engineering` section.

In [None]:
analyse_single_feature("Condition2", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BldgType: Type of dwelling
       1Fam	Single-family Detached	
       2FmCon	Two-family Conversion; originally built as one-family dwelling
       Duplx	Duplex
       TwnhsE	Townhouse End Unit
       TwnhsI	Townhouse Inside Unit        
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: The different categories exhibit a range of average SalePrice. The class with the most observations is "1Fam". We can also see that the variance within classes is quite tight, with only a few extreme values in each case. There could be a possibility to cluster these classes, however for now I am going to create dummy features.

In [None]:
analyse_single_feature("BldgType", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: HouseStyle: Style of dwelling
       1Story	One story
       1.5Fin	One and one-half story: 2nd level finished
       1.5Unf	One and one-half story: 2nd level unfinished
       2Story	Two story
       2.5Fin	Two and one-half story: 2nd level finished
       2.5Unf	Two and one-half story: 2nd level unfinished
       SFoyer	Split Foyer
       SLvl	Split Level        
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes -> see the `feature_engineering` section to see how we dealt with it
- **Comments**: Here we see quite a few extreme values across the categories and a large weighting of observations towards the integer story houses. Although the highest average SalePrice comes from "2.5Fin", this has a very high standard deviation and therefore more reliably, the "2Story" houses are also very highly priced on average. Since there are some categories with very few values, I will cluster these into another category and create dummy variables.

In [None]:
analyse_single_feature("HouseStyle", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor        
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Numerical but is really categorical with an precise order
- **Encoding needed**: No
- **Comments**: Others have pointed out there is one outliers at OveralQual ~ 5 with a SalePrice which istoo high. We'll check later if our automated outliers procedure has removed it or not. This feature although being numeric is actually categoric and ordinal. Oridnal because as the value increases so does the SalePrice. We see here a nice positive correlation with the increase in OverallQual and the SalePrice, as you'd expect.

In [None]:
analyse_single_feature("OverallQual", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: OverallCond: Rates the overall condition of the house
       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average	
       5	Average
       4	Below Average	
       3	Fair
       2	Poor
       1	Very Poor
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: Interestingly, we see here that it does follow a positive correlation with SalePrice, however we see a peak at a value of 5, along with a high number of observations at this value. The highest average SalePrice actually comes from a value of 5 as opposed to 10. For this feature, I will leave it as being numeric and ordinal.

In [None]:
analyse_single_feature("OverallCond", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- We are here exploring how much features `OverallQual` and `OverallCond` are correlated to each others.
    - Option #1 - leave them as they are if we know they are correlated (bad option)
    - Option #2 - get the mean between them and drop the two feature (better option)
    - Option #3 - get the mean and keep the two features, three in total (worst option)
    - Option #4 - keep only one feature (best option)
- What follows is an attempt to study the relationship between these two features.

In [None]:
def compare_categorical_variables(feature1, feature2, df):
    """Compare two categorical variables.
    
    Highlghts entries that are not the same. This allows to make
    a line-by-line comparison of the entries.
    
    Parameters
    ----------
    feature1 : string
        First feature's name 
    feature2 : string
        Second feature's name
    
    Returns
    -------
    df : pandas dataframe
        This is where the line-by-line comparison is returned
    """
    
    df1 = df[feature1]
    df2 = df[feature2]    
    
    result = []
    for i, value in enumerate(df1.values):            
        if df1.iloc[i] != df2.iloc[i]:
            result.append("different")
        else:
            result.append("equal")
    
    # Create a pandas dataframe
    dummy = pd.DataFrame()
    dummy[feature1] = df1
    dummy[feature2] = df2
    dummy["Equal?"] =  result
    

    # Decorate the dataframe for quick visualisation
    def highlight(x):    
        return ['background: yellow' if v == "different" else '' for v in x]

    def bold(x):
        return ['font-weight: bold' if v == "different" else '' for v in x]

    # Visualise the highlighted df
    return dummy.style.apply(highlight).apply(bold)            

In [None]:
compare_categorical_variables("OverallQual", "OverallCond", train)

In [None]:
compare_categorical_variables("OverallQual", "OverallCond", test)

- The two dataframe above show most of the value are different in the term of absolute values, but that does not mean they are not correlated. In fact, if you look carefully even when the value are different, most of them capture the trend meaning: when the `OverallQual` is high then the `OverallCond` is also high.
- Further, we can also use the Spearmans and Kendal's rank correlation to confirm if these two variables are correlated to each other.
- As you can see both correlation tests suggest the two variables are stronlgy correlated. If that is the case, if we drop one variable, the piece of information dropped in then recovered by the presence of the other variable.
- This makes me think that creating a `meanQuality` is not going to be such a great idea.

In [None]:
get_features_correlation("OverallQual", "OverallCond", train)

In [None]:
get_features_correlation("OverallQual", "OverallCond", test)

- **Feature**: YearBuilt: Original construction date
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Numerical -> treated as categorical
- **Encoding needed**: Yes
- **Comments**: It depends on whether it is used as a measurement of time (quantitative) or a name of a particular interval of time (categorical). It could be either. One of the test one can use to understand this is to ask whether the ratio between two years is meaningful or not. Taking the ratio betwteen two ueast is not meaningful which is why its not appropriate to classify it as a quantitative variable. However, one-hot encoding on all of it would be problematic. What I'd suggest would be to use a 7 to 10 year bin size window. This will lower the number of extra column being added.

In [None]:
analyse_single_feature("YearBuilt", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- Since we have suggested to use binning, we'd like to see how these would look like on both sets.
- Notice how the bins differ; a confirmation of the differences between the two datasets.
- It also shown how the bins would have looked if we had merged the two sets.

In [None]:
dummy = pd.cut(train['YearBuilt'], 7)
for i in dummy.unique():
    print(i)

In [None]:
dummy = pd.cut(test['YearBuilt'], 7)
for i in dummy.unique():
    print(i)

In [None]:
all_data = pd.concat((train, test)).reset_index(drop=True)
dummy = pd.cut(all_data['YearBuilt'], 7)
for i in dummy.unique():
    print(i)

- **Feature**: YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Numerical -> treated as categorical
- **Encoding needed**: Yes
- **Comments**: The newer the remodelling of a house, the higher the SalePrice. From the data description, I believe that creating a new feature describing the difference in number of years between remodeling and construction could be a good choice.
- **EDA** This means that if we take the difference we can find out if the house ws recently remodelled. This could be used as an extra feature.

In [None]:
analyse_single_feature("YearRemodAdd", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- The number of years after the house was last refurbished **does not seem** to correlate well with the `SalePrice`. 
- I guess this will correlated well with the delta price if we had only the previous sale price. Since we do not have it, I will suggest we drop this feature `YearRemodAdd`.

In [None]:
dummy = train['YearRemodAdd'] - train['YearBuilt']

plt.subplots(figsize=(40, 10))
sns.barplot(x=dummy, y=train["SalePrice"])
plt.xticks(rotation=90);

- **Feature**: RoofStyle: Type of roof
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: This feature has two highly frequent categories. Since this is a categorical feature without order, I will create dummy variables.

In [None]:
analyse_single_feature("RoofStyle", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: RoofMatl: Roof material
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: Interestingly, there are very few observations in the training data for several classes. 
- **EDA**: The two sets dot not have the same entries. This will cause some issues while encoding the feature if we treat the train and set sets separately.

In [None]:
analyse_single_feature("RoofMatl", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Exterior1st: Exterior covering on house

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast	
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles        
- **Missing value**: 1 in the test set
- **Cardinality**: Low
- **Type**: Categorical -> see `Feature_engineering`
- **Encoding needed**: Yes
- **Comments**: 
- **EDA**: Looking at these 2 features (`Exterior1st` and `Exterior2nd`) together, we can see that they exhibit very similar behaviours against SalePrice. This tells me that they are very closely related. Hence, I will create a flag to indicate whether there is a different 2nd exterior covering to the first. Then I will keep "Exterior1st" and create dummy variables from this.

In [None]:
analyse_single_feature("Exterior1st", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Exterior2nd: Exterior covering on house (if more than one material)
- **Missing value**: 1 in the test set
- **Cardinality**: Low
- **Type**: Categorical -> see `Feature_engineering`
- **Encoding needed**: Yes
- **Comments**: use the mode for the missing value imputation.
- **EDA**: see the discussion for `Exterior1st`

In [None]:
analyse_single_feature("Exterior2nd", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: MasVnrType: Masonry veneer type

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone
- **Missing value**: ~ 20
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: 
- **EDA**: Each class has quite an unique range of values for `SalePrice`. The only class that stands out is `BrkCmn` which has a low frequency. Clearly `Stone` demands the highest `SalePrice` on average, although there are some extreme values within `BrkFace`. Since this is a categorical feature without order, I will create dummy variables here.

In [None]:
analyse_single_feature("MasVnrType", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: MasVnrArea: Masonry veneer area in square feet
- **Missing value**: ~ 20
- **Cardinality**: 22%
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: The median is zero, this means that 50% of the entries have zero `MasVnrArea`. Using the 75% percentile would give an estimate that is very much biased toward the highest value. The mean seems a good compromise.
- **EDA**: This feature has negligible correlation with `SalePrice`, and the values for this feature vary widely based on house type, style and size. Since this feature is insignificant in regards to SalePrice, and it also correlates highly with `MasVnrType` (if "MasVnrType = "None" then it has to be equal to 0), I will drop this feature.

In [None]:
analyse_single_feature("MasVnrArea", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: ExterQual, Evaluates the quality of the material on the exterior 
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor    
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> ordinal
- **Encoding needed**: Yes
- **EDA**: This feature shows a clear order and has a positive correlation with `SalePrice`. As the quality increases, so does the `SalePrice`. We see the largest number of observations within the two middle classes, and the lowest observations within the lowest class. Since this is a categorical feature with order, I will replace these values by hand.

In [None]:
analyse_single_feature("ExterQual", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: ExterCond, Evaluates the present condition of the material on the exterior
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor    
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **EDA**: Interestingly we see the largest values of SalePrice for the second and third best classes. This is perhaps because of the large frequency of values within these classes, whereas we only see 3 observations within "Ex" from the training data. Since this categorical feature has an order, **but the SalePrice does not necessarily correlate with this order, I will create dummy variables**.

In [None]:
analyse_single_feature("ExterCond", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Type of foundation
       BrkTil	Brick & Tile
       CBlock	Cinder Block
       PConc	Poured Contrete	
       Slab	Slab
       Stone	Stone
       Wood	Wood
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: Creating some ordinal label will not make sense here.
- **EDA**: We have 3 classes with high frequency but we also have 3 of low frequency.

In [None]:
analyse_single_feature("Foundation", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BsmtQual: Evaluates the height of the basement
       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement
- **Missing value**: ~80
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: The description tells us that nan is probably going to be a NA -> no basement.
- **EDA**: `SalePrice` is clearly affected by `BsmtQual`: the better the quality the higher the price. However, it looks as though most houses have either `Good` or `Typical` sized basements. Since this feature is ordinal, i.e. the categories represent different levels of order, I will replace the values by hand. Also note that there is no `Po` entries available neither in the train nor test set. I've seen from other kagglers that people tend not to encode this value if there are no entries.

In [None]:
analyse_single_feature("BsmtQual", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BsmtCond: Evaluates the general condition of the basement
       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement
- **Missing value**: ~80
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments** The description tells us that nan is probably going to be a NA -> not basement.
- **EDA**: As the condition of the basement improves, the SalePrice also increases. However, we see some very high SalePrice values for the houses with "Typical" basement conditions. This perhaps suggests that although these two features correlate positively, BsmtCond may not have a largely influential contribution on SalePrice. We also see the largest number of houses falling into the "TA" category. Since this feature is ordinal, I will replace the values by hand.

In [None]:
analyse_single_feature("BsmtCond", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- We'll also note that having no basement correlates better than having one but in poor in general condition! 
- However, when compared to `BsmtQual` the trend is not replicated, meaning that not having a basement shows less correlation than having one in fair condition. 
- Unfortunately, this statement is very weak because we have no entries with `Po` in for the `BsmtQual` feature.

In [None]:
plt.subplots(figsize=(20, 7))

dummy = copy.deepcopy(train)
dummy['BsmtCond'] = dummy['BsmtCond'].fillna("NA")
dummy['BsmtQual'] = dummy['BsmtQual'].fillna("NA")

plt.subplot(1, 2, 1)
sns.boxplot(x="BsmtCond", y="SalePrice", data=dummy,
            order=['NA', 'Po', 'Fa', 'TA', 'Gd'])

plt.subplot(1, 2, 2)
sns.boxplot(x="BsmtQual", y="SalePrice", data=dummy,
            order=['NA', 'Po', 'Fa', 'TA', 'Gd'])
plt.show()

- **Feature**: BsmtExposure: Refers to walkout or garden level walls
       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)	
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement
- **Missing value**: ~80
- **Cardinality**: Low
- **Type**: Categorical -> ordinal
- **Encoding needed**: Yes
- **Comments**: Use `No` Basement as null values imputation
- **EDA**: As the amount of exposure increases, so does the typical `SalePrice`. Interestingly, the average difference of `SalePrice` between categories is quite low here which is telling me that some houses are sold for very high prices, even with no exposure. 

In [None]:
analyse_single_feature("BsmtExposure", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BsmtFinType1
       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement    
- **Missing value**: ~79
- **Cardinality**: Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: Use NA Basement as null value imputation
- **EDA**: Houses with an unfinished basement on average sold for more money than houses having an average rating. However, houses with a good finish within the basement still demand more money than unfinished ones. This is an ordinal feature, however as you can see this order does not necessarily cause a higher SalePrice. **By creating an ordinal variable we are suggesting that as the order of the feature increases, then the target variable would increase as well. We can see that this is not the case. Therefore, I will create dummy variables from this feature.**

In [None]:
analyse_single_feature("BsmtFinType1", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BsmtFinSF1: Type 1 finished square feet
- **Missing value**: ~1
- **Cardinality**: ~ 40%
- **Type**: Categorical
- **Encoding needed**: No (yes if we use bins!)
- **Comments**: Use 0.0 as null value imputation. The mean will be too skewed to the right due to outliers.
- **EDA**: This feature has a positive correlation with SalePrice and the spread of data points is quite large. It is also clear that the local area (Neighborhood) and style of building (BldgType, HouseStyle and LotShape) has a varying effect on this feature. Since this is a continuous numeric feature we have two options:
    - Leave it as it is 
    - Bin it and then use one-hot encoding. 

In [None]:
analyse_single_feature("BsmtFinSF1", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['BsmtFinSF1'], 4)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['BsmtFinSF1'], 4)
for i in dummy_test.unique():
    print(i)

- **Feature**: BsmtFinType2: Rating of basement finished area (if multiple types)
       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement
- **Missing value**: ~80
- **Cardinality**: ~ Low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: Use Na Basement as null value imputation
- **EDA**: There are a lot of houses with unfinished second basements, and this may cause the skew in terms of SalePrice's being relatively high for these. There are only a few values for each of the other categories, with the highest average SalePrice coming from the second best category. Although this is intended to be an ordinal feature, we can see that the SalePrice does not necessarily increase with order. Thus, I will use one-hot encoding.

In [None]:
analyse_single_feature("BsmtFinType2", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BsmtFinSF2, Type 2 finished square feet
       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement    
- **Missing value**: 1
- **Cardinality**: ~ 10%
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: 
- **EDA**: There are a large number of data points with this feature = 0. Apart from this, there is no significant correlation with `SalePrice` and there is a large spread of values. Hence, I will replace this feature with a flag.

In [None]:
analyse_single_feature("BsmtFinSF2", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BsmtUnfSF, Unfinished square feet of basement area
- **Missing value**: 1
- **Cardinality**: ~ 50%
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: Use median for nan value imputation
- **EDA**: This feature has a significant positive correlation with SalePrice, with a small proportion of data points having a value of 0. This tells me that most houses will have some amount of square feet unfinished within the basement, and this actually positively contributes towards SalePrice. The amount of unfinished square feet also varies widely based on location and style. Whereas the average unfinished square feet within the basement is fairly consistent across the different lot shapes. Since this is a continuous numeric feature with a significant correlation, one option could be to create bins and then create dummy variables.

In [None]:
analyse_single_feature("BsmtUnfSF", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['BsmtUnfSF'], 4)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['BsmtUnfSF'], 4)
for i in dummy_test.unique():
    print(i)

- **Feature**: TotalBsmtSF, Total square feet of basement area
- **Missing value**: 1
- **Cardinality**: ~ 50%
- **Type**: Numerical
- **Encoding needed**: No
- **Comments** Use median for nan value imputation. More importantly is the knowledge that the distribution is NOT normal and the fact that that are many entries with zeros. If we were to use a log transform then we'll have to deal with those entries that have a zero value assigned to them. In the figure below we can see how the log transform on the original data would not work because of these zero entries. We'll take note of that and do something about it later one.
- **EDA**: This is a very important feature within my analysis, due to such a high correlation with `Saleprice`. We can see that it varies widely based on location, however the average basement size has a lower variance based on type, style and lot shape. I will create some binnings and dummy variables.

In [None]:
analyse_single_feature("TotalBsmtSF", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['TotalBsmtSF'], 10)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(train['TotalBsmtSF'], 10)
for i in dummy_test.unique():
    print(i)

- **Feature**: Heating
       Floor	Floor Furnace
       GasA	Gas forced warm air furnace
       GasW	Gas hot water or steam heat
       Grav	Gravity furnace	
       OthW	Hot water or steam heat other than gas
       Wall	Wall furnace
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: Use median for nan value imputation
- **EDA**: We see the highest frequency and highest average SalePrice coming from "GasA" and a very low frequency from all other classes. Hence, I will create a flag to indicate whether "GasA" is present or not and drop the Heating feature.

In [None]:
analyse_single_feature("Heating", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: HeatingQC
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> ordinal
- **Encoding needed**: Yes
- **Comments**: Use median for nan value imputation
- **EDA**: Here we see a positive correlation with SalePrice as the heating quality increases. With "Ex" bringing the highest average SalePrice. We also see a high number of houses with this heating quality too, which means most houses had very good heating! This is a categorical feature which exhibits an order.

In [None]:
analyse_single_feature("HeatingQC", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: CentralAir
       N	No
       Y	Yes
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Categorical -> ordinal
- **Encoding needed**: Yes
- **Comments**: Use median for nan value imputation
- **EDA**: - We see that houses with central air conditioning are able to demand a higher average `SalePrice` than ones without. For this feature, I will simply replace the categories with numbers 0 and 1. However, please consider this:
    - If we use 0, 1, we are also implicitly giving an order as 0 <1, which seems appropriate here.
    - If we use one-hot encoding then the algorithm would not see an order and the risk of the algorithm learning a possible order would be eliminated.

In [None]:
analyse_single_feature("CentralAir", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Electrical system
       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed
- **Missing value**: 1
- **Cardinality**: Low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: Use mode for nan value imputation
- **EDA**:  We see the highest average `SalePrice` coming from houses with `SBrkr` electrics, and these are also the most frequent electrical systems installed in the houses from this area. We have 2 categories in particular that have very low frequencies, `FuseP` and `Mix`. I am going to cluster all the classes related to fuses.

In [None]:
analyse_single_feature("Electrical", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: 1stFlrSF: First Floor square feet
- **Missing value**: None
- **Cardinality**: 50%
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: 
- **EDA**: Clearly this shows a very high positive correlation with SalePrice, this will be an important feature during modeling. Once again, this feature varies greatly across neighborhoods and the size of this feature varies across building types and styles. This feature does not vary so much across the lot size. Since this is a continuous numeric feature, once again I will bin this feature and create dummy variables.

In [None]:
analyse_single_feature("1stFlrSF", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['1stFlrSF'], 6)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['1stFlrSF'], 6)
for i in dummy_test.unique():
    print(i)

- **Feature**: 2ndFlrSF, Second floor square feet
- **Missing value**: None
- **Cardinality**: 25%
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: 
- **EDA**: Interestingly we see a highly positively correlated relationship with `SalePrice`, however we also see a significant number of houses with value = 0. We also see a high dependance and variation between neighborhoods, building types and lot sizes. It is evident that all the variables related to "space" are important in this analysis.

In [None]:
analyse_single_feature("2ndFlrSF", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['2ndFlrSF'], 6)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['2ndFlrSF'], 6)
for i in dummy_test.unique():
    print(i)

- **Feature**: LowQualFinSF, Low quality finished square feet (all floors)
- **Missing value**: None
- **Cardinality**: 1%
- **Type**: Numerical -> use a flag
- **Encoding needed**: No
- **Comments**:
- **EDA**: We can see that there is a large number of properties with a value of 0 for this feature. Clearly, it does not have a significant correlation with SalePrice. For this reason, I will replace this feature with a flag.

In [None]:
analyse_single_feature("LowQualFinSF", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: GrLivArea, Above grade (ground) living area square feet
- **Missing value**: None
- **Cardinality**: 60%
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: We can see that there are large values of `GrLivArea` that have low prices. These look liike outliers. We'll keep this in mind and we'll check them out under the `Detect outliers` section. There is also
one in the test set but we obviously can't drop that one, otherwise your submission will be invalid. 
Can we use linear regression here? Ideally, if the assumptions are met, the residuals will be randomly scattered around the centerline of zero with no apparent pattern. The residual will look like an unstructured cloud of points centered around zero. However, our residual plot is anything but an unstructured cloud of points. Even though it seems like there is a linear relationship between the response variable and predictor variable, the residual plot looks more like a funnel. The error plot shows that as GrLivArea value increases, the variance also increases, which is the characteristics known as Heteroscedasticity. Credit to this [reference](https://www.kaggle.com/masumrumi/a-detailed-regression-guide-with-house-pricing/notebook).
- **EDA**: We see a very high positive correlation with `SalePrice`. We also see the values varying very highly between styles of houses and neigborhood. We could create some bins and dummy features. The code snippet for this is reported under `feature_engineering`.

In [None]:
analyse_single_feature("GrLivArea", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['GrLivArea'], 6)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['GrLivArea'], 6)
for i in dummy_test.unique():
    print(i)

- **Feature**: BsmtFullBath, Basement full bathrooms
- **Missing value**: None
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**:
- **EDA**:

In [None]:
analyse_single_feature("BsmtFullBath", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: BsmtHalfBath
- **Missing value**: 2
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **COMMENTS** use zero for value imputation

In [None]:
analyse_single_feature("BsmtHalfBath", train, test, correlated_feature="SalePrice", target="SalePrice")

- **Feature**: FullBath, Full bathrooms above grade
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **COMMENTS** use zero for value imputation

In [None]:
analyse_single_feature("FullBath", train, test, correlated_feature="SalePrice", target="SalePrice")

- **Feature**: HalfBath, Half baths above grade
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **COMMENTS** use zero for value imputation

In [None]:
analyse_single_feature("HalfBath", train, test, correlated_feature="SalePrice", target="SalePrice")

- **Feature**: KitchenQual
       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
- **Missing value**: 1
- **Cardinality**: low
- **Type**: Categorical -> ordinal
- **Encoding needed**: No
- **Comments**: use TA Typical/Average for missing value imputation
- **EDA**: There is a clear positive correlation with the `SalePrice` and the quality of the kitchen. There is one value for "Gd" that has an extremely high SalePrice however.

In [None]:
analyse_single_feature("KitchenQual", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: TotRmsAbvGrd, Total rooms above grade (does not include bathrooms)
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Nuumerical
- **Encoding needed**: No
- **Comments**: use TA Typical/Average for missing value imputation
- **EDA**: Generally we see a positive correlation, as the number of rooms increases, so does the SalePrice. However due to low frequency, we do see some unreliable results for the very large and small values for this feature. Since this is a discrete numerical feature, I will leave it as it is.

In [None]:
analyse_single_feature("TotRmsAbvGrd", train, test, correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Functional, Home functionality (Assume typical unless deductions are warranted)
       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only
- **Missing value**: 2
- **Cardinality**: low
- **Type**: Categorical -> ordinal
- **Encoding needed**: Yes
- **Comments** use mode for missing value imputation
- **EDA**: This categorical feature shows that most houses have "Typ" functionality, and looking at the data description lead me to believe that there is an order within these categories, "Typ" being of the highest order. Therefore, I will replace the values of this feature by hand with numbers.

In [None]:
analyse_single_feature("Functional", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Fireplaces, Number of fireplaces
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: use mode for missing value imputation
- **EDA**: Once again we have a positive correlation with SalePrice, with most houses having just 1 or 0 fireplaces. I will leave this feature as it is.

In [None]:
analyse_single_feature("Fireplaces", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: FireplaceQu

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace
       
- **Missing value**: 730
- **Cardinality**: low
- **Type**: Categorical, use Label Encoding and **NOT** one-hot encoding because order is important.
- **Encoding needed**: Yes
- **Comments**: use NA-> No Fireplace for missing value imputation
- **EDA**: We see a positive correlation and the fireplace quality increases. Most houses have either "TA" or "Gd" quality fireplaces. Since this is a categorical feature with order, I will replace the values by hand.

In [None]:
analyse_single_feature("FireplaceQu", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: GarageType
       2Types	More than one type of garage
       Attchd	Attached to home
       Basment	Basement Garage
       BuiltIn	Built-In (Garage part of house - typically has room above garage)
       CarPort	Car Port
       Detchd	Detached from home
       NA	No Garage
- **Missing value**: 76
- **Cardinality**: low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: use NA-> No Garage for missing value imputation
- **EDA**: Here we see "BuiltIn" and "Attched" having the 2 highest average SalePrices, with only a few extreme values within each class. Since this is categorical without order, I will create dummy variables.

In [None]:
analyse_single_feature("GarageType", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: GarageYrBlt: Year garage was built.
- **Missing value**: 159
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: use mode only if if GarageType is different from NA? Otherwise, something must have gone wrong!
- **EDA**: We can see a slight upward trend as the garage building year becomes more modern. We could create bins and then dummy variables on it.

In [None]:
analyse_single_feature("GarageYrBlt", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['GarageYrBlt'], 3)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['GarageYrBlt'], 3)
for i in dummy_test.unique():
    print(i)

- **Feature**: GarageFinish
       Fin	Finished
       RFn	Rough Finished	
       Unf	Unfinished
       NA	No Garage
- **Missing value**: 159
- **Cardinality**: low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: At the moment we are using NA for imputation but that needs to be cross-checked!
- **EDA**: Here we see a nice split between the 3 classes, with "Fin" producing having the highest SalePrice's on average. I will create dummy variables for this feature.

In [None]:
analyse_single_feature("GarageFinish", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: GarageCars
- **Missing value**: 1
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: Yes
- **Comments**: use 0.0 for missign values imputation
- **EDA**: We generally see a positive correlation with an increasing garage car capacity. However, we see a slight dip for 4 cars I believe due to the low frequency of houses with a 4 car garage.

In [None]:
analyse_single_feature("GarageCars", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: GarageArea
- **Missing value**: 1
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: Yes
- **Comments**: use 0.0 for missign values imputation
- **EDA**: This has an extremely high positive correlation with SalePrice, and it is highly dependant on Neighborhood, building type and style of the house. This could be an important feature in the analysis, so I will bin this feature and create dummy variables.

In [None]:
analyse_single_feature("GarageArea", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: GarageQual
       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
- **Missing value**: 159
- **Cardinality**: low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: use NA -> No Garage for missign values imputation
- **EDA**: We see a lot of homes having "TA" quality garages, with very few homes having high quality and low quality ones. I am going to cluster the classes here, and then create dummy variables.

In [None]:
analyse_single_feature("GarageQual", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: GarageCond
       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
- **Missing value**: 159
- **Cardinality**: low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: use NA -> No Garage for missign values imputation. GarageCond and GarageQual can be merged together?
- **EDA**: We see a fairly similar pattern here with the previous feature. We see a slight positive correlation and then a dip, I believe due to the low number of houses that have "Ex" or "Gd" garage conditions. Similarly to before, I am going to cluster and then dummy (one-hot encoding) this feature.

In [None]:
analyse_single_feature("GarageCond", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: PavedDrive
       Y	Paved 
       P	Partial Pavement
       N	Dirt/Gravel
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**:
- **EDA**: Here we see the highest average price being demanded from houses with a paved driveway, and most houses in this area seem to have one. Since this is a categorical feature without order, I will create dummy variables.

In [None]:
analyse_single_feature("PavedDrive", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: WoodDeckSF, Wood deck area in square feet
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**:
- **EDA**: This feature has a high positive correlation with SalePrice. We can also see that it varies widely with location, building type, style and size of the lot. There is a significant number of data points with a value of 0, so I will create a flag to indicate no Wood Deck. Then, since this is a continuous numeric feature, and I believe it to be an important one, I will bin this and then create dummy features. 

In [None]:
analyse_single_feature("WoodDeckSF", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['WoodDeckSF'], 4)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['WoodDeckSF'], 4)
for i in dummy_test.unique():
    print(i)

- **Feature**: OpenPorchSF, Open porch area in square feet
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**:
- **EDA**: We can see a high number of data points having a value of 0 here once again. Apart from this, we see a high positive correlation with SalePrice showing that this may be an influential factor for analysis. Finally, we see that this value ranges widely based on location, building type, style and lot. I will create a flag to indicate no open porch, then I will bin the feature and create dummy variables.

In [None]:
analyse_single_feature("OpenPorchSF", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

In [None]:
dummy_train = pd.cut(train['OpenPorchSF'], 4)
for i in dummy_train.unique():
    print(i)

In [None]:
dummy_test = pd.cut(test['OpenPorchSF'], 4)
for i in dummy_test.unique():
    print(i)

- **Feature**: EnclosedPorch, Enclosed porch area in square feet
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments:** will combine this feature into a single `TotalPorchSF` feature
- **EDA**:

In [None]:
analyse_single_feature("EnclosedPorch", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: 3SsnPorch, Three season porch area in square feet
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: will combine this feature into a single `TotalPorchSF` feature
- **EDA**:

In [None]:
analyse_single_feature("3SsnPorch", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: ScreenPorch, Screen porch area in square feet
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: will combine this feature into a single `TotalPorchSF` feature

In [None]:
analyse_single_feature("ScreenPorch", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: PoolArea: Pool area in square feet
- **Missing value**: 0
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **Comments**: will combine this feature into a single `TotalPorchSF` feature
- **EDA**: We see almost 0 correlation due to the high number of houses without a pool. Hence, I will create a flag here, and then we'll drop the feature.

In [None]:
analyse_single_feature("PoolArea", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: PoolQC: Pool quality
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool
- **Missing value**: 2909
- **Cardinality**: low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments** Use NA-> No Pool for data imputation as suggested by the `data_description.txt`
- **EDA**: Due to not many houses having a pool, we see very low numbers of observations for each class. Since this does not hold much information this feature, I will simply remove it. Also consider what we discuss for `PoolArea` feature.

In [None]:
analyse_single_feature("PoolQC", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: Fence quality
    		
       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence
- **Missing value**: 2348
- **Cardinality**: low
- **Type**: Categorical -> use one-hot encoding
- **Encoding needed**: Yes
- **Comments**: Use NA-> No Fence for data imputation
- **EDA**: Here we see that the houses with the most privacy have the highest average SalePrice. There seems to be a slight order within the classes, however some of the class descriptions are slightly ambiguous, therefore I will create dummy variables here from this categorical feature.

In [None]:
analyse_single_feature("Fence", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: MiscFeature: Miscellaneous feature not covered in other categories
		
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None
- **Missing value**: 1408
- **Cardinality**: low
- **Type**: Categorical
- **Encoding needed**: Yes
- **Comments**: Use NA-> None for data imputation
- **EDA**: We can see here that only a low number of houses in this area with any miscalleanous features. Hence, I do not believe that this feature holds much. Therefore I will drop this feature along with MiscVal.

In [None]:
analyse_single_feature("MiscFeature", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: MiscVal: Value of miscellaneous feature
- **Missing value**: 2814
- **Cardinality**: low
- **Type**: Numerical
- **Encoding needed**: No
- **COMMENTS**:
- **EDA**: We can see here that only a low number of houses in this area with any miscalleanous features. Hence, I do not believe that this feature holds much. Therefore I will drop this feature along with MiscVal.

In [None]:
analyse_single_feature("MiscVal", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: MoSold: Month Sold (MM)
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Numerical but treated as categorical.
- **Encoding needed**: No
- **Comments**: Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together, for example. (Other names for categorical data are qualitative data, or Yes/No data.)
- **EDA**: Although this feature is a numeric feature, it should really be a category. We can see that there is no real indicator as to any months that consistetly sold houses of a higher price, however there seem to be a fairly even distribution of values between classes. I will create dummy variables from each category.

In [None]:
analyse_single_feature("MoSold", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

- **Feature**: YrSold: Year Sold (YYYY)
- **Missing value**: None
- **Cardinality**: Low
- **Type**: Numerical but treated as categorical.
- **Encoding needed**: No
- **Comments**:
- **EDA**: Here we see just a 5 year time period of which the houses in this dataset were sold. There is an even distribution of values between each class, and each year has a very similar average SalePrice. Even though this is numeric, it should be categorical. Therefore I will create dummy variables.

In [None]:
analyse_single_feature("YrSold", train, test,
                       correlated_feature="SalePrice", target="SalePrice")

# Other insights

- People tend to move during the summer?
- Is the trend consistent over the years?
- [reference](https://www.kaggle.com/janiobachmann/house-prices-useful-regression-techniques)

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(16,8))
sns.countplot(y="MoSold", hue="YrSold", data=train)
plt.show()

- Which neighborhoods gave the most revenue?
- This might indicate higher demand toward certain neighborhoods.
- [reference](https://www.kaggle.com/janiobachmann/house-prices-useful-regression-techniques)

In [None]:
plt.style.use('seaborn-white')
zoning_value = train.groupby(by=['MSZoning'], as_index=False)[
    'SalePrice'].sum()
zoning = zoning_value['MSZoning'].values.tolist()


# Let's create a pie chart.
labels = ['C: Commercial', 'FV: Floating Village Res.', 'RH: Res. High Density', 'RL: Res. Low Density',
          'RM: Res. Medium Density']
total_sales = zoning_value['SalePrice'].values.tolist()
explode = (0, 0, 0, 0.1, 0)

fig, ax1 = plt.subplots(figsize=(12, 8))
texts = ax1.pie(total_sales, explode=explode, autopct='%.1f%%', shadow=True, startangle=90, pctdistance=0.8,
                radius=0.5)


ax1.axis('equal')
plt.title('Sales Groupby Zones', fontsize=16)
plt.tight_layout()
plt.legend(labels, loc='best')
plt.show()

In [None]:
plt.style.use('seaborn-white')
SalesbyZone = train.groupby(['YrSold','MSZoning']).SalePrice.count()
SalesbyZone.unstack().plot(kind='bar',stacked=True, colormap= 'gnuplot',  
                           grid=False,  figsize=(12,8))
plt.title('Building Sales (2006 - 2010) by Zoning', fontsize=18)
plt.ylabel('Sale Price', fontsize=14)
plt.xlabel('Sales per Year', fontsize=14)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
sns.countplot(x="Neighborhood", data=train, palette="Set2")
ax.set_title("Types of Neighborhoods", fontsize=20)
ax.set_xlabel("Neighborhoods", fontsize=16)
ax.set_ylabel("Number of Houses Sold", fontsize=16)
ax.set_xticklabels(labels=train['Neighborhood'].unique(),rotation=45)
plt.show()

In [None]:
# Sawyer and SawyerW tend to be the most expensive neighberhoods. Nevertheless, what makes them the most expensive
# Is it the LotArea or LotFrontage? Let's find out!
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.boxplot(x="Neighborhood", y="SalePrice", data=train)
ax.set_title("Range Value of the Neighborhoods", fontsize=18)
ax.set_ylabel('Price Sold', fontsize=16)
ax.set_xlabel('Neighborhood', fontsize=16)
ax.set_xticklabels(labels=train['Neighborhood'].unique(), rotation=45)
plt.show()

In [None]:
# Which Neighborhoods had the best Quality houses?
plt.style.use('seaborn-white')
types_foundations = train.groupby(['Neighborhood', 'OverallQual']).size()
types_foundations.unstack().plot(kind='bar', stacked=True, colormap='RdYlBu', figsize=(13,11), grid=False)
plt.ylabel('Overall Price of the House', fontsize=16)
plt.xlabel('Neighborhood', fontsize=16)
plt.xticks(rotation=90, fontsize=12)
plt.title('Overall Quality of the Neighborhoods', fontsize=18)
plt.show()

- **Overall Condition**: of the house or building, meaning that further remodelations are likely to happen in the future, either for reselling or to accumulate value in their real-estate..
- **Overall Quality**: The quality of the house is one of the factors that mostly impacts SalePrice. It seems that the overall material that is used for construction and the finish of the house has a great impact on SalePrice.
- **Year Remodelation**: Houses in the high price range remodelled their houses sooner. The sooner the remodelation the higher the value of the house. 

- What follows was take from [here](https://www.kaggle.com/ar2017/house-price-prediction-systematic-eda)

In [None]:
def create_boxen_count_2x2(df, first_feature, second_feature, figsize, feature_against):
    """
    Reference
    """
    plt.figure(figsize=figsize)

    # Create boxen plot of first_feature and Log_SalePrice
    ax3 = plt.subplot(223)
    sns.boxenplot(x=first_feature, y=feature_against, data=df, color='dimgrey')

    # Create boxen plot of second_feature and Log_SalePrice
    ax4 = plt.subplot(224, sharey=ax3)
    sns.boxenplot(x=second_feature, y=feature_against, data=df, color='tomato')
    plt.setp(ax4.get_yticklabels(), visible=False)
    plt.ylabel('')

    #---------------------------------------------------------------------------------------

    # Create countplot of Condition1
    ax1 = plt.subplot(221, sharex=ax3)
    sns.countplot(x=first_feature, data=df, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Create countplot of Condition2
    ax2 = plt.subplot(222, sharey=ax1, sharex=ax4)
    sns.countplot(x=second_feature, data=df, color='tomato')
    plt.setp(ax2.get_yticklabels(), visible=False)
    plt.setp(ax2.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(wspace=0, hspace=0)

    plt.show()

In [None]:
create_boxen_count_2x1(train, 'MSZoning', (16,7), "SalePrice")

- One one way to encode this is as follows. We'll make a copy for now and then do it properly under the `feature_engineering` section.
- In this way the feature also plotted in order of importance (from high to low).
- Most of the houses are located in zone number 2 (residential low population density area) and Houses located at lower population density area generally have higher `SalePrice` than houses at higher population density area.

In [None]:
MSZoning_map = {'FV':1, 'RL':2, 'RM':3, 'RH':4, 'C (all)':5}
train_copy = copy.deepcopy(train)
train_copy['MSZoning'].replace(MSZoning_map, inplace=True)

create_boxen_count_2x1(train_copy, 'MSZoning', (16,7), "SalePrice")

- Price per square feet? Can we use this and plot against the neighboorhood to make a classification of the best neighborhood?

In [None]:
create_boxen_count_2x1(train, 'Neighborhood', (16,7), "SalePrice")

- In this cell I'd just want to highlight how the correlation changes depending on which encoding we use:
    - If we use a label encoding, we'll keep one column but we assign a different value to each entries.
    - If we use one-hot encoding, we'll add one column for each entries in in the columns.
    
- The results tell us two interesting things:
    - If we use label encoding we get the general correlation of the feature wrt the target. In this case we can see that if we label encode the data, we keep the relative ordinal importance of the feature, and the feature turn out to be slightly negatively correlated against the target variable.
    - On the other hand, if we hot-encode the feature the model is unable to rank the instance under the same feature but we get an extra piece of information. We see that some of the entries are instead positively correlated against the target. Nevertheless, we can still see the most of them are negatively correlated.

In [None]:
MSZoning_map

In [None]:
plt.figure(figsize=(6, 6))
train_copy = copy.deepcopy(train)
train_copy['MSZoning'].replace(MSZoning_map, inplace=True)

surround_feats = train_copy[['MSZoning', 'SalePrice']]
sns.heatmap(surround_feats.corr(), annot=True, cmap='RdBu')

plt.show()


In [None]:
plt.figure(figsize=(6, 6))
train_copy2 = copy.deepcopy(train)

surround_feats = train_copy2[['MSZoning', 'SalePrice']]
surround_feats = pd.get_dummies(surround_feats)

sns.heatmap(surround_feats.corr(), annot=True, cmap='RdBu')

plt.show()

- Now, we'd like to answer another question. **How is the correlation affected if we log the target?**
- The answer is that by logging the target we get (*for some, not for all!*) an increase in the absolute value of the correlation meaning:
    - Those that were positevely correlated are now even more so.
    - Those that were negatively correlated are now even more so.

In [None]:
plt.figure(figsize=(6, 6))
train_copy = copy.deepcopy(train)
train_copy['MSZoning'].replace(MSZoning_map, inplace=True)
train_copy['SalePrice'] = np.log1p(train_copy['SalePrice'])

surround_feats = train_copy[['MSZoning', 'SalePrice']]
sns.heatmap(surround_feats.corr(), annot=True, cmap='RdBu')

plt.show()

In [None]:
plt.figure(figsize=(6, 6))
train_copy2 = copy.deepcopy(train)
train_copy2['SalePrice'] = np.log1p(train_copy2['SalePrice'])

surround_feats = train_copy2[['MSZoning', 'SalePrice']]
surround_feats = pd.get_dummies(surround_feats)

sns.heatmap(surround_feats.corr(), annot=True, cmap='RdBu')

plt.show()

In [None]:
create_boxen_count_2x2(train, 'Condition1', 'Condition2', (16,7), "SalePrice")

In [None]:
def create_boxen_count_2x3(df,first_feature, second_feature, third_feature, figsize, feature_against):
    plt.figure(figsize=figsize)

    # Create boxenplot of first_feature and Log_SalePrice
    ax4 = plt.subplot(234)
    sns.boxenplot(x=first_feature, y=feature_against, data=df, color='dimgrey')

    # Create boxenplot of second_feature and Log_SalePrice
    ax5 = plt.subplot(235, sharey=ax4)
    sns.boxenplot(x=second_feature, y=feature_against, data=df, color='tomato')
    plt.setp(ax5.get_yticklabels(), visible=False)
    plt.ylabel('')

    # Create boxenplot of third_feature and Log_SalePrice
    ax6 = plt.subplot(236, sharey=ax4)
    sns.boxenplot(x=third_feature, y=feature_against, data=df, color='darkseagreen')
    plt.setp(ax6.get_yticklabels(), visible=False)
    plt.ylabel('')

    #---------------------------------------------------------------------------------------

    # Create countplot of first_feature
    ax1 = plt.subplot(231, sharex=ax4)
    sns.countplot(x=first_feature, data=df, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Create countplot of second_feature
    ax2 = plt.subplot(232, sharey=ax1, sharex=ax5)
    sns.countplot(x=second_feature, data=df, color='tomato')
    plt.setp(ax2.get_yticklabels(), visible=False)
    plt.setp(ax2.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Create countplot of second_feature
    ax3 = plt.subplot(233, sharey=ax1, sharex=ax6)
    sns.countplot(x=third_feature, data=df, color='darkseagreen')
    plt.setp(ax3.get_yticklabels(), visible=False)
    plt.setp(ax3.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(wspace=0, hspace=0)

    plt.show()

In [None]:
create_boxen_count_2x3(train,'MSSubClass', 'BldgType', 'HouseStyle', (16,8), "SalePrice")

- There seems to be no relationship between MSSubClass and SalePrice.
- There seems to be no relationship between BldgType and SalePrice.
- There seems to be no relationship between HouseStyle and SalePrice.

In [None]:
create_boxen_count_2x3(train, 'MSSubClass', 'BldgType', 'HouseStyle', (16,8), "SalePrice")

- Houses with higher OverallQual have higher Log_SalePrice. In other words, there seems to be a strong relationship between OverallQual and Log_SalePrice.
- There seems to be no relationship between OverallCond and Log_SalePrice. Similarly, there seems to be no relationship between Functional and Log_SalePrice. These features are not likely to help in predicting house prices.

In [None]:
create_boxen_count_2x3(train, 'OverallQual', 'OverallCond', 'Functional', (16,8), "SalePrice")

In [None]:
def create_boxen_count_2x4(df, first_feature, second_feature, third_feature, fourth_feature, figsize, feature_against):
    plt.figure(figsize=figsize)

    # Create boxenplot of first_feature and Log_SalePrice
    ax5 = plt.subplot(245)
    sns.boxenplot(x=first_feature, y=feature_against,
                  data=df, color='dimgrey')

    # Create boxenplot of second_feature and Log_SalePrice
    ax6 = plt.subplot(246, sharey=ax5)
    sns.boxenplot(x=second_feature, y=feature_against,
                  data=df, color='tomato')
    plt.setp(ax6.get_yticklabels(), visible=False)
    plt.ylabel('')

    # Create boxenplot of third_feature and Log_SalePrice
    ax7 = plt.subplot(247, sharey=ax5)
    sns.boxenplot(x=third_feature, y=feature_against,
                  data=df, color='darkseagreen')
    plt.setp(ax7.get_yticklabels(), visible=False)
    plt.ylabel('')

    # Create boxenplot of fourth_feature and Log_SalePrice
    ax8 = plt.subplot(248, sharey=ax5)
    sns.boxenplot(x=fourth_feature, y=feature_against,
                  data=df, color='seagreen')
    plt.setp(ax8.get_yticklabels(), visible=False)
    plt.ylabel('')

    # ---------------------------------------------------------------------------------------

    # Create countplot of first_feature
    ax1 = plt.subplot(241, sharex=ax5)
    sns.countplot(x=first_feature, data=df, color='dimgrey')
    plt.setp(ax1.get_xticklabels(), visible=False)
    plt.xlabel('')

    # Create countplot of second_feature
    ax2 = plt.subplot(242, sharey=ax1, sharex=ax6)
    sns.countplot(x=second_feature, data=df, color='tomato')
    plt.setp(ax2.get_yticklabels(), visible=False)
    plt.setp(ax2.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Create countplot of third_feature
    ax3 = plt.subplot(243, sharey=ax1, sharex=ax7)
    sns.countplot(x=third_feature, data=df, color='darkseagreen')
    plt.setp(ax3.get_yticklabels(), visible=False)
    plt.setp(ax3.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Create countplot of fourth_feature
    ax4 = plt.subplot(244, sharey=ax1, sharex=ax8)
    sns.countplot(x=fourth_feature, data=df, color='seagreen')
    plt.setp(ax4.get_yticklabels(), visible=False)
    plt.setp(ax4.get_xticklabels(), visible=False)
    plt.ylabel('')
    plt.xlabel('')

    # Adjusting the spaces between graphs
    plt.subplots_adjust(wspace=0, hspace=0)

    plt.show()

- There seems to be no relationship between lot characteristics related features and Log_SalePrice.
- These features are not likely to help in predicting house prices.

In [None]:
create_boxen_count_2x4(train,'LotShape', 'LandContour', 'LotConfig', 'LandSlope', (20,8), "SalePrice")

- I need to find a better place to put this plot!
- A joint plot does two things:
    - Scatter plot the data, and fit a regression line
    - Show a bar plor and its kde on the side

In [None]:
sns.jointplot(x='MasVnrArea', 
              y='SalePrice', 
              data=train, 
              kind='reg', 
              height=9,
              color='darkseagreen')

# EDA's conclusions

- There are features with **ambiguous types**. `GarageYrBlt`, `MoSold`, `YearBuilt`, `YearRemodAdd` and `YrSold` are date features. Those are numerical features it might be better to use some of them as categorical features.

- There were a lot of features with **missing entries** which made them sparse. 

- Target distribution is **highly skewed** and long tailed because of the outliers. It requires a transformation in order to perform better in models. Dealing with the outliers could also achieve better model performance.

- Many features are **strongly correlated with each other** and target. This relationship can be used to create new features with feature interaction in order to overcome multicollinearity issue.

- Some numerical features have **too many zeros**, something that needs to be addressed especially if a log transformation is to be use. 

- Some categorical features are **not informative** for two reasons. The feature is either too homogenous like Utilities feature, or all of the values have the same characteristics like MoSold feature. Those features can be conbined with other features or dropped completely.

- There are some numerical feature distributions that are **too noisy** and therefore bring little value to the model is modelle.

- Some features have have a **different distributions** in training and test set are quite different. They may require grouping to overcome this problem.

In [None]:
# This is just a small reminder of the numerical vs. categorical features in the two datasets
_,_ = get_features_type(train)
_,_ = get_features_type(test)