# Introduction
<hr style="border:2px solid black"> </hr>

- **What?** Kaggle competition: House Prices - Advanced Regression Techniques. This particular notebook serves as a common repository to code snippets of my own or taken from other kagglers.

- **Dataset description** Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

# General info
<hr style="border:2px solid black"> </hr>

- To promote code reusability and tidiness, I will try to bring inside a method most of the actions performed in this notebook.
- If you do not like it, it would extremely easy to get rid of the method and use the content as a code snippet.
- Please, consider this notebook as a collection of ideas taken (and made mine with some modifications) from several notebooks published by other kagglers who generously shared their idea. Here I am returning the favour for the benefit of the others.

- This notebook is part 1 of a 4-series analysis:
    - **Step_#1_Train_test_comparison.ipynb**
    - Step_#2_EDA.ipynb
    - Step_#3_Data_preparation.ipynb
    - Step_#4_Modelling.ipynb

# Import modules
<hr style="border:2px solid black"> </hr>

In [None]:
# Dara wrangling
import pandas as pd
import numpy as np
from collections import Counter
import copy
from functools import reduce
import pandas_profiling as pp

# Plotting
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns

# Statistics
from scipy.stats import norm
from scipy import stats
from scipy.stats import ttest_ind
import statsmodels.api as sm
from scipy.stats import spearmanr, kendalltau
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

In [None]:
# Other notebook settings
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
import warnings
warnings.filterwarnings("ignore")
from IPython.display import display as dsp

# Load datasets
<hr style="border:2px solid black"> </hr>

- To know what this step does read the comments inside the `load_data` method.

In [None]:
def load_data():
    """Load data

    Load the train and test data as provided by Kaggle.
    Keep in mind that the way Kaggle provides the data is
    different than the usual idea we have of the trian-test
    split. In particular, the target column is not present
    in the test set.

    Parameters
    ----------
    None

    Returns
    -------
    train : pandas dataframe
    test : pandas dataframe
    """

    print("\nLoading data")

    # Read the train data
    print("Read train set")
    train = pd.read_csv('./DATASETS/train.csv')

    # Read the test data
    print("Read test set")
    test = pd.read_csv('./DATASETS/test.csv')

    print("Train size", train.shape)
    print("Test size:", test.shape)

    train_features = train.columns
    test_features = test.columns
    print("Not share columns: ", set(train_features).difference(test_features))
    print("Not share columns: ", set(test_features).difference(train_features))

    return train, test

In [None]:
# Read data for the first time
train, test = load_data()

- Get the number of categorical and numerical features.
- This information can be used to debug the final dataset.
- The difference of (-1) between numerical features between train and set is due to the fact that the test set provided by Kaggle does not have the target column. The idea is that the target in the test set is then used to score your submission in the public board.

In [None]:
def get_features_type(SET):
    
    print("\nGet feature types")
    
    df_numerical_features = SET.select_dtypes(exclude=['object'])
    df_non_numerical_features = SET.select_dtypes(include=['object'])
    
    print("No of numerical features: ", df_numerical_features.shape)
    print("No of NON numerical features: ", df_non_numerical_features.shape)    
        
    return df_numerical_features, df_non_numerical_features

In [None]:
_,_ = get_features_type(train)

In [None]:
_,_ = get_features_type(test)

- Check for duplicates in both train and test set,
- The check is done against the ID but a more thourough one would involve checking if different ID have exactly the same entry for each column.

In [None]:
Ids_dupli_train = train.shape[0] - len(train["Id"].unique())
Ids_dupli_test = test.shape[0] - len(test["Id"].unique())

print("There are " + str(Ids_dupli_train) + " duplicated IDs for " +
      str(train.shape[0]) + " total TRAIN entries")
print("There are " + str(Ids_dupli_test) + " duplicated IDs for " +
      str(test.shape[0]) + " total TEST entries")

# Minor changes
<hr style="border:2px solid black"> </hr>

- Just a collection of actions generally forgotten. It is better to do here so we'll keep the notebook clean for later steps.
- More info is provided under the `minor_changes` doc string.

In [None]:
def minor_changes(train, test, target_name):
    """Minor changes.

    This method performs minor changes to the dataset.
    These are generally forgotten actions hence the
    name of the method.

    At the moment we have:
    - Get train and test IDs, the former used for Kaggle
    competition submission file
    - Remove the column Id as it is not needed. This is an 
    artifact used by Kaggle to keep track of the submitted
    predictions.

    Parameters
    ----------
    train : pandas datafrme
    test : pandas dataframe
    target_name : string

    Returns
    -------
    train : pandas datafrme
        Dataframe NOT containing only the ID column
    test : pandas dataframe
        Dataframe NOT containing only the ID column
    df_target : pandas dataframe
        Dataframe containing only the target    
    """

    print("Train size BEFORE:", train.shape)
    print("Test size BEFORE:", test.shape)

    Id_train = train.Id.values
    Id_test = test.Id.values

    if test['Id'].count() != 0.0:
        print("Removing column Id from test set")
        test = test.drop(['Id'], axis=1)
    else:
        print("No column ID present")

    if train['Id'].count() != 0.0:
        print("Removing column Id from train set")
        train = train.drop(['Id'], axis=1)
    else:
        print("No column ID present")

    print("Train size AFTER:", train.shape)
    print("Test size AFTER:", test.shape)
    df_target = train[target_name]

    return train, test, Id_train, Id_test, df_target

In [None]:
train, test, Id_train, Id_test, TARGET = minor_changes(train, test, "SalePrice")

# Quick and comprehensive data overview with Pandas

- Pandas has a little known profiler which automatically prints a complete and comprehensive description of your data.
- It can be used as a reminder of what you can try to build in terms of systematic data profiler and also as a quick overview of what you should really pay attention to.
- **Just a small warning**: the size of the notebook will increase cosniderably (~60MB), so you may want to comment out these two lines of code if size/loading is an issue!

In [None]:
# Comment this if notebook size/loading is an issue!
#pp.ProfileReport(train)

In [None]:
# Comment this if notebook size/loading is an issue!
#pp.ProfileReport(test)

# Train vs. test sets
<hr style="border:2px solid black"> </hr>

- In this section we are going to compare the train against the test sets. This is somethng that is generally not done, but I feel comparing the two would help us understand more about the set and how it was split.
- It is generally assumed that train and test are a representative of the same population. This is a **naive assumption** as there is very little guarantee this is the case. I feel (just speculatio here) that this way of working is that in a Kaggle competition even if you are aware of such a difference, the only thing you can do is simply be aware of it and **more importabtly** be weary of the score you get on the **public board**. Essentially, we are asking whether the test set is statistically representative of the train set or not? 
- **Why are we doing this?** It has been reported that, in some cases, the private set was not representative of the one used for scoring submissions on the leaderboard. If that was a possibility in the past, then it is fair to assume that the the same can be thought of the train and test set provided by Kaggle. Whether this is a necessary step or not it is arguable, but it'll help us understand the data. The only shared notebook I've found covering something similar is [this one](https://www.kaggle.com/gunesevitan/house-prices-advanced-stacking-tutorial).

In [None]:
def basic_details(df, sort_by="Feature"):
    """Get basic details of the dataset.

    The following feature are recorded:
        [1] Missing values and their percentage
        [2] Unique values and their percentage (Cardinality)
            Cardinality is important if you are trying to 
            understand feature importance.
        [3] Type if numerical or not    

    Parameters
    ----------
    df : pandas dataframe
    sort_by : string, defaul = "Feature"

    Returns
    -------
    b : pandas dataframe
    """

    b = pd.DataFrame()
    b['No missing value'] = df.isnull().sum()
    b["Missing[%]"] = df.isna().mean()*100
    b['No unique value'] = df.nunique()
    b['Cardinality[%]'] = (df.nunique()/len(df.values))*100
    b["No Values"] = [len(df.values) for _ in range(len(df.columns))]
    b['dtype'] = df.dtypes

    # Turn index into a columns
    b['Feature'] = b.index
    # Getting rid of the index
    b.reset_index(drop=True, inplace=True)
    # Order by feature name
    b.sort_values(by=[sort_by], inplace=True)
    # Move feature as first column
    b = b[['Feature'] + [col for col in b.columns if col != 'Feature']]

    return b

In [None]:
basic_details(train)

In [None]:
basic_details(test)

- We'd like to know the number of **non usable entries**. Non usable entries are here defined as entries that cannot be used directly. This means that, in some cases, we can impute the data and use a valid value instead.
- This is anothr occasion to check if the splitting provided is representative of the real data distribution or not. It is not so rare to be given a test set whose property are not *similar* to the train set.
- We'll then highlight all those feature for which the difference in percetnage is greater than a threshold of your choice. I am not sure what that threshold would be, but I am assuming a value around 2% would be a sensible choice.

In [None]:
def compare_sets_over_non_usable_entries(train, test, delta_threshold=2.0):
    """Compare sets over non usable entries

    As the name suggests two sets are compared over their numbers
    of non-usable entries. If the percentage difference is greater
    than 2%, this is then highlighted.

    Parameters
    ----------
    train : pandas dataframe
    test : pandas dataframe
    delta_threshold : float, default = 2.0
         Value in percentage above which the column get highlithed

    Returns
    -------
    b : pandas dataframe
    """

    # Pandas dataframe showing the number of null values for the train set
    nan_train = pd.DataFrame(train.isna().sum(), columns=['Nan_sum_train'])
    nan_train['feature_name'] = nan_train.index
    nan_train = nan_train[nan_train['Nan_sum_train'] > 0]
    nan_train['Percentage_train'] = (nan_train['Nan_sum_train']/len(train))*100
    nan_train = nan_train.sort_values(by=['feature_name'])

    # Pandas dataframe showing the number of null values for the test set
    nan_test = pd.DataFrame(test.isna().sum(), columns=['Nan_sum_test'])
    nan_test['feature_name'] = nan_test.index
    nan_test = nan_test[nan_test['Nan_sum_test'] > 0]
    nan_test['Percentage_test'] = (nan_test['Nan_sum_test']/len(test))*100
    nan_test = nan_test.sort_values(by=['feature_name'])

    # Merge the two datasets by "feature_name"
    pd_merge = pd.merge(nan_test, nan_train, how='outer', on='feature_name')
    pd_merge = pd_merge.fillna(0)
    pd_merge["NaN_tot"] = pd_merge["Nan_sum_train"] + pd_merge["Nan_sum_test"]
    pd_merge["delta_percentage"] = abs(
        pd_merge["Percentage_test"] - pd_merge["Percentage_train"])
    pd_merge = pd_merge.sort_values(by=['feature_name'])

    # We'd like to highlight those entries where the differences > delta_threshold
    def highlight(x):
        return ['background: yellow' if v > delta_threshold else '' for v in x]

    def bold(x):
        return ['font-weight: bold' if v > delta_threshold else '' for v in x]

    # Highligth the entries
    a = pd_merge.style.apply(highlight, subset="delta_percentage").apply(
        bold, subset="delta_percentage")
    return a

In [None]:
 compare_sets_over_non_usable_entries(train, test)

- The barplot below shows how the No of NaN compare against the features for both training and test set.
- Further we can see how the number of NaN is pretty similar for both sets. Nevertheless we can still capture some small deviatios.

In [None]:
def plot_sets_over_non_usable_entries(train, test):
    """Plot sets over non usable entries.
    
    Parameters
    ----------
    train : pandas dataframe
    test : pandas dataframe
    
    Returns
    -------
    None
    """
    
    # Pandas dataframe showing the number of null values for the train set
    nan_train = pd.DataFrame(train.isna().sum(), columns=['Nan_sum_train'])
    nan_train['feature_name'] = nan_train.index
    nan_train = nan_train[nan_train['Nan_sum_train'] > 0]
    nan_train['Percentage_train'] = (nan_train['Nan_sum_train']/len(train))*100
    nan_train = nan_train.sort_values(by=['feature_name'])    

    # Pandas dataframe showing the number of null values for the test set
    nan_test = pd.DataFrame(test.isna().sum(), columns=['Nan_sum_test'])
    nan_test['feature_name'] = nan_test.index
    nan_test = nan_test[nan_test['Nan_sum_test'] > 0]
    nan_test['Percentage_test'] = (nan_test['Nan_sum_test']/len(test))*100
    nan_test = nan_test.sort_values(by=['feature_name'])    

    # Merge the two dataset by "feature_name"
    pd_merge = pd.merge(nan_test, nan_train, how='outer', on='feature_name')
    pd_merge = pd_merge.fillna(0)
    pd_merge["NaN_tot"] = pd_merge["Nan_sum_train"] + pd_merge["Nan_sum_test"]
    pd_merge["delta_percentage"] = abs(
        pd_merge["Percentage_test"] - pd_merge["Percentage_train"])
    pd_merge = pd_merge.sort_values(by=['feature_name'])

    # Plotting
    rcParams['figure.figsize'] = 19, 8
    rcParams['font.size'] = 15
    pd_merge = pd_merge.sort_values(by=['feature_name'])
    plt.figure()
    labels = ["train", "test"]

    sns.barplot(x=pd_merge['feature_name'],
                y=pd_merge['Percentage_train'], linewidth=2.5, facecolor="w",
                errcolor=".2", edgecolor="k",)

    ax = sns.barplot(x=pd_merge['feature_name'],
                     y=pd_merge['Percentage_test'], linewidth=2.5, facecolor="w",
                     errcolor=".2", edgecolor="r", ls="--")

    plt.xticks(rotation=90, size=25)
    plt.title('Train vs. test sets', size=25)
    plt.xlabel('Features')
    plt.ylabel('% of Missing Data', size=25)
    plt.show()

In [None]:
plot_sets_over_non_usable_entries(train, test)

- **Are the difference in percetage shown above signigicant?** There is a more sophisticated way to check if the differences in distribution are really significant. The test is called `t-test`. We can only run this test for numerical variables and we also have to get rid of the null values. At the moment I am turning each null value into zero, but I am unsure on how to best handle this case. [See this discussion](https://www.researchgate.net/post/How-to-handle-missing-data-for-a-paired-t-test)
- In this particular case we can see that only three features have a statistically different dsitribution. **What can we do about it?** Considering that this is a competition and you are NOt in the position to change the dataset, you can only be aware of this fact and that's all!
- Further we'll also check if any of the instances in the test set extend over the domain of the train set. If so, we are training a model that would not be able to predict the test because the model has not seen those values.

In [None]:
def get_IQR_frequency(df):
    """Get the IQR or the frequency.
    
    This method returns the interquartilies or the frquenecy
    depending on the feature beeing numerical or categorical.

    Parameters
    ----------
    df : pandas dataframe

    Returns
    -------
    dummy : pandas dataframe
        dummy storing IQR if numerical
        dummy storing the instance frequency if categorical
    """

    if all(df.dtypes != object):
        dummy = pd.DataFrame(df.describe()).T
    else:
        frequency = []
        frequency_percentage = []
        unique = list(set([i[0] for i in pd.DataFrame(df).values]))
        for i in unique:
            frequency.append(df[df == i].count()[0])
            frequency_percentage.append((df[df == i].count()[0]/len(df))*100)

        dummy = pd.DataFrame()
        dummy["Entries"] = unique
        dummy["Frequency"] = frequency
        dummy["Frequency[%]"] = frequency_percentage
        dummy.sort_values(by=['Frequency[%]'], inplace=True, ascending=False)

    return dummy

In [None]:
def compare_distribution_sets_on_numerical_columns(train, test):
    """Compare feature distribution on both train and test set.
    
    Test if each feature has the same distribution
    in both train and test set. This is achieved via t-test. 
    To be able to perform this test we ONLY select the numerical 
    variables.
          
    Parameters
    ----------
    train : pandas dataframe
    test : pandas dataframe
    
    Returns
    -------
    dummy : pandas dataframe
    """

    # The reason why we get the test features is because we are sure
    # that every single feature is present in the train set and not viceversa
    # in fact, the target is not present in the test set
    numeric_features = test.dtypes[test.dtypes != object].index

    similar = []
    p_value = []
    mu, sigma = [], []
    min_test_within_train = []
    max_test_within_train = []
    
    for feature in numeric_features:

        # Getting rid of all null values. We are using zero as a form of imputation
        # but this was not a thought process in the sense that if you do not use it
        # throws you an error.
        train_clean = train[feature].fillna(0.0)
        test_clean = test[feature].fillna(0.0)

        stat, p = ttest_ind(train_clean, test_clean) 
        p_value.append(p)

        #print('Statistics=%.3f, p=%.3f' % (stat, p))
        alpha = 0.05
        if p > alpha:
            similar.append("similar")                
            #print('Same distributions (fail to reject H0)')
        else:
            similar.append("different")
            #print('Different distributions (reject H0)')
                
        min_train = get_IQR_frequency(pd.DataFrame(train_clean))["min"].values[0]
        min_test = get_IQR_frequency(pd.DataFrame(test_clean))["min"].values[0]
        
        max_train = get_IQR_frequency(pd.DataFrame(train_clean))["max"].values[0]
        max_test = get_IQR_frequency(pd.DataFrame(test_clean))["max"].values[0]
        
        #print("----",max_test <= max_train)
        
        min_test_within_train.append(min_test >= min_train)
        max_test_within_train.append(max_test <= max_train)
        

    # Create a pandas dataframe
    dummy = pd.DataFrame()
    dummy["numerical_feature"] =  numeric_features
    dummy["type"] = ["numerical" for _ in range(len(numeric_features))]
    dummy["train_test_similar?"] = similar
    dummy["ttest_p_value"] = p_value
    dummy["min_test>=min_train"] = min_test_within_train
    dummy["max_test<=max_train"] = max_test_within_train

    # Decorate the dataframe for quick visualisation
    def highlight(x):    
        return ['background: yellow' if v == "different" or v == False else '' for v in x]

    def bold(x):
        return ['font-weight: bold' if v == "different" or v == False else '' for v in x]

    # Visualise the highlighted df
    return dummy.style.apply(highlight).apply(bold)        

In [None]:
compare_distribution_sets_on_numerical_columns(train, test)

- One of the things we can do is to visually compare (via an histograma and its kde) the distribution of each feature over the two sets.
- This is much easier to read but it would not be as precise as the T-test described above.
- Also consider that we have not transformed the data as yet.

In [None]:
def get_unique_values(df):
    """Get unique values.

    Parameters
    ----------
    df : pandas dataframe

    Returns
    -------
    unique : set
    """

    unique = set([i[0] for i in df.dropna().values])
    return unique

In [None]:
def compare_hist_kde(train, test):
    """Compare hist and kde for two histogram

    Parameteres
    -----------
    train : pandas dataframe
    test : pandas dataframe

    Returns
    -------
    None    
    """
    rcParams['figure.figsize'] = 17, 5
    rcParams['font.size'] = 15

    dummy = get_unique_values(train)
    No_bins = 50  # int(len(dummy))
    print("No of bins used for histograme: ", No_bins)

    for i in set(list(train.columns.values) + list(test.columns.values)):

        print("***************")
        print("Feature's name:", i)
        print("***************")

        # Plot histogram
        try:
            test[i].hist(legend=True, bins=No_bins)
        except:
            print("Feature", i, " NOT present in TEST set!")
        try:
            train[i].hist(legend=True, bins=No_bins)
        except:
            print("Feature", i, " NOT present in TRAIN set!")
        plt.xticks(rotation=90)
        plt.show()

        # Plot density function
        try:
            sns.distplot(test[i], hist=False, kde=True,
                         kde_kws={'shade': True, 'linewidth': 3})
        except:
            print("Feature", i, " is categorical in TEST set!")
        try:
            sns.distplot(train[i], hist=False, kde=True,
                         kde_kws={'shade': True, 'linewidth': 3})
        except:
            print("Feature", i, " is categorical in TRAIN set!")
        plt.xticks(rotation=90)
        plt.show()

In [None]:
compare_hist_kde(train, test)

# Conclusions

- In this notebook we have compared the train and test sets. The reasons for this are many (a non-completed list is reported below):
    - We need to know if leakage is a possibility or not. For instance, if I merge the data, am I leaking info from the test to the train set?
    - We need to know if the two sets are statistically equivalent, if NOT this will tell us how much we can trust the CV results.
    - Knowing the data, especially when it comes to explaining the results, is extremely important.

- For this particular dataset:
    - Leakage is a risk, however whether this risk will affect the final results or not needs to be established.
    - Not all the features have the same distribution in both sets, meaning that there are some statistically relevant differences.