# Introduction

This notebook is intended to extract useful insights for the datasets of ‘Tabular Playground Series - Dec 2021’ competition in Kaggle. 

For this competition, you will be predicting a categorical target based on a number of feature columns given in the data. 

The data is synthetically generated by a GAN that was trained on a the data from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview). This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.

**Note:** Please refer to this [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data) for a detailed explanation of the features.

We are going to perform the complete and comprehensive EDA as follows
-	Automate the generic aspects of EDA with AutoViz, one of the leading freeware Rapid EDA tools in Pythonic Data Science world
-	Deep into the problem-specific advanced analytical questions/discoveries with the custom manual EDA routines programmed on top of standard capabilities of Plotly and Matplotlib

# EDA Findings

This section will be focused on the discoveries and insights we obtained from the *AutoViz*-automated EDA for the contest dataset.

**Note:** The diagrams used in the subsecctions have been automatically generated by running the code in the next chapter.

# Now What? Let's see it in action

The sections below demonstrate the source code of the express EDA experiment that lead to the insights collected above.

Executing the source code in the sections below will lead to generating the charts used as images in the previous chapter.

In [1]:
# !pip install AutoViz

## Initial Preparations

We are going to start with the essential pre-requisites as follows

- installing *AutoViz* into this notebook
- importing the standard Python packages we need to use down the road
- programming the useful automation routines for repeatable data visualizations we are going to draw in the Advance Analytical EDA trials down the road

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple, List, Dict

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# read data
in_kaggle = False

def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-dec-2021/train.csv'
        test_path = '../input/tabular-playground-series-dec-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-dec-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path

C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.IPBC74C7KURV7CB2PKT5Z5FNR3SIBV4J.gfortran-win_amd64.dll
C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
  stacklevel=1)

Bad key "text.kerning_factor" on line 4 in
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


In [3]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

Started at  2021-12-05 23:11:05.489576


In [4]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

df_train = pd.read_csv(train_set_path)
df_test = pd.read_csv(test_set_path)

subm = pd.read_csv(sample_subm_path)

Wall time: 17.7 s


## Training Set Overview

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 56 columns):
 #   Column                              Dtype
---  ------                              -----
 0   Id                                  int64
 1   Elevation                           int64
 2   Aspect                              int64
 3   Slope                               int64
 4   Horizontal_Distance_To_Hydrology    int64
 5   Vertical_Distance_To_Hydrology      int64
 6   Horizontal_Distance_To_Roadways     int64
 7   Hillshade_9am                       int64
 8   Hillshade_Noon                      int64
 9   Hillshade_3pm                       int64
 10  Horizontal_Distance_To_Fire_Points  int64
 11  Wilderness_Area1                    int64
 12  Wilderness_Area2                    int64
 13  Wilderness_Area3                    int64
 14  Wilderness_Area4                    int64
 15  Soil_Type1                          int64
 16  Soil_Type2                          

## Detecting Cardinality of the Variables in Training Set

In [6]:
cols = df_train.columns
for f in cols:
    dist_value = df_train[f].value_counts().shape[0]
    print('Variable {:>40} has {} distinct values'.format(f, dist_value))

Variable                                       Id has 4000000 distinct values
Variable                                Elevation has 2525 distinct values
Variable                                   Aspect has 440 distinct values
Variable                                    Slope has 68 distinct values
Variable         Horizontal_Distance_To_Hydrology has 1636 distinct values
Variable           Vertical_Distance_To_Hydrology has 916 distinct values
Variable          Horizontal_Distance_To_Roadways has 7760 distinct values
Variable                            Hillshade_9am has 301 distinct values
Variable                           Hillshade_Noon has 221 distinct values
Variable                            Hillshade_3pm has 326 distinct values
Variable       Horizontal_Distance_To_Fire_Points has 8112 distinct values
Variable                         Wilderness_Area1 has 2 distinct values
Variable                         Wilderness_Area2 has 2 distinct values
Variable                         Wi

As a result, we see that *'Soil_Type15'*, and *'Soil_Type7'* have just one value in every training records. Therefore it won't make any sense to use such features in the model training down the road.

'Id' feature is also a nominal identifier, and therefore it should be excluded from the training set in the model training time down the road. 

In [7]:
features_to_drop = ['Soil_Type15', 'Soil_Type7']
df_train = df_train.drop(features_to_drop, axis=1)

## Metadata

To facilitate the data management, we'll store meta-information about the variables in a DataFrame. This will be helpful when we want to select specific variables for analysis, visualization, modeling etc.

In [8]:
data = []

for f in df_train.columns:
    if f == 'Cover_Typet':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
        
    if 'Type' in f or 'Area' in f or f == 'Cover_Typet' or f == 'Id':
        level = 'nominal'
    elif 'cat' in f or f == 'Id':
        level = 'nominal'
    elif df_train[f].dtype == float:
        level = 'interval'
    elif df_train[f].dtype == int:
        level = 'ordinal'
        
    keep = True
    
    if f == 'Id':
        keep = False
    
    dtype = df_train[f].dtype
    
    f_dict = {
        'varname' : f,
        'role' : role,
        'level' : level,
        'keep' : keep,
        'dtype' : dtype
    }
    
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns = ['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)

In [9]:
print(meta)

                                     role    level   keep  dtype
varname                                                         
Id                                  input  nominal  False  int64
Elevation                           input  nominal   True  int64
Aspect                              input  nominal   True  int64
Slope                               input  nominal   True  int64
Horizontal_Distance_To_Hydrology    input  nominal   True  int64
Vertical_Distance_To_Hydrology      input  nominal   True  int64
Horizontal_Distance_To_Roadways     input  nominal   True  int64
Hillshade_9am                       input  nominal   True  int64
Hillshade_Noon                      input  nominal   True  int64
Hillshade_3pm                       input  nominal   True  int64
Horizontal_Distance_To_Fire_Points  input  nominal   True  int64
Wilderness_Area1                    input  nominal   True  int64
Wilderness_Area2                    input  nominal   True  int64
Wilderness_Area3         

## Express EDA Analysis with AutoViz

We are going to invoke *AutoViz*, one of the prominent freeware Pythonic Rapid EDA tools, to quickly draw the basic insights about the data

In [12]:
from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()
dftc = AV.AutoViz(
    filename='', 
    sep='' , 
    depVar='Cover_Type', 
    dfte=df_train, 
    header=0, 
    verbose=2, 
    lowess=False, 
    chart_format='png', 
    max_rows_analyzed=400000, 
    max_cols_analyzed=55
)


Imported AutoViz_Class version: 0.1.23. Call using:
    AV = AutoViz_Class()
    AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=0, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
Note: verbose=0 or 1 generates charts and displays them in your local Jupyter notebook.
      verbose=2 does not display plots but saves them in AutoViz_Plots folder in local machine.
Note: chart_format='bokeh' generates and displays charts in your local Jupyter notebook.
      chart_format='server' generates and displays charts in the browser - one tab for each chart.


ImportError: cannot import name 'TypeGuard' from 'typing_extensions' (C:\ProgramData\Anaconda3\lib\site-packages\typing_extensions.py)

# References

- https://www.kaggle.com/damagejun/tps-dec-2021-eda

In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)