<div style="display:block">
    <div style="width: 10%; display: inline-block; text-align: left;">
        <img src="https://upload.wikimedia.org/wikipedia/en/a/ad/Naruto_-_Shippuden_DVD_season_1_volume_1.jpg" style="height:75px; margin-left:0px" />
    </div>
    <div style="width: 69%; display: inline-block">
        <h5  style="color:maroon; text-align: center; font-size:25px;">Bricks - Exploratory Data Analysis for Panel Data in Python - v20.06.03</h5>
        <div style="width: 90%; text-align: center; display: inline-block;"><i>Author(s): </i> <strong>Naruto</strong> </div>
    </div>
    <div style="width: 20%; text-align: right; display: inline-block;">
        <div style="width: 100%; text-align: left; display: inline-block;">
            <i>Modified: June 100th, 2200</i>
        </div>
    </div>
</div>

***
***

# Introduction

Exploratory Data Analysis (EDA) is an approach to analyzing data sets and summarizing their main characteristics with visualizations. It is an essential step before modeling in any data analytics project. In order to better understand the data associated with the problem, we need to perform certain activities to ensure that we get relevant insights and decide on the appropriate next steps.

This notebook can handle two different types of data:

* __Time-Series data__ is a collection of observations(behavior) for a __single subject__(entity) at different time intervals(generally equally spaced).
* __Panel data__ is basically a __cross-sectional time-series data__ as it a collection of observations for __multiple subjects__ at multiple instances (sequence of time).

## Installing the necessary packages (Mandatory)

Please ensure all the packages are installed in your system. Follow the instructions in __README__ file before executing any cells.

## Prerequisite Functions (for aesthetics)

The following chunks of codes help improve the usability of the notebook. Please make sure you execute these codes before proceeding.

The code in the following cell is a function that helps __highlight negative values__ in red color and rest as black.

In [None]:
# defining the function "color_negative_red" which color-code values across Notebook

from IPython.display import HTML, display, Markdown, clear_output

def color_negative_red(val):
    '''
    Function to color-code negative values with red and others as black.
    
    input: 
        val : value as string
    return: 
        `color: red` or `color: black`
    '''
    
    color = 'black'
    if type(val) == int or type(val) == float:
        color = 'red' if val < 0 else 'black'
    else:
        pass
    return 'color: {}'.format(color)

display(Markdown('<span style="color:darkgreen; font-style: italic; font-size: 15px">Prerequisite Code #1 for <b>color-coded highlights</b> is EXECUTED!</span>'))

The code in the following cell enables __suppressing warnings__ that can get generated across the Notebook and __initiating seed__.

In [1]:
# importing libraries for setting random seed and supress warnings

import warnings
import random
import logging

# supress warnings
warnings.filterwarnings('ignore')

logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

# setting random seed - 369 in this case
random.seed(369)

display(Markdown('<span style="color:darkgreen; font-style: italic; font-size: 15px">Prerequisite Code #2 for <b>suppressing warning</b> is EXECUTED!</span>'))

NameError: name 'Markdown' is not defined

The code in the following cell __center aligns__ the plots and images generated in the notebook console and creates a __loading symbol__ to indicate a cell processing to the user.

In [None]:
# setting the notebook output to be in the center and creating loading symbol

display(HTML("""
<style>
.output_png img {
    display: block;
    margin-left: auto;
    margin-right: auto;
}
 
.loader {
  border: 5px solid #f3f3f3;
  border-radius: 50%;
  border-top: 5px solid teal;
  border-right: 5px solid grey;
  border-bottom: 5px solid maroon;
  border-left: 5px solid tan;
  width: 20px;
  height: 20px;
  -webkit-animation: spin 1s linear infinite;
  animation: spin 1s linear infinite;
  float: left;
}

@-webkit-keyframes spin {
  0% { -webkit-transform: rotate(0deg); }
  100% { -webkit-transform: rotate(360deg); }
}

@keyframes spin {
  0% { transform: rotate(0deg); }
  100% { transform: rotate(360deg); }
}

</style>
"""))

display(Markdown('<span style="color:darkgreen; font-style: italic; font-size: 15px">Prerequisite Code #3 for <b>image and table aesthetics</b> is EXECUTED!</span>'))

The code in the following cell will enable to keep __track of the activities__ performed across the notebook.

The code in the following cell will help __generating bokeh plots in the HTML report__.

In [None]:
# enabling bokeh plots to be rendered in HTML report

from jinja2 import Template
from bokeh.embed import components

html_plot = Template("""
<!DOCTYPE html>
<html lang="en-US">

<link
    href="http://cdn.pydata.org/bokeh/release/bokeh-1.4.0.min.css"
    rel="stylesheet" type="text/css"
>
<script src="http://cdn.pydata.org/bokeh/release/bokeh-1.4.0.min.js"></script>
<script src="https://cdn.bokeh.org/bokeh/release/bokeh-widgets-1.4.0.min.js"></script>
<script src="https://cdn.bokeh.org/bokeh/release/bokeh-tables-1.4.0.min.js"></script>

<body>
    {{ script }}
    {{ div }}
</body>

</html>
""")

display(Markdown('<span style="color:darkgreen; font-style: italic; font-size: 15px">Prerequisite Code #5 for <b>embedding Bokeh plots in HTML</b> is EXECUTED!</span>'))

The code in the following cell is a function that will help to __display dataframes using Plotly with scrollable rows__.

In [None]:
import plotly.graph_objects as go


def display_data(data_table):
    data_table_series = [data_table[i] for i in data_table.columns]

    fig = go.Figure(data=[go.Table(
        header=dict(values=list(data_table.columns),
                    fill_color='grey',
                    align='center',
                    font=dict(color='white', size=15)
                   ),
        cells=dict(values=data_table_series,
                   fill_color='lightblue',
                   align='center',
                   font=dict(color='black', size=10)
                  ))
    ])

    if data_table.shape[0] <= 5:
        fig_ht = 50*data_table.shape[0]
    elif data_table.shape[0] > 5 and data_table.shape[0] <= 22:
        fig_ht = 20*data_table.shape[0]
    else:
        fig_ht = 500
    
    fig.update_layout(width=150*len(data_table.columns), 
                      height=fig_ht,
                      margin=dict(l=0,r=0,b=0,t=0,pad=0))
    fig.show(config={'displaylogo': False})
    
display(Markdown('<span style="color:darkgreen; font-style: italic; font-size: 15px">Prerequisite Code #6 for <b>displaying dataframes using Plotly</b> is EXECUTED!</span>'))

***

# Import libraries 

## Importing libraries for data loading and processing

In [None]:
# importing the following libraries for data loading and processing
value= "Import Libraries for processing"

try:
    if __name__ == '__main__':
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; IMPORTING libraries for data loading and processing</h2></div>'))
    
    from math import ceil, sqrt, pi, isnan
    from statistics import mean
    from pathlib import Path
    from itertools import combinations, groupby, product
    from io import StringIO
    
    import numpy as np
    import pandas as pd
    pd.set_option('display.max_columns', 500)
    pd.options.display.float_format = '{:.4f}'.format
    
    from scipy import stats
    from scipy.signal import periodogram

    import statsmodels.api as sm
    from statsmodels.formula.api import ols
    from statsmodels.stats.anova import anova_lm
    from statsmodels.stats.multicomp import MultiComparison
    from statsmodels.tsa.stattools import acf, pacf, grangercausalitytests, coint
    from statsmodels.tsa.seasonal import seasonal_decompose
    from statsmodels.tsa.api import VAR

    from factor_analyzer import FactorAnalyzer, Rotator
    from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo
    from statsmodels.multivariate.factor import Factor, FactorResults

    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_samples, silhouette_score
     
    import pymssql
    import psycopg2
    import pandas.io.sql as psql
    
    from arch.unitroot import ADF, KPSS, VarianceRatio, PhillipsPerron
    
    from spectrum import aryule, AIC
    
    import concurrent.futures
    
    track_cell(value, flag)
except Exception as err:
    if __name__ == '__main__':
        clear_output()
    
    print(err)
    flag = 0
    err = str(err)
    track_cell(value, flag, err)
else:
    if __name__ == '__main__':
        clear_output()
    display(Markdown('<span style="color:darkgreen; font-style: bold; font-size: 15px">All the libraries for <b>data loading and processing</b> are successfully IMPORTED!</span>'))

## Importing libraries for charts and visualization

In [None]:
# importing the following libraries for charts and visualization
value= "Import Libraries for Viz"

try:
    if __name__ == '__main__':
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; IMPORTING libraries for charts and visualization</h2></div>'))
    
    from IPython.display import display, Markdown, HTML
    
    from termcolor import colored

    from matplotlib import pyplot as plt
    from matplotlib import gridspec, cm
    
    import seaborn as sns
    
    import plotly
    import plotly.express as px
    import plotly.graph_objects as go
    from plotly.offline import plot, iplot
    from plotly.subplots import make_subplots
    import plotly.io as pio
    pio.renderers.default = "notebook"
    
    from bokeh.models.widgets import Panel, Tabs, TextInput, DataTable, TableColumn
    from bokeh.io import output_file, show, reset_output, output_notebook
    from bokeh.plotting import figure
    from bokeh.layouts import column, row, gridplot, widgetbox
    from bokeh.models import Div, ColumnDataSource, Plot, LinearAxis, Grid, Range1d, Band, LinearColorMapper, CategoricalColorMapper, ColorBar, FactorRange, Legend
    from bokeh.models.glyphs import Text
    from bokeh.models.tools import HoverTool, SaveTool, ResetTool, PanTool, WheelZoomTool, CrosshairTool
    from bokeh.palettes import Category20
    from bokeh.transform import factor_cmap
    from bokeh.embed import components
    
    from scipy.cluster.hierarchy import dendrogram
    
    from dtaidistance import clustering
    from dtaidistance import dtw

    reset_output()
    output_notebook()
    
    track_cell(value, flag)
except Exception as err:
    if __name__ == '__main__':
        clear_output()
    print(err)
    flag = 0
    err = str(err)
    track_cell(value, flag, err)
    
else:
    if __name__ == '__main__':
        clear_output()
    display(Markdown('<span style="color:darkgreen; font-style: bold; font-size: 15px">All the libraries for <b>charts and visualization</b> are successfully IMPORTED!</span>'))

***

# Import Dataset

Data Loading can be done in 3 ways from different data sources:

## Import dataset from LOCAL

The user can provide __absolute__ or __relative path__ to execute the cell below for accessing the data from your __computer__.

In [None]:
# loading the data from computer (local)
value="Loading data from local"

display(Markdown('### Loading data from LOCAL'))

if __name__ == '__main__':
    display(Markdown('Enter the path for CSV file and press `Enter`.'))
    display(Markdown('E.g.- \nFor __Ubuntu/macOS__: `./Sample_Datasets/HotelPanel_100.csv`'))
    display(Markdown('While running on __JupyterHub__ in __Ubuntu/macOS__, place your datasets in the `Sample_Datasets` folder and enter the path in the above mentioned format.'))
    display(Markdown('For __Windows__, replace forward slash `/` with __double__ backward slashes `\`.'))
    display(Markdown('__NOTE__ : If you want to __skip__ this step, simply press `Enter` once the user input box appears.'))

    # read the data from computer
    data_path = input('Path: ')

    if data_path == '':
        print(colored('\nNO path entered!','red',attrs=['bold']),
              colored('Please run this cell again to enter the path.','grey'))
    else:
        try:
            # data is read
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING</h2></div>'))

            # obtaining the file name from the data path provided
            if '/' in data_path:                     # if the OS is Ubuntu or macOS
                file_name = data_path.split('/')[-1]
            elif '\\' in data_path:                  # if the OS is Windows
                file_name = data_path.split('\\')[-1]
            else:                                   # if the notebook and the data is in the same directory
                file_name = data_path

            df = pd.read_csv(data_path)
            clear_output()
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING data from computer</h2></div>'))

            track_cell(value, flag)
        except Exception as err:
            # display the error
            clear_output()
            print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
            flag = 0
            err = str(err)
            track_cell(value, flag, err)

        else:
            # displaying the necessary information
            clear_output()
            display(Markdown('File `{}` having __{} rows__ and __{} columns__ loaded.'.format(file_name, df.shape[0], df.shape[1])))
            
            if df.shape[0] > 2500:
                display_data(df.head(2500))
            else:
                display_data(df)
else:
    data_path = './Sample_Datasets/HotelPanel_100.csv'
    file_name = data_path.split('/')[-1]
    
    df = pd.read_csv(data_path)
    display(Markdown('File `{}` having __{} rows__ and __{} columns__ loaded.'.format(file_name, df.shape[0], df.shape[1])))
    display_data(df)

## Import dataset from a URL

Execute the cell below to access the data from a __URL__. _Link provided by the user needs to be a __direct link to the CSV__._

Access the set of sample datasets [here](https://vincentarelbundock.github.io/Rdatasets/datasets.html), choose your dataset, right-click on the `CSV` hyperlink, copy the link and paste it in the user input.

In [None]:
# loading the data from URL
value="Loading data from URL"

if __name__ == '__main__':
    display(Markdown('### Loading data from URL'))
    display(Markdown('Enter your URL and press `Enter`.'))
    display(Markdown('__NOTE__ : If you want to __skip__ this step, simply press `Enter` once the user input box appears.'))

    cell_name = 'load_from_url'

    # library to send the request and recieve the content from URL
    from requests import get

    # once the box appears after executing this cell, enter your URL and press 'Enter'
    enter_url = input('URL: ')

    if enter_url == '':
        print(colored('\nNO URL entered!','red',attrs=['bold']),
              colored('Please run this cell again to enter the URL.','grey'))
    else:
        try:
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; DOWNLOADING data from URL </h2></div>'))
            # send request to obtain the data from the URL
            data_from_url = get(enter_url)
            clear_output()

            # CSV file name obtained from online
            file_name = enter_url.split('/')[-1]

            # file that will be created in your local when the data is pulled from the URL
            file_path = Path(os.path.join(os.getcwd(), 'Sample_Datasets', file_name))

            # create the CSV file in your local and write the data pulled in it
            with open(file_path, "w") as my_empty_csv:
                pass

            file_path.write_bytes(data_from_url.content)

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING the data </h2></div>'))
            # read the data
            df = pd.read_csv(file_path)
            clear_output()

            track_cell(value, flag)

        except Exception as err:
            clear_output()
            # display the error
            print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
            flag = 0
            err = str(err)
            track_cell(value, flag, err)

        else:
            # displaying necessary information
            clear_output()
            display(Markdown('File `{}` having __{} rows__ and __{} columns__ loaded.'.format(file_name, df.shape[0], df.shape[1])))
            display_data(df)

else:
    pass

## Import dataset from a DATABASE

Execute the cell below to access the data from a __Database Server__. Change the __Host IP Address and Credentials__ based on your details.

Given below are example snippets to connect to MS SQL and PostgreSQL database servers. You can connect to other database servers using the same logic as below. 


1) Connection through __PostgreSQL__ :

In [None]:
# # pulling data through PostgreSQL

# value = "Loading data from DATABASE (postgreSQL)"

# try:
#     display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING data from DATABASE (POSTGRESQL) </h2></div>'))
    
#     #code to establish a connection to the server
#     conn = psycopg2.connect("dbname='postgres' user='' host='10.1.2.60' password=''")

#     #code to fetch the dataset from the server
#     df = psql.read_sql("""SELECT * FROM "msu_greco"."cosmo_store_table" """, conn)

#     track_cell(value, flag)
# except Exception as err:
#     clear_output()
#     # display the error
#     print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
#     flag = 0
#     err = str(err)
#     track_cell(value, flag, err)

# else:
#     # displaying necessary information
#     clear_output()
#     display(Markdown('File having __{} rows__ and __{} columns__ loaded.\
#     The first 10 rows are shown below:'.format(df.shape[0], df.shape[1])))
#     display(df.head(10).style.applymap(color_negative_red).highlight_null(null_color='lightblue'))
 

2) Connection through __MS SQL__ :

In [None]:
# # pulling data through ms sql

# value = "Loading data from DATABASE (MS SQL)"

# try:
#     display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING data from DATABASE (MS SQL) </h2></div>'))  
#     server = 'INBAAVVMMSUSQL'
#     db = 'MSU_2016'
#     user = ''
#     pw = ''
    
#     # code to establish a connection to the server
#     conn = pymssql.connect(server, user, pw, db)
#     cursor = conn.cursor()

#     #code to fetch the dataset from the server
#     df = pd.DataFrame(cursor.fetchall())
# #     df.columns = [desc[0] for desc in cursor.description]
#     track_cell(value, flag)

# except Exception as err:
#     clear_output()
#     # display the error
#     print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
#     flag = 0
#     err = str(err)
#     track_cell(value, flag, err)

# else:
#     # displaying necessary information
#     clear_output()
#     display(Markdown('File having __{} rows__ and __{} columns__ loaded.\
#     The first 10 rows are shown below:'.format(df.shape[0], df.shape[1])))
#     display(df.head(10).style.applymap(color_negative_red).highlight_null(null_color='lightblue'))


***

# Dataset Understanding

## Quick peek

In [None]:
# replacing any empty cell of data type string in any column with NaN
df = df.replace(r'^\s*$', np.nan, regex=True)
# replacing inf with NaN
df = df.replace([np.inf, -np.inf], np.nan)

display(Markdown('The __data summary__ is shown below:'.format(file_name)))
print('_'*75+'\n')
df.info()

## Deleting irrelevant columns

The cell below is a user-defined function created accept user input and generate a list of column names to be dropped.

In [None]:
# function to accept list of column names and entering column name(s) to be deleted

def columns_to_delete(column_names):
    '''
    Function to delete column(s) based on the user input. For multiple columns, the user enters the names 
    in commma separated format.
    It validates to check if the column name entered is available in the dataframe.
    If incorrectly entered, it will keep on prompting the user to enter until given properply.
    
    input:
        column_names : List of columns available in the dataframe
    return:
        col_to_drop : List of columns to be deleted
    '''
    col_to_drop = []
    
    # user will enter column names and it will keep on going for infinite loop 
    # until it is in correct format
    while True:
        col_to_drop_input = input("\nEnter column name(s) to drop: ")

        # if user enters 'None', then it will break the infinite loop and skip this operation
        if col_to_drop_input == '':
            break
        else:
            # if it is a single column entry, keep on asking for input unless correctly entered
            if ',' not in col_to_drop_input:
                if ' ' in col_to_drop_input:
                    print(colored("\nPlease read the instruction and enter properly!","red",attrs=['bold']))
                    continue
                else:
                    if col_to_drop_input in column_names:
                        col_to_drop = [col_to_drop_input]
                        break
                    else:
                        print(colored("\nINCORRECT COLUMN NAME. Please try again!","red",attrs=['bold']))
                        continue
            # if it is multiple column entry and properly entered, then it is stored in list
            else:
                # split the string with ',' and convert it into list
                # check if any element is empty, then eliminate it
                col_to_drop = [i.strip() for i in col_to_drop_input.split(',') if i]

                # if all the column names are correctly entered, then proceed
                if all(item in column_names for item in col_to_drop):
                    break
                else:
                    print(colored("\nINCORRECT COLUMN NAME(S). Please enter all the names again!","red",attrs=['bold']))
                    continue
                    
    return col_to_drop

display(Markdown('<span style="color:darkgreen; font-size: 15px"><i>CODE EXECUTED!</i> Continue executing the next cell for performing <b>column deletion</b>.</span>'))

Now, we apply this function and obtain the modified dataframe.

In [None]:
# eliminate the columns
value="Deleting Cols"

try:
    if __name__ == '__main__':
        display(Markdown('__PLEASE READ THE INSTRUCTION TO DELETE COLUMN BEFORE PROCEEDING!__'))
        display(Markdown('Enter the column name(s) from the list shown _(comma `,` separated for multiple)_ and press `Enter`.'))
        display(Markdown('__NOTE__ : If you want to __skip__ this step, simply press `Enter` once the user input box appears.'))
        print('_'*75)

        df_cols = df.columns
        num_cols = df._get_numeric_data().columns.tolist()
        cat_cols = list(set(df_cols) - set(num_cols))

        # display list of numerical column names for ease
        num_cols_vis = ' || '.join(num_cols)
        print(colored("\nNumerical Columns:\nCount:",'magenta',attrs=['bold']),"{}\n{}".format(len(num_cols), 
                                                                                             num_cols_vis))
        # display list of categorical column names for ease
        cat_cols_vis = ' || '.join(cat_cols)
        print(colored("\nCategorical Columns:\nCount:",'blue',attrs=['bold']),"{}\n{}".format(len(cat_cols), 
                                                                                              cat_cols_vis))

        col_to_drop = columns_to_delete(df_cols)
        if col_to_drop == []:
            clear_output()
            display(Markdown('__No columns were dropped!__'))
        else:
            clear_output()
            df.drop(col_to_drop, axis=1, inplace=True)

            # updated list of columns in dataframe
            cols = df.columns

            # updated list of numerical columns
            num_cols = df._get_numeric_data().columns.tolist()

            # updated list of categorical columns
            cat_cols = list(set(cols) - set(num_cols))

            display(Markdown('__Column(s) dropped__ : {}'.format(col_to_drop)))
            
            # displaying the updated dataframe
            print('_'*75)
            display(Markdown('__Updated data type summary__ :'))
            display(df.info())
            
    else:
        df.drop('X', axis=1, inplace=True)
        display(Markdown('__Column dropped__ : X'))
    
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Formatting & renaming columns

##### Formatting Columns

Here we lower case every column name and if there is any _white space_ or _dot_ `.` or _comma_ `,` separating words in any column name, it is replaced with an underscore for convenience.

In [None]:
# formatting the columns
# value="Formatting Cols"
value="Formatting Cols 2"

try:
    # before formatting
    original_cols = df.columns
    cols_vis = ' || '.join(original_cols)
    print(colored("\nBEFORE FORMATTING Column Names:",'grey',attrs=['bold']),"\n{}".format(cols_vis))

    # lower-case and replace any white-space/dot/comma seperator with '_' for every column name
    df = df.rename(columns = lambda x: (x.lower()).replace(' ','_').replace('.','_').replace(',','_').replace("'",""))

    updated_cols = df.columns
    cols_vis = ' || '.join(updated_cols)
    print(colored("\nAFTER FORMATTING Column Names:",'grey',attrs=['bold']),"\n{}".format(cols_vis))

    # displaying the necessary information
    print('_'*75)
    display(Markdown('__Updated data summary is as follows:__'))
    display(df.info())
  
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

##### Renaming Columns

If the user wishes to __change (or rename)__ the column name, then execute the following cell.

In [None]:
# renaming the columns
value="Renaming Cols"

try:
    if __name__ == '__main__':
        # displaying the instruction
        display(Markdown('__PLEASE READ THE INSTRUCTION TO RENAME COLUMN BEFORE PROCEEDING!__'))
        display(Markdown('Enter the column name from the list shown, press `enter` and then give your custom column name.'))
        display(Markdown('If you wish to proceed for more than one column, type `y` else `n`, and press `Enter`.'))
        display(Markdown('__NOTE__ : If you want to __skip__ this step, simply press `enter` once the user input box appears.'))
        print('_'*75)

        # the list of columns in dataframe
        original_cols = df.columns
        cols_vis = ' || '.join(original_cols)
        print(colored("\nColumn Names:",'grey',attrs=['bold']),"\n{}".format(cols_vis))
        print('_'*75)

        # creating an empty dictionary to keep the old and new column name in pair
        col_dict = dict()
        no_change = False

        # entering an infinite loop until the user doesn't want to change column names anymore
        while True:
            # check if the user is entering the right column name
            while True:
                org_col_in = input('\nEnter the original column name: ')
                if org_col_in in original_cols or org_col_in == '':
                    break
                else:
                    print(colored("\nINCORRECT COLUMN NAME. Please try again!","red",attrs=['bold']))
                    continue

            if org_col_in == '':
                no_change = True
                break
            else:
                pass

            # check if the user is entering the new name in correct format
            while True:
                new_col_in = input('\nEnter the new column name (use under-score [_] to separate multiple words): ')

                if new_col_in in original_cols:
                    print(colored("\nSAME AS ORIGINAL NAME. Please try again!","red",attrs=['bold']))
                    continue
                else:
                    if len(new_col_in.split()) == 1 or '_' in new_col_in:
                        break
                    else:
                        print(colored("\nINCORRECT FORMAT. Please try again!","red",attrs=['bold']))
                        continue

            # keep the pair of old and new column name in the dictionary
            col_dict[org_col_in] = new_col_in

            # check if the user wishes to rename more columns
            while True:
                rename_another_col = input('\nDo you want to rename another column? (Y/N): ').lower()
                if rename_another_col.lower() == 'y' or rename_another_col.lower() == 'n':
                    break
                else:
                    print(colored("\nINCORRECT ENTRY. Please select y or n!","red",attrs=['bold']))
                    continue

            # if user wants to rename columns again, then repeat else break from this infinite loop
            if rename_another_col == 'y':
                continue
            else:
                break


        if no_change:
            clear_output()
            display(Markdown('__No columns renamed!__'))
        else:
            # renaming the columns in the original dataframe
            df.rename(columns=col_dict, inplace=True)

            # obtain the list of updated column names and display them
            updated_cols = df.columns
            updated_cols_vis = ' || '.join(updated_cols)
            print('_'*75)
            print(colored("\nUpdated Column Names:",'grey',attrs=['bold']),"\n{}".format(updated_cols_vis))
    else:
        pass
    
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Type of data

Here, enter what type of data it is - panel-level or a single panel time series data. Depending on this input, rest of the analysis will flow.

In [None]:
# select what kind of panel data is it

if __name__ == '__main__':
    dataset_type = input("Enter the type of the data (MP for Multi-Panel/SP for Single-Panel): ")
    clear_output()

    if dataset_type.lower() == 'mp':
        display(Markdown("The selected datatype is: __Multi-Panel__"))
    else:
        display(Markdown("The selected datatype is: __Single-Panel Time Series__"))

else:
    dataset_type = 'mp'
    display(Markdown("The selected datatype is: __Multi-Panel__"))

## Assigning a unique row/panel identifier

Here, the user will enter the column name which will be treated as a unique panel identifier throughout the notebook. It will be neglected while performing any analysis on the dataset.

In [None]:
# assigning panel identifier
value="Assigning panel identifier"

try:
    if __name__ == '__main__':
        if dataset_type.lower() == 'mp':
            display(Markdown('\nEnter the name of column to be used as the __panel identifier__.'))

            # the list of columns in dataframe
            original_cols = df.columns
            cols_vis = ' || '.join(original_cols)
            print(colored("\nColumn Names:",'grey',attrs=['bold']),"\n{}".format(cols_vis))
            print('_'*75)

            while True:
                panel_col = input("\nEnter the column name: ")

                if panel_col not in df.columns:
                    print(colored("\nPlease enter the column name properly!","red",attrs=['bold']))
                    continue
                else:
                    clear_output()
                    display(Markdown("The column which will be considered as the __unique panel identifier__ is: __{}__".format(panel_col)))
                    # list down the panels
                    panel_ids = list(set(df[panel_col]))
                    display(Markdown('__Panel Information:__'))
                    
                    panel_shape = []
                    for each_panel in panel_ids:
                        panel_data = df.groupby(panel_col).get_group(each_panel)
                        panel_shape.append(panel_data.shape[0])
                    
                    panel_info = pd.DataFrame()
                    panel_info['panel_names'] = panel_ids
                    panel_info['size'] = panel_shape

                    display_data(panel_info)

                    break
        else:
            display(Markdown('Not applicable for __single panel time series data__!'))
            
    else:
        panel_col = 'fac_id'
        panel_ids = list(set(df[panel_col]))
        display(Markdown("The column which will be considered as the __unique panel identifier__ is: __{}__".format(panel_col)))

    # track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Assigning Target Variable

Here, the user will enter the column name of the target variable to be used in univariate and mutlivariate analysis.

In [None]:
# assigning target variable
value="Assigning target variable"

try:
    if __name__ == '__main__':
        display(Markdown('Enter the name of column to be used as the __target variable__.'))

        # the list of columns in dataframe
        original_cols = df.columns
        cols_vis = ' || '.join(original_cols)
        print(colored("\nColumn Names:",'grey',attrs=['bold']),"\n{}".format(cols_vis))
        print('_'*75)

        while True:
            target_var = input("\nEnter the column name: ")
            if target_var not in original_cols:
                print(colored("\nPlease enter the column name properly!","red",attrs=['bold']))
                continue
            else:
                clear_output()
                display(Markdown("The column which will be considered as the __target variable__ is: __{}__".format(target_var)))
                break
    else:
        target_var = 'occupancy'
        display(Markdown("The column which will be considered as the __target variable__ is: __{}__".format(target_var)))

    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Formatting date-time column(s)

In this section, we will choose the date-time column and convert that to proper date-time format for future use. This is a very important as only after selecting this we can select other important components of time series, such as `frequency` of the data, `hierarchy` of the data, etc.

_Sample Format_

|       date-time       | Format                   |
| ----------------------| ------------------------ |
| 01/01/2013 00:00:00   |  "%d/%m/%Y %H:%M:%S"     |
| 01:01:2013 00:00:01   |  "%d:%m:%Y %H:%M:%S"     |
| 01-01-2013 00:00:02   |  "%d-%m-%Y %H:%M:%S"     |
| 01012013 00:00:03     |  "%d%m%Y %H:%M:%S"       |
| 01:01:2013:00:00:04 AM|  "%d:%m:%Y:%H:%M:%S %p"  |
| 01/01-2013 00:00:05 PM|  "%d/%m-%Y %H:%M:%S %p"  |

**NOTE:** The date-time formatting can't handle any date-time before `1677-09-22 00:12:43.145225`.

In [None]:
# formatting date-time
value="Formatting date-time"

try:
    if __name__ == '__main__':
        while True:

            # Selection of date-time column
            date_time_col = input("Enter the date time column: ")
            ts_format = input("Enter the existing format of datetime column (E.g. %Y-%m-%d): ")

            # Converting Date-Time to Date-Time format
            if type(df.index) is pd.core.indexes.datetimes.DatetimeIndex:
                print(colored("Selected datetime column is already in datetime format.","green",attrs=["bold"]))
                break
            else:
                df[date_time_col] = pd.to_datetime(df[date_time_col], format= ts_format)
                
                if '/' in ts_format:
                    pass
                else:
                    if "-" in ts_format:
                        ts_format_final = ts_format
                    else:
                        ts_format_final = '-'.join(ts_format[i:i+2] for i in range(0, len(ts_format), 2))

                    df[date_time_col] = df[date_time_col].map(lambda x: x.strftime(ts_format_final))

                while True:
                    display(Markdown('If your dataset has time stamps and you want to retain it along with the date, press `Y`.\
                    Otherwise, if there is date only and you want to keep it that way, press `N`.'))
                    is_timestamp = input('\nDo you want to keep both the date and timestamp? (Y/N): ')
                    if is_timestamp.lower() == 'y':
                        df[date_time_col] = pd.to_datetime(df[date_time_col])
                        break
                    elif is_timestamp.lower() == 'n':
                        df[date_time_col] = pd.to_datetime(df[date_time_col]).dt.date
                        break
                    else:
                        print(colored('Incorrect entry! Please try again.','red',attrs=['bold']))
                        continue

                clear_output()
                display(Markdown("The selected date-time column is: __{}__".format(date_time_col)))
                display(Markdown('Datetime column converted!'))
                
                df = df.sort_values(date_time_col)
                
                if df.shape[0] > 2500:
                    display_data(df.head(2500).round(4))
                else:
                    display_data(df.round(4))

                break
                
    else:
        date_time_col = 'yearmonth'
        ts_format = '%Y%m'
        df[date_time_col] = pd.to_datetime(df[date_time_col], format=ts_format)
        ts_format_final = '-'.join(ts_format[i:i+2] for i in range(0, len(ts_format), 2))
        df[date_time_col] = df[date_time_col].map(lambda x: x.strftime(ts_format_final))
        df[date_time_col] = pd.to_datetime(df[date_time_col]).dt.date
        
        df = df.sort_values(date_time_col)
        
        display(Markdown("The selected date-time column is: __{}__".format(date_time_col)))
        display(Markdown('Datetime column converted!'))

        if df.shape[0] > 2500:
            display_data(df.head(2500).round(4))
        else:
            display_data(df.round(4))
    
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Changing data type of columns

In this section, data types for selected column(s) can be updated.

__NOTE__ : If your data contains special character such as `$` or `@` that you want to ignore during conversion, then mention it in the list `special_chars` located in the beginning of the code by separating with `|`. Since `|` itself can't be handled like this, we are separately handling it at first.

In [None]:
# type-cast columns for proper analysis
value='Type Cast Columns'

try:
    if __name__ == '__main__':
        # list of special characters to be ignored
        special_chars = r'[?|$|,|@|#|%|!]'
        
        # displaying the instruction
        display(Markdown('__PLEASE READ THE INSTRUCTION TO TYPE-CAST COLUMN BEFORE PROCEEDING!__'))
        display(Markdown('Enter the column name from the list shown, press `Enter` and then give your data type.'))
        display(Markdown('Enter the name of column(s) comma `,` separated for __multiple inputs__.'))
        display(Markdown('__NOTE__ : If you want to __skip__ this step, simply press `Enter` once the user input box appears.'))
        print('_'*75)

        # the list of columns in dataframe
        display(Markdown('__Original data type of columns:__'))
        display(df.dtypes)
        print('_'*75)

        col_type_cast = []

        # user will enter column names and it will keep on going for infinite loop 
        # until it is in correct format
        while True:
            col_type_cast_input = input("\nEnter the column name(s): ")

            # if user enters 'None', then it will break the infinite loop and skip this operation
            if col_type_cast_input == '':
                col_type_cast = []
                break
            else:
                # if it is a single column entry, keep on asking for input unless correctly entered
                if ',' not in col_type_cast_input:
                    if ' ' in col_type_cast_input:
                        print(colored("\nPlease read the instruction and enter properly!","red",attrs=['bold']))
                        continue
                    else:
                        if col_type_cast_input in df.columns:
                            col_type_cast = [col_type_cast_input]
                            break
                        else:
                            print(colored("\nIncorrect column name! Please try again.","red",attrs=['bold']))
                            continue
                # if it is multiple column entry and properly entered, then it is stored in list
                else:
                    # split the string with ',' and convert it into list
                    # check if any element is empty, then eliminate it
                    # remove extra white spaces from column name using `strip`

                    col_type_cast = [i.strip() for i in col_type_cast_input.split(',') if i]
                    break

        if col_type_cast == []:
            clear_output()
            display(Markdown('__No data type conversion took place!__'))
            display(Markdown('__Original data type of the columns is:__'))
            print('_'*75)
            pass
        else:
            for each_col in col_type_cast:
                # ask the user the desired type cast
                what_type = input('\nEnter the data type (int/str/float) for column `{}`: '.format(each_col))
                if 'int' in what_type:
                    try:
                        if df[each_col].dtype == object:
                            df[each_col] = df[each_col].str.replace('|','')
                            df[each_col] = df[each_col].map(lambda x:re.sub(special_chars,r'',x)).astype('int64')
                        else:
                            df[each_col] = df[each_col].astype('int64')
                    except Exception as err:
                        # display the error
                        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
                    else:
                        print(colored('\nCONVERTED!:','green',attrs=['bold']))

                elif 'str' in what_type:
                    try:
                        df[each_col] = df[each_col].astype('str')
                    except Exception as err:
                        # display the error
                        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
                    else:
                        print(colored('\nCONVERTED!:','green',attrs=['bold']))

                elif 'float' in what_type:
                    try:
                        if df[each_col].dtype == object:
                            df[each_col] = df[each_col].str.replace('|','')
                            df[each_col] = df[each_col].map(lambda x:re.sub(special_chars,r'',x)).astype('float64')
                        else:
                            df[each_col] = df[each_col].astype('float64')
                    except Exception as err:
                        # display the error
                        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
                    else:
                        print(colored('\nCONVERTED!:','green',attrs=['bold']))

                else:
                    pass

            print("_"*75)
            display(Markdown('__Updated data type of columns:__'))

        display(df.dtypes)
        
    else:
        pass
    
    track_cell(value, flag)
    
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Checking and removing duplicates

Presence of duplicate observations can be misleading, this sections helps get rid of such rows in the datasets.

In [None]:
# check for duplicate rows
value="Check Duplicates"

try:
    # length of original data
    len_df_before = len(df)
    display(Markdown('__The length of the original dataframe__ : {}'.format(len_df_before)))

    # check for duplicate rows using parameter 'keep' having 3 possible values
    # 
    # *first (Default) : considers (counts) duplicates except for the first occurrence.
    # *last : considers (counts) duplicates except for the last occurrence.
    # *False : considers (counts) all duplicates.

    # collect duplicate using 'last'
    df_duplicate_avoid_one_val = df[df.duplicated(keep='last')]

    # collect all the duplicates
    df_duplicate_all = df[df.duplicated(keep=False)]

    # get unique count of duplicates
    len_df_duplicate_avoid_one_val = len(df_duplicate_avoid_one_val)
    len_df_duplicate_all = len(df_duplicate_all)
    count_unique_val_with_duplicates = len_df_duplicate_all - len_df_duplicate_avoid_one_val
    display(Markdown('\n__The number of unique duplicates__ : {}'.format(count_unique_val_with_duplicates)))

    if count_unique_val_with_duplicates == 0:
        print(colored("NO DUPLICATES FOUND!",'green',attrs=['bold']))
    else:
        print(colored("DUPLICATES ARE SPOTTED!",'red',attrs=['bold']))
        
    track_cell(value, flag)
except Exception as err:
    clear_output()
    # display the error
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

### Deleting duplicate rows

If there are duplicate values, then it is __recommended to delete__ those rows.

In [None]:
# drop rows having missing values across all the columns
value="Treat duplicate data"

try:
    if count_unique_val_with_duplicates == 0:
        print(colored('NO DUPLICATE VALUES to remove!','green',attrs=['bold']))
    else:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; Processing</h2></div>'))

        print('Number of rows of the original dataframe: {}'.format(df.shape[0]))

        # dropping the rows
        df.drop_duplicates(keep='last',inplace=True)
        clear_output()

        print(colored('DUPLICATE VALUES removed.','red',attrs=['bold']))
        print('\nAfter removing duplicate values, the number of rows in the dataframe is: {}'.format(df.shape[0]))
        display(Markdown('The __data type summary__ is shown below:'.format(file_name)))
        df.info() 
        
    track_cell(value, flag)        
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## List of numerical & categorical column names

In [None]:
# the list of columns in dataframe
df_cols = df.columns

# obtain the list of numerical columns
num_cols = df._get_numeric_data().columns.tolist()
num_cols_vis = ' || '.join(num_cols)
print(colored("Numerical Columns:\nCount:",'magenta',attrs=['bold']),"{}\n{}".format(len(num_cols), 
                                                                                     num_cols_vis))

# obtain the list of categorical columns
cat_cols = list(set(df_cols) - set(num_cols))
cat_cols_vis = ' || '.join(cat_cols)
print(colored("\nCategorical Columns:\nCount:",'blue',attrs=['bold']),"{}\n{}".format(len(cat_cols), 
                                                                                      cat_cols_vis))

## Filtering dataset for analysis

Here, the user can filter the data by rows by executing a query. Check the code for better understanding.

In [None]:
# filtering datset
value="Filtering data"

try:
    if __name__ == '__main__':
        display(Markdown('__NOTE__ : If you execute this code, this will filter out from the original data and __this filtered data will be used throughout the rest of the notebook__.'))

        original_df_shape = df.shape

        while True:
            operation_input = input("Do you want to filter the data? (Y/N): ")

            # if user enters 'n', then it will break the infinite loop and skip this operation
            if operation_input == 'N' or operation_input =='n' or operation_input =='':
                break
            elif operation_input == 'Y' or operation_input =='y':
                # add conditions to filter data below

                display(Markdown('Please provide a query based on your dara to filter. Take a look at this __sample query__ :\
                *location_type == "HIGHWAY" and rooms1mile>=50*'))

                filter_query = input('Enter your query: ')
                if filter_query == '':
                    display(Markdown('No query entered. _Breaking from the filter operation!_'))
                    break
                else:

                    # sample query based filtering
                    df_filtered = df.query(filter_query)
                    if df_filtered.shape[0] == 0:
                        print(colored('\nIncorrect query! Please try again.\n','red',attrs=['bold']))
                        continue
                    else:
                        print(colored('\nFiltering operation successful!\n','green',attrs=['bold']))
                        df = df_filtered.copy()
                        break
            else:
                print(colored("\nPlease read the instruction and enter properly!","red",attrs=['bold']))
                continue

        modified_df_shape = df.shape

        if modified_df_shape[0] == original_df_shape[0]:
            clear_output()
            display(Markdown('__No filtering operation performed!__'))
        else:
            clear_output()
            display(Markdown('__Filtered data__: {} rows and {} columns.'.format(modified_df_shape[0], modified_df_shape[1])))
            display(df.info())
    else:
        pass
    
    track_cell(value, flag)
except Exception as err:
    clear_output()
    # display the error
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

> __Notes__:
 
```
*Add notes here*
```

***

# Missing Value Analysis <a name='missing_val'></a>

Missing value in the training data can lead to a biased model because we have not analyzed the behavior and relationship of those values with other variables correctly. It can lead to a wrong prediction or classification. Missing values can be of 3 types:
1. **Missing Completely At Random (MCAR)**: When missing data are MCAR, the presence/absence of data is completely independent of observable variables and parameters of interest. This is a case when the __probability of missing variables is the same__ for all observations. _For example, respondents of the data collection process decide that they will declare they're earning after tossing a fair coin. If a head occurs, the respondent declares his / her earnings & vice versa._
2. **Missing At Random (MAR)**: When missing data is not random but can be related to an observed variable where there is complete information. This kind of missing data can induce a bias in the analysis, especially if it unbalances the data because of many missing values in a certain category. _For example, we are collecting data for age and female has higher missing value compare to male._
3. **Missing Depending on Unobserved Predictors**: This is a case when the missing values are not random and are related to the unobserved input variable.

## Missing Value on entire data

In [None]:
# calculate the percentage of total missing values in the data
percent_msng_val = (df.isnull().sum().sum()/(df.shape[0]*df.shape[1]))*100

display(Markdown('Percentage of missing value in entire dataset is: __{}%__'.format(round(percent_msng_val,4))))

## Missing Value on Column-level

The following code is to visualize the missing values (if any) using bar chart.

In [None]:
# visualize the missing data
value="Visualizing missing data"

try:
    # calculate the sum
    total_msng_val = df.isnull().sum().sort_values(ascending=False)
    
    if sum(total_msng_val.tolist()) == 0:
        print(colored('NO MISSING VALUES to visualize!','green',attrs=['bold']))
    else:
        if __name__ == '__main__':
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating plot</h2></div>'))
        
            # calculate the percentage
            percent_msng_val = ((df.isnull().sum()/df.isnull().count())*100).sort_values(ascending=False)
            # generate a table for displaying the information
            missing_data = pd.concat([total_msng_val, percent_msng_val], axis=1, keys=['Total', 'Percentage']).reset_index()

            missing_data.rename(columns = {'index': 'columns'}, inplace=True)

            fig = go.Figure()
            fig.add_trace(go.Bar(x=missing_data['Percentage'], y=missing_data['columns'],
                                 orientation='h'))
            fig.update_traces(marker_color='maroon')
            fig.update_xaxes(title = "Percentage missing (%)",range=[0, 100])
            fig.update_yaxes(title = "Columns")
            
            if df.shape[1] < 30:
                fig_ht = 600
            else:
                fig_ht = int(600*(df.shape[1]/100))
            fig.update_layout(title ="Percentage of Missing Value for every columns",
                              height = fig_ht, width = 900)
            
            clear_output()
            fig.show(config={'displaylogo': False})
        else:
            pass
    
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Missing Value on Panel-level

The following code is to visualize the missing values (if any) using heatmap.

In [None]:
# visualize the missing value on panel level
value = "Panel-level Missing Value"

try:
    if dataset_type.lower() == 'mp':
        # calculate the sum
        total_msng_val = df.isnull().sum().sum()

        if total_msng_val == 0:
            print(colored('No Missing Values!','green',attrs=['bold']))
        else:
            cols_except_id = df_cols.tolist()
            cols_except_id.remove(panel_col)
            cols_except_id.remove(date_time_col)

            df_missing_heatmap = df.groupby(panel_col)[cols_except_id].apply(lambda x: x.isnull().sum()/len(x)*100)

            # generate and display the plot
            fig_height = 30
            col_count = df_missing_heatmap.shape[0]
            if col_count > 30:
                fig_height = 30*((col_count/30) - (col_count//30)) + 30
            else:
                pass

            plt.figure(figsize=(30,fig_height))
            sns.set_style("ticks", {"xtick.major.size": 8, "ytick.major.size": 8})
            sns.heatmap(df_missing_heatmap.round(3), cmap='RdYlGn', annot=True, alpha=0.9)
            clear_output()

            # display the heatmap
            display(Markdown('__Missing Value Heatmap on Panel-level__ '))
            plt.show()
    else:
        display(Markdown('This operation is __not applicable__ for __single-panel__ data!'))
        
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Missing Time-Stamp(s)

Identification of missing time-stamp is very important while doing a time-series analysis.

Missing timestamps can lead to misinterpretation of any kind of forecasting model. In order to get optimal outcome from a fitted time-series model, these missing time-stamps need to be treated.

In [None]:
value='missing time stamp'

try:
    if dataset_type.lower() == 'mp':
        
        # Creating a padded time-series with all dates
        unique_timestamps = sorted(set(df[date_time_col].astype(str)))

        panel_ts_dict = {}
        mssng_panels = []
        mssng_panels_cnt = []

        for each_id in panel_ids:
            panel_data =  df.groupby(panel_col).get_group(each_id)

            if panel_data[date_time_col].shape[0] == len(unique_timestamps):
                panel_ts_dict[each_id] = {'timestamps': panel_data[date_time_col].tolist(), 
                                          'missing_count': 0}
            else:
                missing_timestamp = list(set(unique_timestamps) - set(panel_data[date_time_col].astype(str)))
                panel_ts_dict[each_id] = {'timestamps': panel_data[date_time_col].tolist(), 
                                          'missing_count': len(missing_timestamp)}
                print('_'*75)
                display(Markdown('Panel ID: __{}__'.format(each_id)))
                print('Missing Timestamps count: {}'.format(len(missing_timestamp)))
                mssng_panels.append(each_id)
                mssng_panels_cnt.append(len(missing_timestamp))
                panel_ids.remove(each_id)

        mssng_panels_df = pd.DataFrame({'panel':mssng_panels, 'count': mssng_panels_cnt})
        
        if mssng_panels_df.shape[0] == 0:
            display(Markdown('__NO IMBALANCED PANELS FOUND!__'))
        else:
            plot_msng_ts = figure(plot_width=900, plot_height=600, title='Count of missing time-stamps in the panels',
                                   y_axis_label="count", x_axis_label='panel', 
                                   x_range=mssng_panels_df['panel'].astype(str).to_list())
            plot_msng_ts.vbar(x=mssng_panels_df['panel'].astype(str).to_list(), 
                               top=mssng_panels_df['count'].astype(int).to_list(), width=0.5)
            plot_msng_ts.xaxis.major_label_orientation = pi/4
            plot_msng_ts.toolbar.logo = None    
            show(plot_msng_ts)

            display(Markdown('<span style="color:red; font-size: 15px"><b>NOTE</b>: The above panels are <b>IMBALANCED</b> and hence will be <b>IGNORED FROM REST OF THE ANALYSIS</b>!</span>'))

    else:
        df_test = df.sort_values(by=date_time_col)
        d = df_test[date_time_col].diff()

        periodicity = input('Enter the periodicity (Y/m/d/H/M/S): ')
        missing_timestamp = []
        missing_timestamp_start = []
        missing_timestamp_end = []
        for i,each_gap in enumerate(d.to_list()):
            if each_gap > d.mode()[0] and i <= len(d)-2:
                prev_date = df_test[date_time_col].to_list()[i-1]
                next_date = df_test[date_time_col].to_list()[i+1]
                missing_timestamp_start.append(prev_date)
                missing_timestamp_end.append(next_date)
                missing_timestamp.append(pd.date_range(prev_date, next_date,freq=periodicity).astype(str).to_list())
            else:
                pass

        display(Markdown('__Missing Timestamps Count__: {}'.format(len(missing_timestamp))))
        missing_timestamp_df = pd.DataFrame({'missing_timestamp_start':missing_timestamp_start,
                                             'missing_timestamp_end':missing_timestamp_end})
        if len(missing_timestamp) == 0:
            pass
        else:
            display(missing_timestamp_df.T)
        
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Missing value treatment

### Drop column(s) with missing values

The cell below accepts user input and drops the specified columns.

In [None]:
# eliminate the columns
value="Deleting Cols missing"

try:
    if __name__ == '__main__':
        # columns with missing values
        total_msng_val = df.isnull().sum()
    
        if total_msng_val.sum() == 0:
            print(colored('No Columns with Missing Values to drop!','green',attrs=['bold']))
        
        else:
            display(Markdown('__PLEASE READ THE INSTRUCTION TO DELETE COLUMN DUE TO MISSING VALUE BEFORE PROCEEDING!__'))
            display(Markdown('Enter the column name(s) from the list shown _(comma `,` separated for multiple)_ and press `Enter`.'))
            display(Markdown('__NOTE__ : If you want to __skip__ this step, simply press `Enter` once the user input box appears.'))
            print('_'*75)
        
            missing_cols = missing_data[missing_data['Percentage']>0]['columns'].to_list()
            display(Markdown("Columns with Missing Values: {}".format(missing_cols)))
        
            col_to_drop = columns_to_delete(df_cols)
            if col_to_drop == []:
                clear_output()
                display(Markdown('__No columns were dropped due to missing value!__'))
            else:
                clear_output()
                df.drop(col_to_drop, axis=1, inplace=True)

                # updated list of columns in dataframe
                cols = df.columns

                # updated list of numerical columns
                num_cols = df._get_numeric_data().columns.tolist()

                # updated list of categorical columns
                cat_cols = list(set(cols) - set(num_cols))

                display(Markdown('__Column(s) dropped__ : {}'.format(col_to_drop)))

                # displaying the updated dataframe
                print('_'*75)
                display(Markdown('__Updated data type summary__ :'))
                display(df.info())
            
    else:
        display(Markdown('__No columns were dropped due to missing value!__'))
    
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

### Drop row(s) with missing values


If there are missing values, then it is __recommended to delete__ those rows.

In [None]:
# drop rows having missing values across all the columns
value="Treat missing data"

try:
    # calculate the sum
    total_msng_val = df.isnull().sum()
    
    if sum(total_msng_val.tolist()) == 0:
        print(colored('NO MISSING VALUES to remove!','green',attrs=['bold']))
    else:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; Processing</h2></div>'))

        print('Number of rows of the original dataframe: {}'.format(df.shape[0]))

        # dropping the rows
        df.dropna(axis=0, inplace=True)
        clear_output()

        print(colored('MISSING VALUES removed!','green',attrs=['bold']))
        print('\nAfter removing missing values, the number of rows in the dataframe is: {}'.format(df.shape[0]))
    
    display(Markdown('The __data summary__ is shown below:'))
    display(df.info())
    
    track_cell(value, flag)        
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

> __Notes__:
 
```
*Add notes here*
```

***

# Final Dataset Summary

All further operations will be performed on the following dataset.

In [None]:
# displays the final data along with the datatype summary
value = 'final sata summary'

try:
    display(Markdown('Final dataframe contains __{} rows__ and __{} columns__.'.format(df.shape[0], df.shape[1])))
    
    if dataset_type.lower() == 'mp':
        display(Markdown('The unique panel identifier is __{}__.'.format(panel_col)))

    if dataset_type.lower() == 'mp':
        display(Markdown("The selected datatype is: __Multi-Panel__"))
        display(Markdown("Total number of unique panels: __{}__".format(len(panel_ids))))
        if mssng_panels != None:
            display(Markdown('The imbalanced panel(s): _{}_'.format(mssng_panels)))
        else:
            display(Markdown('No panels to be ignored!'))
    else:
        display(Markdown("The selected datatype is: __Single-Panel__"))
        if len(missing_timestamp) == 0:
            display(Markdown('__Missing Timestamps Count__: {}'.format(len(missing_timestamp))))
        else:
            display(Markdown('__Missing Timestamps Count__: {}'.format(len(missing_timestamp))))
            display(missing_timestamp_df.T)

    display(Markdown('The target variable is: __{}__'.format(target_var)))
    
    if df.shape[0] > 2500:
        display_data(df.head(2500).round(4))
    else:
        display_data(df.round(4))
    
    display(Markdown('The __data summary__ is shown below:'.format(file_name)))
    print('_'*75)
    df.info()
    
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

##### List of Numerical & Categorical columns

We will be using the column name entered as a __unique panel identifier__ for the rest of the analysis to flow properly, and hence will be ignored when calling the list of numerical and categorical column names.

__NOTE__: The column is not deleted, just ignored.

In [None]:
# the list of columns in dataframe

df_cols = df.columns

# obtain the list of numerical columns
num_cols = df._get_numeric_data().columns.tolist()

if dataset_type.lower() == 'mp':
    cols_to_ignore = [panel_col,date_time_col]
else:
    cols_to_ignore = [date_time_col]

for col_name in cols_to_ignore:
    if col_name in num_cols:
        num_cols.remove(row_id_input)
    else:
        pass

num_cols_vis = ' || '.join(num_cols)
print(colored("Numerical Columns:\nCount:",'magenta',attrs=['bold']),"{}\n{}".format(len(num_cols), 
                                                                                     num_cols_vis))

# obtain the list of categorical columns
cat_cols = list(set(df_cols) - set(num_cols))

for col_name in cols_to_ignore:
    if col_name in cat_cols:
        cat_cols.remove(col_name)
    else:
        pass

cat_cols_vis = ' || '.join(cat_cols)
print(colored("\nCategorical Columns:\nCount:",'blue',attrs=['bold']),"{}\n{}".format(len(cat_cols), 
                                                                                      cat_cols_vis))

##### Selection of panels for analysis

The following function will allow the user to either perform EDA on every panels or some selective ones.

In [None]:
# creating the function for the user to select panels of their choices

if __name__ == '__main__':
    def panel_selection(panel_ids):
        '''
        Function to allow the users to select panels for panel data
        '''

        while True:
            selected_panels = None
            panel_choice = input('Do you want to generate results for all {} panels? (Y/N): '.format(len(panel_ids)))
            if panel_choice.lower() == 'y':
                selected_panels = panel_ids
                break
            elif panel_choice.lower() == 'n':
                while True:
                    display(Markdown('List of panels:\n{}'.format(panel_ids)))
                    panel_in = input('Enter the panel names (comma separated for multiple): ')

                    if ',' not in panel_in:
                        if ' ' in panel_in:
                            print(colored("\nPlease read the instruction and enter properly!","red",attrs=['bold']))
                            continue
                        else:
                            if panel_in in panel_ids:
                                selected_panels = [panel_in]
                                break
                            else:
                                print(colored("\nINCORRECT PANEL NAME. Please try again!","red",attrs=['bold']))
                                continue
                    # if it is multiple column entry and properly entered, then it is stored in list
                    else:
                        # split the string with ',' and convert it into list
                        # check if any element is empty, then eliminate it
                        selected_panels = [i.strip() for i in panel_in.split(',') if i]

                        # if all the column names are correctly entered, then proceed
                        if all(item in panel_ids for item in selected_panels):
                            break
                        else:
                            print(colored("\nINCORRECT PANEL NAME(S). Please enter all the names again!","red",attrs=['bold']))
                            continue
            else:
                print(colored('Incorrect Entry! Please try again.','red',attrs=['bold']))
                continue

            if selected_panels != None:
                break
            else:
                continue

        return selected_panels

    display(Markdown('<span style="color:darkgreen; font-style: italic; font-size: 15px">Code cell to activate function for <b>panel selection</b> is EXECUTED!</span>'))
    
else:
    selected_panels = 'H136400C09'

##### Selection of columns for analysis

The following function will allow the user to either perform EDA on every columns or some selective ones.

In [None]:
if __name__ == '__main__':
    def col_selection(cols):
        while True:
            selected_cols = None
            if cols == num_cols:
                col_choice = input('Do you want to generate results for all numerical {} columns? (Y/N): '.format(len(cols)))
            else:
                col_choice = input('Do you want to generate results for all categorical {} columns? (Y/N): '.format(len(cols)))
            if col_choice.lower() == 'y':
                selected_cols = cols
                break
            elif col_choice.lower() == 'n':
                while True:
                    display(Markdown('List of cols:\n{}'.format(cols)))
                    col_in = input('Enter the column names (comma separated for multiple): ')
                    
                    if ',' not in col_in:
                        if ' ' in col_in:
                            print(colored("\nPlease read the instruction and enter properly!","red",attrs=['bold']))
                            continue
                        else:
                            if col_in in cols:
                                selected_cols = [col_in]
                                break
                            else:
                                print(colored("\nINCORRECT COLUMN NAME. Please try again!","red",attrs=['bold']))
                                continue
                    # if it is multiple column entry and properly entered, then it is stored in list
                    else:
                        # split the string with ',' and convert it into list
                        # check if any element is empty, then eliminate it
                        selected_cols = [i.strip() for i in col_in.split(',') if i]

                        # if all the column names are correctly entered, then proceed
                        if all(item in cols for item in selected_cols):
                            break
                        else:
                            print(colored("\nINCORRECT col NAME(S). Please enter all the names again!","red",attrs=['bold']))
                            continue
            else:
                print(colored('Incorrect Entry! Please try again.','red',attrs=['bold']))
                continue

            if selected_cols != None:
                break
            else:
                continue

        return selected_cols
    
    display(Markdown('<span style="color:darkgreen; font-style: italic; font-size: 15px">Code cell to activate function for <b>column selection</b> is EXECUTED!</span>'))
    
else:
    selected_cols = num_cols

## Time-series plots of all variables on Panel-level

In [None]:
# generating the time series plot for every numerical columns across panels
value = 'time series plot'

try:
    if __name__ == '__main__':

        if dataset_type.lower() == 'mp':
            
            selected_cols = col_selection(num_cols)
            
            if len(selected_cols) <= 4:
                row_cnt = 1
                col_cnt = len(selected_cols)
            else:
                col_cnt = 4
                row_cnt = len(selected_cols)//4+1
                if len(selected_cols)%4 == 0:
                    row_cnt = len(selected_cols)//4

            fig = make_subplots(rows=row_cnt, cols=col_cnt, subplot_titles=selected_cols)

            # calculating the number of rows and columns
            r, c = list(range(1,(len(selected_cols)//4+2))), list(range(1,5))
            rc_pair = list(product(r, c))

            selected_panels = panel_selection(panel_ids)

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))

            # performing the operation on panel level
            for p_id,each_panel in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_panel)
                panel_data = panel_data.sort_values(date_time_col)
                vis = True if each_panel == selected_panels[0] else False

                # generating the plots for every numerical columns
                for c_id,each_col in enumerate(selected_cols):
                    fig_r = rc_pair[c_id][0]
                    fig_c = rc_pair[c_id][1]

                    if each_col == target_var:
                        lin_col = 'firebrick'
                    else:
                        lin_col = 'royalblue'
                    fig.add_trace(go.Scatter(x=panel_data[date_time_col], y=panel_data[each_col],
                                             line=dict(color=lin_col, width=1), visible=vis), 
                                  row=fig_r, col=fig_c)
            
            panel_dict_list = []
            # creating the drop-down option
            for each_panel in selected_panels:
                vis_check = [[True]*len(selected_cols) if i==each_panel else [False]*len(selected_cols) for i in selected_panels]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                panel_dict_list.append(dict(args = [{"visible": vis_check_flat},
                                                    {"title": "Observation of each columns in panel: {}".format(each_panel)}],
                                            label=each_panel, method="update"))

            # Add dropdown by opening the option on horizontal direction
            fig.update_layout(updatemenus=[dict(buttons=list(panel_dict_list),
                                                direction="right",
                                                x=0, xanchor="left", y=1.15, yanchor="top")],
                             showlegend=False, title_x=0.5)

        else:
            selected_cols = col_selection(num_cols)
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))

            row_cnt = len(selected_cols)//2+1
            if len(selected_cols)%2 == 0:
                row_cnt = len(selected_cols)//2

            fig = make_subplots(rows=row_cnt, cols=2, subplot_titles=selected_cols)

            # calculating the number of rows and columns
            r, c = list(range(1,(len(selected_cols)//2+2))), list(range(1,3))
            rc_pair = list(product(r, c))

            for c_id,each_col in enumerate(selected_cols):
                fig_r = rc_pair[c_id][0]
                fig_c = rc_pair[c_id][1]

                if each_col == target_var:
                    lin_col = 'firebrick'
                else:
                    lin_col = 'royalblue'
                fig.add_trace(go.Scatter(x=df[date_time_col], y=df[each_col],
                                         line=dict(color=lin_col, width=1)),
                              row=fig_r, col=fig_c)

#         fig.update_xaxes(tickangle=90)
        if len(selected_cols) <= 4:
            fig_ht = 550
        elif len(selected_cols) > 4 and len(selected_cols) <= 20:
            fig_ht = 1000
        else:
            fig_ht = 1500
        fig.update_layout(width=900, height=fig_ht, showlegend=False, title_x=0.5)

        clear_output()

        fig.show(config={'displaylogo': False})
        
    else:
        pass
    
    track_cell(value, flag)
    
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Rolling Statistics on Panel-level

A moving mean (or standard deviation) for a given time period is the (arithmetic) average (or standard deviation) of the values in that time period and those close to it. 

A window of size `k` means k consecutive values at a time. So, the smoothness pattern changes with respect to the window size. For example, the smoothness factor for a short time-series data can be visible with a small windows size, but can't be the same case with a longer data.

In [None]:
# time series for selected target column across panels
value='rolling stats'

if __name__ == '__main__':
    try:
        col_color_pair = {target_var:'blue', 'roll_mean':'orange',
                         'roll_std_upper':'green', 'roll_std_lower':'red'}
        fig = go.Figure()

        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            
            # performing the operation on panel level
            for p_id,each_panel in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_panel)
                panel_data = panel_data.sort_values(date_time_col)
                
                window_size = int(0.3*panel_data.shape[0])
                
                # calculating the rolling mean
                col_rolling_mean = panel_data[target_var].rolling(window=window_size).mean().to_frame()
                col_rolling_mean.columns = ['roll_mean']

                # calculating the rolling std
                col_rolling_std = panel_data[target_var].rolling(window=window_size).std().to_frame()

                # concatenating the original dataframe with the dataframe storing the rolling mean result
                df_rolling_ts = pd.concat([panel_data,col_rolling_mean],axis=1)
                # calculating the upper and lower bound 
                df_rolling_ts['roll_std_upper'] = col_rolling_mean.iloc[:,0] + col_rolling_std.iloc[:,0]
                df_rolling_ts['roll_std_lower'] = col_rolling_mean.iloc[:,0] - col_rolling_std.iloc[:,0]
                
                visible = True if each_panel == selected_panels[0] else False
                for k,v in col_color_pair.items():
                    fig.add_trace(go.Scatter(x=df_rolling_ts[date_time_col], y=df_rolling_ts[k], name=k,
                                         line=dict(color=v, width=1), visible = visible))
            
            panel_dict_list = []
            # creating the drop-down option
            for each_panel in selected_panels:
                vis_check = [[True]*4 if i==each_panel else [False]*4 for i in selected_panels]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                panel_dict_list.append(dict(args = [{"visible": vis_check_flat},
                                                    {"title": "Rolling statistics in panel: {}".format(each_panel)}],
                                            label=each_panel, method="update"))

            # Add dropdown by opening the option on horizontal direction
            fig.update_layout(updatemenus=[dict(buttons=list(panel_dict_list),
                                                direction="right",
                                                x=0, xanchor="left", y=1.11, yanchor="top")],
                             showlegend=False, width=900, title_x=0.5)

        else:
            window_size = int(0.3*df.shape[0])
            
            # calculating the rolling mean
            col_rolling_mean = df[target_var].rolling(window=window_size).mean().to_frame()
            col_rolling_mean.columns = ['roll_mean']

            # calculating the rolling std
            col_rolling_std = df[target_var].rolling(window=window_size).std().to_frame()

            df_rolling_ts = pd.concat([df[[target_var, date_time_col]],col_rolling_mean],axis=1)
            df_rolling_ts['roll_std_upper'] = col_rolling_mean.iloc[:,0] + col_rolling_std.iloc[:,0]
            df_rolling_ts['roll_std_lower'] = col_rolling_mean.iloc[:,0] - col_rolling_std.iloc[:,0]
            
            for k,v in col_color_pair.items():
                fig.add_trace(go.Scatter(x=df_rolling_ts[date_time_col], y=df_rolling_ts[k], name=k,
                                     line=dict(color=v, width=1)))
            
            fig.update_layout(width=900, showlegend=False, title_x=0.5)
        
        fig.update_layout(hovermode='x unified')
        fig.update_yaxes(title_text= target_var)
        fig.update_xaxes(tickangle=90)
        clear_output()

        display(Markdown('__Rolling statistics plots for target variable `{}` with `window size {}` generated!__'.format(target_var, window_size)))
        fig.show(config={'displaylogo': False})
        track_cell(value, flag)

    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

> __Notes__:
 
```
*Add notes here*
```

***

# Variable Analysis

## Numerical features

### Histogram plots for numerical columns

A __Histogram__ groups values into bins of equal value range. The shape of the histogram may contain clues about the underlying distribution type: Gaussian, exponential, etc. It can help us to spot any skewness in its shape when the distribution is nearly regular but has some anomalies.

There is also another, often clearer, way to grasp the distribution - __Density Plots__ or, more formally, __Kernel Density Plots__. They can be considered a smoothed version of the histogram. Their main advantage over the histogram is that they do not depend on the size of the bins.

In [None]:
# histogram plots on the numerical column(s)

value="Plotting Numeric Cols"

if __name__ == '__main__':
    try:
        if num_cols == []:
            display(Markdown('__NO NUMERICAL COLUMNS AVAILABLE!__'))

        # for multiple numerical columns
        else:
            # initializing the tabs
            fig = go.Figure()
            selected_cols = col_selection(num_cols)
            
            for each_col in selected_cols:
                vis = True if each_col == num_cols[0] else False
                fig.add_trace(go.Histogram(x=df[each_col], opacity=0.5, histnorm='probability density',
                                          visible=vis))

            tab_dict_list = []
            for each_col in selected_cols:
                vis_check = [[True] if i==each_col else [False] for i in selected_cols]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                tab_dict_list.append(dict(args = [{"visible": vis_check_flat},
                                                  {"title": "Histogram plots for: {}".format(each_col)}],
                                          label=each_col, method="update"))
                fig.update_layout(updatemenus=[dict(buttons=list(tab_dict_list),
                                                    direction="right", x=0, xanchor="left", y=1.11, yanchor="top")],
                                  showlegend=False, title_x=0.5)

            clear_output()
            fig.update_yaxes(title_text= 'KDE')
            display(Markdown('__Histogram Plots for the numerical columns are generated!__'))
            fig.show(config={'displaylogo': False})

        track_cell(value, flag)
    except Exception as err:
        clear_output()
        # display the error
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
else:
    pass

### Descriptive statistics on Column-level

In [None]:
# descriptive and numerical statistics of numerical column(s)

if num_cols == []:
    display(Markdown('__NO NUMERICAL COLUMNS AVAILABLE!__'))
else:
    num_desc = df[num_cols].describe().T
    num_desc['count'] = num_desc['count'].astype(int)
    num_desc['median'] = df[num_cols].median()
    num_desc.drop(['25%','50%','75%'], axis=1 ,inplace = True)
    num_desc.insert(loc=0, column='columns', value=num_desc.index)
    
    display_data(num_desc.round(4))

### Descriptive statistics on Panel-level

In [None]:
# dot and whisker plot of descriptive stats on panel level

value = 'num desc stats'

try:
    if __name__ == '__main__':
        if dataset_type.lower() == 'mp':

            def dot_whisk(selected_panels, stat):
                dw_plot_slots = [Panel() for i in range(len(selected_cols))]

                hover = HoverTool()
                save = SaveTool()

                # performing the operation on every numerical columns
                for idx,each_col in enumerate(selected_cols):
                    panel_data_q1 = []
                    panel_data_q3 = []
                    panel_data_mean = []
                    if stat == 'Standard Deviation':
                        panel_data_std = []

                    # calculating the stats on every panels
                    for each_id in selected_panels:
                        panel_data = df.groupby(panel_col)[each_col].get_group(each_id)
                        panel_data_mean.append(panel_data.mean())
                        panel_data_q1.append(panel_data.quantile(.25))
                        panel_data_q3.append(panel_data.quantile(.75))
                        if stat == 'Standard Deviation':
                            panel_data_std.append(panel_data.std())

                    if len(selected_panels) < 10:
                        dw_fig_height = 100*(len(selected_panels)+1)
                    else:
                        dw_fig_height = 10*(len(selected_panels)+1)

                    if stat == 'Mean':
                        # storing the result in dataframe
                        dw_df = pd.DataFrame({'panels':selected_panels, 'mean':panel_data_mean,
                                              'Q1':panel_data_q1, 'Q3':panel_data_q3}).sort_values('mean').reset_index()

                        # creating the plot
                        dw_plot = figure(y_range=dw_df['panels'], plot_width=900, plot_height=dw_fig_height, 
                                         tools=[hover, save])
                        # plotting the mean values
                        dw_plot.circle(dw_df['mean'], dw_df['panels'], 
                                       size=5, color="navy", alpha=0.5, legend_label='Mean')
                        # plotting the 1st and 3rd quantiles
                        for each_row in range(dw_df.shape[0]):
                            dw_plot.line([dw_df.loc[each_row,'Q1'], dw_df.loc[each_row,'Q3']], 
                                         [dw_df.loc[each_row,'panels'], dw_df.loc[each_row,'panels']],
                                         line_width=1, color="navy", alpha=0.5, legend_label='Whisker (Q1 & Q3)')
                        # plotting the mean of the mean values of every panel
                        dw_plot.line([mean(panel_data_mean), mean(panel_data_mean)], [0, len(selected_panels)],
                                     line_width=2, color="maroon", line_dash='dashed', legend_label='Column Mean wrt MEAN')
                    else:
                        # storing the result in a dataframe
                        dw_df = pd.DataFrame({'panels':selected_panels, 'std':panel_data_std}).sort_values('std').reset_index()

                        # generating the plot
                        dw_plot = figure(y_range=dw_df['panels'], plot_width=900, plot_height=dw_fig_height, 
                                         tools=[hover, save])
                        # plotting the std
                        dw_plot.circle(dw_df['std'], dw_df['panels'], 
                                       size=5, color="navy", alpha=0.5, legend_label='Standard Deviation')
                        # plotting the mean of std of every panel
                        dw_plot.line([mean(panel_data_std), mean(panel_data_std)], [0, len(selected_panels)],
                                     line_width=2, color="maroon", line_dash='dashed', legend_label='Column Mean wrt STD')

                    dw_plot.legend.location = "bottom_right"
                    dw_plot.toolbar.logo = None

                    dw_plot_slots[idx] = Panel(child=dw_plot, title=each_col)

                # creating the tabs having plots
                dw_plot_tabs = Tabs(tabs = dw_plot_slots)
                dw_plot_layout = column(dw_plot_tabs)

                plot_title = Div(text='''<span style="font-size: 15px"><b>Dot(Mean) & Whisker(IQR) Plot</b> for every panel is generated!</span>''',
                                width=900)
                if stat == 'Standard Deviation':
                    plot_title = Div(text='''<span style="font-size: 15px"><b>Dot(Standard Deviation) Plot</b> for every panel is generated!</span>''',
                                    width=900)
                title_widg = widgetbox(plot_title)

                return dw_plot_layout, title_widg


            selected_panels = panel_selection(panel_ids)
            selected_cols = col_selection(num_cols)
    #         selected_panels = ['H453700C08', 'H346600C03', 'H136400C09', 'H635900C03']

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))
        
            mean_df = pd.DataFrame(columns=selected_cols)
            std_df = pd.DataFrame(columns=selected_cols)

            num_desc_dfs = {'Mean':mean_df, 'Standard Deviation':std_df}

            # performing the operation on panel level
            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)

                desc_df = panel_data.describe().round(3)
                mean_df.loc[mean_df.shape[0]] = desc_df.loc['mean',:].to_list()
                std_df.loc[std_df.shape[0]] = desc_df.loc['std',:].to_list()

            summary_plot_slots = [Panel() for i in range(len(num_desc_dfs))]

            data_table = []
            # generate the result on `mean` and `std` on every panel
            for idx,(each_id,each_df) in enumerate(num_desc_dfs.items()):
                each_df.insert(loc=0, column='panels', value=selected_panels)

                table_height = 150
                if each_df.shape[0] > 5:
                    table_height = int(150*((each_df.shape[0]/2) - (each_df.shape[0]//2)) + 150)
                else:
                    pass

                # generate the table output 
                data_table.append(DataTable(columns=[TableColumn(field=Ci, title=Ci) for Ci in each_df.columns],
                                       source=ColumnDataSource(each_df), height=table_height, width=900))

                dw_plot, title_widg = dot_whisk(selected_panels, each_id)
                data_table_plot = gridplot([[data_table[idx]], [title_widg], [dw_plot]], plot_width=900)

                clear_output()
                summary_plot_slots[idx] = Panel(child=data_table_plot, title=each_id)


            clear_output()

            # creating the tabs having plots
            summary_plot_tabs = Tabs(tabs = summary_plot_slots)
            summary_plot_layout = column(summary_plot_tabs)

            show(summary_plot_layout)
        else:
            clear_output()
            display(Markdown('This operation is __not applicable__ for __single-panel__ data!'))
            
    else:
        pass
        
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

## Categorical features

### Frequency plots for categorical columns

Here, we will get a __frequency table__, which shows how frequent each value of the categorical variable is, and using a __bar plot__, we can visualize it.

In [None]:
# frequency bar plot of categories in categorical column(s) 

def cat_feature_viz(data, categorical_column):
    '''
    Function to visualize the counts of categories in categorical variables. 
    If there are more than 20 categories in a variable, we calculate the 
    mean of value count. Then, any categories whose value count is less than 
    the mean value is kept in a category 'Other'.
    input:
        data               : data frame
        categotical_column : list of categorical columns
    return:
        None. It displays a table of value count of categories and a horizontal 
        bar plot in the console output for each categorical column(s).
    '''
    
    # getting the value count against each category (key) and storing it as key:value pair
    cat_value_pair = dict(data[categorical_column].value_counts().items())
    
    # list of category names and their corresponding values
    cat_value_pair_keys = list(cat_value_pair.keys())
    cat_value_pair_values = list(cat_value_pair.values())
    
    # map the keys under the `categorical_column` column and value count under `count` column
    cat_dict = {categorical_column:cat_value_pair_keys, 'count':cat_value_pair_values}
    
    # create the dataframe
    cat_value_df = pd.DataFrame(cat_dict).set_index(categorical_column)
    cat_value_df.reset_index(inplace=True)
    
    return cat_value_df
    
display(Markdown('<span style="color:darkgreen; font-size: 15px"><i>CODE EXECUTED!</i> Continue executing the next cell for generating <b>Frequency Plot</b>.</span>'))

In [None]:
# generating the plots (if any)
value="Plotting Cat Cols"

if __name__ == '__main__':
    try:
        if cat_cols == []:
            display(Markdown('__NO CATEGORICAL COLUMNS AVAILABLE!__'))
            script, div = None, None

        # for multiple categorical columns
        else:
            selected_cols = col_selection(cat_cols)
            
            fig = go.Figure()

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            
            ignored_cols = []
            for c in selected_cols:
                vis = True if c == selected_cols[0] else False
                if df[c].nunique()>30:
                    ignored_cols.append(c)
                    vis = False
                else:
                    cat_value_df = cat_feature_viz(df, c)
                    fig.add_trace(go.Bar(x=cat_value_df[c], y=cat_value_df['count'],width=0.5, visible=vis))
            
            tab_dict_list = []
            for each_col in selected_cols:
                vis_check = [[True] if i==each_col else [False] for i in selected_cols]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                tab_dict_list.append(dict(args = [{"visible": vis_check_flat},
                                                  {"title": "Frequency plots for: {}".format(each_col)}],
                                          label=each_col, method="update"))
                fig.update_layout(updatemenus=[dict(buttons=list(tab_dict_list),
                                                    direction="right", x=0, xanchor="left", y=1.11, yanchor="top")],
                                  showlegend=False, title_x=0.5)
            
            fig.update_xaxes(tickangle=45)
            clear_output()
            display(Markdown('__Plots generated!__'))
            if ignored_cols == []:
                pass
            else:
                display(Markdown('Plots of the following columns are NOT generated for having __too many categories__!'))
                display(Markown('__Columns__ : {}'.format(ignored_cols)))
            
            fig.update_yaxes(title='Frequency')
            fig.show(config={'displaylogo':False})

#         track_cell(value, flag)
    except Exception as err:
        clear_output()
        # display the error
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
            
else:
    pass

### Descriptive Statistics on Column-level

In [None]:
# descriptive statistics of categorical column(s)

if cat_cols == []:
    display(Markdown('__NO CATEGORICAL COLUMNS AVAILABLE!__'))
else:
    cat_desc = df.describe(include=[np.object]).T
    cat_desc.insert(loc=0, column='columns', value=cat_desc.index)
    display_data(cat_desc)

### Descriptive statistics on Panel-level

In [None]:
# calculating the descriptive statistics on categorical columns
value = 'cat desc stats'

try:
    if __name__ == '__main__':
        if selected_cols == []:
            display(Markdown('__NO CATEGORICAL COLUMNS AVAILABLE!__'))
        else:
            if dataset_type.lower() == 'mp':
                selected_panels = panel_selection(panel_ids)
#                 selected_panels = ['H453700C08', 'H346600C03', 'H136400C09', 'H635900C03']
                selected_cols = col_selection(cat_cols)

                display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the results</h2></div>'))
                
                unique_df = pd.DataFrame(columns=selected_cols)
                top_df = pd.DataFrame(columns=selected_cols)

                cat_desc_dfs = {'Unique':unique_df, 'Top':top_df}

                # performing the operation on panel level
                for each_id in selected_panels:
                    panel_data = df.groupby(panel_col)[selected_cols].get_group(each_id)

                    desc_df = panel_data.describe(include=[np.object])
                    unique_df.loc[unique_df.shape[0]] = desc_df.loc['unique',:].to_list()
                    top_df.loc[top_df.shape[0]] = desc_df.loc['top',:].to_list()

                fig = go.Figure()

                for name,data_table in cat_desc_dfs.items():
                    data_table.insert(loc=0, column='panels', value=selected_panels)
                    data_table_series = [data_table[i] for i in data_table.columns]

                    fig.add_traces(data=[go.Table(
                        header=dict(values=list(data_table.columns),
                                    fill_color='grey',
                                    align='center',
                                    font=dict(color='white', size=15)
                                   ),
                        cells=dict(values=data_table_series,
                                   fill_color='lightblue',
                                   align='center',
                                   font=dict(color='black', size=10)
                                  ))
                    ])

                    if data_table.shape[0] <= 5:
                        fig_ht = 50*data_table.shape[0]
                    elif data_table.shape[0] > 5 and data_table.shape[0] <= 22:
                        fig_ht = 20*data_table.shape[0]
                    else:
                        fig_ht = 500

                    fig.update_layout(width=150*len(data_table.columns), 
                                      height=fig_ht,
                                      margin=dict(l=0,r=0,b=0,t=0,pad=0))

                fig.update_layout(updatemenus=[dict(type="buttons", direction="right", active=0,
                                                    x=0.1, y=1.2, 
                                                    buttons=list([dict(label="Unique",
                                                                      method="update",
                                                                      args=[{"visible": [True, False]}]),
                                                                  dict(label="Top",
                                                                      method="update",
                                                                      args=[{"visible": [False, True]}])
                                                                 ])
                                                   )
                                              ])

                clear_output()
                display(Markdown('__Summary generated!__'))
                fig.show(config={'displaylogo': False})

            else:
                clear_output()
                display(Markdown('This operation is __not applicable__ for __single-panel__ data!'))
    
    else:
        pass
    
    track_cell(value, flag)
except Exception as err:
    # display the error
    clear_output()
    print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
    flag = 0
    err = str(err)
    track_cell(value, flag, err)

__Stacked Bar Chart for distrbution of categories of categorical columns across panels__

In [None]:
def cat_dist(data, selected_cols):
    # generating the pivot table
    cnt_data = pd.DataFrame(columns=['cat','count','column'])
    for each_col in selected_cols:
        cat_prop = pd.DataFrame(data[each_col].value_counts()).reset_index()
        cat_prop['column'] = [each_col]*cat_prop.shape[0]
        cat_prop.columns = cnt_data.columns
        cnt_data = cnt_data.append(cat_prop, ignore_index=True)

    data_pivot = cnt_data.pivot(index='column',columns='cat',values='count').fillna(0)
    data_pivot.index = [x for x in data_pivot.index]

    # converting the pivot table to a dictionary supported to generate the plot
    data_cols = data_pivot.columns.to_list()
    data = {}
    for each_col in data_cols:
        data[each_col] = data_pivot[each_col].to_list()
    data['column'] = data_pivot.index.to_list()

    # generate the plot
    p_stacked_bar = figure(x_range=data_pivot.index.to_list(), 
                           plot_height=600, plot_width=900, toolbar_location=None, 
                           tools="hover", tooltips="$name @column: @$name")

    viridis_cmap = cm.get_cmap('viridis', 256)
    viridis_cat_rgba = viridis_cmap(np.linspace(0,1,len(data_cols)))

    color_range = []
    for each_rgb in viridis_cat_rgba:
        r_val, g_val, b_val = 255*each_rgb[0], 255*each_rgb[1], 255*each_rgb[2]
        hex_val = '#%02x%02x%02x' % (int(r_val), int(g_val), int(b_val))
        color_range.append(hex_val)

    p_stacked_bar_0 = p_stacked_bar.vbar_stack(data_cols, x='column', width=0.6, 
                                               source=data, 
                                               fill_color=color_range)

    vbar_comp = []
    for p_id,each_part in enumerate(p_stacked_bar_0):
        vbar_comp.append([p_stacked_bar_0[p_id]])

    legend_list = list(zip(data_cols,vbar_comp))

    if len(legend_list) <= 6:
        data_legend = Legend(items=legend_list, location=(100, 0), orientation="horizontal")
        p_stacked_bar.add_layout(data_legend, 'below')
    else:
        slicing_index = list(range(0, len(legend_list), 6))
        for i in range(len(slicing_index) - 1):
            idx_start = slicing_index[i]
            idx_end = slicing_index[i+1]
            data_legend = Legend(items=legend_list[idx_start:idx_end], location=(100, i*10), orientation="horizontal")
            p_stacked_bar.add_layout(data_legend, 'below')

        if idx_end != len(legend_list):
            i += 1
            data_legend = Legend(items=legend_list[idx_end:], location=(100, i*10), orientation="horizontal")
            p_stacked_bar.add_layout(data_legend, 'below')
        else:
            pass

    p_stacked_bar.toolbar.logo = None
    p_stacked_bar.xaxis.major_label_orientation = pi/4

    return p_stacked_bar

In [None]:
# distribution of categories on categorical variables across panels
value = 'cat dist plot'

if __name__ == '__main__':
    if cat_cols == []:
        display(Markdown('__NO CATEGORICAL COLUMNS AVAILABLE!__'))
    else:
        try:
            if dataset_type.lower() == 'mp':
                selected_panels = panel_selection(panel_ids)
                # H453700C08, H346600C03, H136400C09, H635900C03
                selected_cols = col_selection(cat_cols)

                display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))

                cat_freq_plot_slots = [Panel() for i in range(len(selected_panels))]

                # performing the operation on panel level
                for idx,each_id in enumerate(selected_panels):
                    panel_data = df.groupby(panel_col)[selected_cols].get_group(each_id)

                    p_stacked_bar = cat_dist(panel_data, selected_cols)

                    cat_freq_plot_slots[idx] = Panel(child=p_stacked_bar, title=each_id)

                cat_freq_plot_tabs = Tabs(tabs = cat_freq_plot_slots)
                cat_freq_plot_layout = column(cat_freq_plot_tabs)

                script, div = components(cat_freq_plot_layout)

            else:
                selected_cols = col_selection(cat_cols)
                p_stacked_bar = cat_dist(df, selected_cols)
                script, div = components(p_stacked_bar)

            track_cell(value, flag)

        except Exception as err:
            # display the error
            clear_output()
            print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
            flag = 0
            err = str(err)
            track_cell(value, flag, err)

        else:
            clear_output()
            if script== None and div == None:
                display(Markdown('This operation is __not applicable__ for __single-panel__ data!'))
            else:
                display(Markdown('__Plots generated!__'))
                display(HTML(html_plot.render(script=script, div=div)))
            
else:
    pass

> __Notes__:
 
```
*Add notes here*
```

***

# Univariate Analysis

## Decomposition

The decomposition of time series data is a statistical task that de-constructs a time series into several components to define its characteristics. Usually, a time-series data contains 4 types of pattern:

* __Trend (T)__ : The general change (long-term upward or downward pattern) in the level of the data over a duration longer than a year.
* __Seasonal (S)__ : The regular wavelike fluctuations of constant length, repeating themselves within each 12-month period year after year.
* __Cyclical (C)__ : The _quasi-regular_ (4 phases) wavelike fluctuations - from peak (prosperity) to contractions (recession) to trough (depression) to expansion (recovery) - around the long-term trend, lasting longer than a year.
* __Irregular (R)__ : The short-duration and non-repeating random variations of the data that exist after taking into account the unforeseen events such as strikes or natural disaster.

However, when we decompose the time-series into components, we combine the trend and cycle component into one time-series component to get a __trend-cycle__ component, also known as __Trend__. Thus, time series can be considered of comprising 3 major components: 

* __Trend-cycle__ or __Trend__ component
* __Seasonal__ component
* __Remainder__ component (containing noise)

For decomposition, we need to define the __type__ of the observed series, which is stored in `series_type` variable. It can either be __additive__ or __multiplicative__.
* __Additive__ : The seasonal, cyclical and random variations are absolute deviations from the trend. It is _linear_ where seasonality changes over time are consistently made by having almost the _same frequency_ (width of cycles) and _amplitude_ (height of cycles).

$$\begin{gather*}
y(t) = T_{t}+S_{t}+C_{t}+R_{t}
\end{gather*}$$

* __Multiplicative__ : The seasonal, cyclical and random variations are relative (percentage) deviations from the trend. It is nonlinear, such as quadratic or exponential and non-linear seasonality has an increasing or decreasing frequency and/or amplitude over time.

$$\begin{gather*}
y(t) = T_{t}*S_{t}*C_{t}*R_{t}
\end{gather*}$$

In [None]:
value="Decomposition Plot"

if __name__ == '__main__':
    try:
        display(Markdown('Enter `A` or `M` for `additive` or `multiplicative` series type.'))
        while True:
            series_type_inp = input('Enter the series_type (A/M): ')
            if series_type_inp.lower() == 'a':
                series_type = 'additive'
                break
            elif series_type_inp.lower() == 'm':
                series_type = 'multiplicative'
                break
            else:
                print(colored('Invalid choice! Try again.','red',attrs=['bold']))
                continue
        
        display(Markdown('In order to give the frequency, see the following mapping and provide the __integer value__:'))
        display(Markdown('[_month_ - 12] [_week_ - 52] [_day_ - 365] [_hour_ - 60] [_minute_ - 3600]'))
        
        while True:
            enter_freq = int(input('Enter the freq: '))
            if enter_freq in [12,52,365,60,3600]:
                break
            else:
                print(colored('Incorrect! Try again','red',attrs=['bold']))
                continue
        
        decomp_df_cols =  ['Observed', 'Trend', 'Seasonal' ,'Residual']
        fig = make_subplots(rows=4, cols=1, subplot_titles=tuple(decomp_df_cols))
        
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            
            for idx,each_id in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_id)
                
                # Decomposing the time-series
                decomp_res = seasonal_decompose(panel_data[target_var], model=series_type, 
                                                filt=None, freq=enter_freq)
                decomp_df = pd.DataFrame({'Observed': decomp_res.observed, 'Trend':decomp_res.trend,
                                          'Seasonal':decomp_res.seasonal, 'Residual':decomp_res.resid})
                
                decomp_df_cols = decomp_df.columns
                decomp_df[date_time_col] = panel_data[date_time_col]
                visible = True if each_id == selected_panels[0] else False
                for i,each_col in enumerate(decomp_df_cols):
                    j= i+1
                    fig.add_trace(go.Scatter(x= decomp_df[date_time_col],y= decomp_df[each_col],
                                             mode = 'lines', name = 'value', opacity= 0.8, showlegend= False,
                                             visible = visible, legendgroup='value')
                                  ,row = j, col = 1)
            
            tab_dict_list = []
            for each_id in selected_panels:
                vis_check = [[True]*len(decomp_df_cols) if i==each_id else [False]*len(decomp_df_cols) for i in selected_panels]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                tab_dict_list.append(dict(args=[{"visible": vis_check_flat},
                                                {"title": "Decomposition plots for panel: {}".format(each_id)}],
                                          label=each_id, method="update"))
                fig.update_layout(updatemenus=[dict(buttons=list(tab_dict_list),
                                                    direction="right", x=0, xanchor="left", y=1.11, yanchor="top")],
                                  showlegend=False, title_x=0.5)
        else:
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            
            decomp_res = seasonal_decompose(df[target_var], model=series_type, 
                                            filt=None, freq=enter_freq)
            decomp_df = pd.DataFrame({'Observed': decomp_res.observed, 'Trend':decomp_res.trend,
                                      'Seasonal':decomp_res.seasonal, 'Residual':decomp_res.resid})
            decomp_df['Residual'].fillna(0, inplace=True)
            decomp_df_cols = decomp_df.columns
            for i,each_col in enumerate(decomp_df_cols):
                j= i+1
                fig.add_trace(go.Scatter(x= df[date_time_col],y= decomp_df[each_col],
                                             mode = 'lines', name = 'value', opacity= 0.8, showlegend= False,
                                             visible = True, legendgroup='value')
                                  ,row = j, col = 1)
        
        fig.update_layout(height=800, hovermode='x unified')
        clear_output()
        display(Markdown('__Decomposition plot for `{}` generated!__'.format(target_var)))
        fig.show(config={'displaylogo': False})
        
#         track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
else:
    pass

## Stationary Analysis

For a time-series to have the [stationary](https://en.wikipedia.org/wiki/Stationary_process) property,
the series should have three basic properties:

* The mean of the series should NOT be a function of time and should be a constant.
* The variance of the series should NOT be a function of time. This property is known as [homoscedasticity](https://www.statisticssolutions.com/homoscedasticity/).
* The [covariance](https://www.investopedia.com/terms/c/covariance.asp) of the $i^{th}$ term and the $(i+k)^{th}$ term (`k` being the time lag) should NOT be a function of time.

There are 3 different types of stationarity present in time-series analysis:

* **Strict Stationary:**
A strict stationary series satisfies the mathematical definition of a stationary process. For a strict stationary series, the __mean, variance and covariance are not the function of time__. The aim is to convert a non-stationary series into a strict stationary series for making predictions.

* **Trend Stationary:**
A __unit root__ is a feature of some stochastic/random processes (such as random walks). A series that has __NO unit root__ but exhibits a trend is referred to as a trend stationary series. __Once the trend is removed, the resulting series will be strict stationary__.

* **Difference Stationary:**
A time series that can be made strict stationary by differencing is known as difference stationary.

There are several techniques to check whether a time-series is stationary or not such as:

* **Look at Plots:** You can review a time series plot of your data and visually check if there are any obvious trends or seasonality. However, if a time-series has complex stationary patterns, it is not recommended to analyze that using this approach.
* **Summary Statistics:** You can review the summary statistics for your data for seasons or random partitions and check for obvious or significant differences.
* **Statistical Tests:** You can use statistical tests to check if the expectations of stationarity are met or have been violated.

In this notebook, we will be discussing the most popular **statistical tests** to identify the presence of stationarity in the data.

### Parametric Test

##### ADF (Augmented dickey-Fuller) Test

The __[Augmented Dickey-Fuller test](https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test)__ is a type of statistical test called a __unit root test__. The intuition behind a unit root test is that it determines how strongly a time series is defined by a trend.

ADF uses an autoregressive model and optimizes an information criterion across multiple different lag values.

* __Null Hypothesis ($H_0$)__ : The series contains __unit root__, and hence it is non-stationary. It has some time dependent structure.
* __Alternative Hypothesis ($H_1$)__ : The series is weakly stationary. It does not have time-dependent structure.

The function __`ADF`__ has the `trend` parameter with the following 4 options: 
* `nc` - `N`o `C`onstant trend
* `c` - `C`onstant trend (default) 
* `ct` - `C`onstant and linear `T`ime trend
* `ctt` - `C`onstant and linear `T`ime and quadratic `T`ime trends

`lags` is the number of lags to use in the ADF regression. If omitted or None, `method` such as __AIC, BIC__ or __t-stat__ is used to automatically select the lag length.

__HOW TO INTERPRET__ : If the __p-value__ obtained from the test is __less than the significance level of 0.05__, then we fail to accept the $H_0$ i.e., __the series is stationary__.

In [None]:
# ADF test
value="ADF test"

def adf_analysis(adf_test, each_id):
    # assigning whether the null hypothesis is accepted
    stationary_check = 1 if adf_test.pvalue < 0.05 else 0

    # storing the result in a dataset
    df_adf_output = pd.DataFrame([adf_test.stat, adf_test.pvalue, adf_test.lags, stationary_check],
                                index=['Test Statistic','p-value','Number of Lags Used',
                                      'Stationary Check']).reset_index()
    df_adf_output.columns = (['Panel','value'])
    df_adf_output['index'] = each_id

    for k,v in adf_test.critical_values.items():
        df_adf_output.loc[df_adf_output.index.max()+1] = (['Critical Value at {}'.format(k),v,each_id])

    df_adf_output = df_adf_output.pivot(columns = 'Panel',values = 'value',index = 'index').round(4)
    df_adf_output.index = [x for x in df_adf_output.index]
    
    df_adf_output[['Number of Lags Used', 'Stationary Check']] = df_adf_output[['Number of Lags Used', 'Stationary Check']].astype(int)
    
    df_adf_output.insert(loc=0, column='Panels', value=df_adf_output.index)
    
    return df_adf_output


if __name__ == '__main__':
    try:
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))

            df_adf = pd.DataFrame()
            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)

                # performing ADF test
                adf_test = ADF(panel_data[panel_data[panel_col] == each_id][target_var])

                df_adf_output = adf_analysis(adf_test, each_id)

                df_adf = df_adf.append(df_adf_output)
        else:
            adf_test = ADF(df[target_var])

            df_adf = adf_analysis(adf_test, 0)

        track_cell(value, flag)

    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)

    else:
        clear_output()
        # Showing adf test summary
        display(Markdown('__Results of Augmented Dickey-Fuller Test for `{}`:__'.format(target_var)))
        if df_adf.shape[0] == 1:
            display(df_adf)
        else:
            display_data(df_adf.round(4))
            
else:
    pass

##### KPSS (Kwiatkowski–Phillips–Schmidt–Shin) Test

The __Kwiatkowski–Phillips–Schmidt–Shin test__ is used to determine whether differencing is required on the data. 

* __Null Hypothesis ($H_0$)__ : The series is weakly stationary.
* __Alternative Hypothesis ($H_1$)__ : The series contains __unit root__, and hence it is non-stationary.

Note that the $H_0$ and $H_1$ for the KPSS test are opposite to that of the ADF test, which often creates confusion.

The function __`KPSS`__ has the `trend` parameter with the following 2 options:
* `c` - `C`onstant trend (default) 
* `ct` - `C`onstant and linear `T`ime trend

Also, the `lag` parameter can be manually added with maximum value being less than the length of the sample data, or automatically set to `12*(nobs/100)**(1/4)`, where `nobs` is the length of the sample size.

__HOW TO INTERPRET__ : If the __p-value__ obtained from the test is __less than the significance level of 0.05__, then we fail to accept the $H_0$, i.e., __the series is non-stationary__.

In [None]:
# KPSS test
value="KPSS test"


def kpss_analysis(kpss_test, each_id):
    # assigning whether the null hypothesis is accepted
    stationary_check = 0 if kpss_test.pvalue < 0.05 else 1

    # storing the result in dataframe
    df_kpss_output = pd.DataFrame([kpss_test.stat, kpss_test.pvalue, kpss_test.lags, stationary_check],
                                index=['Test Statistic','p-value','Number of Lags Used',
                                      'Stationary Check']).reset_index()
    df_kpss_output.columns = (['Panel','value'])
    df_kpss_output['index'] = each_id

    for k,v in kpss_test.critical_values.items():
        df_kpss_output.loc[df_kpss_output.index.max()+1] = (['Critical Value at {}'.format(k),v,each_id])

    df_kpss_output = df_kpss_output.pivot(columns = 'Panel',values = 'value',index = 'index').round(4)
    df_kpss_output.index = [x for x in df_kpss_output.index]
    
    df_kpss_output[['Number of Lags Used', 'Stationary Check']] = df_kpss_output[['Number of Lags Used', 'Stationary Check']].astype(int)
    
    df_kpss_output.insert(loc=0, column='Panels', value=df_kpss_output.index)
    
    return df_kpss_output


if __name__ == '__main__':
    try:
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))

            df_kpss = pd.DataFrame()
            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)

                # performing KPSS test
                kpss_test = KPSS(panel_data[panel_data[panel_col] == each_id][target_var])

                df_kpss_output = kpss_analysis(kpss_test, each_id)
                df_kpss = df_kpss.append(df_kpss_output)
        else:
            kpss_test = KPSS(df[target_var])

            df_kpss = kpss_analysis(kpss_test, 0)

        track_cell(value, flag)

    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)

    else:
        clear_output()
        # Showing kpss test summary
        display(Markdown('__Results of KPSS Test for `{}`:__'.format(target_var)))
        if df_kpss.shape[0] == 1:
            display(df_kpss)
        else:
            display_data(df_kpss.round(4))
            
else:
    pass

### Semi-Parametric Test

##### Variance Ratio Test

__Variance Ratio test__ is one of the most popular semi-parametric tests to check the _random walk_ hypothesis. Note that the variance ratio test is __NOT a unit root test__. This test is used to check whether the observed series is a _random walk_ or it has some predictability.

* __Null Hypothesis ($H_0$)__ : The series is a ranodm walk series.
* __Alternative Hypothesis ($H_1$)__ : The series is __NOT a random walk__ series.

Rejection of the null with a positive test statistic indicates the presence of positive serial correlation in the time series.

In `trend` parameter of the function __`VarianceRatio`__, `c` allows for a non-zero drift in the random walk, while `nc` requires that the increments to y are of mean 0.

The `lags` must be at least 2, with maximum value that can be added is less than the length of the sample data.

__HOW TO INTERPRET__ : If the __p-value__ obtained from the test is __less than the significance level of 0.05__, then we fail to accept the $H_0$, i.e., __there is no random walk__.

In [None]:
# Variance Ratio test
value="Variance Ratio test"


def var_ratio_analysis(var_ratio_test, each_id):
    # assigning whether the null hypothesis is accepted
    random_walk_check = 0 if var_ratio_test.pvalue < 0.05 else 1

    # storing the result in the dataframe
    df_var_ratio_output = pd.DataFrame([var_ratio_test.stat, var_ratio_test.pvalue, 
                                        var_ratio_test.lags, random_walk_check],
                                       index=['Test Statistic','p-value',
                                              'Number of Lags Used','Random Walk Check']).reset_index()
    df_var_ratio_output.columns = (['Panel','value'])
    df_var_ratio_output['index'] = each_id

    for k,v in var_ratio_test.critical_values.items():
        df_var_ratio_output.loc[df_var_ratio_output.index.max()+1] = (['Critical Value at {}'.format(k),v,each_id])

    df_var_ratio_output = df_var_ratio_output.pivot(columns = 'Panel',values = 'value',index = 'index').round(4)
    df_var_ratio_output.index = [x for x in df_var_ratio_output.index]
    
    df_var_ratio_output[['Number of Lags Used', 'Random Walk Check']] = df_var_ratio_output[['Number of Lags Used', 'Random Walk Check']].astype(int)
    
    df_var_ratio_output.insert(loc=0, column='Panels', value=df_var_ratio_output.index)
    
    return df_var_ratio_output


if __name__ == '__main__':
    try:
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))

            df_var_ratio = pd.DataFrame()
            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)

                # performing the variance ratio test
                var_ratio_test = VarianceRatio(panel_data[panel_data[panel_col] == each_id][target_var])

                df_var_ratio_output = var_ratio_analysis(var_ratio_test, each_id)

                df_var_ratio = df_var_ratio.append(df_var_ratio_output)

        else:
            var_ratio_test = VarianceRatio(df[target_var])

            df_var_ratio = var_ratio_analysis(var_ratio_test, 0)

        track_cell(value, flag)

    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)

    else:
        clear_output()
        # Showing var_ratio test summary
        display(Markdown('__Results of Variance Ratio Test for `{}`:__'.format(target_var)))
        if df_var_ratio.shape[0] == 1:
            display(df_var_ratio)
        else:
            display_data(df_var_ratio.round(4))
            
else:
    pass

### Non-Parametric Test

##### Phillips-Perron Test

Compared with the ADF test, __Phillips-Perron__ unit root test makes correction to the test statistics and is robust to the unspecified autocorrelation and heteroscedasticity in the errors. There are two types of test statistics, $Z_{\rho}$ and $Z_{\tau}$, which have the same asymptotic distributions as ADF statistic. $Z_{\tau}$ (default) is based on the t-stat and $Z_{\rho}$ uses a test based on the length of time-series (`nobs`) times the re-centered regression coefficient.

* __Null Hypothesis ($H_0$)__ : The series contains __unit root__, and hence it is non-stationary.
* __Alternative Hypothesis ($H_1$)__ : The series is weakly stationary.

The function __`PhillipsPerron`__ has the `trend` parameter with the following 3 options: 
* `nc` - `N`o `C`onstant trend
* `c` - `C` onstant trend (default) 
* `ct` - `C`onstant and linear `T`ime trend

Also, the `lag` parameter can be manually added with maximum value being less than the length of the sample data, or automatically set to `12*(nobs/100)**(1/4)`.

__HOW TO INTERPRET__ : If the __p-value__ obtained from the test is __less than the significance level of 0.05__, then we fail to accept the $H_0$, i.e., __the series is stationary__.

In [None]:
# Phillips-Perron test
value="Phillips-Perron test"


def pp_analysis(pp_test, each_id):
    # assigning whether the null hypothesis is accepted
    stationary_check = 1 if pp_test.pvalue < 0.05 else 0

    # storing the result in dataframe
    df_pp_output = pd.DataFrame([pp_test.stat, pp_test.pvalue, pp_test.lags, stationary_check],
                                index=['Test Statistic','p-value','Number of Lags Used',
                                      'Stationary Check']).reset_index()
    df_pp_output.columns = (['Panel','value'])
    df_pp_output['index'] = each_id

    for k,v in pp_test.critical_values.items():
        df_pp_output.loc[df_pp_output.index.max()+1] = (['Critical Value at {}'.format(k),v,each_id])

    df_pp_output = df_pp_output.pivot(columns = 'Panel',values = 'value',index = 'index').round(4)
    df_pp_output.index = [x for x in df_pp_output.index]
    
    df_pp_output[['Number of Lags Used', 'Stationary Check']] = df_pp_output[['Number of Lags Used', 'Stationary Check']].astype(int)
    
    df_pp_output.insert(loc=0, column='Panels', value=df_pp_output.index)
    
    return df_pp_output


if __name__ == '__main__':
    try:
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))

            df_pp = pd.DataFrame()
            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)

                # performing the PP test
                pp_test = PhillipsPerron(panel_data[panel_data[panel_col] == each_id][target_var])

                df_pp_output = pp_analysis(pp_test, each_id)

                df_pp = df_pp.append(df_pp_output)

        else:
            pp_test = PhillipsPerron(df[target_var])

            df_pp = pp_analysis(pp_test, 0)

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)

    else:
        clear_output()
        # Showing pp test summary
        display(Markdown('__Results of Phillips-Perron Test for `{}`:__'.format(target_var)))
        if df_pp.shape[0] == 1:
            display(df_pp)    
        else:
            display_data(df_pp.round(4))
            
else:
    pass

## Autocorrelation and Partial Autocorrelation plots

__Auto-Correlation__ and __Partial Auto-Correlation__ are measures of association between current and past series values to indicate which past series values are most useful in predicting future values. These plots can be used to determine the auto-regressive and moving average components of your forecasting model.

* __Auto-Correlation Function (ACF):__ Autocorrelation is the correlation between a signal's observations as a function of the time-lag between them. So, __ACF plot__ is a plot of total correlation between different lag functions. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise.

* __Partial Auto-Correlation Function (PACF):__ PACF plot is a plot of the partial correlation of a stationary time series with its own lagged values, controlling for the values of the time series at all shorter lags.

__HOW TO INTERPRET__ : 

* If the PACF displays a sharp cutoff while the ACF decays more slowly (i.e., has significant spikes at higher lags), we say that the stationarized series displays an `AR signature`, meaning that the autocorrelation pattern can be explained more easily by adding AR terms.
* Alternatively, if there is no correlation between $x(t)$ and $x(t - (n-1))$, the autocorrelation plot cuts off sharply at the $n^{th}$ lag, and the time series is said to display an `MA signature` i.e it is over differenced.

**NOTE:** Here the plots are generated without differencing the original data.

In [None]:
# performing ACF and PACF on the target variable across panels
value="ACF and PACF"

if __name__ == '__main__':
    try:
        fig = make_subplots(rows = 2, cols = 1, subplot_titles=('ACF','PACF'))
        
        def acf_pacf_plot(data,vis):
            acf_res, acf_conf = acf(data[target_var], nlags= inp_lag, alpha=.05)
            pacf_res, pacf_conf = pacf(data[target_var], nlags= inp_lag, alpha=.05)
            # ACF
            fig.add_trace(go.Bar(x= list(range(inp_lag)), y= acf_res.tolist(), 
                                 marker_color = 'maroon', width = 0.07,
                                 showlegend= False,visible=vis), 
                          row = 1,col = 1)
            fig.add_trace(go.Scatter(x=list(range(inp_lag)), y=acf_conf[:, 0] - acf_res,
                                     line=dict(shape = 'spline',width = 0.01,color='lightgray'),
                                     showlegend= False,visible=vis),
                          row = 1,col = 1)
            fig.add_trace(go.Scatter(x=list(range(inp_lag)), y=acf_conf[:, 1] - acf_res,
                                     line=dict(shape = 'spline',width = 0.01,color='lightgray'),
                                     showlegend= False,visible=vis, fill='tonexty'),
                          row = 1,col = 1)
            # PACF
            fig.add_trace(go.Bar(x= list(range(inp_lag)), y= pacf_res.tolist(),
                                 marker_color = 'maroon', width = 0.07,
                                 showlegend= False,visible=vis), 
                          row = 2,col = 1)

            fig.add_trace(go.Scatter(x=list(range(inp_lag)),
                                     y=pacf_conf[:, 0] - pacf_res,
                                     line=dict(shape = 'spline',width = 0.01,color='lightgray'),
                                     showlegend= False,visible=vis), 
                          row =2,col = 1)
            fig.add_trace(go.Scatter(x=list(range(inp_lag)),
                                     y=pacf_conf[:, 1] - pacf_res,
                                     line=dict(shape = 'spline',width = 0.01,color='lightgray'),
                                     showlegend= False,visible=vis, fill='tonexty'),
                          row = 2,col = 1)
        
        
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            
            all_panel_length = []
            for idx,each_id in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_id)
                all_panel_length.append(panel_data.shape[0])

            lag_thresh = 30
            if min(all_panel_length) <= lag_thresh:
                lag_thresh = min(all_panel_length)
                display(Markdown('Length of the smallest panel data: __{}__'.format(min(all_panel_length))))

            # asking the user to add the lag
            while True:
                inp_lag = int(input('Enter the lag (should be less than {}): '.format(lag_thresh)))
                if inp_lag <= lag_thresh:
                    break
                else:
                    print(colored('Please enter the right lag value.','red',attrs=['bold']))
                    continue
            
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            for i, each_id in enumerate(selected_panels):
                vis = True if each_id == selected_panels[0] else False
                panel_data = df.groupby(panel_col).get_group(each_id)
                acf_pacf_plot(panel_data, vis)
            
            tab_dict_list = []
            for each_id in selected_panels:
                vis_check = [[True]*6 if i==each_id else [False]*6 for i in selected_panels]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                tab_dict_list.append(dict(args = [{"visible": vis_check_flat},
                                                    {"title": "ACF-PACF plots for panel: {}".format(each_id)}],
                                            label=each_id, method="update"))

                fig.update_layout(updatemenus=[dict(buttons=list(tab_dict_list),
                                                direction="right", x=0, xanchor="left", y=1.11, yanchor="top")],
                             showlegend=False, title_x=0.5)
        else:
            while True:
                inp_lag = int(input('Enter the lag (at max 30): '))
                if inp_lag <= 30:
                    break
                else:
                    print(colored('Please enter the right lag value.','red',attrs=['bold']))
                    continue
            
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            acf_pacf_plot(df, True)
        
        fig.update_xaxes(title_text='Lags')
        clear_output()
        display(Markdown('__ACF-PACF plot for column `{}` generated!__'.format(target_var)))
        fig.show(config={'displaylogo': False})
        
        track_cell(value, flag)
    except Exception as err:
        # display the error
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
else:
    pass

## Spectral Analysis

Spectral analysis is the decomposition of a time series into underlying __sine and cosine__ functions of different frequencies using __Fourier transform__, which allows us to determine __those frequencies that appear particularly strong or important__. This enables us to find underlying periodicities.

The spectral intensities are plotted against time.  

$$\begin{aligned}
x_{t} = \sum_{k} a_{k} sin(2 \pi n ft) + b_{k} cos(2 \pi nft)
\end{aligned}$$

where:  
* `f` is the frequency
* `k` = `1/f` is the period of seasonality
* ($a_{k}$) and ($b_{k}$) are coefficients which can be calculated as $a_{k} = \sum_{t} x_{t} sin(2 \pi n ft)$  and  $b_{k} = \sum_{t} x_{t} cos(2 \pi n ft)$.

The coefficient are usually used to generate spectrum ($s_{k}$) of the data to find out importance of each frequency (`f`). Spectrum is calculated as:

$$\begin{aligned}
s_{k} = \frac{1}{2}(a_{k}^2 + b_{k}^2)
\end{aligned}$$

For large ($s_{k}$),  $\frac{k}{n}\$ is important.

__Power-Spectral-Density (PSD)__ analysis is a type of frequency-domain analysis in which a structure is subjected to a probabilistic spectrum of harmonic loading to obtain probabilistic distributions for dynamic response measures.

__Interpretation of the spectral plot__ : The spectral intensity will be plotted against `n` times the frequency `f` (`f`,`2f`,`3f`, ..., etc.). The spikes in the spectral plot appear farther along the X axis if the number of seasonal repetitions is greater in the given time period.

In [None]:
# performing PSD using periodogram
value="PSD"

if __name__ == '__main__':
    try:
        fig = go.Figure()
        
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            
            for idx,each_id in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_id)
                freqs, psd = periodogram(panel_data[target_var], window=('tukey', 0.25), detrend='linear')
                vis = True if each_id == selected_panels[0] else False
                
                fig.add_trace(go.Bar(x=freqs, y=psd, width=0.002, visible=vis))
                fig.add_trace(go.Scatter(x=freqs, y=psd, mode='markers',opacity=0.8, visible=vis))
            
            tab_dict_list = []
            for each_id in selected_panels:
                vis_check = [[True]*2 if i==each_id else [False]*2 for i in selected_panels]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                tab_dict_list.append(dict(args = [{"visible": vis_check_flat},
                                                 {"title": "PSD plots for panel: {}".format(each_id)}],
                                            label=each_id, method="update"))
                fig.update_layout(updatemenus=[dict(buttons=list(tab_dict_list),
                                                direction="right",
                                                x=0, xanchor="left", y=1.11, yanchor="top")],
                                  showlegend=False,title_x=0.5)
            
        else:
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            
            freqs, psd = periodogram(df[target_var], window=('tukey', 0.25), detrend='linear')
            fig.add_trace(go.Bar(x=freqs, y=psd, width=0.002))
            fig.add_trace(go.Scatter(x=freqs, y=psd, mode='markers',opacity=0.8, visible=vis))
            
        fig.update_xaxes(title_text='Frequency')
        fig.update_yaxes(title_text= 'PSD')
        
        clear_output()
        display(Markdown('__PSD plots generated!__'))
        fig.show(config={'displaylogo': False})
        track_cell(value, flag)
    except Exception as err:
        # display the error
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
else:
    pass

> __Notes__:
 
```
*Add notes here*

```

***
# Multivariate Analysis

First, the user will be allowed to enter the regressor i.e., the independent variable.

In [None]:
# selecting the regressor

if __name__ == '__main__':
    display(Markdown('\nEnter the name of column to be used as the __regressor__.'))

    # the list of columns in dataframe
    original_cols = df.columns
    cols_vis = ' || '.join(original_cols)
    print(colored("\nColumn Names:",'grey',attrs=['bold']),"\n{}".format(cols_vis))
    print('_'*75)

    while True:
        reg_var = input("\nEnter the column name: ")
        if reg_var not in original_cols:
            print(colored("\nPlease enter the column name properly!","red",attrs=['bold']))
            continue
        elif reg_var == target_var:
            print(colored("\nRegressor and Target variable CANNOT BE SAME!","red",attrs=['bold']))
            continue
        else:
            clear_output()
            display(Markdown("The column which will be considered as the __regressor__ is: __{}__".format(reg_var)))
            break
else:
    reg_var = 'adr'
    display(Markdown("The column which will be considered as the __regressor__ is: __{}__".format(reg_var)))

## Causality

Granger causality is a way to investigate causality between two variables in a time series, i.e., does one variable directly cause the other.

It is based on the idea that if `X` causes `Y`, then the forecast of `Y` based on previous values of `Y` AND the previous values of `X` should outperform the forecast of `Y` based on previous values of `Y` _alone_.

* __Null Hypothesis ($H_0$)__ : The __lagged value__ of a regressor does NOT affect the value of the target variable. So, no causation.
* __Alternate Hypothesis ($H_1$)__ : The lagged value of a regressor affects the value of the target variable.

__HOW TO INTERPRET__ : If the __p-value__ obtained from the test is __less than the significance level of 0.05__, then we fail to accept the $H_0$, i.e., __there is causation__.

In [None]:
# performing causality test between two selected series
value = 'causality'


def caus_analysis(data, each_id):
    data_rqrd = data[[target_var, reg_var]]
    data_rqrd.index = pd.DatetimeIndex(data[date_time_col].astype(str).to_list())

    # getting the correct order using AIC method
    order_range = range(1, len(data[target_var]))
    rho = [aryule(data[target_var], i, norm='unbiased')[1] for i in order_range]
    aic = AIC(len(data[target_var]), rho, order_range).tolist()
    min_aic = min(aic)
    min_rho_index = aic.index(min_aic)

    final_data = data_rqrd.diff(order_range[min_rho_index]).dropna()

    model = VAR(final_data)
    results = model.fit()
    caus_result = results.test_causality(target_var, reg_var, kind='wald')

    # assigning whether the null hypothesis is accepted
    caus_check = 1 if caus_result.pvalue < 0.05 else 0

    # storing the result in a dataframe
    df_caus_output = pd.DataFrame([caus_result.test_statistic, caus_result.pvalue, caus_check],
                                   index=['Test Statistic','p-value', 'Causality Check']).reset_index()
    df_caus_output.columns = (['Panel','value'])
    df_caus_output['index'] = each_id

    df_caus_output = df_caus_output.pivot(columns = 'Panel',values = 'value',index = 'index')
    df_caus_output.index = [x for x in df_caus_output.index]
    
    df_caus_output['Causality Check'] = df_caus_output['Causality Check'].astype(int)
    
    df_caus_output.insert(loc=0, column='Panels', value=df_caus_output.index)
    
    return df_caus_output


if __name__ == '__main__':
    try:
        if dataset_type.lower() == 'mp':
            df_caus = pd.DataFrame()

            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))

            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)

                df_caus_output = caus_analysis(panel_data, each_id)

                df_caus = df_caus.append(df_caus_output)

        else:
            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))
            df_caus = caus_analysis(df,0)

        clear_output()        
        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)

    else:
        clear_output()
        display(Markdown('__Result of Granger Causality Test between `{}` and `{}`__'.format(target_var, reg_var)))
        if df_caus.shape[0] == 1:
            display(df_caus)
        else:
            display_data(df_caus.round(4))
else:
    pass

## Cointegration Test

Let’s define the __order of integration `d`__ which is the number of differencing required to make a non-stationary time series stationary. Consider a pair of time series, both of which are non-stationary. If we take a particular linear combination of theses series, it can sometimes lead to a stationary series. Such a pair of series would then be termed __cointegrated__, and the `d` is less than that of the individual series.

* __Null Hypothesis ($H_0$)__ : There is __NO cointegration__ between the pair of series.
* __Alternative Hypoethesis ($H_1$)__: The pair of series are cointegrated.

The `trend` parameter included in regression for cointegrating equation has 4 options:
* `nc` - `N`o `C`onstant trend
* `c` - `C`onstant trend (default) 
* `ct` - `C`onstant and linear `T`ime trend
* `ctt` - `C`onstant and linear `T`ime and quadratic `T`ime trends

__HOW TO INTERPRET__ : If the __p-value__ obtained from the test is __less than the significance level of 0.05__, then we fail to accept the $H_0$, i.e., __the pair of series are cointegrated__.

In [None]:
# performing cointegration test between 2 selected series
value = 'coint test'

def coint_analysis(coint_test, each_id):
    # assigning whether the null hypothesis is accepted
    coint_check = 1 if coint_test[1] < 0.05 else 0

    coint_test_stats = list(coint_test[:2])
    coint_test_stats.append(coint_check)
    coint_conf_intrvl = coint_test[2].tolist()

    # storing the result in a dataframe
    df_coint_output = pd.DataFrame(coint_test_stats+coint_conf_intrvl,
                                   index=['Test Statistic','p-value','Cointegration Check',
                                          'Critical Value at 1%','Critical Value at 5%', 'Critical Value at 10%']).reset_index()
    df_coint_output.columns = (['Panel','value'])
    df_coint_output['index'] = each_id

    df_coint_output = df_coint_output.pivot(columns = 'Panel',values = 'value',index = 'index')
    df_coint_output.index = [x for x in df_coint_output.index]
    
    df_coint_output['Cointegration Check'] = df_coint_output['Cointegration Check'].astype(int)
    
    df_coint_output.insert(loc=0, column='Panels', value=df_coint_output.index)
    
    return df_coint_output


if __name__ == '__main__':
    try:
        if dataset_type.lower() == 'mp':
            df_coint = pd.DataFrame()

            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the result</h2></div>'))

            # performing the operation on panel level
            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)

                # perfroming the cointegration test
                coint_test = coint(panel_data[panel_data[panel_col] == each_id][target_var],
                                   panel_data[panel_data[panel_col] == each_id][reg_var])

                df_coint_output = coint_analysis(coint_test, each_id)

                df_coint = df_coint.append(df_coint_output)

        else:
            # perfroming the cointegration test
            coint_test = coint(df[target_var], df[reg_var])

            df_coint = coint_analysis(coint_test, 0)

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)

    else:
        clear_output()
        display(Markdown('__Result of Augmented Engle-Granger 2-step Cointegration Test between `{}` and `{}`__'.format(target_var, reg_var)))
        if df_coint.shape[0] == 1:
            display(df_coint)
        else:
            display_data(df_coint.round(4))
else:
    pass

## Cross-correlation

Here, the user will get an option to select another column that is different from the target column. This code will generate correlation value of __lag/lead differentiated series__ against the target series to find out if there is any correlation available between the lagged/lead version of one series against the target series.

The maximum number of lags against which this can be performed is 1 less than the length of the series.

In [None]:
# performing cross correlation
value='cross-corr'

if __name__ == '__main__':
    def cross_corr_analysis(data, cross_cor_lag, vis):
        d1 = data[target_var]
        d2 = data[reg_var]

        # calculating the cross correlation between the 2 series against a range of lag
        cross_corr_res = [d1.corr(d2.shift(lag)) for lag in range(-cross_cor_lag,cross_cor_lag+1)]
        cross_corr_res = [0 if isnan(x) else x for x in cross_corr_res]

        # generating the plot
        fig.add_trace(go.Bar(x=list(range(-cross_cor_lag,cross_cor_lag+1)), 
                             y=cross_corr_res, width=0.25, visible=vis))
        fig.add_trace(go.Scatter(x=list(range(-cross_cor_lag,cross_cor_lag+1)), 
                                 y=cross_corr_res, mode='markers',opacity=0.8, visible=vis))
    
    try:
        fig = go.Figure()
        
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03

            cross_corr_plot_slots = [Panel() for i in range(len(selected_panels))]

            all_panel_length = []
            for idx,each_id in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_id)
                all_panel_length.append(panel_data.shape[0])

            lag_thresh = 30
            if min(all_panel_length) < lag_thresh:
                lag_thresh = min(all_panel_length)
                display(Markdown('Length of the smallest panel data: __{}__'.format(min(all_panel_length))))

            # asking the user to add the lag
            while True:
                cross_cor_lag = int(input('Enter the lag (should be less than {}): '.format(lag_thresh)))
                if cross_cor_lag <= lag_thresh:
                    break
                else:
                    print(colored('Please enter the right lag value.','red',attrs=['bold']))
                    continue

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))

            # performing the operation on panel level
            for each_id in selected_panels:
                panel_data = df.groupby(panel_col).get_group(each_id)
                vis = True if each_id == selected_panels[0] else False
                cross_corr_plot = cross_corr_analysis(panel_data, cross_cor_lag, vis)
            
            tab_dict_list = []
            for each_id in selected_panels:
                vis_check = [[True]*2 if i==each_id else [False]*2 for i in selected_panels]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                tab_dict_list.append(dict(args=[{"visible": vis_check_flat},
                                                {"title": "Decomposition plots for panel: {}".format(each_id)}],
                                          label=each_id, method="update"))
                fig.update_layout(updatemenus=[dict(buttons=list(tab_dict_list),
                                                    direction="right", x=0, xanchor="left", y=1.11, yanchor="top")],
                                  showlegend=False, title_x=0.5)

        else:
            lag_thresh = 30
            while True:
                cross_cor_lag = int(input('Enter the lag (at max 30): '))
                if cross_cor_lag <= lag_thresh:
                    break
                else:
                    print(colored('Please enter the right lag value.','red',attrs=['bold']))
                    continue

            cross_corr_plot_layout = cross_corr_analysis(df, cross_cor_lag, True)
            fig.update_layout(showlegend=False)

        fig.update_xaxes(title='Lags')
        fig.update_yaxes(title='Corr. Coeff.')
        clear_output()
        display(Markdown('__The cross-correlation plots between `{}` and `{}` are generated.__'.format(target_var, reg_var)))
        display(Markdown('__Range of lags selected__: -{0} to {0}'.format(cross_cor_lag)))
        fig.show(config={'displaylogo':False})

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
        
else:
    pass

> __Notes__:
 
```
*Add notes here*
```

# Non-stationary to Stationary Conversion

## Data Differencing

In order to remove non-stationarity of a time-series, one of the simplest and most popular way is to take a difference of the observed series with it's $k^{th}$  lag. This technique is used to eliminate trend stationarity from the time-series data.

In [None]:
# performing data differencing to remove non-stationarity
value = 'data diff'


if __name__ == '__main__':
    try:

        def data_diff_generate(data, differ_order, vis):
            diff_panel_data = data[target_var].diff(periods = differ_order)

            # storing the value in a dataframe
            target_var_mod = target_var+'_mod'
            diff_panel_data_df = pd.DataFrame({target_var_mod:diff_panel_data})

            # Add traces
            fig.add_trace(go.Scatter(x=data[date_time_col], y=data[target_var], 
                                     name="actual {}".format(target_var), visible=vis), secondary_y=False)

            fig.add_trace(go.Scatter(x=data[date_time_col], y=diff_panel_data, 
                                     name="differenced {}".format(target_var), visible=vis), secondary_y=True)

            fig.update_layout(hovermode='x unified')

            fig.update_layout(legend_orientation="h", legend=dict(x=.25, y=-0.1))

            # Set y-axes titles
            fig.update_yaxes(title_text="<b>actual</b> observation", secondary_y=False)
            fig.update_yaxes(title_text="<b>differenced</b> observation", secondary_y=True)

        fig = make_subplots(specs=[[{"secondary_y": True}]])
        
        if dataset_type.lower() == 'mp':
            selected_panels = panel_selection(panel_ids)
            # H453700C08, H346600C03, H136400C09, H635900C03

            all_panel_length = []
            for idx,each_id in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_id)
                all_panel_length.append(panel_data.shape[0])

            lag_thresh = 30
            if min(all_panel_length) < lag_thresh:
                lag_thresh = min(all_panel_length)
                display(Markdown('Length of the smallest panel data: __{}__'.format(min(all_panel_length))))

            # asking the user to add the lag
            while True:
                differ_order = int(input('Enter the lag (should be less than {}): '.format(lag_thresh)))
                if differ_order <= lag_thresh:
                    break
                else:
                    print(colored('Please enter the right lag value.','red',attrs=['bold']))
                    continue

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))
            
            # performing the operation on panel level
            for idx,each_id in enumerate(selected_panels):
                panel_data = df.groupby(panel_col).get_group(each_id)
                visible = True if each_id == selected_panels[0] else False
                # calling the function to generate the plot
                data_diff_generate(panel_data, differ_order,visible)
                
            tab_dict_list = []
            for each_id in selected_panels:
                vis_check = [[True]*2 if i==each_id else [False]*2 for i in selected_panels]
                vis_check_flat = [i for sublist in vis_check for i in sublist]
                tab_dict_list.append(dict(args=[{"visible": vis_check_flat},
                                                {"title": "Data differenced plot for panel: {}".format(each_id)}],
                                          label=each_id, method="update"))
                fig.update_layout(updatemenus=[dict(buttons=list(tab_dict_list),
                                                    direction="right", x=0, xanchor="left", y=1.11, yanchor="top")],
                                  showlegend=False, title_x=0.5)

        else:

            while True:
                differ_order = int(input('Enter the lag (at max 30): '))
                if differ_order <= 30:
                    break
                else:
                    print(colored('Please enter the right lag value.','red',attrs=['bold']))
                    continue

            display(Markdown('<div><div class="loader"></div><h2> &nbsp; Generating the plots</h2></div>'))

            # calling the function to generate the plot
            data_diff_generate(df, differ_order, True)
            
        clear_output()
        display(Markdown('__Data Differencing Plots for `{}` generated!__'.format(target_var)))
        display(Markdown('__Order of differencing__ : {}'.format(differ_order)))
        fig.show(config={'displaylogo': False})

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

> __Notes__:
 
```
*Add notes here*
```

***

# Dynamic Time Warping

Dynamic time warping (DTW) is a family of algorithms which compute the local stretch or compression to apply to the time axes of 2 time-series in order to optimally map one (query) onto the other (reference). DTW computes the euclidean distance at each frame across every other frames to compute the minimum path that will match the two signals. 

The greatest advantage of this method is that it can also deal with signals of different length. One downside is that it cannot deal with missing values so you would need to interpolate beforehand if you have missing data points.

At the core of the algorithms, the model proceeds as follows:

1. Divide the two series into equal points.
2. Calculate the __Euclidean distance__ between the $1^{st}$ point in the $1^{st}$ series and every point in the $2^{nd}$ series. Store the minimum distance calculated. This is the __time warp__ stage.
3. Move to the $2^{nd}$ point and __repeat `2`__. Accordingly, move step by step along with points and _repeat 2_ till all points are exhausted.
4. Now __repeat `2` and `3`__ but with the $2^{nd}$ series as a reference point.
5. Add up all the minimum distances that were stored and this is a true measure of similarity between the two series.

##### DTW Clustering

Here, we are performing agglomerative clustering against the target variable to show which panels exhibits similar characteristics, and which one of them possess abnormal behavior.

In [None]:
# performing dtw and see which series are approximately similar
value = 'dtw'

if __name__ == '__main__':
    try:
        def add_distance(ddata, dist_threshold=None, fontsize=8):
            '''
            Function to plot cluster points & distance labels in dendrogram

            Arguments
                ddata: scipy dendrogram output
                dist_threshold: distance threshold where label will be drawn, if None, 1/10 from base leafs will not be labelled to prevent clustter
                fontsize: size of distance labels
            '''
            if dist_threshold==None:
                # add labels except for 1/10 from base leaf nodes
                dist_threshold = max([a for i in ddata['dcoord'] for a in i])/10

            for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
                y = sum(i[1:3])/2
                x = d[1]
                # only label above distance threshold
                if x > dist_threshold:
                    plt.plot(x, y, 'o', c=c, markeredgewidth=0)
                    plt.annotate(int(x), (x, y), xytext=(15, 3),
                                 textcoords='offset points', va='top', 
                                 ha='center', fontsize=15)


        def maxclust_draw(data):
            '''
            Function to draw agglomerative clustering dendrogram based on maximum cluster criteron

            input:
                data       : dataframe or arrays of timeseries
                max_cluster: maximum cluster size to flatten cluster

            return:
                Plot of Dendrogram with timeseries graphs on the side
            '''
            max_cluster = data.shape[0]
            # define gridspec space
            gs = gridspec.GridSpec(max_cluster,60)

            # add dendrogram to gridspec
            fig_height = 40
            if max_cluster > 50:
                fig_height = max_cluster

            plt.figure(figsize=(25, fig_height), facecolor="w")
            plt.subplot(gs[:, 0:35])
            plt.xlabel('Distance', fontsize=15)
            plt.ylabel('Cluster', fontsize=15)

            data_series = np.matrix(data)

            # Custom Hierarchical clustering
            model = clustering.Hierarchical(dtw.distance_matrix_fast, {})
            # Augment Hierarchical object to keep track of the full tree
            h_model = clustering.HierarchicalTree(model, method='average')

            # Fit Model:
            h_model.fit(series=data_series)
            
            linkage_val = h_model.linkage

            ddata = dendrogram(linkage_val, orientation='left',
                               leaf_font_size=15, labels=data.T.columns)

            # add distance labels in dendrogram
            add_distance(ddata)

            # add timeseries graphs to gridspec
            for cluster in range(1,max_cluster+1):
                reverse_plot = max_cluster+1-cluster
                plt.subplot(gs[reverse_plot-1:reverse_plot, 45:60])

                cluster_id = ddata['ivl'][cluster-1]
                plt.plot(data.T[cluster_id])

                plt.tick_params(axis='y', which='both', labelleft=False, labelright=True)
                if cluster-1 != 0:
                    plt.xticks([])
                else:
                    plt.xticks(rotation=90)

        if dataset_type.lower() == 'mp':

            selected_panels = panel_selection(panel_ids)

            # creating the dataframe to be used to create dendogram
            dendogram_df = pd.DataFrame()
            for idx,each_id in enumerate(selected_panels):
                panel_data = df.groupby(panel_col)[date_time_col,target_var].get_group(each_id)
                dendogram_df[each_id] = panel_data[target_var].to_list()

            dendogram_df.index = panel_data[date_time_col]

            clear_output()
            display(Markdown('__DTW Clustering Plot for `{}` generated!__'.format(target_var)))
            maxclust_draw(dendogram_df.T)

        else:
            df_dtw = df[num_cols].copy()
            for each_col in num_cols:
                df_dtw[each_col] = (df[each_col] - df[each_col].mean())/df[each_col].std()
            
            display(Markdown('__DTW Clustering Plot for all numerical columns generated!__'))
            maxclust_draw(df_dtw.T)

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

> __Notes__:
 
```
*Add notes here*
```

***

# Factor Analysis

Factor Analysis provides the tool for analyzing the structure of the interrelationships among a large number of variables by defining a set of highly interrelated variables, known as __factors__ or __components__. It helps in data interpretations by reducing the number of variables.

There are 2 types of factor analysis:
* __Exploratory Factor Analysis (EFA)__: It is the most popular factor analysis approach among social and management researchers. Its basic assumption is that any observed variable is directly associated with any factor.
* __Confirmatory Factor Analysis (CFA)__: Its basic assumption is that each factor is associated with a particular set of observed variables. CFA confirms what is expected on the basis.

_In this notebook, we will be covering EFA._

##### How does factor analysis work?
As mentioned, the primary objective of factor analysis is to reduce the number of observed variables and find unobservable variables, and it can be achieved in 3 steps:
* __Step 1 - Adequacy Test__: We need to evaluate the “factorability” of our dataset. Factorability means "can we found the factors in the dataset?" If yes, then only we can proceed for the remaining steps, else ignore.
* __Step 2 - Factor Extraction__: In this step, the number of factors and factor loading is performed.
* __Step 3 - Factor Rotation__: In this step, rotation tries to convert factors into __uncorrelated factors__ i.e., making the factors independent of each other. There are lots of rotation methods available, which are performed [here](#fact_rot), such as:
    <a name='rot_types'></a>
    * __Orthogonal__ rotation: _varimax, oblimax, quartimax, equamax_
    * __Oblique__ rotation: _promax, oblimin, quartimin_


##### Step 1. : Is Factor Analysis necessary?

1. _If a visual inspection reveals no substantial number of correlations greater than .30, then factor analysis is __probably inappropriate__._ The correlations among variables can also be analyzed by computing the __partial correlations__ among variables. A partial correlation is a correlation that is unexplained when the effects of other variables are taken into account. _If “true” factors exist in the data, the partial correlation should be small, because the variable can be explained by the variables loading on the factors. __If the partial correlations are high, indicating no underlying factors, then factor analysis is inappropriate__._

2. __Bartlett’s Test of Sphericity__ _(a measure of how closely the shape of an object resembles that of a perfect sphere)_ checks whether or not an observed matrix is significantly different from an identity matrix. Small values __(less than 0.05)__ of the significance level indicate that __factor analysis may be useful with the data__.

3. __Measure of Sampling Adequacy (MSA)__ checks if it is possible to factorize the main variables efficiently. It ranges from 0 to 1, reaching 1 when each variable is _perfectly predicted without error by the other variables_. The measure can be interpreted with the following guidelines: 
    * => 0.80 -> meritorious
    * => 0.70 -> middling
    * => 0.60 -> mediocre
    * => 0.50 -> miserable
    * < 0.50  -> unacceptable
    
   The MSA increases as 
   * the sample size increases, OR
   * the average correlations increase, OR
   * the number of variables increases, OR
   * the number of factors decreases
       
   __MSA values must exceed .50__ for both the overall test and each variable - variables with values less than .50 should be omitted from the factor analysis one at a time, with the smallest one being omitted each time. __Kaiser-Meyer-Olkin (KMO) Test__ is a measure of the adequacy of sampling.

## Data Preprocessing for factor analysis

In this section the categorical columns are one-hot encoded.

In [None]:
# finding the variances of the columns
value="FA data preprocess"

if __name__ == '__main__':
    try:
        # create a copy of the dataframe
        df_fact_anal = df.copy()
        cols_to_be_used = num_cols.copy()
        cols_to_be_used.extend(cat_cols)
        df_fact_anal = df_fact_anal[cols_to_be_used]
        
        if cat_cols == []:
            pass
        else:
            # perform one hot encoding to the categorical columns
            df_fact_anal = pd.get_dummies(df_fact_anal, prefix=cat_cols)
            df_fact_anal = df_fact_anal.rename(columns = lambda x: (x.replace(' ','_')))
        
        if num_cols == []:
            pass
        else:
            if dataset_type.lower() == 'mp':
                df_fact_anal[panel_col] = df[panel_col]
                selected_panels = panel_selection(panel_ids)
                
                display(Markdown('<div><div class="loader"></div><h2> &nbsp; Processing</h2></div>'))
                
                def func_scale(x):
                    return (x - x.mean())/x.std()
                
                df_fact_anal_final = pd.DataFrame()
                for idx,each_id in enumerate(selected_panels):
                    panel_data = df_fact_anal.groupby(panel_col).get_group(each_id)
                    
                    # perform scaling
                    panel_data[num_cols] = panel_data[num_cols].apply(func_scale, axis=0)
                    df_fact_anal_final = df_fact_anal_final.append(panel_data.fillna(0), ignore_index=True)
                
                df_fact_anal = df_fact_anal_final.copy()
                df_fact_anal.drop(panel_col, axis=1, inplace=True)
            else:
                for each_col in num_cols:
                    df_fact_anal[each_col] = (df[each_col] - df[each_col].mean())/df[each_col].std()
        
        # lower-case and replace any white-space/dot/comma seperator with '_' for every column name
        df_fact_anal = df_fact_anal.rename(columns = lambda x: x.replace(' ','_').replace('.','_').replace(',','_').replace("'",""))
        
        df_fact_anal = df_fact_anal.replace([np.inf, -np.inf], np.nan)
        
        clear_output()
        display(Markdown('Displaying the __preprocessed dataframe__ :'))
        display_data(df_fact_anal.round(4))
        
        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

## Variance check for scaled variables

Here, we will calculate the variance of each column and if we notice any column giving variance 0, then those columns will be eliminated.

In [None]:
# performing variance check on the scaled dataset
value='variance check fa'

if __name__ == '__main__':
    try:
        # calculating the variance for each column and storing it in dataframe
        df_fact_anal_var = pd.DataFrame(data = df_fact_anal.var().round(4),
                                        columns = ['variance']).reset_index()
        df_fact_anal_var.rename(columns = {'index': 'column_name'}, inplace=True)

        # filtering out the rows having variance 0
        df_fact_anal_var_0 = df_fact_anal_var[df_fact_anal_var['variance']==0]
        
        if df_fact_anal_var_0.shape[0] == 0:
            display(Markdown('__No columns have been dropped in this process!__'))
        else:
            display(display_data(df_fact_anal_var_0.round(4)))

            # drop the columns having 0 variance
            var_0_cols = df_fact_anal_var_0['column_name'].to_list()
            df_fact_anal.drop(var_0_cols, axis=1, inplace=True)

            display(Markdown('The above listed __{} columns have been dropped__ from the scaled dataframe.'.format(df_fact_anal_var_0.shape[0])))

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

## Singularity check for variables

Here, we need to check if data is singular (i.e. determinant is zero). This can be checked and handled using __Variance Inflation Factor (VIF)__ which will help us to drop the columns causing singularity.

VIF estimates how much the variance of a coefficient is _"inflated"_ because of __linear dependence with other predictors__. Thus, a __VIF of 1.4__ tells us that the variance of a particular coefficient is __40% larger__ than it would be if that predictor was completely uncorrelated with all the other predictors.

The VIF has a lower bound of 1 but the upper bound varies depending on the problem statement. In order to learn more about the threshold, check this [link](https://www.statisticshowto.datasciencecentral.com/variance-inflation-factor/).

In order to know at what scenarios a high VIF is not a problem and can be safely ignored, check this [link](https://statisticalhorizons.com/multicollinearity).

In [None]:
# finding the columns responsible for singularity
from statsmodels.stats.outliers_influence import variance_inflation_factor
value="singularity check fa"

if __name__ == '__main__':
    try:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; Processing</h2></div>'))
        
        vif_thresh = 5

        fa_variables = list(range(df_fact_anal.shape[1]))
        col_to_drop_for_fa = []

        # list of columns and their index stored in `variable`
        df_fa_cols = df_fact_anal.columns
        fa_variables = np.arange(df_fact_anal.shape[1])

        # enter infinite loop and iterate through it until all the columns causing singularity is removed
        while True:    
            # obtaining the matrix with updated column names
            c = df_fact_anal[df_fa_cols[fa_variables]].values

            # generating the VIF for the available columns
            vif = [variance_inflation_factor(c, ix) for ix in np.arange(c.shape[1])]

            # check the max value of that specific VIF
            maxloc = vif.index(max(vif))

            # if it crosses the threshold, then remove the column generating that max value
            if max(vif) > vif_thresh:
                col_to_drop_for_fa.append(df_fact_anal.iloc[:, fa_variables].columns[maxloc])
                # drop the column
                fa_variables = np.delete(fa_variables, maxloc)
                continue
            else:
                break
        
        clear_output()
        display(Markdown('__Columns dropped for causing singularity:__'))
        print('{}'.format(' || '.join(col_to_drop_for_fa)))

        fact_anal_cols = df_fact_anal.columns

        # columns on which factor analysis wiil be performed
        col_to_retain_for_fa = set(fact_anal_cols) - set(col_to_drop_for_fa)

        if col_to_drop_for_fa == []:
            print('No columns to be eliminated!')
            pass
        else:
            try:
                # removing columns entered by user for Bartlett's test only
                df_fact_anal.drop(col_to_drop_for_fa, axis=1, inplace=True)

                # display updated list of column names
                updated_fact_anal_cols_vis = ' || '.join(col_to_retain_for_fa)
                print(colored("\nColumns Retained:",'blue',attrs=['bold']),"\n{}".format(updated_fact_anal_cols_vis))

                track_cell(value, flag)
            except Exception as err:
                print(colored('ERROR!','red',attrs=['bold']), colored('{}'.format(err),'grey'))
                flag = 0
                err = str(err)
                track_cell(value, flag, err)

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
    else:
        display(Markdown('<span style="color:darkgreen; font-size: 15px"><i>CODE EXECUTED!</i> Continue executing the next cell for performing <b>Adequacy Test</b>.</span>'))
        
else:
    pass

## Adequacy Test

The adequacy test is done to understand if the dataset has enough factors to go ahead with factor analysis.
Henry Kaiser (1970) introduced a Measure of Sampling Adequacy (MSA) of factor analytic data matrices.

In [None]:
# performing Bartlett’s Test of Sphericity and  Kaiser-Meyer-Olkin (KMO) Test
value="Bartlett's and KMO"

if __name__ == '__main__':

    try:
        # performing Bartlett's test
        chi_square_value, p_value = calculate_bartlett_sphericity(df_fact_anal)
        display(Markdown('__Bartlett’s test of Sphericity__'))
        display(Markdown('_Chi-Squared Value_ : __{}__'.format(chi_square_value.round(3))))
        display(Markdown('_p-value_ : __{}__'.format(p_value)))

        # performing Kaiser-Meyer-Olkin (KMO) Test
        kmo_per_variable, kmo_total = calculate_kmo(df_fact_anal)
        display(Markdown('__Kaiser-Meyer-Olkin (KMO) Test__'))
        display(Markdown('_The KMO score overall (MSA)_  = __{}__'.format(kmo_total.round(3))))

        if p_value > 0.05 or kmo_total < 0.5:
            print(colored('DON\'T PROCEED for Factor Analysis!','red',attrs=['bold']))
        elif isnan(p_value) or isnan(kmo_total):
            print(colored('You have singular matrix! CAN\'T PROCEED for Factor Analysis!','red',attrs=['bold']))
        else:
            print(colored('PROCEED for Factor Analysis!','green',attrs=['bold']))

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

## Scree plot

##### Step 2.1. : How to obtain the right number of factors?

Any decision on the number of factors to be retained should be based on several considerations:

* __Latent Root Criterion__: This is the most commonly used technique. The rationale for this criterion is that any individual factor should account for the variance of at least a single variable if it is to be retained for interpretation. Thus, only the factors having __eigen-values__ _(represent variance explained each factor from the total variance)_ greater than 1 are considered significant, and rest are not. We can perform this using __Scree Plot__. It is derived by plotting the eigen-values against the number of factors in their order of extraction, and the cut-off would be the point just before where eigen-value becomes less than 1.
* __Priori Criterion__: A predetermined number of factors based on research objectives and/or prior research
* __Percentage of Variance Criterion__: It is an approach based on achieving a specified cumulative percentage of total variance extracted by successive factors. The purpose is to ensure practical significance for the derived factors by ensuring that they explain at least a specified amount of variance.

_Here, we would be performing Latent Root Criterion._

In [None]:
# scree plot
value="Scree Plot"

if __name__ == '__main__':
    try:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING. Might take longer to process in case of factors greater than 200. </h2></div>'))
        # generate the eigen value matrix
        fa = FactorAnalyzer(rotation=None)
        fa.fit(df_fact_anal)
        ev, v = fa.get_eigenvalues()
        
        fig = go.Figure()
        
        # generate the scree plot
        fig.add_trace(go.Scatter(x=list(range(1, df_fact_anal.shape[1]+1)), y=ev,
                                 mode='markers+lines', marker_color='teal', name='Factor'))
        
        # highlight the point where latent root criterion is satisfied using green triangle
        latent_fact = 0
        eigen_val = 0
        for i,val in enumerate(list(ev)):
            if val < 1:
                latent_fact = i
                eigen_val = list(ev)[latent_fact-1]
                break
            else:
                pass
            
        fig.add_trace(go.Scatter(x=[latent_fact], y=[eigen_val], marker_symbol='x', marker_size=15,
                                 opacity=0.75, mode='markers', marker_color='tan', name='Latent Root Factor'))
        
        # highlight the eigen value threshold
        fig.add_shape(dict(type="line", x0=0,y0=1, x1=df_fact_anal.shape[1]+1,y1=1,
                           line=dict(color="maroon",width=1,dash='dash')))
        fig.add_trace(go.Scatter(x=[2.75], y=[0.9], text=["Eigen Value Threshold"], 
                                 mode="text", showlegend=False))
        fig.update_layout(title='Scree Plot',title_x=0.5)
        
        fig.update_yaxes(title='Eigen Value')
        fig.update_xaxes(title='Number of columns after scaling')
        clear_output()
        display(Markdown('__Highest number of factors__ that can be taken based on _Latent Root Criterion_ is __{}__'.format(latent_fact)))
        fig.show(config={'displayLogo':False})
        track_cell(value, flag)
    except Exception as err:
        clear_output()
        print(colored('ERROR!','red',attrs=['bold']), colored('{}'.format(err),'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

## Factor loading

##### Step 2.2. : Generate the initial factor loading

Once we figured the maximum number of factors required, we now need to obtain __factor loading__. It is a matrix that shows the relationship of each variable to the underlying factor. It shows the __correlation coefficient__ for observed variables and factors. We will generate a heat map to visualize the distribution of the factors across the columns.

In [None]:
# performing factor loading
value="Factor Loading"

if __name__ == '__main__':
    try:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; PROCESSING</h2></div>'))

        if latent_fact>100 :
            clear_output()
            print(colored('Cannot display the result due to large number of factors!','red'))
        else:    
            # analyze the factors with the maximum number of factors obtained
            fa_threshold_fact = FactorAnalyzer(n_factors = latent_fact)
            fa_threshold_fact.fit(df_fact_anal)

            # generate those factor values in a matrix
            matrix_fa_threshold_fact = fa_threshold_fact.loadings_

            # store the matrix in a dataframe
            df_fa_threshold_fact = pd.DataFrame(data = matrix_fa_threshold_fact,
                                                columns = ['fact_{}'.format(i) for i in range(latent_fact)],
                                                index = df_fact_anal.columns)

            # generate and display the plot
            fig_height = 20
            col_count = df_fa_threshold_fact.shape[0]
            if col_count > 30:
                fig_height = 30*((col_count/30) - (col_count//30)) + 30
            else:
                pass

            plt.figure(figsize=(20,fig_height))
            sns.heatmap(df_fa_threshold_fact.round(3), cmap='RdYlGn', annot=True, alpha=0.9)
            clear_output()
            # display the heatmap
            display(Markdown('__Factor Loading Heatmap__ '))
            plt.show()

        track_cell(value, flag)
    except Exception as err:
        clear_output()
        print(colored('ERROR!','red',attrs=['bold']), colored('{}'.format(err),'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

## Variance explained

Now, we need to obtain the __variance of each factor__.

* The __SS loadings__ row is the sum of squared loadings. This is sometimes used to determine the value of a particular factor. _We say a factor is worth keeping if the __SS loading is greater than 1__._
* __Variance__ is simply the proportion of variance explained by each factor.
* __Cumulative Variance__ tells us about the total variance explained by all the factors put together.

In [None]:
# calculating the variance of the factors
value="Variance of factors"

if __name__ == '__main__':
    try:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING</h2></div>'))
        if latent_fact>100:
            clear_output()
            print(colored('Cannot display the result due to large number of factors!','red'))
        else:
            # generating the variances across each factors
            matrix_fa_var = fa_threshold_fact.get_factor_variance()

            # naming the indexes
            index_names = ['Sum of Square (SS) Loading', 'Variance', 'Cumulative Variance']

            # store the matrix in a dataframe
            df_fa_var = pd.DataFrame(data = matrix_fa_var, 
                                     columns = ['fact_{}'.format(i) for i in range(latent_fact)],
                                     index = index_names)
            clear_output()
            # displaying the necessary information
            display(df_fa_var.head(10).round(3))

        track_cell(value, flag)
    except Exception as err:
        clear_output()
        print(colored('ERROR!','red',attrs=['bold']), colored('{}'.format(err),'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

## Uniqueness and Communality

__Uniqueness__ is the variance that is 'unique' to the variable and not shared with other variables. It ranges from 0 to 1. _A high uniqueness value implies that it __doesn't fit neatly__ into our factors_.

If we subtract the uniqueness from 1, we get __communality__. The communality of a variable is the proportion of variance of that variable contributed by the common factors.

In [None]:
# obtaining the factors
value = 'uniqueness communality'

if __name__ == '__main__':
    try:
        fa_fact = Factor(endog=df_fact_anal, n_factor=latent_fact)
        fa_fact.fit()

        # obtaining results from factor
        fa_fact_result = FactorResults(fa_fact)

        # finding uniqueness and communality
        df_fa_unq_comm = pd.DataFrame({'columns':df_fact_anal.columns, 'uniqueness':fa_fact_result.uniqueness,
                                      'communality':fa_fact_result.communality})

        display_data(df_fa_unq_comm.round(4))

        track_cell(value, flag)
    except Exception as err:
        # display the error
        clear_output()
        print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

## Factor Rotation

##### Step 3 : Factor Rotation

Here, we can perform factor rotation using different rotation techniques. By default, it performs `varimax`. Refer to the available sets of rotations [here](#rot_types).

In [None]:
# performing factor rotation
value="Factor Rotation"

if __name__ == '__main__':
    try:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; PROCESSING</h2></div>'))

        if latent_fact>100 :
            clear_output()
            print(colored('Cannot display the result due to large number of factors!','red'))
        else:
            # default method = 'varimax'
            fa_rotator = Rotator(method='varimax')
            # generate those factor values in a matrix
            matrix_rot_fa_threshold_fact = fa_rotator.fit_transform(fa_threshold_fact.loadings_)
            # store the matrix in a dataframe
            df_rot_fa_threshold_fact = pd.DataFrame(data = matrix_rot_fa_threshold_fact, 
                                                    columns = ['fact_{}'.format(i) for i in range(latent_fact)],
                                                    index = df_fact_anal.columns)
            clear_output()

            # generate and display the plot
            fig_height = 20
            col_count = df_rot_fa_threshold_fact.shape[0]
            if col_count > 30:
                fig_height = 30*((col_count/30) - (col_count//30)) + 30
            else:
                pass

            plt.figure(figsize=(20,fig_height))
            sns.heatmap(df_rot_fa_threshold_fact.round(3), cmap='RdYlGn', annot=True, alpha=0.9)
            clear_output()
            # display the heatmap
            display(Markdown('__Factor Rotation Heatmap__ '))
            plt.show()        

        track_cell(value, flag)
    except Exception as err:
        clear_output()
        print(colored('ERROR!','red',attrs=['bold']), colored('{}'.format(err),'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

Once obtained, we compare the variances across all the factors between __unrotated factor loading__ and __rotated factor loading__ and check the sum of their variance. If we notice __changes in the individual variances__, then we can conclude that we have __successfully uncorrelated the factors__, else it was uncorrelated without any requirement of rotation.

In [None]:
# validating if the variances of the factors have reduced after rotation
value="Validating variance of factors"

if __name__ == '__main__':
    try:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING</h2></div>'))
        if latent_fact > 100 :
            clear_output()
            print(colored('Cannot display the result due to large number of factors!','red'))
        else:
            df_fa_var = pd.DataFrame(data = df_fa_threshold_fact.var(),
                                            columns = ['Unrotated FA Variance'])

            # generate the series for the rotated 
            df_fa_rot_var = pd.DataFrame(data = df_rot_fa_threshold_fact.var(),
                                            columns = ['Rotated FA Variance'])

            # merge the 2 series
            df_fa_result = pd.concat([df_fa_var, df_fa_rot_var], axis=1)

            # generating the sum of variances for each columns
            df_fa_result.loc['sum_of_variance',:] = df_fa_result.sum(axis=0)
            clear_output()
            # displaying the result
            display(df_fa_result.T.round(3))

        track_cell(value, flag)
    except Exception as err:
        clear_output()
        print(colored('ERROR!','red',attrs=['bold']), colored('{}'.format(err),'grey'))
        flag = 0
        err = str(err)
        track_cell(value, flag, err)
        
else:
    pass

> __Notes__:
 
```
*Add notes here*
```

***

# Generating HTML Report

Execute the __following two cells__ to generate HTML report of this notebook. It will __not display any code cells__.

__NOTE__ : 
1. Ensure you have __saved the notebook with the latest checkpoint__ before running this chunk.
2. __Always__ execute these two cells if you are getting error while rendering.
3. Depending on your __Jupyter nbconvert version__, you might need to change `html_embed` to `html` for proper rendering.

In [1]:
%%javascript
IPython.notebook.kernel.execute(`notebook_name = '${IPython.notebook.notebook_name}'`);

<IPython.core.display.Javascript object>

In [3]:
value="Generating HTML report"

if __name__ == '__main__':
    try:
        display(Markdown('<div><div class="loader"></div><h2> &nbsp; LOADING</h2></div>'))

        if platform.system() != 'Windows':
            notebook_name = notebook_name.replace(' ','\ ')

        if platform.system() == 'Windows':
            notebook_path = os.getcwd()+'\{}'.format(notebook_name)
            os.system('jupyter nbconvert "{}" --no-input --no-prompt --template toc2 --to=html'.format(notebook_path))

        else:
            notebook_path = os.getcwd()+'/{}'.format(notebook_name)
            os.system('jupyter nbconvert {} --no-input --no-prompt --template toc2 --to=html'.format(notebook_path))

        clear_output()
        display(Markdown('<span style="color:darkgreen; font-size: 15px"><b><i>HTML Report generated and saved in your current local folder!</i></b></span>'))
        track_cell(value, flag)
    except Exception as err:
        # display the error
#         print(colored('\nERROR:','red',attrs=['bold']),colored(err,'grey'))
        flag = 0
        err = str(err)
#         track_cell(value, flag, err)
else:
    pass

NameError: name 'track_cell' is not defined

***

# Executive Summary

```
*Add summary here*

```

<div style="text-align: right;">
    <i>&copy; 2020 Copyright Mu Sigma Inc.</i>
</div>