# Introduction
<a id='top'></a>

<a href=#columns>Comparing columns</a>

<a href=#rows>Comparing rows</a>

<a href=#sums>Comparing sums</a>

<a href=#conclusions>Conclusion</a>

<a href=#End>End</a>

The purpose of this Notebook is to compare the results of data cleaning and merging for the static year version versus the rolling month version. Both files have the whole of 2018 as the target year.

# Loading data

In [1]:
import pandas as pd
import numpy as np
from IPython.display import HTML

In [2]:
cols_key = ['date_month', 'id_company', 'id_branch']

In [3]:
dtype_static ={ 'id_company'  :np.float64, 'id_branch'    :np.int64, 'is_discontinued':bool, 'code_discontinuation': np.float64,
        'code_financial_calamity':object, 'financial_calamity_outcome'   : np.float64, 'code_legal_form' : np.float64,
        'qty_employees' :np.float64, 'year_qty_employees' :np.float64, 'id_company_creditproxy':object,
        'score_payment_assessment'    : np.float64, 'amt_revenue'  : np.float64, 'year_revenue'  : np.float64,
        'amt_operating_result'   : np.float64, 'year_operating_result'    :object, 'amt_consolidated_revenue'   : np.float64,
        'year_consolidated_revenue'   :object, 'amt_consolidated_operating_result' : np.float64,
        'year_consolidated_operating_result'   :object, 'qty_issued_credit_reports' : np.float64,
        'perc_credit_limit_adjustment' :object, 'color_credit_status'  :object, 'rat_pd'              :object,
        'score_pd'            : np.float64, 'has_increased_risk'  :bool, 'is_sole_proprietor'   :bool,
        'code_sbi_2'         : np.float64, 'qty_address_mutations_total'  :np.float64, 'qty_address_mutations_month'   :np.float64,
        'has_relocated':bool, 'qty_started_names': np.float64, 'qty_stopped_names': np.float64,
        'total_changeof_board_members_' :np.float64
    }

df_static = pd.read_csv('2017_merged.csv', dtype=dtype_static)

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
dtype_rolling = {'id_company':np.float64,  'id_branch':np.int64, 'is_discontinued':bool,
                 'code_discontinuation':np.float64, 'code_financial_calamity':object,
                  'financial_calamity_outcome':np.float64, 'code_legal_form':np.float64,
                  'qty_employees':np.float64, 'year_qty_employees':np.float64,
                  'id_company_creditproxy':object, 'score_payment_assessment':np.float64,
                  'amt_revenue':np.float64, 'year_revenue':np.float64,
                  'amt_operating_result':np.float64, 'year_operating_result':object,
                  'amt_consolidated_revenue':np.float64, 'year_consolidated_revenue':object,
                  'amt_consolidated_operating_result':np.float64, 'year_consolidated_operating_result':object,
                  'qty_issued_credit_reports':np.float64, 'perc_credit_limit_adjustment':object,
                  'color_credit_status':object, 'rat_pd':object, 'score_pd': np.float64,
                  'has_increased_risk':bool, 'is_sole_proprietor':bool, 'code_sbi_2':np.float64,
                  'qty_address_mutations_total':np.float64, 'qty_address_mutations_month':np.float64,
                  'has_relocated':bool, 'qty_started_names':np.float64, 'qty_stopped_names':np.float64,
                  'total_changeof_board_members_':np.float64 }

df_rolling = pd.read_csv('cleaned_merged_2018-01-01.csv', dtype=dtype_rolling)

  interactivity=interactivity, compiler=compiler, result=result)


<a id='columns'></a>
# Comparing columns
<a href=#top>Top</a>

Are the number of columns the same, and if not, how do they differ?

In [6]:
qty_cols_static = df_static.shape[1]
qty_cols_rolling = df_rolling.shape[1]
print("Columns for static:", "{:,}".format(qty_cols_static).replace(',', '.'), 
      "versus columns for rolling:", "{:,}".format(qty_cols_rolling).replace(',', '.'), 
      ", meaning there is a difference of:", "{:,}".format(qty_cols_rolling - qty_cols_static).replace(',', '.'), "columns.")

Columns for static: 44 versus columns for rolling: 48 , meaning there is a difference of: 4 columns.


Columns in static, but not in rolling:

In [7]:
df_static.columns.difference(df_rolling.columns)

Index([], dtype='object')

No missing columns in rolling dataset.

Columns in rolling, but not in static:

In [8]:
df_rolling.columns.difference(df_static.columns)

Index(['date_dataset', 'date_month_last', 'years_current_location',
       'years_previous_location'],
      dtype='object')

## Comparing data-types

In [17]:
cols_intersection = df_static.columns.intersection(df_rolling.columns).tolist()
df_rolling[cols_intersection].dtypes == df_static[cols_intersection].dtypes

Unnamed: 0                             True
date_month                             True
id_company                             True
id_branch                              True
date_established                       True
is_discontinued                        True
code_discontinuation                   True
code_financial_calamity                True
date_financial_calamity_started        True
date_financial_calamity_stopped        True
financial_calamity_outcome             True
code_legal_form                        True
qty_employees                          True
year_qty_employees                     True
id_company_creditproxy                 True
score_payment_assessment               True
amt_revenue                            True
year_revenue                           True
amt_operating_result                   True
year_operating_result                  True
amt_consolidated_revenue               True
year_consolidated_revenue              True
amt_consolidated_operating_resul

All are equal dtypes, minus has_relocated, which is kinda important....

In [24]:
print("Rolling dtype:", df_rolling['has_relocated'].dtypes, "\n", 
      "Static dtype:", df_static['has_relocated'].dtypes)

Rolling dtype: object 
 Static dtype: bool


In [21]:
df_rolling['has_relocated'].value_counts()

False    21090118
True      1057676
Name: has_relocated, dtype: int64

In [22]:
df_static['has_relocated'].value_counts()

False    22605016
True       124746
Name: has_relocated, dtype: int64

### Conclusion
There are no missing columns in the rolling dataframe, but there are some extra. This is as expected since there are:
* New columns added for identifying the dataset:
   - date_dataset
* Columns that are created in the 'Clean and Merge' phase instead of the 'Aggregate and Transform' phase:
   - date_month_last 
   - years_current_location
   - years_previous_location

The columns in the rolling dataframe are:

In [None]:
df_rolling.columns

<a id='rows'></a>
# Comparing rows
<a href=#top>Top</a>

In [None]:
qty_rows_static = df_static.shape[0]
qty_rows_rolling = df_rolling.shape[0]
print("Rows for static:", "{:,}".format(qty_rows_static).replace(',', '.'), 
      "versus rows for rolling:", "{:,}".format(qty_rows_rolling).replace(',', '.'), 
      ", meaning there is a difference of:", "{:,}".format(qty_rows_rolling - qty_rows_static).replace(',', '.'), "rows.")

## Conclusion
So there are way more rows in the rolling dataframe. Are there repeating keys (meaning month, company and branch)?

<a id='sums'></a>
# Comparing sums
<a href=#top>Top</a>

Check whether summable columns yield the same result

In [None]:
#cols_summable = ['qty_employees', 'year_qty_employees', 'score_payment_assessment', 'amt_revenue', 'year_revenue',
#                 'amt_operating_result', 'year_operating_result', 'amt_consolidated_revenue', 'year_consolidated_revenue',
#                 'amt_consolidated_operating_result', 'year_consolidated_operating_result', 'qty_issued_credit_reports', 
#                 'perc_credit_limit_adjustment', 'rat_pd', 'score_pd', 'has_increased_risk', 'is_sole_proprietor',  
#                 'code_sbi_2', 'code_sbi_1', 'qty_address_mutations_total', 'qty_address_mutations_month',
#                 'date_start', 'from_date_start', 'qty_started_names', 'qty_stopped_names', 'has_name_change',
#                 'total_changeof_board_members_', 'has_relocated']

cols_summable = ['qty_employees', 'year_qty_employees', 'score_payment_assessment', 'amt_revenue', 'year_revenue',
                 'amt_operating_result', 'year_operating_result', 'amt_consolidated_revenue', 'year_consolidated_revenue',
                 'amt_consolidated_operating_result', 'year_consolidated_operating_result', 'qty_issued_credit_reports', 
                 'rat_pd', 'score_pd', 'qty_address_mutations_total', 'qty_address_mutations_month']

In [None]:
sums_static = df_static[cols_summable].sum()
sums_rolling = df_rolling[cols_summable].sum()
sums_static == sums_rolling

<a id='conclusions'></a>
# Conclusion
<a href=#top>Top</a>

I've looked at 3 criteria:
* The columns: there are added columns in the rolling month dataset, but they were there by design. 
* The rows: the number of rows is equal, so that is alright. 
* Sums of some summable columns: 