# Introduction
<a id='top'></a>

<a href=#columns>Comparing columns</a>

<a href=#rows>Comparing rows</a>

<a href=#sums>Comparing sums</a>

<a href=#conclusions>Conclusion</a>

<a href=#End>End</a>

The purpose of this Notebook is to compare the results of data cleaning and merging for the static year version versus the rolling month version. Both files have the whole of 2018 as the target year.

# Loading data

In [1]:
import pandas as pd
import numpy as np
from IPython.display import HTML

In [2]:
cols_key = ['date_month', 'id_company', 'id_branch']

In [20]:
df_static = pd.read_csv('2017_merged.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
df_rolling = pd.read_csv('cleaned_merged_2018-01-01.csv')

  interactivity=interactivity, compiler=compiler, result=result)


<a id='columns'></a>
# Comparing columns
<a href=#top>Top</a>

Are the number of columns the same, and if not, how do they differ?

In [22]:
qty_cols_static = df_static.shape[1]
qty_cols_rolling = df_rolling.shape[1]
print("Columns for static:", "{:,}".format(qty_cols_static).replace(',', '.'), 
      "versus columns for rolling:", "{:,}".format(qty_cols_rolling).replace(',', '.'), 
      ", meaning there is a difference of:", "{:,}".format(qty_cols_rolling - qty_cols_static).replace(',', '.'), "columns.")

Columns for static: 44 versus columns for rolling: 49 , meaning there is a difference of: 5 columns.


Columns is static, but not in rolling:

In [23]:
df_static.columns.difference(df_rolling.columns)

Index([], dtype='object')

No missing columns in rolling dataset

Columns is rolling, but not in static:

In [24]:
df_rolling.columns.difference(df_static.columns)

Index(['date_dataset_x', 'date_dataset_y', 'date_month_last',
       'years_current_location', 'years_previous_location'],
      dtype='object')

### Conclusion
There are no missing columns in the rolling dataframe, but there are some extra. This is as expected since there are:
* New columns added for identifying the dataset:
   - date_dataset_x 
   - date_dataset_y
* Columns that are created in the 'Clean and Merge' phase instead of the 'Aggregate and Transform' phase:
   - date_month_last 
   - years_current_location
   - years_previous_location

The columns in the rolling dataframe are:

In [25]:
df_rolling.columns

Index(['Unnamed: 0', 'date_month', 'id_company', 'id_branch',
       'date_established', 'is_discontinued', 'code_discontinuation',
       'code_financial_calamity', 'date_financial_calamity_started',
       'date_financial_calamity_stopped', 'financial_calamity_outcome',
       'code_legal_form', 'qty_employees', 'year_qty_employees',
       'id_company_creditproxy', 'score_payment_assessment', 'amt_revenue',
       'year_revenue', 'amt_operating_result', 'year_operating_result',
       'amt_consolidated_revenue', 'year_consolidated_revenue',
       'amt_consolidated_operating_result',
       'year_consolidated_operating_result', 'qty_issued_credit_reports',
       'perc_credit_limit_adjustment', 'color_credit_status', 'rat_pd',
       'score_pd', 'has_increased_risk', 'is_sole_proprietor', 'code_sbi_2',
       'code_sbi_1', 'qty_address_mutations_total',
       'qty_address_mutations_month', 'date_start', 'from_date_start',
       'qty_started_names', 'qty_stopped_names', 'has_name_c

<a id='rows'></a>
# Comparing rows
<a href=#top>Top</a>

In [26]:
qty_rows_static = df_static.shape[0]
qty_rows_rolling = df_rolling.shape[0]
print("Rows for static:", "{:,}".format(qty_rows_static).replace(',', '.'), 
      "versus rows for rolling:", "{:,}".format(qty_rows_rolling).replace(',', '.'), 
      ", meaning there is a difference of:", "{:,}".format(qty_rows_rolling - qty_rows_static).replace(',', '.'), "rows.")

Rows for static: 22.729.762 versus rows for rolling: 22.729.762 , meaning there is a difference of: 0 rows.


## Conclusion
So there are way more rows in the rolling dataframe. Are there repeating keys (meaning month, company and branch)?

<a id='sums'></a>
# Comparing sums
<a href=#top>Top</a>

Check whether summable columns yield the same result

In [None]:
cols_summable = ['qty_employees', 'year_qty_employees', 'score_payment_assessment', 'amt_revenue', 'year_revenue',
                 'amt_operating_result', 'year_operating_result', 'amt_consolidated_revenue', 'year_consolidated_revenue',
                 'amt_consolidated_operating_result', 'year_consolidated_operating_result', 'qty_issued_credit_reports', 
                 'perc_credit_limit_adjustment', 'rat_pd', 'score_pd', 'has_increased_risk', 'is_sole_proprietor',  
                 'code_sbi_2', 'code_sbi_1', 'qty_address_mutations_total', 'qty_address_mutations_month',
                 'date_start', 'from_date_start', 'qty_started_names', 'qty_stopped_names', 'has_name_change',
                 'total_changeof_board_members_', 'has_relocated']

In [36]:
sums_static = df_static[cols_summable].sum()
sums_rolling = df_rolling[cols_summable].sum()
sums_static == sums_rolling

qty_employees         True
year_qty_employees    True
dtype: bool

<a id='conclusions'></a>
# Conclusion
<a href=#top>Top</a>

I've looked at 3 criteria:
* The columns: there are added columns in the rolling month dataset, but they were there by design. 
* The rows: the number of rows is equal, so that is alright. 
* Sums ofWhen comparing the sum of all summable columns