# How to...clean up and validate banking IDs

This notebook shows how to use the **financial-entity-cleaner.id.banking** module to validate ID's such as LEI, ISIN and SEDOL. You can use this module in three different ways:
1. [by validating single values of text as ID's](#validate_id)
2. [by cleaning & validating single values of text as ID's](#clean_validate_id)
3. [by cleaning & validating IDs on tabular dataframe](#df)

No matter which approach you choose, you will need to import and create an object based on the **BankingIdCleaner()** class which is available in the **financial_entity_cleaner.id.banking** module. This notebook shows how you can customize the behaviour of this class to adapt the cleaning to your own needs.   

<div class="alert alert-block alert-info">
<b>Note:</b> The complete documentation of the <b>financial-entity-cleaner</b> is available at <a href="https://financial-entity-cleaner.readthedocs.io/en/latest/" title="ReadTheDocs">ReadTheDocs</a>.</div>

In [1]:
# Sets up the location of the financial-entity-cleaner library relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the BankingIdCleaner() class for ID validation
from financial_entity_cleaner.id.banking import BankingIdCleaner

In [3]:
# Create an object based on BankingIdCleaner() class to perform validation over string values, dataframe or .csv file
id_cleaner_obj = BankingIdCleaner()

To see all the supported ID types:

In [4]:
# Check the ID's supported by the library
id_cleaner_obj.get_types()

['lei', 'isin', 'sedol']

<div class="alert alert-block alert-danger">
<b>EXCEPTION:</b> The library throws an exception if the ID type is not supported.
</div>

In [5]:
id_cleaner_obj.id_type='test'

TypeOfBankingIdNotSupported: Financial-Entity-Cleaner (Error) - The ID type <test> is not supported.

## 1. Validating single values of text as ID's <a id="validate_id"></a>

Use **is_valid()** method to verify if an ID is valid. This methods will return:
- None if the value is not a string or has no characters in it.
- True if the value is a valid ID of the specified type
- False if the value is not a valid ID of the specified type

By default, the library assumes that the value passed as parameter is an ISIN code. 

In [None]:
# Checking the default type
print(id_cleaner_obj.id_type)

In [None]:
# Testing an invalid ISIN code
print(id_cleaner_obj.is_valid('tttt0B1YW4409'))

In [None]:
# Testing a valid LEI code
id_cleaner_obj.id_type='lei'
print(id_cleaner_obj.is_valid('969500DPKGC9JE9F0820'))

In [None]:
# Testing a valid SEDOL code
id_cleaner_obj.id_type='sedol'
print(id_cleaner_obj.is_valid('2595708'))

The **operation mode** of **BankingIdCleaner()** class is set by **default** to be in **SILENT_MODE**. When in this default mode, if the ID is not a string or is empty, **None** is returned. 

In [6]:
# Testing to clean up a value that is not a string
print(id_cleaner_obj.is_valid(12345))

None


The code below imports the modes of operation from **utils.lib** module and use it to change the **mode property** of the **BankingIdCleaner()** object. 

In [7]:
# Import the modes from utils.lib
from financial_entity_cleaner.utils.lib import ModeOfUse

In [8]:
id_cleaner_obj.mode = ModeOfUse.EXCEPTION_MODE

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b> Now, the code below will throw a customized exception because the parameter of is_valid() method is not a string.
</div>

In [9]:
# Testing to clean up a value that is not a string
print(id_cleaner_obj.is_valid(12345))

BankingIdIsNotAString: Financial-Entity-Cleaner (Error) - The input data <12345> is not a string.

In [10]:
# Back to SILENT mode
id_cleaner_obj.mode = ModeOfUse.SILENT_MODE

## 2. Cleaning and validating single values of text as ID's <a id="clean_validate_id"></a>

The library can also be used for cleaning and validation at the same time. In this case, it will return a dictionary with the cleaning ID and if it is valid or not:

In [11]:
id_cleaner_obj.id_type='lei'

In [12]:
# Cleaning a valid LEI code
clean_lei = id_cleaner_obj.get_clean_data('969500DPKGC9JE9F0820')
clean_lei

{'cleaned_id': '969500DPKGC9JE9F0820', 'isvalid_id': True}

You may want to change the dictionary keys returned by get_clean_data() method. Instead of calling them 'cleaned_id' and 'isvalid_id' you can define other names by changing the properties **output_cleaned_id** and **output_validated_id**:

In [13]:
id_cleaner_obj.output_cleaned_id = 'LEI'
id_cleaner_obj.output_validated_id = 'IS_VALID'
clean_lei = id_cleaner_obj.get_clean_data('969500DPKGC9JE9F0820')
clean_lei

{'LEI': '969500DPKGC9JE9F0820', 'IS_VALID': True}

You can reset these output names any time by calling the **reset_output_names()** methods of your CountryCleaner() object: 

In [14]:
id_cleaner_obj.reset_output_names()

In [15]:
clean_lei = id_cleaner_obj.get_clean_data('969500DPKGC9JE9F0820')
clean_lei

{'cleaned_id': '969500DPKGC9JE9F0820', 'isvalid_id': True}

You may also want to return NaN if the ID is invalid. For this, set the property **invalid_ids_as_nan** to True. By default, it is set to False and, therefore, the get_clean_data() method will always return the ID text:

In [16]:
id_cleaner_obj.invalid_ids_as_nan = True
clean_lei = id_cleaner_obj.get_clean_data('96XX00DPKGC9JE9F0820')
clean_lei

{'cleaned_id': nan, 'isvalid_id': False}

## 3. Cleaning and validating IDs on tabular dataframe <a id="df"></a>

A more realistic scenario is to have your data in a tabular format and you are already using the Pandas library to make operations on it. You can write your own code to iterate over your pandas dataframe structure by using the get_clean_data() method to clean up some columns. However, the financial-entity-cleaner makes this task easier for you. The BankingIdCleaner() class provides the **get_clean_df()** method that performs the normalization of ids defined as dataframe columns. See the code below on how to apply this method: 

In [17]:
import pandas as pd

In [18]:
# Read the .csv file as a pandas dataframe object
input_filename = '../../tests/data/test_cleaner.csv'
df_original = pd.read_csv(input_filename,sep=';',encoding='utf-8', usecols=['NAME','LEI'])
df_original

Unnamed: 0,NAME,LEI
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456
4,"Brault Loisirs, Orl. SARL",
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82
6,StarCOM Group Servizi **CAT** SRL,
7,Wolbeck (Archer Daniels) *Unified* GmbH,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888
9,"Susamar-Patino, colectores (adm) SA",


In [19]:
# Set up the resultant letter case
id_cleaner_obj.output_lettercase='upper'

In [20]:
id_cleaner_obj.id_type='lei'

In [21]:
df_cleaner = id_cleaner_obj.get_clean_df(df_original, 'LEI')
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 10/10 [00:00<00:00, 833.87it/s]


Unnamed: 0,NAME,LEI,cleaned_id,isvalid_id
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,984500E8DA1DE9A0D939,True
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,5493001MT6YISZH3YV05,True
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,,False
4,"Brault Loisirs, Orl. SARL",,,
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,213800UHTTV6EGY74X82,True
6,StarCOM Group Servizi **CAT** SRL,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,False
9,"Susamar-Patino, colectores (adm) SA",,,


The result above sets cleaned_id to NaN if the ID is invalid. This happens when **invalid_ids_as_nan** property is True. Therefore, make sure it is set to False if you don't want this behaviour: 

In [22]:
id_cleaner_obj.invalid_ids_as_nan = False 

In [23]:
df_cleaner = id_cleaner_obj.get_clean_df(df_original, 'LEI')
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 10/10 [00:00<00:00, 667.12it/s]


Unnamed: 0,NAME,LEI,cleaned_id,isvalid_id
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,984500E8DA1DE9A0D939,True
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,5493001MT6YISZH3YV05,True
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,34123456,False
4,"Brault Loisirs, Orl. SARL",,,
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,213800UHTTV6EGY74X82,True
6,StarCOM Group Servizi **CAT** SRL,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,US4567899888,False
9,"Susamar-Patino, colectores (adm) SA",,,


Change the properties **output_cleaned_id** and **output_validated_id** to change the output column names.

In [24]:
id_cleaner_obj.output_cleaned_id = 'LEI_CLEANED'
id_cleaner_obj.output_validated_id = 'IS_VALID_LEI'

In [25]:
df_cleaner = id_cleaner_obj.get_clean_df(df_original, 'LEI')
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 10/10 [00:00<00:00, 714.74it/s]


Unnamed: 0,NAME,LEI,LEI_CLEANED,IS_VALID_LEI
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,984500E8DA1DE9A0D939,True
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,5493001MT6YISZH3YV05,True
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,34123456,False
4,"Brault Loisirs, Orl. SARL",,,
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,213800UHTTV6EGY74X82,True
6,StarCOM Group Servizi **CAT** SRL,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,US4567899888,False
9,"Susamar-Patino, colectores (adm) SA",,,


If the **output_cleaned_id** is the same name of the input column, then its values will be replaced by the cleaned ones. Therefore, be carefull if you combine this feature when the property **invalid_ids_as_nan** is True because the IDs will be set to NaN if they are invalid. Compare lines 3 and 8 of the results below with the previous one:

In [26]:
id_cleaner_obj.invalid_ids_as_nan = True

In [27]:
id_cleaner_obj.output_cleaned_id = 'LEI'

In [28]:
df_cleaner = id_cleaner_obj.get_clean_df(df_original, 'LEI')
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 10/10 [00:00<00:00, 714.79it/s]


Unnamed: 0,NAME,LEI,IS_VALID_LEI
0,Bechel *Australia (Services) Pty Ltd,984500E8DA1DE9A0D939,True
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,True
3,"Glass Coatings & Concepts ""CBG"" LLC",,False
4,"Brault Loisirs, Orl. SARL",,
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,True
6,StarCOM Group Servizi **CAT** SRL,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",,False
9,"Susamar-Patino, colectores (adm) SA",,
