# How to...clean up and validate banking IDs

This notebook shows how to use the **BankingIDCleaner** class to validate ID's such as LEI, ISIN and SEDOL. You can use this class in three different ways:
1. [to validate single values of text as ID's](#validate_id)
2. [to clean & validate single values of text as ID's](#clean_validate_id)
3. [convert siret into siren code](#siret_to_siren)
4. [to clean & validate IDs on tabular dataframes](#df)

No matter which approach you choose, you will need to import and create an object based on the **BankingIdCleaner()** class which is available in the **financial_entity_cleaner.id** package. This notebook shows how you can customize the behaviour of this class to adapt the cleaning to your own needs.   

In [1]:
# Sets up the location of the financial-entity-cleaner library relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import BankingIdCleaner
from financial_entity_cleaner.id import BankingIdCleaner

In [3]:
# Create an object based on the BankingIdCleaner() class
id_cleaner_obj = BankingIdCleaner()

To see all the supported ID types:

In [4]:
# Check the ID's supported by the library
id_cleaner_obj.get_types()

['lei', 'isin', 'sedol', 'cusip', 'bic', 'siren', 'siret']

<div class="alert alert-block alert-danger">
<b>EXCEPTION:</b> The library throws an exception if the ID type is not supported.
</div>

In [5]:
id_cleaner_obj.id_type='test'

TypeOfBankingIdNotSupported: Financial-Entity-Cleaner (Error) <BankingIdCleaner> - The ID type(s) <test> is/are not supported.

## 1. Validating single values of text as ID's <a id="validate_id"></a>

Use **is_valid()** method to verify if an ID is valid. This methods will return:
- None if the value is not a string or has no characters in it.
- True if the value is a valid ID of the specified type
- False if the value is not a valid ID of the specified type

By default, the library assumes that the value passed as parameter is an ISIN code. 

In [6]:
# Checking the default type
print(id_cleaner_obj.id_type)

isin


In [7]:
# Testing an invalid ISIN code
print(id_cleaner_obj.is_valid('tttt0B1YW4409'))

False


In [8]:
# Testing a valid LEI code
id_cleaner_obj.id_type='lei'
print(id_cleaner_obj.is_valid('969500DPKGC9JE9F0820'))

True


In [9]:
# Testing a valid SEDOL code
id_cleaner_obj.id_type='sedol'
print(id_cleaner_obj.is_valid('2595708'))

True


Sometimes, it is necessary to have boolean values described as categorical data (0 and 1's). For example, when we want to store data in a database or perform machine learning operations on it. In order to return categorical data as the result of the validation process, you must set the property **validation_as_categorical** as **True**: 

In [10]:
id_cleaner_obj.validation_as_categorical = True 

In [11]:
# Testing a valid SIRET code
id_cleaner_obj.id_type='siret'
print(id_cleaner_obj.is_valid('73282932000074'))

1


In [12]:
# Testing a invalid LEI code
id_cleaner_obj.id_type='lei'
print(id_cleaner_obj.is_valid('969500DPKGC9JE9F0855'))

0


In [13]:
# Undo the categorical output
id_cleaner_obj.validation_as_categorical = False 

In [14]:
# Testing a valid SIRET code
id_cleaner_obj.id_type='siret'
print(id_cleaner_obj.is_valid('73282932000074'))

True


The **operation mode** of **BankingIdCleaner()** class is set by **default** to be in **SILENT_MODE**, meaning that if the ID is not a string or is empty, **None** is returned instead of an error. 

In [15]:
# Testing to clean up a value that is not a string
print(id_cleaner_obj.is_valid(12345))

None


The code below shows how to change the operation mode to throw an error for invalid ID's.

In [16]:
id_cleaner_obj.mode = BankingIdCleaner.EXCEPTION_MODE

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b> Now, the code below will throw a customized exception because the parameter of is_valid() method is not a string.
</div>

In [17]:
# Testing to clean up a value that is not a string
print(id_cleaner_obj.is_valid(12345))

BankingIdIsNotAString: Financial-Entity-Cleaner (Error) <BankingIdCleaner> - The input data <12345> is not a string.

In [18]:
# Back to SILENT mode
id_cleaner_obj.mode = BankingIdCleaner.SILENT_MODE

## 2. Cleaning and validating single values of text as ID's <a id="clean_validate_id"></a>

The library can also be used for cleaning and validation at the same time. In this case, it will return a dictionary with the cleaning ID and if it is valid or not:

In [19]:
id_cleaner_obj.id_type='lei'

In [20]:
# Cleaning a valid LEI code
clean_lei = id_cleaner_obj.clean('969500  dpKGC9JE9F0820')
clean_lei

{'cleaned_id': '969500DPKGC9JE9F0820', 'isvalid_id': True}

You may want to change the dictionary keys returned by the clean() method. Instead of calling them 'cleaned_id' and 'isvalid_id' you can define other names by changing the properties **output_cleaned_id** and **output_validated_id**:

In [21]:
id_cleaner_obj.output_cleaned_id = 'LEI'
id_cleaner_obj.output_validated_id = 'IS_VALID'
clean_lei = id_cleaner_obj.clean('969500DPKGC9JE9F0820')
clean_lei

{'LEI': '969500DPKGC9JE9F0820', 'IS_VALID': True}

You can reset these output names any time by calling the **reset_output_names()**: 

In [22]:
id_cleaner_obj.reset_output_names()

In [23]:
clean_lei = id_cleaner_obj.clean('969500DPKGC9JE9F0820')
clean_lei

{'cleaned_id': '969500DPKGC9JE9F0820', 'isvalid_id': True}

You may also want to return NaN if the ID is invalid. For this, set the property **invalid_ids_as_nan** to **True**. By default, it is set to False and, therefore, the clean() method will always return the ID text:

In [24]:
id_cleaner_obj.invalid_ids_as_nan = True
clean_lei = id_cleaner_obj.clean('96XX00DPKGC9JE9F0820')
clean_lei

{'cleaned_id': nan, 'isvalid_id': False}

## 3. Convert Siret to Siren <a id="siret_to_siren"></a>

You can use the method **siret_to_siren()** to convert a siret code into siren: 

In [25]:
siret = '440 558 062'
siren = id_cleaner_obj.siret_to_siren(siret)
siren

'440558062'

In [26]:
id_cleaner_obj.id_type='siren'
id_cleaner_obj.is_valid(siren)

True

The cleaning library will do its best to clean up and retrieve the correspond siren code: 

In [27]:
siret = '533 602 942 R.C.S Nanterre'
siren = id_cleaner_obj.siret_to_siren(siret)
siren

'533602942'

In [28]:
id_cleaner_obj.id_type='siren'
id_cleaner_obj.is_valid(siren)

True

## 4. Cleaning and validating IDs on tabular dataframe <a id="df"></a>

A more realistic scenario is to have your data in a tabular format and you are already using the Pandas library to make operations on it. You can write your own code to iterate over your pandas dataframe structure by using the clean() method to clean up some columns. However, the financial-entity-cleaner makes this task easier for you.

You can use the method **siret_to_siren_df()** to convert a siret column of a dataframe into a siren column:

In [29]:
import pandas as pd

In [30]:
# Read a .csv file as a pandas dataframe object
input_filename = '../../tests/data/test_cleaner.csv'
df_original = pd.read_csv(input_filename,sep=';',encoding='utf-8', usecols=['NAME','LEI', 'SIRET', 'SIREN'])
df_original

Unnamed: 0,NAME,LEI,SIRET,SIREN
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,533 602 942 R.C.S. Nanterre,733602966.0
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,440 558 062,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,784-00-03-412,
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,be 0413458441,
4,"Brault Loisirs, Orl. SARL",,fr24313140642,
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,333 171 544 00013,
6,StarCOM Group Servizi **CAT** SRL,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,
9,"Susamar-Patino, colectores (adm) SA",,,


In [31]:
df_original = id_cleaner_obj.siret_to_siren_df(df_original, 
                                              siret_col_name='SIRET', 
                                              siren_col_name='SIREN', 
                                              replace_if_exists=False)

In [32]:
df_original

Unnamed: 0,NAME,LEI,SIRET,SIREN
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,533 602 942 R.C.S. Nanterre,733602966.0
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,440 558 062,440558062.0
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,784-00-03-412,784000341.0
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,be 0413458441,41345844.0
4,"Brault Loisirs, Orl. SARL",,fr24313140642,243131406.0
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,333 171 544 00013,333171544.0
6,StarCOM Group Servizi **CAT** SRL,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,
9,"Susamar-Patino, colectores (adm) SA",,,


The BankingIdCleaner() class provides the **clean_df()** method to perform the normalization of ids defined as dataframe columns. See the code below on how to apply this method: 

In [33]:
df_cleaner = id_cleaner_obj.clean_df(df_original, cols=['LEI', 'SIRET', 'SIREN'], 
                                     remove_cols= False, 
                                     output_names_as= 'prefix',
                                     types = ['lei', 'siret', 'siren'])

Normalizing IDs...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 611.50it/s]


In [34]:
df_cleaner

Unnamed: 0,NAME,LEI,SIRET,SIREN,cleaned_id_LEI,isvalid_id_LEI,cleaned_id_SIRET,isvalid_id_SIRET,cleaned_id_SIREN,isvalid_id_SIREN
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,533 602 942 R.C.S. Nanterre,733602966.0,984500E8DA1DE9A0D939,True,,False,,
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,440 558 062,440558062.0,,,,False,440558062.0,True
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,784-00-03-412,784000341.0,5493001MT6YISZH3YV05,True,,False,784000341.0,True
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,be 0413458441,41345844.0,,False,,False,41345844.0,True
4,"Brault Loisirs, Orl. SARL",,fr24313140642,243131406.0,,,,False,243131406.0,True
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,333 171 544 00013,333171544.0,213800UHTTV6EGY74X82,True,33317154400013.0,True,333171544.0,True
6,StarCOM Group Servizi **CAT** SRL,,,,,,,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,,,,,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,,,False,,,,
9,"Susamar-Patino, colectores (adm) SA",,,,,,,,,


Another important property of the BankingIdCleaner() class is the **letter_case** that defines if the output will be returned in lower, upper or title case. By default, the letter case is set to be 'lower'. If you want to change that, just set the letter_case property to **'lower'**, **'upper'** or **'title'** or use the built-in class constants LOWER_LETTER_CASE, UPPER_LETTER_CASE, TITLE_LETTER_CASE, as shown below:

In [35]:
# Set up the resultant letter case
id_cleaner_obj.letter_case = BankingIdCleaner.UPPER_LETTER_CASE

In [36]:
id_cleaner_obj.id_type='lei'

In [37]:
df_cleaner = id_cleaner_obj.clean_df(df_original, cols=['LEI'])
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 1572.45it/s]


Unnamed: 0,NAME,LEI,SIRET,SIREN,LEI_cleaned_id,LEI_isvalid_id
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,533 602 942 R.C.S. Nanterre,733602966.0,984500E8DA1DE9A0D939,True
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,440 558 062,440558062.0,,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,784-00-03-412,784000341.0,5493001MT6YISZH3YV05,True
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,be 0413458441,41345844.0,,False
4,"Brault Loisirs, Orl. SARL",,fr24313140642,243131406.0,,
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,333 171 544 00013,333171544.0,213800UHTTV6EGY74X82,True
6,StarCOM Group Servizi **CAT** SRL,,,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,,,False
9,"Susamar-Patino, colectores (adm) SA",,,,,


Now, let's set the ID validation as categorical and check the results. As you can notice the values are 1.0 for valid IDs and 0.0 for invalid ones. The floating number (0.0 or 1.0) is returned because there are Null values in this dataset. Otherwise, you would be able to see only integer values (1 or 0).

In [38]:
id_cleaner_obj.validation_as_categorical = True 
df_cleaner = id_cleaner_obj.clean_df(df_original, cols=['LEI'])
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 1572.29it/s]


Unnamed: 0,NAME,LEI,SIRET,SIREN,LEI_cleaned_id,LEI_isvalid_id
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,533 602 942 R.C.S. Nanterre,733602966.0,984500E8DA1DE9A0D939,1.0
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,440 558 062,440558062.0,,
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,784-00-03-412,784000341.0,5493001MT6YISZH3YV05,1.0
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,be 0413458441,41345844.0,,0.0
4,"Brault Loisirs, Orl. SARL",,fr24313140642,243131406.0,,
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,333 171 544 00013,333171544.0,213800UHTTV6EGY74X82,1.0
6,StarCOM Group Servizi **CAT** SRL,,,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,,,0.0
9,"Susamar-Patino, colectores (adm) SA",,,,,


In the first row you can notice that the cleaning processing did a good job on the LEI ID. However, keep in mind that the cleaning performed by **BankingIdCleaner()** is very simple and just remove extra spaces and unicode characters. If you need to perform a more advanced cleaning task, use the **SimpleCleaner()** class provided in the financial-entity-cleaner.text package that is able to apply different pre-defined regex rules on texts or string attributes.  

The result above shows that two new columns were created with a suffix defined by the properties **output_cleaned_id** and **output_validated_id**. It also preserved the original LEI column. But, what if we want to clean and validate more than one column in the dataframe, use a standard prefix to rename the new columns and remove the original columns? Also, notice that we did not inform the ID type of the LEI column. If the argument **types** is not passed, the cleaning method assumes that all columns are of the same type as defined by the property **id_type**. The code below performs the cleaning on different IDs: 

In [39]:
id_cleaner_obj.validation_as_categorical = False
id_cleaner_obj.output_cleaned_id = 'CLEANED'
id_cleaner_obj.output_validated_id = 'IS_VALID'
df_cleaner = id_cleaner_obj.clean_df(df_original, cols=['SIRET', 'SIREN'], 
                                     remove_cols= False, 
                                     output_names_as= 'prefix',
                                     types = ['siret', 'siren'])
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 1100.58it/s]


Unnamed: 0,NAME,LEI,SIRET,SIREN,CLEANED_SIRET,IS_VALID_SIRET,CLEANED_SIREN,IS_VALID_SIREN
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,533 602 942 R.C.S. Nanterre,733602966.0,,False,,
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,440 558 062,440558062.0,,False,440558062.0,True
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,784-00-03-412,784000341.0,,False,784000341.0,True
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,be 0413458441,41345844.0,,False,41345844.0,True
4,"Brault Loisirs, Orl. SARL",,fr24313140642,243131406.0,,False,243131406.0,True
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,333 171 544 00013,333171544.0,33317154400013.0,True,333171544.0,True
6,StarCOM Group Servizi **CAT** SRL,,,,,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,,,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,,,,,
9,"Susamar-Patino, colectores (adm) SA",,,,,,,


All the results above set the ID to NaN if the IDs are invalids (see rows 4, 7 and 9 for SEDOL code). This happens when **invalid_ids_as_nan** property is True. Therefore, make sure to set it to False if you don't want this behaviour: 

In [40]:
id_cleaner_obj.invalid_ids_as_nan = False 

In [41]:
df_cleaner = id_cleaner_obj.clean_df(df_original, cols=['SIRET', 'SIREN'], 
                                     remove_cols= True, 
                                     output_names_as= 'prefix',
                                     types = ['siret', 'siren'])
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 1100.79it/s]


Unnamed: 0,NAME,LEI,CLEANED_SIRET,IS_VALID_SIRET,CLEANED_SIREN,IS_VALID_SIREN
0,Bechel *Australia (Services) Pty Ltd,98 4500 e8da1de9a0d939,533602942R.C.S.NANTERRE,False,,
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,440558062,False,440558062.0,True
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,784-00-03-412,False,784000341.0,True
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,BE0413458441,False,41345844.0,True
4,"Brault Loisirs, Orl. SARL",,FR24313140642,False,243131406.0,True
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,33317154400013,True,333171544.0,True
6,StarCOM Group Servizi **CAT** SRL,,,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,,,,
9,"Susamar-Patino, colectores (adm) SA",,,,,


By default, the result of the ID validation is a boolean attribute. But, as stated before, it is possible to define the return type as a categorical data (0 and 1's) and this behaviour also works for cleaning up dataframes. Just set the property **validation_as_categorical** as **True**, and this will also work for multiples ID types as shown below: 

In [42]:
id_cleaner_obj.validation_as_categorical = True 

In [43]:
id_cleaner_obj.cleaning_rules = ["remove_all_punctuation","remove_spaces"]
df_cleaner = id_cleaner_obj.clean_df(df_original, cols=['LEI', 'SIRET', 'SIREN'], 
                                     remove_cols= True, 
                                     output_names_as= 'prefix',
                                     types = ['lei','siret', 'siren'])
df_cleaner

Normalizing IDs...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 611.50it/s]


Unnamed: 0,NAME,CLEANED_LEI,IS_VALID_LEI,CLEANED_SIRET,IS_VALID_SIRET,CLEANED_SIREN,IS_VALID_SIREN
0,Bechel *Australia (Services) Pty Ltd,984500E8DA1DE9A0D939,1.0,533602942RCSNANTERRE,0.0,,
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,,,440558062,0.0,440558062.0,1.0
2,meo - serviços de comunicação e multimedia SA...,5493001MT6YISZH3YV05,1.0,7840003412,0.0,784000341.0,1.0
3,"Glass Coatings & Concepts ""CBG"" LLC",34123456,0.0,BE0413458441,0.0,41345844.0,1.0
4,"Brault Loisirs, Orl. SARL",,,FR24313140642,0.0,243131406.0,1.0
5,"Cole & Brothers Fabric, Services LLC.",213800UHTTV6EGY74X82,1.0,33317154400013,1.0,333171544.0,1.0
6,StarCOM Group Servizi **CAT** SRL,,,,,,
7,Wolbeck (Archer Daniels) *Unified* GmbH,,,,,,
8,"Anheuser-BUSCH, Brothers (food services), LLC",US4567899888,0.0,,,,
9,"Susamar-Patino, colectores (adm) SA",,,,,,


Notice that the validation columns were converted to float type (0.0 or 1.0) because of the null values in some of the cells.

In [None]:
df_cleaner.info()