# How to...normalize country information

This notebook shows how to use the **financial-entity-cleaner.country.iso3166** module to normalize country information according to the ISO 3166 standard. You can use this module in two ways:
1. [by normalizing single instances of text with country information](#string_values)
2. [by normalizing country information on tabular data](#df)

No matter which approach you choose, you will need to import and create an object based on the **CountryCleaner()** class which is available in the **financial_entity_cleaner.country.iso3166** module. This notebook shows how you can customize the behaviour of this class to adapt the cleaning to your projects.

<div class="alert alert-block alert-info">
<b>Note:</b> The complete documentation of the <b>financial-entity-cleaner</b> is available at <a href="https://financial-entity-cleaner.readthedocs.io/en/latest/" title="ReadTheDocs">ReadTheDocs</a>.</div>

In [1]:
# Sets up the location of the financial-entity-cleaner library relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the CountryCleaner() class to normalize country information
from financial_entity_cleaner.country.iso3166 import CountryCleaner

In [3]:
# Create an object based on CountryCleaner() class to perform cleaning over string values, dataframe or .csv file
country_cleaner_obj=CountryCleaner()

## 1. Normalizing single instances of text with country information <a id="string_values"></a>

When normalizing country information on single string values, the **get_clean_data()** method is the way to go. This method receives only one parameter, which is a string value that contains any country information, such as code or name. The examples below show three cases in which get_clean_data() method is successfully performed and returns a dictionary with the following standardized country information as stated by ISO 3166: 
- iso_name: official ISO country name, 
- iso_alpha2: official ISO code with two characters, and
- iso_alpha3: official ISO code with three characters.

In [4]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_clean_data('PT')
print('Country information: {}'.format(country_info_dict))

Country information: {'iso_name': 'portugal', 'iso_alpha2': 'pt', 'iso_alpha3': 'prt'}


In [5]:
# Get the complete information about a country given alpha3 code
country_info_dict = country_cleaner_obj.get_clean_data('BRA')
print('Country information: {}'.format(country_info_dict))

Country information: {'iso_name': 'brazil', 'iso_alpha2': 'br', 'iso_alpha3': 'bra'}


In [6]:
# Get the complete information about a country given a name
country_info_dict = country_cleaner_obj.get_clean_data(' China ')
print('Country information: {}'.format(country_info_dict))

Country information: {'iso_name': 'china', 'iso_alpha2': 'cn', 'iso_alpha3': 'chn'}


Now, let's see three cases in which the search fails. 
- when the value passed as parameter cannot be found as being a country code or name.
- when the value passed as parameter is not a string, and 
- when the value passed as parameter is too short to be a country code or name.

In [7]:
# Testing with a string value that is not a country code or name.
print(country_cleaner_obj.get_clean_data('123'))

None


In [8]:
# Testing with a value that is not a string.
print(country_cleaner_obj.get_clean_data(123))

None


In [9]:
# Testing with a value that is too short to be a country name or code
print(country_cleaner_obj.get_clean_data('uk'))

None


In the cases above, the **get_clean_data()** method returned a 'None' value which indicates that the search failed. However, there will be cases in which the user would prefer to receive an exception instead. For this case, it is required to change the **operation mode** of the **CountryCleaner()** class, which is set by **default** to be in **SILENT_MODE**. The code below imports the modes of operation from **utils.lib** module and use it to set the **mode property**: 

In [10]:
# Import the modes from utils.lib
from financial_entity_cleaner.utils.lib import ModeOfUse

In [11]:
country_cleaner_obj.mode = ModeOfUse.EXCEPTION_MODE

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b> Now, the code below will throw a customized exception because the parameter of get_clean_data() method is not a string.
</div>

In [12]:
# Trigger an exception if a value is not a string.
print(country_cleaner_obj.get_clean_data(123))

CountryIsNotAString: Financial-Entity-Cleaner (Error) - The input data <123> is not a string.

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b> Another exception is shown because get_clean_data() couldn't locate the country.
</div>

In [17]:
# Trigger an exception if the country is not found.
print(country_cleaner_obj.get_clean_data('uk'))

CountryNotFound: Financial-Entity-Cleaner (Error) - The country <uk> was not found.

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b>The last case below shows an exception because the input parameter is too short to be a country code or name.
</div>

In [18]:
# Testing with a value that is too short to be a country name or code
print(country_cleaner_obj.get_clean_data('u'))

CountryInputDataTooSmall: Financial-Entity-Cleaner (Error) - The input data <u> is too small to be a country data.

<div class="alert alert-block alert-success">
<b>SILENT MODE:</b> You can set up your cleaning object back to SILENT_MODE as shown below:
</div>

In [19]:
country_cleaner_obj.mode = ModeOfUse.SILENT_MODE

In [20]:
# Trigger an exception if the country is not found.
print(country_cleaner_obj.get_clean_data('123'))

None


Another important property of the CountryCleaner() class is the **letter_case** that defines if the output of the country search will be returned in lower, upper or title case. By default, the letter case is set to be 'lower'. If you want to change that, just set the letter_case property to **'lower'**, **'upper'** or **'title'** as shown below:

In [21]:
# Set up the resultant letter case
country_cleaner_obj.letter_case='upper'

In [22]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_clean_data('us')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'iso_name': 'UNITED STATES OF AMERICA', 'iso_alpha2': 'US', 'iso_alpha3': 'USA'}


Finally, you may want to change the dictionary keys returned by the country search. Instead of calling them 'iso_name', 'iso_alpha2' and 'iso_alpha3' you can define other names. The code below shows how you can do that by changing the properties **output_name**, **output_alpha2** and **output_alpha3** of the CountryCleaner() object:

In [23]:
country_cleaner_obj.output_name = 'clean_name'
country_cleaner_obj.output_alpha2 = 'clean_alpha2'
country_cleaner_obj.output_alpha3 = 'clean_alpha3'

In [24]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_clean_data('US')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'clean_name': 'UNITED STATES OF AMERICA', 'clean_alpha2': 'US', 'clean_alpha3': 'USA'}


You can reset these output names any time by calling the **reset_output_names()** methods of your CountryCleaner() object: 

In [25]:
country_cleaner_obj.reset_output_names()

In [26]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_clean_data('US')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'iso_name': 'UNITED STATES OF AMERICA', 'iso_alpha2': 'US', 'iso_alpha3': 'USA'}


## 2. Normalizing country information on pandas dataframes <a id="df"></a>

A more realistic scenario is to have your data in a tabular format and you are already using the Pandas library to make operations on it. You can write your own code to iterate over your pandas dataframe structure by using the get_clean_data() method to clean up some columns. However, the financial-entity-cleaner makes this task easier for you. The CountryCleaner() class provides the **get_clean_df()** method that performs the normalization of countries defined as dataframe columns. See the code below on how to apply this method: 

In [27]:
# Import the pandas library to read data from a .csv file
import pandas as pd

In [28]:
# Read the .csv file as a pandas dataframe object
input_filename = '../../tests/data/test_cleaner.csv'
df_original = pd.read_csv(input_filename,sep=';',encoding='utf-8', usecols=['NAME','COUNTRY_HEAD'])
df_original

Unnamed: 0,NAME,COUNTRY_HEAD
0,Bechel *Australia (Services) Pty Ltd,au
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia
2,meo - serviços de comunicação e multimedia SA...,PT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States
4,"Brault Loisirs, Orl. SARL",FR
5,"Cole & Brothers Fabric, Services LLC.",uk
6,StarCOM Group Servizi **CAT** SRL,italy
7,Wolbeck (Archer Daniels) *Unified* GmbH,de
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb
9,"Susamar-Patino, colectores (adm) SA",ES


In [29]:
# Let's setup the output in upper case
country_cleaner_obj.letter_case='upper'

In [30]:
# Let's perform the cleaning on the dataframe by using the column named as 'COUNTRY'
df_cleaner = country_cleaner_obj.get_clean_df(df=df_original, column_name='COUNTRY_HEAD')
df_cleaner

Normalizing countries...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 500.28it/s]


Unnamed: 0,NAME,COUNTRY_HEAD,iso_name,iso_alpha2,iso_alpha3
0,Bechel *Australia (Services) Pty Ltd,au,AUSTRALIA,AU,AUS
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,AUSTRALIA,AU,AUS
2,meo - serviços de comunicação e multimedia SA...,PT,PORTUGAL,PT,PRT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,UNITED STATES OF AMERICA,US,USA
4,"Brault Loisirs, Orl. SARL",FR,FRANCE,FR,FRA
5,"Cole & Brothers Fabric, Services LLC.",uk,,,
6,StarCOM Group Servizi **CAT** SRL,italy,ITALY,IT,ITA
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,GERMANY,DE,DEU
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN I...,GB,GBR
9,"Susamar-Patino, colectores (adm) SA",ES,SPAIN,ES,ESP


From the results above it's clear to spot where the country search failed because the library returns a 'NaN' object (row 5). Therefore, you can easily identify the countries that were not normalized.

Another important thing to notice is that the cleaning process created three new columns: iso_name, iso_alpha2 and iso_alpha3. These names are defined by the properties output_name, output_alpha2 and output_alpha3. By default, these names are in lower case: iso_name, iso_alpha2 and iso_alpha3. The code below shows how you can change them: 

In [35]:
# Changing the output column names
country_cleaner_obj.output_name = 'COUNTRY_NAME'
country_cleaner_obj.output_alpha2 = 'ALPHA2'
country_cleaner_obj.output_alpha3 = 'ALPHA3'

In [32]:
# Calling the cleaning again to reflect the new naming convention
df_cleaner = country_cleaner_obj.get_clean_df(df=df_original, column_name='COUNTRY_HEAD')
df_cleaner

Normalizing countries...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 524.09it/s]


Unnamed: 0,NAME,COUNTRY_HEAD,COUNTRY_NAME,ALPHA2,ALPHA3
0,Bechel *Australia (Services) Pty Ltd,au,AUSTRALIA,AU,AUS
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,AUSTRALIA,AU,AUS
2,meo - serviços de comunicação e multimedia SA...,PT,PORTUGAL,PT,PRT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,UNITED STATES OF AMERICA,US,USA
4,"Brault Loisirs, Orl. SARL",FR,FRANCE,FR,FRA
5,"Cole & Brothers Fabric, Services LLC.",uk,,,
6,StarCOM Group Servizi **CAT** SRL,italy,ITALY,IT,ITA
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,GERMANY,DE,DEU
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN I...,GB,GBR
9,"Susamar-Patino, colectores (adm) SA",ES,SPAIN,ES,ESP


If the output_name, output_alpha2 or output_alpha3 names coincide with some column names in the dataframe, then its correspondent values are replaced by the cleaned/normalized ones. For instance, the code below replaces the value in COUNTRY_HEAD column by the alpha2 code returned as the result of get_clean_df() method:

In [37]:
country_cleaner_obj.output_alpha2 = 'COUNTRY_HEAD'
df_cleaner = country_cleaner_obj.get_clean_df(df=df_original, column_name='COUNTRY_HEAD')
df_cleaner

Normalizing countries...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 478.48it/s]


Unnamed: 0,NAME,COUNTRY_NAME,COUNTRY_HEAD,ALPHA3
0,Bechel *Australia (Services) Pty Ltd,AUSTRALIA,AU,AUS
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,AUSTRALIA,AU,AUS
2,meo - serviços de comunicação e multimedia SA...,PORTUGAL,PT,PRT
3,"Glass Coatings & Concepts ""CBG"" LLC",UNITED STATES OF AMERICA,US,USA
4,"Brault Loisirs, Orl. SARL",FRANCE,FR,FRA
5,"Cole & Brothers Fabric, Services LLC.",,,
6,StarCOM Group Servizi **CAT** SRL,ITALY,IT,ITA
7,Wolbeck (Archer Daniels) *Unified* GmbH,GERMANY,DE,DEU
8,"Anheuser-BUSCH, Brothers (food services), LLC",UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN I...,GB,GBR
9,"Susamar-Patino, colectores (adm) SA",SPAIN,ES,ESP
