# How to...normalize country information

This notebook shows how to use the **CountryCleaner** class to normalize country information according to the ISO 3166 standard. You can use this class in two different ways:
1. [to normalize single instances of text with country information](#string_values)
2. [to normalize country information on tabular data](#df)

No matter which approach you choose, you will need to import and create an object based on the **CountryCleaner()** class which is available in the **financial_entity_cleaner.location** package. This notebook shows how you can customize the behaviour of this class to adapt the cleaning to your own needs.

In [1]:
# Sets up the location of the financial-entity-cleaner library relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import CountryCleaner
from financial_entity_cleaner.location import CountryCleaner

In [3]:
# Create an object based on CountryCleaner() class
country_cleaner_obj=CountryCleaner()

## 1. Normalizing single instances of text with country information <a id="string_values"></a>

When normalizing country information on single string values, the **clean()** method is the way to go. This method receives only one parameter, which is a string value that contains any country information, such as code or name. The examples below show three cases in which this method successfully returns a dictionary with the standardized country information:

In [4]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.clean('PT')
print('Country information: {}'.format(country_info_dict))

Country information: {'code': '620', 'name': 'portugal', 'alpha2': 'pt', 'alpha3': 'prt'}


In [5]:
# Get the complete information about a country given alpha3 code
country_info_dict = country_cleaner_obj.clean('BRA')
print('Country information: {}'.format(country_info_dict))

Country information: {'code': '76', 'name': 'brazil', 'alpha2': 'br', 'alpha3': 'bra'}


In [6]:
# Get the complete information about a country given a name
country_info_dict = country_cleaner_obj.clean(' China ')
print('Country information: {}'.format(country_info_dict))

Country information: {'code': '156', 'name': 'china', 'alpha2': 'cn', 'alpha3': 'chn'}


Now, let's see three cases in which the search fails. 
- when the value passed as parameter cannot be found as being a country code or name.
- when the value passed as parameter is not a string, and 
- when the value passed as parameter is too short to be a country code or name.

In [7]:
# Testing with a string value that is not a country code or name.
print(country_cleaner_obj.clean('123'))

None


In [8]:
# Testing with a value that is not a string.
print(country_cleaner_obj.clean(123))

None


In [9]:
# Testing with a value that is too short to be a country name or code
print(country_cleaner_obj.clean('uk'))

None


In the cases above, the **clean()** method returned a 'None' value which indicates that the search failed. However, there will be cases in which the user would prefer to receive an exception instead. For this case, it is required to change the **operation mode** of the **CountryCleaner()** class, which is set by **default** to be in **SILENT_MODE**. The code below shows how to do that:

In [10]:
country_cleaner_obj.mode = CountryCleaner.EXCEPTION_MODE

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b> Now, the code below will throw a customized exception because the parameter of clean() method is not a string.
</div>

In [11]:
# Trigger an exception if a value is not a string.
print(country_cleaner_obj.clean(123))

CountryIsNotAString: Financial-Entity-Cleaner (Error) - The input data <123> is not a string.

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b> Another exception is shown because clean() couldn't locate the country.
</div>

In [12]:
# Trigger an exception if the country is not found.
print(country_cleaner_obj.clean('uk'))

CountryNotFound: Financial-Entity-Cleaner (Error) - The country <uk> was not found.

<div class="alert alert-block alert-danger">
<b>EXCEPTION MODE:</b>The last case below shows an exception because the input parameter is too short to be a country code or name.
</div>

In [13]:
# Testing with a value that is too short to be a country name or code
print(country_cleaner_obj.clean('u'))

CountryInputDataTooSmall: Financial-Entity-Cleaner (Error) - The input data <u> is too short to be a country data.

<div class="alert alert-block alert-success">
<b>SILENT MODE:</b> You can set up your cleaning object back to SILENT_MODE as shown below:
</div>

In [14]:
country_cleaner_obj.mode = CountryCleaner.SILENT_MODE

In [15]:
# Trigger an exception if the country is not found.
print(country_cleaner_obj.clean('123'))

None


Another important property of the CountryCleaner() class is the **letter_case** that defines if the output of the country search will be returned in lower, upper or title case. By default, the letter case is set to be 'lower'. If you want to change that, just set the letter_case property to **'lower'**, **'upper'** or **'title'** or use the built-in class constants LOWER_LETTER_CASE, UPPER_LETTER_CASE, TITLE_LETTER_CASE, as shown below:

In [16]:
# Set up the resultant letter case
country_cleaner_obj.letter_case= CountryCleaner.UPPER_LETTER_CASE

In [17]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.clean('us')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'code': '840', 'name': 'UNITED STATES OF AMERICA', 'alpha2': 'US', 'alpha3': 'USA'}


Finally, you may want to change the dictionary keys returned by the country search. Instead of calling them 'code', 'name', 'alpha2' and 'alpha3' you can define other names. This can be useful when cleaning up several columns in a dataframe. The code below shows how you can do that by changing the properties **output_code**, **output_name**, **output_alpha2** and **output_alpha3** of the CountryCleaner() object:

In [18]:
country_cleaner_obj.output_code = 'clean_code'
country_cleaner_obj.output_name = 'clean_name'
country_cleaner_obj.output_alpha2 = 'clean_alpha2'
country_cleaner_obj.output_alpha3 = 'clean_alpha3'

In [19]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.clean('US')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'clean_code': '840', 'clean_name': 'UNITED STATES OF AMERICA', 'clean_alpha2': 'US', 'clean_alpha3': 'USA'}


You can reset these output names any time by calling the **reset_output_names()** methods of your CountryCleaner() object: 

In [20]:
country_cleaner_obj.reset_output_names()

In [21]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.clean('US')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'code': '840', 'name': 'UNITED STATES OF AMERICA', 'alpha2': 'US', 'alpha3': 'USA'}


## 2. Normalizing country information on pandas dataframes <a id="df"></a>

A more realistic scenario is to have your data in a tabular format and you are already using the Pandas library to make operations on it. You can write your own code to iterate over your pandas dataframe structure by using the clean() method to clean up some columns. However, the financial-entity-cleaner makes this task easier for you. The CountryCleaner() class provides the **clean_df()** method that performs the normalization of countries defined as dataframe columns. See the code below on how to apply this method: 

In [22]:
# Import the pandas library to read data from a .csv file
import pandas as pd

In [23]:
# Read the .csv file as a pandas dataframe object
input_filename = '../../tests/data/test_cleaner.csv'
df_original = pd.read_csv(input_filename,sep=';',encoding='utf-8', usecols=['NAME','COUNTRY_HEAD', 'COUNTRY_BRANCH'])
df_original

Unnamed: 0,NAME,COUNTRY_HEAD,COUNTRY_BRANCH
0,Bechel *Australia (Services) Pty Ltd,au,au
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,au
2,meo - serviços de comunicação e multimedia SA...,PT,PT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,us
4,"Brault Loisirs, Orl. SARL",FR,France
5,"Cole & Brothers Fabric, Services LLC.",uk,uk
6,StarCOM Group Servizi **CAT** SRL,italy,ita
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,de
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,gb
9,"Susamar-Patino, colectores (adm) SA",ES,spain


In [24]:
# Let's setup the output in upper case
country_cleaner_obj.letter_case='upper'

In [25]:
# Let's perform the cleaning on the dataframe by using the column named as 'COUNTRY_HEAD'
df_cleaner = country_cleaner_obj.clean_df(df=df_original, cols=['COUNTRY_HEAD'])
df_cleaner

Cleaning column [COUNTRY_HEAD]: 100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 314.45it/s]


Unnamed: 0,NAME,COUNTRY_HEAD,COUNTRY_BRANCH,COUNTRY_HEAD_code,COUNTRY_HEAD_name,COUNTRY_HEAD_alpha2,COUNTRY_HEAD_alpha3
0,Bechel *Australia (Services) Pty Ltd,au,au,36.0,AUSTRALIA,AU,AUS
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,au,36.0,AUSTRALIA,AU,AUS
2,meo - serviços de comunicação e multimedia SA...,PT,PT,620.0,PORTUGAL,PT,PRT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,us,840.0,UNITED STATES OF AMERICA,US,USA
4,"Brault Loisirs, Orl. SARL",FR,France,250.0,FRANCE,FR,FRA
5,"Cole & Brothers Fabric, Services LLC.",uk,uk,,,,
6,StarCOM Group Servizi **CAT** SRL,italy,ita,380.0,ITALY,IT,ITA
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,de,276.0,GERMANY,DE,DEU
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,gb,826.0,UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN I...,GB,GBR
9,"Susamar-Patino, colectores (adm) SA",ES,spain,724.0,SPAIN,ES,ESP


From the results above it's clear to spot where the country search failed because the library returns a 'NaN' object (row 5). Therefore, you can easily identify the countries that were not normalized.

Another important thing to notice is that the cleaning process created four new columns to store the cleaning attributes code, name, alpha2 and alpha3. These names are defined by the properties output_code, output_name, output_alpha2 and output_alpha3. The code below changes these names that will be used as prefix. It also chooses to return only the alpha2 and name.

In [26]:
# Changing the output column names
country_cleaner_obj.output_name = 'CLEANED_NAME'
country_cleaner_obj.output_alpha2 = 'CLEANED_ALPHA2'

In [27]:
country_cleaner_obj.output_info = [CountryCleaner.ATTRIBUTE_ALPHA2, CountryCleaner.ATTRIBUTE_NAME]

In [28]:
df_cleaner = country_cleaner_obj.clean_df(df=df_original, cols=['COUNTRY_HEAD'], output_names_as= 'prefix')
df_cleaner

Cleaning column [COUNTRY_HEAD]: 100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 423.31it/s]


Unnamed: 0,NAME,COUNTRY_HEAD,COUNTRY_BRANCH,CLEANED_NAME_COUNTRY_HEAD,CLEANED_ALPHA2_COUNTRY_HEAD
0,Bechel *Australia (Services) Pty Ltd,au,au,AUSTRALIA,AU
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,au,AUSTRALIA,AU
2,meo - serviços de comunicação e multimedia SA...,PT,PT,PORTUGAL,PT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,us,UNITED STATES OF AMERICA,US
4,"Brault Loisirs, Orl. SARL",FR,France,FRANCE,FR
5,"Cole & Brothers Fabric, Services LLC.",uk,uk,,
6,StarCOM Group Servizi **CAT** SRL,italy,ita,ITALY,IT
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,de,GERMANY,DE
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,gb,UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN I...,GB
9,"Susamar-Patino, colectores (adm) SA",ES,spain,SPAIN,ES


To clean more columns in the dataframe at the same time, just add the column name to the cols argument. You also can ask to remove the original country columns, as shown in the code below:

In [29]:
country_cleaner_obj.output_info = [CountryCleaner.ATTRIBUTE_ALPHA2]

In [30]:
df_cleaner = country_cleaner_obj.clean_df(df=df_original, 
                                          cols=['COUNTRY_HEAD', 'COUNTRY_BRANCH'], 
                                          remove_cols = True, 
                                          output_names_as= 'suffix')
df_cleaner

Cleaning column [COUNTRY_BRANCH]: 100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 262.06it/s]


Unnamed: 0,NAME,COUNTRY_HEAD_CLEANED_ALPHA2,COUNTRY_BRANCH_CLEANED_ALPHA2
0,Bechel *Australia (Services) Pty Ltd,AU,AU
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,AU,AU
2,meo - serviços de comunicação e multimedia SA...,PT,PT
3,"Glass Coatings & Concepts ""CBG"" LLC",US,US
4,"Brault Loisirs, Orl. SARL",FR,FR
5,"Cole & Brothers Fabric, Services LLC.",,
6,StarCOM Group Servizi **CAT** SRL,IT,IT
7,Wolbeck (Archer Daniels) *Unified* GmbH,DE,DE
8,"Anheuser-BUSCH, Brothers (food services), LLC",GB,GB
9,"Susamar-Patino, colectores (adm) SA",ES,ES
