# How to...normalize country information

This notebook shows how to use the **financial-entity-cleaner.country.iso3166** module to normalize country information (alpha 2, alpha 3 and official name) according to the ISO 3166 standard. You can use this module in three different ways:
1. [by normalizing string values, one by one](#string_values)
2. [by normalizing a pandas dataframe that contains column(s) with country information (name, code or a mix of both)](#df)
3. [by normalizing a csv file that contains column(s) with country information (name, code or a mix of both)](#csv)

No matter which approach you choose, you will need to import and create an object based on the **CountryCleaner()** class which is available in the **financial_entity_cleaner.country.iso3166** module. This notebook shows how you can customize the behaviour of this class to adapt the cleaning to your own needs.   

In [36]:
# Sets up the location of the financial-entity-cleaner library relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the CountryCleaner() class to normalize country information
from financial_entity_cleaner.country.iso3166 import CountryCleaner

In [37]:
# Create an object based on CountryCleaner() class to perform cleaning over string values, dataframe or .csv file
country_cleaner_obj=CountryCleaner()

## 1. Normalizing country information on single string values <a id="string_values"></a>

When normalizing country information on single string values, the **get_info()** method is the way to go. This method receives only one parameter, which is a string value that contains any country information, such as code or name. The examples below show three cases in which get_info() method is successfully performed and returns a dictionary with the following standardized country information as stated in ISO 3166: 
- iso_name: official ISO country name, 
- iso_alpha2: official ISO code with two characters, and
- iso_alpha3: official ISO code with three characters.

In [4]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_info('PT')
print('Country information: {}'.format(country_info_dict))

Country information: {'iso_name': 'portugal', 'iso_alpha2': 'pt', 'iso_alpha3': 'prt'}


In [5]:
# Get the complete information about a country given alpha3 code
country_info_dict = country_cleaner_obj.get_info('BRA')
print('Country information: {}'.format(country_info_dict))

Country information: {'iso_name': 'brazil', 'iso_alpha2': 'br', 'iso_alpha3': 'bra'}


In [6]:
# Get the complete information about a country given a name
country_info_dict = country_cleaner_obj.get_info(' China ')
print('Country information: {}'.format(country_info_dict))

Country information: {'iso_name': 'china', 'iso_alpha2': 'cn', 'iso_alpha3': 'chn'}


Now, let's see three cases in which the search fails. 
- when the value passed as parameter cannot be found as being a country code or name.
- when the value passed as parameter is not a string, and 
- when the value passed as parameter is too short to be a country code or name.

In [7]:
# Testing with a string value that is not a country code or name.
print(country_cleaner_obj.get_info('123'))

None


In [8]:
# Testing with a value that is not a string.
print(country_cleaner_obj.get_info(123))

None


In [9]:
# Testing with a value that is too short to be a country name or code
print(country_cleaner_obj.get_info('u'))

None


In the cases above, the **get_info()** method returned a 'None' value which indicates that the search failed. However, there will be cases in which the user would prefer to receive an exception instead. For this case, it is required to change the **operation mode** of the **CountryCleaner()** class, which is set by **default** to be in **SILENT_MODE**. The code below imports the modes of operation from **utils.lib** module and use it to set the **mode property** of the CountryClass() object. 

In [10]:
# Import the modes from utils.lib
from financial_entity_cleaner.utils.lib import ModeOfUse

In [11]:
country_cleaner_obj.mode = ModeOfUse.EXCEPTION_MODE

<font color=red>**Now, the code below will throw an exception because the parameter of get_info() method is not a string.**</font>

In [12]:
# Trigger an exception if a value is not a string.
print(country_cleaner_obj.get_info(123))

CountryIsNotAString: FinancialCleanerError - The input data <123> is not a string.

<font color=red>**Another exception is shown because get_info() couldn't locate the country.**</font>

In [14]:
# Trigger an exception if the country is not found.
print(country_cleaner_obj.get_info('123'))

CountryNotFound: FinancialCleanerError - The country <123> was not found.

<font color=red>**The last case below shows an exception because the input parameter is too short to be a country code or name.**</font>

In [13]:
# Testing with a value that is too short to be a country name or code
print(country_cleaner_obj.get_info('u'))

CountryInputDataTooSmall: FinancialCleanerError - The input data <u> is too small to be a country data.

You can set up your cleaning object back to SILENT_MODE as shown below:

In [15]:
country_cleaner_obj.mode = ModeOfUse.SILENT_MODE

In [16]:
# Trigger an exception if the country is not found.
print(country_cleaner_obj.get_info('123'))

None


Another important property of the CountryCleaner() class is the **letter_case** that defines if the output of the country search will be returned in lower, upper or title case. By default, the letter case is set to be 'lower'. If you want to change that, just set the letter_case property to **'lower'**, **'upper'** or **'title'** as shown below:

In [17]:
# Set up the resultant letter case
country_cleaner_obj.letter_case='upper'

In [18]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_info('us')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'iso_name': 'UNITED STATES OF AMERICA', 'iso_alpha2': 'US', 'iso_alpha3': 'USA'}


Finally, you may want to change the dictionary keys returned by the country search. Instead of calling them 'iso_name', 'iso_alpha2' and 'iso_alpha3' you can define other names. The code below shows how you can do that by changing the properties **output_name**, **output_alpha2** and **output_alpha3** of the CountryCleaner() object:

In [19]:
country_cleaner_obj.output_name = 'clean_name'
country_cleaner_obj.output_alpha2 = 'clean_alpha2'
country_cleaner_obj.output_alpha3 = 'clean_alpha3'

In [20]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_info('US')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'clean_name': 'UNITED STATES OF AMERICA', 'clean_alpha2': 'US', 'clean_alpha3': 'USA'}


You can reset these output names any time by calling the **reset_output_names()** methods of your CountryCleaner() object: 

In [21]:
country_cleaner_obj.reset_output_names()

In [22]:
# Get the complete information about a country given and alpha2 code
country_info_dict = country_cleaner_obj.get_info('US')
print('Complete country information: {}'.format(country_info_dict))

Complete country information: {'iso_name': 'UNITED STATES OF AMERICA', 'iso_alpha2': 'US', 'iso_alpha3': 'USA'}


## 2. Normalizing country information on pandas dataframes <a id="df"></a>

A more realistic scenario is that you will have your data in a tabular format and you are already using the Pandas library to make operations on it. You can write your own code to iterate over your pandas dataframe structure by using the get_info() method to clean up some columns. However, the financial-entity-cleaner makes this task easier for you. The CountryCleaner() class provides the **get_clean_df()** method that performs the normalization of countries defined as dataframe columns. See the code below on how to apply this method: 

In [23]:
# Import the pandas library to read data from a .csv file
import pandas as pd

In [24]:
# Read the .csv file as a pandas dataframe object
input_filename = '../../tests/data/test_cleaner_country.csv'
df_original = pd.read_csv(input_filename,sep=',',encoding='utf-8')

In [25]:
# Check the dataframe content, which contains four columns that can be used to look for the official country info
df_original

Unnamed: 0,COUNTRY,ALPHA2,ALPHA3,NAME
0,NC,nc,ncl,new caledonia
1,Fr,fr,fra,france
2,CA,ca,can,canada
3,IT,it,ita,italy
4,es,es,esp,spain
5,ZZ,,,
6,XZ,,,
7,usa,us,usa,united states of america
8,Arg,ar,arg,argentina
9,AUS,au,aus,australia


In [26]:
# Let's setup the output in upper case
country_cleaner_obj.letter_case='upper'

In [27]:
# Let's perform the cleaning on the dataframe by using the column named as 'COUNTRY'
df_cleaner = country_cleaner_obj.get_clean_df(df=df_original, column_name='COUNTRY')
df_cleaner

Normalizing countries...100%|██████████████████████████████████████████████████| 18/18 [00:00<00:00, 189.58it/s]


Unnamed: 0,COUNTRY,ALPHA2,ALPHA3,NAME,iso_name,iso_alpha2,iso_alpha3
0,NC,nc,ncl,new caledonia,NEW CALEDONIA,NC,NCL
1,Fr,fr,fra,france,FRANCE,FR,FRA
2,CA,ca,can,canada,CANADA,CA,CAN
3,IT,it,ita,italy,ITALY,IT,ITA
4,es,es,esp,spain,SPAIN,ES,ESP
5,ZZ,,,,,,
6,XZ,,,,,,
7,usa,us,usa,united states of america,UNITED STATES OF AMERICA,US,USA
8,Arg,ar,arg,argentina,ARGENTINA,AR,ARG
9,AUS,au,aus,australia,AUSTRALIA,AU,AUS


From the results above it's clear to spot where the country search failed because the library returns a 'NaN' object (see rows 5, 6, 10 and 16). Therefore, you can easily identify the countries that were not normalized.

Another important thing to notice is that the cleaning process created three new columns: iso_name, iso_alpha2 and iso_alpha3. These names are defined by the properties output_name, output_alpha2 and output_alpha3. Be aware that these output names don't follow the lettercase property. Therefore, the column name will have the exact name defined by their equivalent properties. By default, the names are in lower case: iso_name, iso_alpha2 and iso_alpha3. The code below shows how you can change them: 

In [28]:
# Changing the output column names
country_cleaner_obj.output_name = 'CLEAN_COUNTRY'
country_cleaner_obj.output_alpha2 = 'CLEAN_ALPHA2'
country_cleaner_obj.output_alpha3 = 'CLEAN_ALPHA3'

In [29]:
# Calling the cleaning again to reflect the new naming convention
df_cleaner = country_cleaner_obj.get_clean_df(df=df_original, column_name='COUNTRY')
df_cleaner

Normalizing countries...100%|██████████████████████████████████████████████████| 18/18 [00:00<00:00, 181.92it/s]


Unnamed: 0,COUNTRY,ALPHA2,ALPHA3,NAME,CLEAN_COUNTRY,CLEAN_ALPHA2,CLEAN_ALPHA3
0,NC,nc,ncl,new caledonia,NEW CALEDONIA,NC,NCL
1,Fr,fr,fra,france,FRANCE,FR,FRA
2,CA,ca,can,canada,CANADA,CA,CAN
3,IT,it,ita,italy,ITALY,IT,ITA
4,es,es,esp,spain,SPAIN,ES,ESP
5,ZZ,,,,,,
6,XZ,,,,,,
7,usa,us,usa,united states of america,UNITED STATES OF AMERICA,US,USA
8,Arg,ar,arg,argentina,ARGENTINA,AR,ARG
9,AUS,au,aus,australia,AUSTRALIA,AU,AUS


## 3. Normalizing country information on .csv files <a id="csv"></a>

In [30]:
# Import the module for normalizing country information
from financial_entity_cleaner.batch.cleaner import AutoCleaner

In [31]:
# Create an AutoCleaner object
auto_cleaner_obj=AutoCleaner()

In [32]:
input_filename = '../../tests/data/test_cleaner_country.csv'

In [33]:
setup_cleaning_filename = '../../tests/data/test_cleaner_country.json'

In [34]:
output_filename = '../../tests/data/test_cleaner_country_result.csv'

In [35]:
auto_cleaner_obj.clean_csv_file(input_filename, setup_cleaning_filename, output_filename)

Reading cleaning settings from ../../tests/data/test_cleaner_country.json
Reading csv file from ../../tests/data/test_cleaner_country.csv
Saving csv output file at ../../tests/data/test_cleaner_country_result.csv


True