# How to...clean up company's name

This notebook shows how to use the **financial-entity-cleaner.text.name** to clean attributes that contain a company's name. The cleaning process relies in two internal and customized dictionaries: 
- **Dictionary of pre-defined cleaning rules**: defines the regex rules to be applied, for example: remove numbers, punctuations, extra spaces, etc.
- **Dictionary of Legal Terms**: defines the replacement rules to normalize businesses legal forms. For instance, a company's name described as LT in the USA will be normalized as LIMITED.

You can use **financial-entity-cleaner.text.name** module in two ways:
1. [by cleaning up single instances of company's name](#clean_names)
3. [by cleaning up multiple companies' names on a tabular dataframe](#df)

No matter which approach you choose, you will need to import and create an object based on the **CompanyNameCleaner()** class. This notebook shows to customize the behaviour of this class to adapt the cleaning process to your projects.

<div class="alert alert-block alert-info">
<b>Note:</b> The complete documentation of the <b>financial-entity-cleaner</b> is available at <a href="https://financial-entity-cleaner.readthedocs.io/en/latest/" title="ReadTheDocs">ReadTheDocs</a>.</div>

In [1]:
# Sets up the location of the api relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the module for cleaning company's name
from financial_entity_cleaner.text.name import CompanyNameCleaner

In [3]:
# Create a CompanyNameCleaner object
company_cleaner_obj = CompanyNameCleaner()

## 1. Cleaning single instances of company's names  <a id="clean_names"></a>

In [4]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['us', 'en']

In [5]:
# Checking the cleaning rules set as default
company_cleaner_obj.default_cleaning_rules

['replace_amperstand_between_space_by_AND',
 'replace_hyphen_between_spaces_by_single_space',
 'replace_underscore_between_spaces_by_single_space',
 'remove_text_puctuation_except_dot',
 'remove_math_symbols',
 'remove_words_in_parentheses',
 'remove_parentheses',
 'remove_brackets',
 'remove_curly_brackets',
 'enforce_single_space_between_words']

Default parameters: performs the normalization of legal terms and returns the clean name in lower case.

In [6]:
company_name = "Glass  Coatings & Concepts (CBG) LLC"

In [7]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_data(company_name)
print(clean_name)

glass coatings and concepts limited liability company


In [8]:
# Remove normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = False

# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_data(company_name)
print(clean_name)

glass coatings and concepts llc


In [9]:
# Apply normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = True

In [10]:
# Changing the resultant letter case
company_cleaner_obj.output_lettercase="upper"

# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_data(company_name)
print(clean_name)

GLASS COATINGS AND CONCEPTS LIMITED LIABILITY COMPANY


In [11]:
# Set up new cleaning rules
my_custom_cleaning_rules=['remove_parentheses', 'replace_amperstand_between_space_by_AND','enforce_single_space_between_words']
company_cleaner_obj.default_cleaning_rules = my_custom_cleaning_rules

In [12]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_data(company_name)
print(clean_name)

GLASS COATINGS AND CONCEPTS CBG LIMITED LIABILITY COMPANY


Now, let's change the country/language to translate the legal terms of a French company:

In [13]:
company_name='assurant holdings france sas'

In [14]:
# Set the current legal term dictionary to France (country) - French (language)
company_cleaner_obj.set_current_legal_term_dict('fr', 'fr')

In [15]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['fr', 'fr']

In [16]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_data(company_name)
print(clean_name)

ASSURANT HOLDINGS FRANCE SOCIÉTÉ PAR ACTIONS SIMPLIFIÉE


In [17]:
# Checking the coountries/languages available
company_cleaner_obj.get_info_available_legal_term_dict()

{'ar': ['es'],
 'at': ['de'],
 'au': ['en'],
 'br': ['pt'],
 'ca': ['en', 'fr'],
 'de': ['de'],
 'es': ['es'],
 'fr': ['fr'],
 'gb': ['en', 'cy'],
 'id': ['id'],
 'in': ['en'],
 'it': ['it'],
 'nl': ['nl'],
 'nz': ['en'],
 'pt': ['pt'],
 'sg': ['en'],
 'us': ['en']}

## 2. Cleaning up multiple companies' names on a tabular dataframe

In [18]:
# Import the pandas library to read data from a .csv file
import pandas as pd

In [19]:
# Read the .csv file as a pandas dataframe object
input_filename = '../../tests/data/test_cleaner.csv'
df_original = pd.read_csv(input_filename,sep=';',encoding='utf-8', usecols=['NAME','COUNTRY_HEAD'])
df_original

Unnamed: 0,NAME,COUNTRY_HEAD
0,Bechel *Australia (Services) Pty Ltd,au
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia
2,meo - serviços de comunicação e multimedia SA...,PT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States
4,"Brault Loisirs, Orl. SARL",FR
5,"Cole & Brothers Fabric, Services LLC.",uk
6,StarCOM Group Servizi **CAT** SRL,italy
7,Wolbeck (Archer Daniels) *Unified* GmbH,de
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb
9,"Susamar-Patino, colectores (adm) SA",ES


In [23]:
# Set up the resultant letter case
company_cleaner_obj.output_lettercase='upper'

In [24]:
df_cleaner = company_cleaner_obj.get_clean_df(df_original, 'NAME', 'NAME_CLEAN')

In [25]:
df_cleaner

Unnamed: 0,NAME,COUNTRY_HEAD,NAME_CLEAN
0,Bechel *Australia (Services) Pty Ltd,au,BECHEL *AUSTRALIA SERVICES PTY LTD
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,NRI - KELLY'S MERCHANDISE AUST PTY LTD
2,meo - serviços de comunicação e multimedia SA...,PT,MEO - SERVIÇOS DE COMUNICAÇÃO E MULTIMEDIA SA ...
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,"GLASS COATINGS AND CONCEPTS ""CBG"" LLC"
4,"Brault Loisirs, Orl. SARL",FR,"BRAULT LOISIRS, ORL. SOCIÉTÉ À RESPONSABILITÉ ..."
5,"Cole & Brothers Fabric, Services LLC.",uk,"COLE AND BROTHERS FABRIC, SERVICES LLC."
6,StarCOM Group Servizi **CAT** SRL,italy,STARCOM GROUP SERVIZI **CAT** SRL
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,WOLBECK ARCHER DANIELS *UNIFIED* GMBH
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,"ANHEUSER-BUSCH, BROTHERS FOOD SERVICES, LLC"
9,"Susamar-Patino, colectores (adm) SA",ES,"SUSAMAR-PATINO, COLECTORES ADM SOCIÉTÉ ANONYME"


In [26]:
# Import the CountryCleaner() class to normalize country information
from financial_entity_cleaner.country.iso3166 import CountryCleaner

In [27]:
# Create an object based on CountryCleaner() class to perform cleaning over string values, dataframe or .csv file
country_cleaner_obj=CountryCleaner()

In [41]:
# Let's setup the output in upper case
country_cleaner_obj.letter_case='lower'
country_cleaner_obj.output_alpha2 = 'COUNTRY_HEAD'

# Let's perform the cleaning on the dataframe by using the column named as 'COUNTRY'
df_cleaner = country_cleaner_obj.get_clean_df(df=df_original, column_name='COUNTRY_HEAD')

Normalizing countries...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 423.30it/s]


In [42]:
# Remove unnecessary columns
df_cleaner.drop(['iso_name', 'iso_alpha3'], inplace=True, axis=1)
df_cleaner

Unnamed: 0,NAME,COUNTRY_HEAD
0,Bechel *Australia (Services) Pty Ltd,au
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,au
2,meo - serviços de comunicação e multimedia SA...,pt
3,"Glass Coatings & Concepts ""CBG"" LLC",us
4,"Brault Loisirs, Orl. SARL",fr
5,"Cole & Brothers Fabric, Services LLC.",
6,StarCOM Group Servizi **CAT** SRL,it
7,Wolbeck (Archer Daniels) *Unified* GmbH,de
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb
9,"Susamar-Patino, colectores (adm) SA",es


In [43]:
# Cleaning with country
df_cleaner = company_cleaner_obj.get_clean_df(df_cleaner, 'NAME', 'NAME_CLEAN', 'COUNTRY_HEAD', 'True')
df_cleaner

Unnamed: 0,NAME,COUNTRY_HEAD,NAME_CLEAN
0,Bechel *Australia (Services) Pty Ltd,au,BECHEL *AUSTRALIA SERVICES LIMITED PROPRIETARY...
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,au,NRI - KELLY'S MERCHANDISE AUST LIMITED PROPRIE...
2,meo - serviços de comunicação e multimedia SA...,pt,MEO - SERVIÇOS DE COMUNICAÇÃO E MULTIMEDIA SA ...
3,"Glass Coatings & Concepts ""CBG"" LLC",us,"GLASS COATINGS AND CONCEPTS ""CBG"" LIMITED LIAB..."
4,"Brault Loisirs, Orl. SARL",fr,"BRAULT LOISIRS, ORL. SOCIÉTÉ À RESPONSABILITÉ ..."
5,"Cole & Brothers Fabric, Services LLC.",,"COLE AND BROTHERS FABRIC, SERVICES LIMITED LIA..."
6,StarCOM Group Servizi **CAT** SRL,it,STARCOM GROUP SERVIZI **CAT** SOCIETÀ A RESPON...
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,WOLBECK ARCHER DANIELS *UNIFIED* GESELLSCHAFT ...
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,"ANHEUSER-BUSCH, BROTHERS FOOD SERVICES, LIMITE..."
9,"Susamar-Patino, colectores (adm) SA",es,"SUSAMAR-PATINO, COLECTORES ADM SOCIEDAD ANÓNIMA"
