# How to...clean up company's name

This notebook shows how to use the **CompanyNameCleaner** class to clean up attributes that contain a company's name. The cleaning process relies in two internal and customized dictionaries: 
- **Dictionary of pre-defined cleaning rules**: defines the regex rules to be applied, for example: remove numbers, punctuations, extra spaces, etc.
- **Dictionary of Legal Terms**: defines the replacement rules to normalize businesses legal forms. For instance, a company's name described as LT in the USA will be normalized as LIMITED.

You can use **CompanyNameCleaner** class in two ways:
1. [to clean up single instances of company's name](#clean_names)
3. [to clean up multiple companies' names on a tabular dataframe](#df)

No matter which approach you choose, you will need to import and create an object based on the **CompanyNameCleaner()** class. This notebook shows to customize the behaviour of this class to adapt the cleaning process to your projects.

In [1]:
# Sets up the location of the api relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the module for cleaning company's name
from financial_entity_cleaner.company import CompanyNameCleaner

In [3]:
# Create a CompanyNameCleaner object
company_cleaner_obj = CompanyNameCleaner()

## 1. Cleaning single instances of company's names  <a id="clean_names"></a>

In [4]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['us', 'en']

In [5]:
# Checking the cleaning rules set as default
company_cleaner_obj.default_cleaning_rules

['place_word_the_at_the_beginning',
 'remove_words_in_asterisk',
 'remove_question_marks_in_parentheses',
 'replace_hyphen_by_space',
 'replace_underscore_by_space',
 'remove_text_puctuation_except_dot',
 'remove_math_symbols',
 'remove_parentheses',
 'remove_brackets',
 'remove_curly_brackets',
 'remove_single_quote_next_character',
 'remove_double_quote',
 'enforce_single_space_between_words']

Default parameters: performs the normalization of legal terms and returns the clean name in lower case.

In [6]:
company_name = "Glass  Coatings & Concepts (CBG) LLC"

In [7]:
# Call the cleaning function
clean_name = company_cleaner_obj.clean(company_name)
print(clean_name)

glass coatings & concepts cbg limited liability company


In [8]:
# Remove normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = False

# Call the cleaning function
clean_name = company_cleaner_obj.clean(company_name)
print(clean_name)

glass coatings & concepts cbg llc


In [9]:
# Apply normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = True

In [10]:
# Changing the resultant letter case
company_cleaner_obj.output_lettercase="upper"

# Call the cleaning function
clean_name = company_cleaner_obj.clean(company_name)
print(clean_name)

glass coatings & concepts cbg limited liability company


In [11]:
# Set up new cleaning rules
my_custom_cleaning_rules=['remove_parentheses', 
                          'replace_amperstand_between_space_by_AND',
                          'enforce_single_space_between_words']
company_cleaner_obj.default_cleaning_rules = my_custom_cleaning_rules

In [12]:
# Call the cleaning function
clean_name = company_cleaner_obj.clean(company_name)
print(clean_name)

glass coatings and concepts cbg limited liability company


Now, let's change the country/language to translate the legal terms of a French company:

In [13]:
company_name='assurant holdings france sas'

In [14]:
# Set the current legal term dictionary to France (country) - French (language)
company_cleaner_obj.set_current_legal_term_dict('fr', 'fr')

In [15]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['fr', 'fr']

In [16]:
# Call the cleaning function
clean_name = company_cleaner_obj.clean(company_name)
print(clean_name)

assurant holdings france société par actions simplifiée


In [17]:
# Removing accents when cleaning
company_cleaner_obj.remove_accents = True
clean_name = company_cleaner_obj.clean(company_name)
print(clean_name)

assurant holdings france societe par actions simplifiee


In [18]:
# Checking the coountries/languages available
company_cleaner_obj.get_info_available_legal_term_dict()

{'ar': ['es'],
 'at': ['de'],
 'au': ['en'],
 'aw': ['nl'],
 'az': ['az'],
 'be': ['nl', 'de', 'fr'],
 'bg': ['bg'],
 'bo': ['es'],
 'br': ['pt'],
 'bs': ['en'],
 'ca': ['en', 'fr'],
 'ch': ['en', 'de', 'fr', 'it'],
 'co': ['es'],
 'cr': ['es'],
 'de': ['de'],
 'dk': ['da'],
 'do': ['es'],
 'ec': ['es'],
 'ee': ['et'],
 'es': ['es'],
 'fi': ['fi', 'sv'],
 'fr': ['fr'],
 'gb': ['en', 'cy'],
 'gg': ['en'],
 'gi': ['en'],
 'gr': ['el'],
 'gt': ['es'],
 'hn': ['es'],
 'hr': ['hr'],
 'hu': ['hu'],
 'id': ['id'],
 'ie': ['en', 'ga'],
 'in': ['en'],
 'is': ['is'],
 'it': ['it'],
 'je': ['en'],
 'ky': ['en'],
 'lu': ['fr'],
 'nl': ['nl'],
 'no': ['no'],
 'nz': ['en'],
 'pl': ['pl'],
 'pt': ['pt'],
 'se': ['sv'],
 'sg': ['en'],
 'tr': ['tr'],
 'us': ['en'],
 'vg': ['en']}

## 2. Cleaning up multiple companies' names on a tabular dataframe

In [19]:
# Import the pandas library to read data from a .csv file
import pandas as pd

In [20]:
# Read the .csv file as a pandas dataframe object
input_filename = '../../tests/data/test_cleaner.csv'
df_original = pd.read_csv(input_filename,sep=';',encoding='utf-8', usecols=['NAME','COUNTRY_HEAD'])
df_original

Unnamed: 0,NAME,COUNTRY_HEAD
0,Bechel *Australia (Services) Pty Ltd,au
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia
2,meo - serviços de comunicação e multimedia SA...,PT
3,"Glass Coatings & Concepts ""CBG"" LLC",United States
4,"Brault Loisirs, Orl. SARL",FR
5,"Cole & Brothers Fabric, Services LLC.",uk
6,StarCOM Group Servizi **CAT** SRL,italy
7,Wolbeck (Archer Daniels) *Unified* GmbH,de
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb
9,"Susamar-Patino, colectores (adm) SA",ES


In [21]:
# Set up the resultant letter case
company_cleaner_obj.output_lettercase='upper'
company_cleaner_obj.remove_accents = True
company_cleaner_obj.default_cleaning_rules = ['place_word_the_at_the_beginning',
                                              'remove_words_in_asterisk', 
                                              'remove_question_marks_in_parentheses', 
                                              'replace_hyphen_by_space', 
                                              'replace_underscore_by_space', 
                                              'remove_text_puctuation_except_dot', 
                                              'remove_math_symbols', 
                                              'remove_parentheses', 
                                              'remove_brackets', 
                                              'remove_curly_brackets', 
                                              'remove_single_quote_next_character', 
                                              'remove_double_quote', 
                                              'enforce_single_space_between_words']

In [22]:
df_cleaner = company_cleaner_obj.clean_df(df_original, 'NAME', 'NAME_CLEAN')

In [23]:
df_cleaner

Unnamed: 0,NAME,COUNTRY_HEAD,NAME_CLEAN
0,Bechel *Australia (Services) Pty Ltd,au,bechel australia services pty ltd
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,nri kelly merchandise aust pty ltd
2,meo - serviços de comunicação e multimedia SA...,PT,meo servicos de comunicacao e multimedia sa a....
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,glass coatings & concepts cbg llc
4,"Brault Loisirs, Orl. SARL",FR,brault loisirs orl. societe a responsabilite l...
5,"Cole & Brothers Fabric, Services LLC.",uk,cole & brothers fabric services llc.
6,StarCOM Group Servizi **CAT** SRL,italy,starcom group servizi srl
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,wolbeck archer daniels gmbh
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,anheuser busch brothers food services llc
9,"Susamar-Patino, colectores (adm) SA",ES,susamar patino colectores adm societe anonyme


In [24]:
# Import the CountryCleaner() class to normalize country information
from financial_entity_cleaner.location import CountryCleaner

In [25]:
# Create an object based on CountryCleaner() class to perform cleaning over string values, dataframe or .csv file
country_cleaner_obj=CountryCleaner()

In [26]:
# Let's setup the output in upper case
country_cleaner_obj.letter_case='lower'
country_cleaner_obj.output_alpha2 = 'COUNTRY_HEAD'

# Let's perform the cleaning on the dataframe by using the column named as 'COUNTRY'
df_cleaner = country_cleaner_obj.clean_df(df=df_original, cols=['COUNTRY_HEAD'])

Cleaning column [COUNTRY_HEAD]: 100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 55.03it/s]


In [27]:
# Remove unnecessary columns
df_cleaner.drop(['COUNTRY_HEAD', 'COUNTRY_HEAD_name', 'COUNTRY_HEAD_alpha3'], inplace=True, axis=1)
df_cleaner

Unnamed: 0,NAME,COUNTRY_HEAD_code,COUNTRY_HEAD_COUNTRY_HEAD
0,Bechel *Australia (Services) Pty Ltd,36.0,au
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,36.0,au
2,meo - serviços de comunicação e multimedia SA...,620.0,pt
3,"Glass Coatings & Concepts ""CBG"" LLC",840.0,us
4,"Brault Loisirs, Orl. SARL",250.0,fr
5,"Cole & Brothers Fabric, Services LLC.",,
6,StarCOM Group Servizi **CAT** SRL,380.0,it
7,Wolbeck (Archer Daniels) *Unified* GmbH,276.0,de
8,"Anheuser-BUSCH, Brothers (food services), LLC",826.0,gb
9,"Susamar-Patino, colectores (adm) SA",724.0,es


In [28]:
# Cleaning with country
df_cleaner = company_cleaner_obj.clean_df(df_cleaner, 'NAME', 'NAME_CLEAN', 'COUNTRY_HEAD_COUNTRY_HEAD', 'True')
df_cleaner

Unnamed: 0,NAME,COUNTRY_HEAD_code,COUNTRY_HEAD_COUNTRY_HEAD,NAME_CLEAN
0,Bechel *Australia (Services) Pty Ltd,36.0,au,bechel australia services limited proprietary ...
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,36.0,au,nri kelly merchandise aust limited proprietary...
2,meo - serviços de comunicação e multimedia SA...,620.0,pt,meo servicos de comunicacao e multimedia sa a....
3,"Glass Coatings & Concepts ""CBG"" LLC",840.0,us,glass coatings & concepts cbg limited liabilit...
4,"Brault Loisirs, Orl. SARL",250.0,fr,brault loisirs orl. societe a responsabilite l...
5,"Cole & Brothers Fabric, Services LLC.",,,cole & brothers fabric services limited liabil...
6,StarCOM Group Servizi **CAT** SRL,380.0,it,starcom group servizi societa a responsabilita...
7,Wolbeck (Archer Daniels) *Unified* GmbH,276.0,de,wolbeck archer daniels gesellschaft mit beschr...
8,"Anheuser-BUSCH, Brothers (food services), LLC",826.0,gb,anheuser busch brothers food services limited ...
9,"Susamar-Patino, colectores (adm) SA",724.0,es,susamar patino colectores adm sociedad anonima
