# How to...clean a company's name

This notebook shows how to use the entitymatching api to clean attributes that contain a company's name. The cleaning process relies in two internal and customized dictionaries to apply the desirable cleaning rules: 
- Dictionary of Legal Terms: defines the replacement rules to normalize the legal forms for business. For instance, a company's name that has LT as its legal form, will be normalized as LIMITED. 
- Dictionary of cleaning rules: defines the cleaning rules to be applied, for example: remove numbers, punctuations, etc.

In [1]:
# Sets up the location of the api relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the module for cleaning company's name
from financial_entity_cleaner.company_cleaner import company

In [3]:
# Create a CompanyNameCleaner object
company_cleaner_obj=company.CompanyNameCleaner()

In [4]:
# Set up some properties of the cleaner library
company_cleaner_obj.normalize_legal_terms = True
company_cleaner_obj.output_lettercase="title"
company_cleaner_obj.remove_unicode=True

In [5]:
# Checking the type of legal terms dictionary available
# The method returns a list of 'language,country'
company_cleaner_obj.get_types_available_legal_term_dict()

['en,us', 'pt,pt', 'pt,br', 'fr,fr']

In [6]:
# Example of some company name with unicode, punctuation, URL and email
company_name = ' [89]	GRAND BUDAPEST HOTEL adm@budapest.com %& 7((888)) www.gbhotel.com lt'
print('COMPANY NAME TO CLEAN: {}'.format(company_name))

COMPANY NAME TO CLEAN:  [89]	GRAND BUDAPEST HOTEL adm@budapest.com %& 7((888)) www.gbhotel.com lt


## 1. Using the default cleaning

In [7]:
test = 'chugach electric assn inc'

In [8]:
company_cleaner_obj.get_clean_name(test)

'Chugach Electric Association Incorporated'

In [9]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_type_current_legal_term_dict()

['en', 'us']

In [10]:
# Checking the cleaning rules set as default
company_cleaner_obj.default_cleaning_rules

['remove_email',
 'remove_url',
 'remove_www_address',
 'remove_words_in_parentheses',
 'remove_numbers',
 'replace_hyphen_underscore_by_space',
 'remove_all_punctuation',
 'enforce_single_space_between_words']

Default parameters: performs the normalization of legal terms and returns the clean name in lower case.

In [11]:
# Call the cleaning function with the default cleaning rules
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

Grand Budapest Hotel Corporation Corporation Limited


## 2. Not applying the normalization of legal terms

In [12]:
# Remove the normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = False

In [13]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

Grand Budapest Hotel Lt


## 3. Returning the clean name in upper case

In [14]:
# Changing the resultant letter case
company_cleaner_obj.output_lettercase="upper"

# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

GRAND BUDAPEST HOTEL LT


## 4. Changing the cleaning rules

In [15]:
# Checking the cleaning rules available
company_cleaner_obj.get_cleaning_rules_available()

['remove_email',
 'remove_url',
 'remove_www_address',
 'enforce_single_space_between_words',
 'treat_AND',
 'replace_hyphen_underscore_by_space',
 'remove_all_punctuation',
 'remove_mentions',
 'remove_hashtags',
 'remove_numbers',
 'remove_single_quote_next_character',
 'remove_words_in_parentheses']

In [16]:
company_name

' [89]\tGRAND BUDAPEST HOTEL adm@budapest.com %& 7((888)) www.gbhotel.com lt'

In [17]:
# Set up new cleaning rules
my_custom_cleaning_rules=['remove_words_in_parentheses','remove_all_punctuation','enforce_single_space_between_words']

In [18]:
# Set the new cleaning rules as default in the library
company_cleaner_obj.default_cleaning_rules=my_custom_cleaning_rules

In [19]:
# Print original name
company_name

' [89]\tGRAND BUDAPEST HOTEL adm@budapest.com %& 7((888)) www.gbhotel.com lt'

In [20]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

89 GRAND BUDAPEST HOTEL ADMBUDAPESTCOM 7 WWWGBHOTELCOM LT


## 5. Changing the dictionary of legal terms

In [21]:
company_name='assurant holdings france sas'

In [22]:
# Remove the normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = True

In [23]:
company_cleaner_obj.set_current_legal_term_dict('fr', 'fr')

In [24]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_type_current_legal_term_dict()

['fr', 'fr']

In [25]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

ASSURANT HOLDINGS FRANCE SOCIÉTÉ PAR ACTIONS SIMPLIFIÉE


In [26]:
company_name ='c.s.n.s.p. 452, s.a.'

In [27]:
company_cleaner_obj.set_current_legal_term_dict('pt', 'pt')

In [29]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_type_current_legal_term_dict()

['pt', 'pt']

In [30]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

CSNSP 452 SOCIEDADE ANÔNIMA


In [31]:
company_cleaner_obj.set_current_legal_term_dict('pt', 'br')

In [32]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_type_current_legal_term_dict()

['pt', 'br']

In [33]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

CSNSP 452 SOCIEDADE ANÔNIMA
