# How to...clean up company's name

This notebook shows how to use the **financial-entity-cleaner.text.name** to clean attributes that contain a company's name. The cleaning process relies in two internal and customized dictionaries: 
- **Dictionary of pre-defined cleaning rules**: defines the regex rules to be applied, for example: remove numbers, punctuations, extra spaces, etc.
- **Dictionary of Legal Terms**: defines the replacement rules to normalize businesses legal forms. For instance, a company's name described as LT in the USA will be normalized as LIMITED.

You can use **financial-entity-cleaner.text.name** module in two ways:
1. [by cleaning up single instances of company's name](#clean_names)
3. [by cleaning up multiple companies' names on a tabular dataframe](#df)

No matter which approach you choose, you will need to import and create an object based on the **CompanyNameCleaner()** class. This notebook shows to customize the behaviour of this class to adapt the cleaning process to your projects.

<div class="alert alert-block alert-info">
<b>Note:</b> The complete documentation of the <b>financial-entity-cleaner</b> is available at <a href="https://financial-entity-cleaner.readthedocs.io/en/latest/" title="ReadTheDocs">ReadTheDocs</a>.</div>

In [1]:
# Sets up the location of the api relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the module for cleaning company's name
from financial_entity_cleaner.text.name import CompanyNameCleaner

In [3]:
# Create a CompanyNameCleaner object
company_cleaner_obj = CompanyNameCleaner()

## 1. Cleaning single instances of company's names  <a id="clean_names"></a>

In [12]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['us', 'en']

In [13]:
# Checking the cleaning rules set as default
company_cleaner_obj.default_cleaning_rules

['replace_amperstand_between_space_by_AND',
 'replace_hyphen_between_spaces_by_single_space',
 'replace_underscore_between_spaces_by_single_space',
 'remove_text_puctuation_except_dot',
 'remove_math_symbols',
 'remove_words_in_parentheses',
 'remove_parentheses',
 'remove_brackets',
 'remove_curly_brackets',
 'enforce_single_space_between_words']

Default parameters: performs the normalization of legal terms and returns the clean name in lower case.

In [14]:
# Call the cleaning function with the default cleaning rules
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

89 Grand Budapest Hotel Adm@Budapest.Com & 7 Www.Gbhotel.Com Limited


## 2. Not applying the normalization of legal terms

In [15]:
# Remove the normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = False

In [16]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

89 Grand Budapest Hotel Adm@Budapest.Com & 7 Www.Gbhotel.Com Lt


## 3. Returning the clean name in upper case

In [17]:
# Changing the resultant letter case
company_cleaner_obj.output_lettercase="upper"

# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

89 GRAND BUDAPEST HOTEL ADM@BUDAPEST.COM & 7 WWW.GBHOTEL.COM LT


## 4. Changing the cleaning rules

In [18]:
# Checking the cleaning rules available
company_cleaner_obj.get_cleaning_rules_available()

['remove_email',
 'remove_url',
 'remove_word_the_from_the_end',
 'place_word_the_at_the_beginning',
 'remove_www_address',
 'enforce_single_space_between_words',
 'replace_amperstand_by_AND',
 'replace_amperstand_between_space_by_AND',
 'replace_hyphen_by_space',
 'replace_hyphen_between_spaces_by_single_space',
 'replace_underscore_by_space',
 'replace_underscore_between_spaces_by_single_space',
 'remove_all_punctuation',
 'remove_punctuation_except_dot',
 'remove_mentions',
 'remove_hashtags',
 'remove_numbers',
 'remove_text_puctuation',
 'remove_text_puctuation_except_dot',
 'remove_math_symbols',
 'remove_math_symbols_except_dash',
 'remove_parentheses',
 'remove_brackets',
 'remove_curly_brackets',
 'remove_single_quote_next_character',
 'remove_words_in_parentheses',
 'repeat_remove_words_in_parentheses']

In [20]:
# Set up new cleaning rules
my_custom_cleaning_rules=['remove_words_in_parentheses','remove_all_punctuation','enforce_single_space_between_words']

In [21]:
# Set the new cleaning rules as default in the library
company_cleaner_obj.default_cleaning_rules=my_custom_cleaning_rules

In [22]:
# Print original name
company_name

' [89]\tGRAND BUDAPEST HOTEL adm@budapest.com %& 7((888)) www.gbhotel.com lt'

In [23]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

89 GRAND BUDAPEST HOTEL ADM BUDAPEST COM 7 WWW GBHOTEL COM LT


## 5. Changing the dictionary of legal terms

In [24]:
company_name='assurant holdings france sas'

In [25]:
# Remove the normalization of the legal terms
company_cleaner_obj.normalize_legal_terms = True

In [26]:
company_cleaner_obj.set_current_legal_term_dict('fr', 'fr')

In [27]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['fr', 'fr']

In [28]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

ASSURANT HOLDINGS FRANCE SOCIÉTÉ PAR ACTIONS SIMPLIFIÉE


In [29]:
company_name ='c.s.n.s.p. 452, s.a.'

In [30]:
company_cleaner_obj.set_current_legal_term_dict('pt', 'pt')

In [31]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['pt', 'pt']

In [32]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

C S N S P 452 S A


In [33]:
company_cleaner_obj.set_current_legal_term_dict('br')

In [34]:
# Checking the type of legal terms dictionary set as default
# returns 'language,country'
company_cleaner_obj.get_info_current_legal_term_dict()

['br', '']

In [35]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

C S N S P 452 S A


In [36]:
# Changing the dictionary and merging with the default dictionary
company_cleaner_obj.set_current_legal_term_dict('pt', 'pt', merge_legal_terms=True)

In [37]:
company_name ='c.s.n.s.p. 452, s.a.'

In [38]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

C S N S P 452 S A


In [39]:
company_name = 'chugach electric assn inc'

In [40]:
# Call the cleaning function
clean_name = company_cleaner_obj.get_clean_name(company_name)
print(clean_name)

CHUGACH ELECTRIC ASSN INCORPORATED


## 6. Cleaning a dataframe

In [41]:
import pandas as pd

In [42]:
input_filename = '../../tests/data/test_cleaner_company.csv'

In [43]:
df_original = pd.read_csv(input_filename,sep=',',encoding='utf-8')

In [44]:
df_original

Unnamed: 0,NAME,CLEAN_NAME,COUNTRY_ALPHA2
0,Central PERK,Central Perk,us
1,dunder-mifflin,Dunder Mifflin,us
2,HOnEYDUKES,Honeydukes,us
3,STARCOURT; MALL,Starcourt Mall,us
4,RR DINER,Rr Diner,us
5,\tSterling Cooper,Sterling Cooper,us
6,&\tWONKA INDUSTRIES!,Wonka Industries,us
7,[89]\tGRAND BUDAPEST HOTEL www.gbhotel.com\t,Grand Budapest Hotel,us
8,(text)\tMONSTERS INC.\tmonsters@gmail.com,Monsters Incorporated,us
9,\t123 hotel CHeVALIER\t2005720784 AC.,Hotel Chevalier Asociacion Civil,us


In [45]:
# Set up the resultant letter case
company_cleaner_obj.output_lettercase='title'

In [46]:
df_cleaner = company_cleaner_obj.apply_cleaner_to_df(df_original, 'NAME', 'NAME_CLEAN')

In [47]:
df_cleaner

Unnamed: 0,NAME,CLEAN_NAME,COUNTRY_ALPHA2,NAME_CLEAN
0,Central PERK,Central Perk,us,Central Perk
1,dunder-mifflin,Dunder Mifflin,us,Dunder Mifflin
2,HOnEYDUKES,Honeydukes,us,Honeydukes
3,STARCOURT; MALL,Starcourt Mall,us,Starcourt Mall
4,RR DINER,Rr Diner,us,Rr Diner
5,\tSterling Cooper,Sterling Cooper,us,Sterling Cooper
6,&\tWONKA INDUSTRIES!,Wonka Industries,us,Wonka Industries
7,[89]\tGRAND BUDAPEST HOTEL www.gbhotel.com\t,Grand Budapest Hotel,us,89 Grand Budapest Hotel Www Gbhotel Com
8,(text)\tMONSTERS INC.\tmonsters@gmail.com,Monsters Incorporated,us,Monsters Inc Monsters Gmail Com
9,\t123 hotel CHeVALIER\t2005720784 AC.,Hotel Chevalier Asociacion Civil,us,123 Hotel Chevalier 2005720784 Ac


In [48]:
# Cleaning with country
df_cleaner = company_cleaner_obj.apply_cleaner_to_df(df_original, 'NAME', 'NAME_CLEAN', 'COUNTRY_ALPHA2', 'True')

In [49]:
df_cleaner

Unnamed: 0,NAME,CLEAN_NAME,COUNTRY_ALPHA2,NAME_CLEAN
0,Central PERK,Central Perk,us,Central Perk
1,dunder-mifflin,Dunder Mifflin,us,Dunder Mifflin
2,HOnEYDUKES,Honeydukes,us,Honeydukes
3,STARCOURT; MALL,Starcourt Mall,us,Starcourt Mall
4,RR DINER,Rr Diner,us,Rr Diner
5,\tSterling Cooper,Sterling Cooper,us,Sterling Cooper
6,&\tWONKA INDUSTRIES!,Wonka Industries,us,Wonka Industries
7,[89]\tGRAND BUDAPEST HOTEL www.gbhotel.com\t,Grand Budapest Hotel,us,89 Grand Budapest Hotel Www Gbhotel Com
8,(text)\tMONSTERS INC.\tmonsters@gmail.com,Monsters Incorporated,us,Monsters Inc Monsters Gmail Com
9,\t123 hotel CHeVALIER\t2005720784 AC.,Hotel Chevalier Asociacion Civil,us,123 Hotel Chevalier 2005720784 Ac
