# How to...clean up text with pre-defined rules

This notebook shows how to use the **SimpleCleaner** class to clean up text using the pre-defined rules available in the financial-entity-cleaner library. 

In [1]:
# Sets up the location of the financial-entity-cleaner library relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import SimpleCleaner
from financial_entity_cleaner.text import SimpleCleaner

In [3]:
# Create an object based on SimpleCleaner() class
simple_cleaner_obj=SimpleCleaner()

In [4]:
# Checking the rules available
simple_cleaner_obj.show_cleaning_rules()

['remove_email',
 'remove_url',
 'remove_word_the_from_the_end',
 'place_word_the_at_the_beginning',
 'remove_www_address',
 'enforce_single_space_between_words',
 'replace_amperstand_by_AND',
 'add_space_between_amperstand',
 'replace_amperstand_between_space_by_AND',
 'replace_hyphen_by_space',
 'replace_hyphen_between_spaces_by_single_space',
 'replace_underscore_by_space',
 'replace_underscore_between_spaces_by_single_space',
 'remove_all_punctuation',
 'remove_punctuation_except_dot',
 'remove_mentions',
 'remove_hashtags',
 'remove_asterisk',
 'remove_numbers',
 'remove_text_puctuation',
 'remove_text_puctuation_except_dot',
 'remove_math_symbols',
 'remove_math_symbols_except_dash',
 'remove_parentheses',
 'remove_brackets',
 'remove_curly_brackets',
 'remove_single_quote_next_character',
 'remove_single_quote',
 'remove_double_quote',
 'remove_words_in_parentheses',
 'remove_words_in_asterisk',
 'remove_question_marks_in_parentheses',
 'repeat_remove_words_in_parentheses']

In [5]:
# Text to clean
text_to_clean = 'Star.COM Group Serviços - Irmãos sociedade anônima **CAT** S.R.L (PT) / \|) adm@starcom.pt'

In [6]:
# Apply some rules to a text
selected_rules = ['remove_email', 
                  'remove_words_in_asterisk', 
                  'remove_words_in_parentheses', 
                  'replace_hyphen_by_space',
                  'remove_punctuation_except_dot']
result = simple_cleaner_obj.apply_cleaning_rules(text=text_to_clean, lst_rules=selected_rules)
print(result)

Star.COM Group Serviços   Irmãos sociedade anônima   S.R.L          


As it can be noticed from the result above, the cleaning process can add extra spaces. To resolve this, one can use the **remove_extra_spaces()** method to fix the final result.

In [7]:
simple_cleaner_obj.remove_extra_spaces(result)

'Star.COM Group Serviços Irmãos sociedade anônima S.R.L'

If all the spaces should be removed, use **remove_all_spaces()** method:

In [8]:
simple_cleaner_obj.remove_all_spaces(result)

'Star.COMGroupServiçosIrmãossociedadeanônimaS.R.L'

The method **remove_accents()** remove all accents from a text:

In [9]:
result = simple_cleaner_obj.remove_extra_spaces(result)
simple_cleaner_obj.remove_accents(result).upper()

'STAR.COM GROUP SERVICOS IRMAOS SOCIEDADE ANONIMA S.R.L'

By default, when applying rules that remove characters from the string, **SimpleCleaner()** substitute the unwanted symbols by space. But, if you don't want this behaviour and just want to remove the symbols, send an empty string as a replacement: 

In [10]:
result = simple_cleaner_obj.apply_cleaning_rules(text=text_to_clean, 
                                                 lst_rules=['remove_all_punctuation'],
                                                 replacement="")
print(result)

StarCOM Group Serviços  Irmãos sociedade anônima CAT SRL PT   admstarcompt
