# How to...clean up .csv file with AutoCleaner

This notebook shows how to use the **AutoCleane()** class to clean up .csv files For this, you will need to specify the cleaning properties using a settings file written in json or yaml.

In [1]:
# Sets up the location of the api relative to this notebook 
import sys
sys.path.append('../../')

In [2]:
# Import the pandas library to read data from a .csv file
import pandas as pd

In [3]:
# Import the module for AutoCleaner
from financial_entity_cleaner.batch import AutoCleaner

In [4]:
auto_cleaner_obj = AutoCleaner()

## Example 01: cleaning country & company name

In [13]:
auto_cleaner_obj.input_filename = "../../tests/data/test_cleaner_company.csv"
auto_cleaner_obj.output_filename = "../../tests/data/test_cleaner_company_CLEANED.csv"
auto_cleaner_obj.settings_file = "../../tests/data/test_cleaner_company.json"

In [14]:
result = auto_cleaner_obj.clean_file()

Cleaning file:../../tests/data/test_cleaner_company.csv
Executing automatic cleaning by location


Normalizing countries...100%|██████████████████████████████████████████████████| 14/14 [00:00<00:00, 1556.20it/s]

Executing automatic cleaning by text name
Saving output file at ../../tests/data/test_cleaner_company_CLEANED.csv





In [15]:
df_cleaned = pd.read_csv(auto_cleaner_obj.output_filename)
df_cleaned

Unnamed: 0,COMPANY_NAME,COUNTRY_ALPHA2,COUNTRY_ALPHA2_NAME_CLEAN,COUNTRY_ALPHA2_CLEAN,COMPANY_NAME_CLEAN
0,Central PERK,us,UNITED STATES,US,CENTRAL PERK
1,dunder-mifflin,us,UNITED STATES,US,DUNDER MIFFLIN
2,HOnEYDUKES,us,UNITED STATES,US,HONEYDUKES
3,STARCOURT; MALL,us,UNITED STATES,US,STARCOURT MALL
4,RR DINER,us,UNITED STATES,US,RR DINER
5,\tSterling Cooper,us,UNITED STATES,US,STERLING COOPER
6,\tWONKA INDUSTRIES!,us,UNITED STATES,US,WONKA INDUSTRIES
7,[]\tGRAND BUDAPEST HOTEL \t,us,UNITED STATES,US,GRAND BUDAPEST HOTEL
8,(text)\tMONSTERS INC.\t,us,UNITED STATES,US,MONSTERS INCORPORATED
9,Krusty Krab,us,UNITED STATES,US,KRUSTY KRAB


## Example 02: cleaning country, company name and several IDs

In [16]:
auto_cleaner_obj.input_filename = "../../tests/data/test_cleaner.csv"
auto_cleaner_obj.output_filename = "../../tests/data/test_cleaner_CLEANED.csv"
auto_cleaner_obj.settings_file = "../../tests/data/test_cleaner.json"

In [17]:
result = auto_cleaner_obj.clean_file()

Cleaning file:../../tests/data/test_cleaner.csv
Executing automatic cleaning by location


Normalizing countries...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 611.46it/s]


Executing automatic cleaning by id


Normalizing IDs...100%|██████████████████████████████████████████████████| 11/11 [00:00<00:00, 333.55it/s]


Executing automatic cleaning by text name


Cleaning company name...100%|██████████████████████████████████████████████████| 9/9 [00:00<00:00, 101.18it/s]
Cleaning company name...100%|██████████████████████████████████████████████████| 9/9 [00:00<00:00, 113.99it/s]

Saving output file at ../../tests/data/test_cleaner_CLEANED.csv





In [18]:
df_cleaned = pd.read_csv(auto_cleaner_obj.output_filename, sep=";")
df_cleaned

Unnamed: 0,COMPANY_NAME,COUNTRY_INCORP,COUNTRY_BUSINESS,LEI,ISIN,SEDOL,SIRET,SIREN,COUNTRY_INCORP_NAME_CLEAN,COUNTRY_INCORP_ALPHA2_CLEAN,...,ISIN_CLEAN,ISIN_VALID,SEDOL_CLEAN,SEDOL_VALID,SIRET_CLEAN,SIRET_VALID,SIREN_CLEAN,SIREN_VALID,COMPANY_NAME_INCORP_CLEAN,COMPANY_NAME_BUSINESS_CLEAN
0,Bechel *Australia (Services) Pty Ltd,au,au,98 4500 e8da1de9a0d939,,,533 602 942 R.C.S. Nanterre,733602966.0,AUSTRALIA,AU,...,,,,,533602942R.C.S.NANTERRE,0.0,733602966.0,0.0,BECHEL AUSTRALIA LIMITED PROPRIETARY COMPANY,BECHEL AUSTRALIA LIMITED PROPRIETARY COMPANY
1,NRI - KELLY's MERCHANDISE (AUST) PTY LTD,Australia,au,,au0000036949,,440 558 062,440558062.0,AUSTRALIA,AU,...,AU0000036949,1.0,,,440558062,0.0,440558062.0,1.0,NRI KELLY MERCHANDISE LIMITED PROPRIETARY COMPANY,NRI KELLY MERCHANDISE LIMITED PROPRIETARY COMPANY
2,meo - serviços de comunicação e multimedia SA...,PT,PT,5493001MT6YISZH3YV05,4ABL286115,,784-00-03-412,784000341.0,PORTUGAL,PT,...,4ABL286115,0.0,,,784-00-03-412,0.0,784000341.0,1.0,MEO SERVICOS DE COMUNICACAO E MULTIMEDIA SOCIE...,MEO SERVICOS DE COMUNICACAO E MULTIMEDIA SOCIE...
3,"Glass Coatings & Concepts ""CBG"" LLC",United States,us,34123456,4AAAA255044,,be 0413458441,41345844.0,UNITED STATES,US,...,4AAAA255044,0.0,,,BE0413458441,0.0,41345844.0,1.0,GLASS COATINGS AND CONCEPTS CBG LIMITED LIABIL...,GLASS COATINGS AND CONCEPTS CBG LIMITED LIABIL...
4,"Brault Loisirs, Orl. SARL",FR,France,,FR0000036196,552111809,fr24313140642,243131406.0,FRANCE,FR,...,FR0000036196,1.0,552111809,0.0,FR24313140642,0.0,243131406.0,1.0,BRAULT LOISIRS ORL SOCIETE A RESPONSABILITE LI...,BRAULT LOISIRS ORL SOCIETE A RESPONSABILITE LI...
5,"Cole & Brothers Fabric, Services LLC.",uk,uk,213800UHTTV6EGY74X82,,,333 171 544 00013,333171544.0,UNITED KINGDOM,GB,...,,,,,33317154400013,1.0,333171544.0,1.0,COLE AND BROTHERS FABRIC SERVICES LIMITED LIAB...,COLE AND BROTHERS FABRIC SERVICES LIMITED LIAB...
6,StarCOM Group Servizi **CAT** SRL,italy,ita,,,,,,ITALY,IT,...,,,,,,,,,STARCOM GROUP SERVIZI SOCIETA A RESPONSABILITA...,STARCOM GROUP SERVIZI SOCIETA A RESPONSABILITA...
7,Wolbeck (Archer Daniels) *Unified* GmbH,de,de,,,HPA 706689,,,GERMANY,DE,...,,,HPA706689,0.0,,,,,WOLBECK GESELLSCHAFT MIT BESCHRANKTER HAFTUNG,WOLBECK GESELLSCHAFT MIT BESCHRANKTER HAFTUNG
8,"Anheuser-BUSCH, Brothers (food services), LLC",gb,gb,US4567899888,4567788,,,,UNITED KINGDOM,GB,...,4567788,0.0,,,,,,,ANHEUSER BUSCH BROTHERS LIMITED LIABILITY COMPANY,ANHEUSER BUSCH BROTHERS LIMITED LIABILITY COMPANY
9,"Susamar-Patino, colectores (adm) SA",ES,spain,,ES0176252718,A7830 4516,,,SPAIN,ES,...,ES0176252718,1.0,A78304516,0.0,,,,,SUSAMAR PATINO COLECTORES SOCIEDAD ANONIMA,SUSAMAR PATINO COLECTORES SOCIEDAD ANONIMA
