This code was used to move a database in Word files into a more structured form. It has functions that look for the specific pattern and apply a cleanup flow.
- Pandas
- docx2txt
- re
Initially there was a database of users in many .docx files. The records had the following pattern:
Claim Number: S0M3C0D3C3Details
Claim Number Cross Reference:
Name: PEDRO I PEREZ
Birth Date: 05/24/1949
Date of Death:
Sex: M
Address: 12345 112TH ST
JAMAICA, NY 12345-6789
Most recent State: NY (33)
Most recent County: QUEENS (590)
In the main function you enter the path to a folder containing all .docx files.
The main() function applies the workflow for reading all documents in path. Here, the function leer_documento() saves the information of the .docx file in a string and looks for a regex pattern to save the information in dictionary groups and converts it into a Pandas DataFrame.
With all records stored in a Pandas DataFrame, a cleanup workflow is executed. This consists of:
- Removes blanks at the beginning and end of each record.
- Separate the ClaimNumber columns into two desired codes: Code1 and Code2.
- Separate the Address column into Address, county, state and zip code.
- Remove unwanted strings written by people in .docx files as: "Paso", "Roll", etc
With our database in a Pandas Dataframe, it is saved in a .csv file and in a .xlsx file with guardar() function.
In minutes, we converted a customer database distributed in 206 .docx files to +60k clean and structured records. After using this code, it was easier to enrich the database and store it in MySQL to optimize the company's workflows.