# Data Cleaning

This notebook presents the whole data cleaning process, which consists in extracting new tables and relations, as well as cleaning the existing files from dirty tuples and values.

The new clean data files are saved in the `.csv` format, and will be used to load data to the database.

In [9]:
# Import packages
import pandas as pd
import os

# Hide warnings
#import warnings
#warnings.filterwarnings('ignore')

## Data loading

We first import all the `.csv` files into `pandas` DataFrames.

_Note_: some lines are ill-formed, we choose to ignore them.

In [10]:
# Root of the data files
PATH = os.path.join('..', 'data', 'original')

# Dic: name -> dataframe
dataframes = {}

# Get all the original files
for file in os.listdir(PATH):
    name = file.split('.')[0]
    # Note: some lines are ill-formed, we ignore them
    dataframes[name] = pd.read_csv(os.path.join(PATH, file), error_bad_lines=False)

b'Skipping line 6010: expected 16 fields, saw 29\nSkipping line 14907: expected 16 fields, saw 29\n'
b'Skipping line 142947: expected 16 fields, saw 27\n'
b'Skipping line 379178: expected 16 fields, saw 27\n'
b'Skipping line 411105: expected 16 fields, saw 18\n'
b'Skipping line 532885: expected 16 fields, saw 27\n'
b'Skipping line 625092: expected 16 fields, saw 27\n'
b'Skipping line 743624: expected 16 fields, saw 27\n'
b'Skipping line 786680: expected 16 fields, saw 27\n'
b'Skipping line 892034: expected 16 fields, saw 29\n'
b'Skipping line 1157212: expected 16 fields, saw 29\nSkipping line 1157657: expected 16 fields, saw 29\n'
b'Skipping line 1265876: expected 16 fields, saw 29\n'
b'Skipping line 1460021: expected 16 fields, saw 22\n'
b'Skipping line 1591431: expected 16 fields, saw 29\n'
b'Skipping line 1615982: expected 16 fields, saw 29\nSkipping line 1623916: expected 16 fields, saw 23\n'


## Notes table

This part aims to extract the notes from each table containing a `notes` attributes. Notes are loaded in a new table, and replaced by foreign keys in the original tables.


We first concatenate all the notes from all the dataframes.

In [3]:
notes = pd.Series()
for _, df in dataframes.items():
    try: 
        note = df['notes'].dropna()
        notes = notes.append(note, ignore_index=True)
    except KeyError:
        # If the dataframe doesn't have a notes attribute
        continue  

# Keep unique notes
notes = notes.drop_duplicates(keep='first').reset_index(drop=True)
# Shift to start with ID 1
notes.index = notes.index + 1
notes.head()

1    Entire book available from gutenberg.org at ht...
2    Auction at cqout.com in August 2007 states "De...
3    Theatrical giveaway depicting a small portion ...
4                               Title from title page.
5    Best-known Wilhelm Busch book; The Katzenjamme...
dtype: object

Create the new notes dataframe:

In [4]:
# Form a DataFrame from the note Series
notes_df = notes.to_frame()
notes_df.columns = ['notes']
notes_df['id'] = notes_df.index

dataframes['notes'] = notes_df[['id', 'notes']]
notes_df.head()

Unnamed: 0,notes,id
1,Entire book available from gutenberg.org at ht...,1
2,"Auction at cqout.com in August 2007 states ""De...",2
3,Theatrical giveaway depicting a small portion ...,3
4,Title from title page.,4
5,Best-known Wilhelm Busch book; The Katzenjamme...,5


Replace the notes by the IDs in the original tables:

In [5]:
# Series from notes to uniqueID
notes_mapper = pd.Series(notes.index, index=notes)

for name, df in dataframes.items():
    # Skip the notes dataframe obviously
    if name == 'notes':
        continue
    
    try:
        # Map notes to their IDs
        df['notes'] = df['notes'].map(notes_mapper)
    except KeyError:
        continue

## Artists table

We chose in our design to create a new _Artists_ entity in order to abstarct artist names from the original tables. We also create different relations to preserve artists roles in the different stories.

In [6]:
# TODO

## Individual files cleaning

This part aims to clean each `.csv` file individually in order to remove dirty rows and clear values that need some special treatment.

In [7]:
# TODO

## Saving files

We can now save our clean and new tables, ready for database loading.

In [8]:
# Code to save data in csv files for later use
#dataframes['notes'].to_csv('test.csv', index=False, float_format='%.0f')