# Data Cleaning

This notebook presents the whole data cleaning process, which consists in extracting new tables and relations, as well as cleaning the existing files from dirty tuples and values.

The new clean data files are saved in the `.csv` format, and will be used to load data to the database.

In [1]:
# Import packages
import pandas as pd
import os

## Data loading

We first import all the `.csv` files into `pandas` DataFrames.

_Note_: some lines are ill-formed, we choose to ignore them.

In [2]:
# Root of the data files
PATH = os.path.join('..', 'data', 'original')

# Dic: name -> dataframe
dataframes = {}

# Get all the original files
for file in os.listdir(PATH):
    name = file.split('.')[0]
    # Note: some lines are ill-formed, we ignore them
    dataframes[name] = pd.read_csv(os.path.join(PATH, file), error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
b'Skipping line 6010: expected 16 fields, saw 29\nSkipping line 14907: expected 16 fields, saw 29\n'
b'Skipping line 142947: expected 16 fields, saw 27\n'
b'Skipping line 379178: expected 16 fields, saw 27\n'
b'Skipping line 411105: expected 16 fields, saw 18\n'
b'Skipping line 532885: expected 16 fields, saw 27\n'
b'Skipping line 625092: expected 16 fields, saw 27\n'
b'Skipping line 743624: expected 16 fields, saw 27\n'
b'Skipping line 786680: expected 16 fields, saw 27\n'
b'Skipping line 892034: expected 16 fields, saw 29\n'
b'Skipping line 1157212: expected 16 fields, saw 29\nSkipping line 1157657: expected 16 fields, saw 29\n'
b'Skipping line 1265876: expected 16 fields, saw 29\n'
b'Skipping line 1460021: expected 16 fields, saw 22\n'
b'Skipping line 1591431: expected 16 fields, saw 29\n'
b'Skipping line 1615982: expected 16 fields, saw 29\nSkipping line

## Notes table

This part aims to extract the notes from each table containing a `notes` attributes. Notes are loaded in a new table, and replaced by foreign keys in the original tables.


We first concatenate all the notes from all the dataframes.

In [3]:
notes = pd.Series()
for _, df in dataframes.items():
    if 'notes' in df.columns: 
        note = df['notes'].dropna()
        notes = notes.append(note, ignore_index=True)

# Keep unique notes
notes = notes.drop_duplicates(keep='first').reset_index(drop=True)
# Shift to start with ID 1
notes.index = notes.index + 1
notes.head()

1    The company was founded in 1865 by a bookselle...
2    The Graphic Office 190 Strand London W.CAddres...
3    Star published a line of coloring books in com...
4    Cupples & Leon was founded in 1902 by Victor I...
5    Intended to be used for books that contain no ...
dtype: object

Create the new notes dataframe:

In [4]:
# Form a DataFrame from the note Series
notes_df = notes.to_frame()
notes_df.columns = ['notes']
notes_df['id'] = notes_df.index

dataframes['notes'] = notes_df[['id', 'notes']]
notes_df.head()

Unnamed: 0,notes,id
1,The company was founded in 1865 by a bookselle...,1
2,The Graphic Office 190 Strand London W.CAddres...,2
3,Star published a line of coloring books in com...,3
4,Cupples & Leon was founded in 1902 by Victor I...,4
5,Intended to be used for books that contain no ...,5


Replace the notes by the IDs in the original tables:

In [5]:
# Series from notes to uniqueID
notes_mapper = pd.Series(notes.index, index=notes)

for name, df in dataframes.items():
    # Skip the notes dataframe obviously
    if name == 'notes':
        continue
    
    if 'notes' in df.columns:
        # Map notes to their IDs
        df['notes'] = df['notes'].map(notes_mapper)

## Artists table

We chose in our design to create a new _Artists_ entity in order to abstarct artist names from the original tables. We also create different relations to preserve artists roles in the different stories.

In [6]:
# TODO

## Individual files cleaning

This part aims to clean each `.csv` file individually in order to remove dirty rows and clear values that need some special treatment.

### Country

By browsing the country data, we see that one row is not valid, with ID 248. We see in the cell below that for `publisher`, for example, no row references this ID, which is with high probably pure dirty data, we can safely remove it.

In [26]:
pub = dataframes['publisher']
print('Number of publisher with country_id 248: {}.'.format(len(pub[pub['country_id'] == 248])))

# Look for NaN values
print('NaN values: ')
df = dataframes['country']
df.isnull().sum()

Number of publisher with country_id 248: 0.
NaN values: 


id      0
code    0
name    0
dtype: int64

In [27]:
# Remove the desired row
dataframes['country'] = df[df['id'] != 248]

### Story Reprint

The story reprint table needs to be full, as we don't accept _NULL_ foreign keys in this case. We see in the cell below that there are no empty cells in the table.

In [31]:
dataframes['story_reprint'].isnull().sum()

id           0
origin_id    0
target_id    0
dtype: int64

### Story Type

By looking at the story types we see that the third row is problematic:

In [34]:
df = dataframes['story_type']
df.ix[2]

id                                             3
name    (backcovers) *do not use* / *please fix*
Name: 2, dtype: object

We check if any story contains a reference to this row:

In [41]:
stories = dataframes['story']
print('Number of stories referencing ID 3: {}.'.format(len(stories[stories['type_id'] == 3])))

Number of stories referencing ID 3: 0.


We can safely remove it:

In [44]:
dataframes['story_type'] = df[df['id'] != 3]

## Saving files

We can now save our clean and new tables, ready for database loading.

In [7]:
# Code to save data in csv files for later use
#dataframes['notes'].to_csv('test.csv', index=False, float_format='%.0f')