# Data Cleaning

This notebook presents the whole data cleaning process, which consists in extracting new tables and relations, as well as cleaning the existing files from dirty tuples and values.

The new clean data files are saved in the `.csv` format, and will be used to load data to the database.

In [279]:
# Import packages
import pandas as pd
import os
import numpy as np
import csv
import json

## Data loading

We first import all the `.csv` files into `pandas` DataFrames.

_Note_: some lines are ill-formed, we choose to ignore them.

In [242]:
# Root of the data files
PATH = os.path.join('..', 'data', 'original')

# Dic: name -> dataframe
dataframes = {}

# Get all the original files
for file in os.listdir(PATH):
    # Skip hidden files
    if (file.startswith('.')):
        continue
        
    name = file.split('.')[0]
    # Note: some lines are ill-formed, we ignore them
    dataframes[name] = pd.read_csv(os.path.join(PATH, file), encoding='utf-8', quoting=csv.QUOTE_NONE)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


## Notes table

This part aims to extract the notes from each table containing a `notes` attributes. Notes are loaded in a new table, and replaced by foreign keys in the original tables.


We first concatenate all the notes from all the dataframes.

In [243]:
notes = pd.Series()
for _, df in dataframes.items():
    if 'notes' in df.columns: 
        note = df['notes'].dropna()
        notes = notes.append(note, ignore_index=True)

# Keep unique notes
notes = notes.drop_duplicates(keep='first').reset_index(drop=True)
# Shift to start with ID 1
notes.index = notes.index + 1
notes.head()

1    Lettering credit from Dick Ayers via Mike Quil...
2    Lettering credit from Dick Ayers via Mike Quil...
3                                Single panel cartoons
4    Victorian style comic (no word balloons; brief...
5                                single panel cartoons
dtype: object

Create the new notes dataframe:

In [244]:
# Form a DataFrame from the note Series
notes_df = notes.to_frame()
notes_df.columns = ['notes']
notes_df['id'] = notes_df.index

dataframes['notes'] = notes_df[['id', 'notes']]
notes_df.head()

Unnamed: 0,notes,id
1,Lettering credit from Dick Ayers via Mike Quil...,1
2,Lettering credit from Dick Ayers via Mike Quil...,2
3,Single panel cartoons,3
4,Victorian style comic (no word balloons; brief...,4
5,single panel cartoons,5


Replace the notes by the IDs in the original tables:

In [245]:
# Series from notes to uniqueID
notes_mapper = pd.Series(notes.index, index=notes)

for name, df in dataframes.items():
    # Skip the notes dataframe obviously
    if name == 'notes':
        continue
    
    if 'notes' in df.columns:
        # Map notes to their IDs
        df['notes'] = df['notes'].map(notes_mapper)

## First/Last issue Relation

We noticed that there is a cyclic dependency between the tables _Issues_ and _Series_, since issues belong to a serie, and series have a first and last issue. It's generally a bad idea (and impossible practically) to create such cyclic relations between tables. So we decide to create a new relation *First_last_issue* to link series with their first and last issue, and remove the reference to _Issues_ in _Series_.

In [246]:
# Extract relation
first_last_issue = dataframes['series'][['id', 'first_issue_id', 'last_issue_id']]

# Rename the columns
first_last_issue.columns = ['serie_id', 'first_issue_id', 'last_issue_id']

# Remove rows if first_issue_id and last_issue_id are both NULL
first_last_issue = first_last_issue.dropna(subset=['first_issue_id', 'last_issue_id'], how='all')

# Save the new relation
dataframes['first_last_issue'] = first_last_issue

first_last_issue.head()

Unnamed: 0,serie_id,first_issue_id,last_issue_id
0,1,1.0,1.0
1,2,2.0,2.0
2,3,3.0,3.0
3,4,6.0,6.0
4,5,4.0,4.0


We can now drop the *first_issue_id* and *last_issue_id* coumns of _Series_

In [247]:
dataframes['series'] = dataframes['series'].drop(['first_issue_id', 'last_issue_id'], axis=1)

## Artists table

We first scan through all the different categories of artists , clean the data and then store all artist in one single table as described in our ER diagram.

In [248]:
# Make table to store the list of all artists
all_artists = pd.Series()
# Dictionnary to store all artists of one category
artists = {}
categories = ['script','pencils','inks','colors','letters']
story_df = dataframes['story']

for category in categories:
    #remove all data in () and in [], removes trailing white space and set empty cells to nan
    story_df[category] = story_df[category].str.replace(r"(\(.*\))|(\[.*\])|\?","").str.strip('; ').replace('',np.nan)

    artists[category] = story_df[category]
    # Set the index of artists to the corresponding stoy_id
    artists[category].index = story_df['id']
    #remove nan
    artists[category] = artists[category].dropna()
    #add artits to the global table
    all_artists = all_artists.append(artists[category],ignore_index=True)
    
# Keep unique artists
all_artists = all_artists.drop_duplicates(keep='first').reset_index(drop=True)
# Shift to start with ID 1
all_artists.index = all_artists.index + 1
all_artists.head()


1                   Gustave Doré
2    Wilhelm Busch; Harry Rogers
3         The Donaldson Brothers
4                  Wilhelm Busch
5                  Richard Doyle
dtype: object

Here, we correctly index the different category of artists table with the correct story index and the correct artists index. We also save these artists in the dataframe where we store all our tables.

In [249]:
# Series from artists to uniqueID
artist_mapper = pd.Series(all_artists.index, index=all_artists)

for category in categories:
    artists[category] = artists[category].to_frame() 
    artists[category]['story_id'] = artists[category].index
    artists[category].columns = ['artist_id','story_id']
    artists[category]['artist_id'] = artists[category]['artist_id'].map(artist_mapper)
    
    dataframes[category] = artists[category]
    

df = all_artists.to_frame()
df.columns = ['name']
df['id'] = df.index
dataframes['artists'] = df
dataframes['artists'].head()

Unnamed: 0,name,id
1,Gustave Doré,1
2,Wilhelm Busch; Harry Rogers,2
3,The Donaldson Brothers,3
4,Wilhelm Busch,4
5,Richard Doyle,5


In [250]:
for category in categories:
    story_df = story_df.drop(category,axis = 1)
dataframes['story']= story_df


## Individual files cleaning

This part aims to clean each `.csv` file individually in order to remove dirty rows and clear values that need some special treatment.

### Country

By browsing the country data, we see that one row is not valid, with ID 248. We see in the cell below that for `publisher`, for example, no row references this ID, which is with high probably pure dirty data, we can safely remove it.

In [251]:
pub = dataframes['publisher']
print('Number of publisher with country_id 248: {}.'.format(len(pub[pub['country_id'] == 248])))

# Look for NaN values
print('NaN values: ')
df = dataframes['country']
df.isnull().sum()

Number of publisher with country_id 248: 0.
NaN values: 


id      0
code    0
name    0
dtype: int64

In [252]:
# Remove the desired row
dataframes['country'] = df[df['id'] != 248]

### Story Reprint

The story reprint table needs to be full, as we don't accept _NULL_ foreign keys in this case. We see in the cell below that there are no empty cells in the table.

In [253]:
dataframes['story_reprint'].isnull().sum()

id           0
origin_id    0
target_id    0
dtype: int64

### Story Type

By looking at the story types we see that the third row is problematic:

In [254]:
df = dataframes['story_type']
df.ix[2]

id                                             3
name    (backcovers) *do not use* / *please fix*
Name: 2, dtype: object

We check if any story contains a reference to this row:

In [255]:
stories = dataframes['story']
print('Number of stories referencing ID 3: {}.'.format(len(stories[stories['type_id'] == 3])))

Number of stories referencing ID 3: 0.


We can safely remove it:

In [256]:
dataframes['story_type'] = df[df['id'] != 3]

### Language

Looking at the language file, all the rows are clean and it's safe to keep them as it is.

In [257]:
dataframes['language'].isnull().sum()

id      0
code    0
name    0
dtype: int64

### Brang group

In [258]:
dataframes['brand_group'].isnull().sum()

id                 0
name               0
year_began      2938
year_ended      3857
notes           4615
url             4701
publisher_id       0
dtype: int64

As we can see, the essential attributes don't have missing values.

In [259]:
dataframes['brand_group']['name'].value_counts().head()

Marvel                  17
DC                      10
Dargaud                  8
A                        7
Classics Illustrated     6
Name: name, dtype: int64

However, we see that there are quite a lot of duplicates in the names. But if we look at the cell below, for the same names, we have each time different *publisher_id*s, so it makes sense to keep these duplicates.

In [260]:
dataframes['brand_group'][dataframes['brand_group']['name'] == 'Marvel']['publisher_id'].values

array([2105,  613, 3434,   78, 4720, 3174, 4437,  592, 3029, 8492, 7151,
       2195, 5905, 1798, 1977, 3655, 6917])

### Series Publication types
Obviously this table is ok.

In [261]:
dataframes['series_publication_type'].head()

Unnamed: 0,id,name
0,1,book
1,2,magazine
2,3,album


### Issue Reprint

We make sure there is no null rows in the reprint table:

In [262]:
dataframes['issue_reprint'].isnull().sum()

id                 0
origin_issue_id    0
target_issue_id    0
dtype: int64

### Indicia Publisher

For this table we need to make sure the *publisher_id* attribute is not null, which is the case:

In [263]:
dataframes['indicia_publisher'].isnull().sum()

id                 0
name               0
publisher_id       0
country_id         0
year_began      2612
year_ended      3563
is_surrogate       0
notes           4282
url             4711
dtype: int64

### Publisher

In [264]:
# TODO

### Stories

In [265]:
# TODO

### Issues

In [266]:
# TODO

### Series

In [267]:
# TODO

## Saving files

We can now save our clean and new tables, ready for database loading.

In [277]:
OUTPUT_PATH = os.path.join('..', 'data', 'clean')

#for name, df in dataframes.items():
#    df.to_csv(name.title() + '.csv', index=False, float_format='%.0f')


"\nfor name, df in dataframes.items():\n    df.to_csv(name.title() + '.csv', index=False, float_format='%.0f')\n"

Also, we collect the max and average length of string attributes for each column of each table, in order to help use choosing right string lengths for the databas:

In [301]:
lengths = {}
for name, df in dataframes.items():
    tmp = {}
    for col in df.columns:
        col_type = df[col].dtype

        if (col_type == np.dtype('O') and type(df[col].dropna().iloc[0]) == str) or col_type == np.dtype(str):
            strs = df[col].dropna().str.len()
            tmp[col] = {'min': int(min(strs)),
                        'max': int(max(strs)),
                        'ave': int(sum(strs) / len(strs))}
    if len(tmp) > 0:        
        lengths[name] = tmp

In [302]:
with open(os.path.join(OUTPUT_PATH, 'lengths.json'), 'w') as file:
    json.dump([lengths], file)