# Data Cleaning

This notebook presents the whole data cleaning process, which consists in extracting new tables and relations, as well as cleaning the existing files from dirty tuples and values.

The new clean data files are saved in the `.csv` format, and will be used to load data to the database.

In [143]:
# Import packages
import pandas as pd
import os
import numpy as np
import csv
import json
import utils

## Data loading

We first import all the `.csv` files into `pandas` DataFrames.

_Note_: some lines are ill-formed, we choose to ignore them.

In [144]:
# Root of the data files
PATH = os.path.join('..', 'data', 'original')

# Dic: name -> dataframe
dataframes = {}

# Get all the original files
for file in os.listdir(PATH):
    # Skip hidden files
    if (file.startswith('.')):
        continue
        
    name = file.split('.')[0]
    # Note: some lines are ill-formed, we ignore them
    dataframes[name] = pd.read_csv(os.path.join(PATH, file), encoding='utf-8', quoting=csv.QUOTE_NONE)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


We just remove here all stories that have everything null appart from _type-id issue-id_ and _id_.

In [145]:
dataframes['story'] = dataframes['story'].dropna(thresh=4)

## Notes table

This part aims to extract the notes from each table containing a `notes` attributes. Notes are loaded in a new table, and replaced by foreign keys in the original tables.


We first concatenate all the notes from all the dataframes. And create the new note dataframe

In [146]:
notes = pd.Series()

# Get all the notes from all the dataframes containing notes
for _, df in dataframes.items():
    if 'notes' in df.columns: 
        notes = notes.append(df['notes'].dropna(), ignore_index=True)
    
    if 'reprint_notes' in df.columns:
        notes = notes.append(df['reprint_notes'].dropna(), ignore_index=True)

notes_df = utils.extract_table(notes, 'notes')
dataframes['notes'] = notes_df
notes_df.head()


Unnamed: 0,id,notes
1,1,Used for the MLJ superheroes that DC licensed ...
2,2,An imprint of HarperCollins Publishers; URL li...
3,3,Lancé en 2008 le label Fusion Comics était com...
4,4,"a highly-stylized ""A"" with the crossbar formed..."
5,5,Letter B is larger than rest of text.


Replace the notes by the IDs in the original tables:

In [147]:
for name, df in dataframes.items():
    # Skip the notes dataframe obviously
    if name == 'notes':
        continue
    
    if 'notes' in df.columns:
        # Map notes to their IDs
        df['notes_id'] = utils.map_column(df['notes'], dataframes['notes'], 'id', 'notes')
        df.drop('notes', axis=1, inplace=True)
        
    if 'reprint_notes' in df.columns:
        df['reprint_notes_id'] = utils.map_column(df['reprint_notes'], dataframes['notes'], 'id', 'notes')
        df.drop('reprint_notes', axis=1, inplace=True)

In [148]:
dataframes['story']['notes_id'].head()

0        NaN
1    86501.0
2    86502.0
3    86502.0
4        NaN
Name: notes_id, dtype: float64

In [149]:
dataframes['story']['reprint_notes_id'].head(10)

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
5    377020.0
6         NaN
7    377021.0
8    377021.0
9         NaN
Name: reprint_notes_id, dtype: float64

## First/Last issue Relation

We noticed that there is a cyclic dependency between the tables _Issues_ and _Series_, since issues belong to a serie, and series have a first and last issue. It's generally a bad idea (and impossible practically) to create such cyclic relations between tables. So we decide to create a new relation *First_last_issue* to link series with their first and last issue, and remove the reference to _Issues_ in _Series_.

In [150]:
# Extract relation
first_last_issue = dataframes['series'][['id', 'first_issue_id', 'last_issue_id']]

# Rename the columns
first_last_issue.columns = ['serie_id', 'first_issue_id', 'last_issue_id']

# Remove rows if first_issue_id and last_issue_id are both NULL
first_last_issue = first_last_issue.dropna(subset=['first_issue_id', 'last_issue_id'], how='all')

# Save the new relation
dataframes['first_last_issue'] = first_last_issue

first_last_issue.head()

Unnamed: 0,serie_id,first_issue_id,last_issue_id
0,1,1.0,1.0
1,2,2.0,2.0
2,3,3.0,3.0
3,4,6.0,6.0
4,5,4.0,4.0


We can now drop the *first_issue_id* and *last_issue_id* columns of _Series_

In [151]:
dataframes['series'] = dataframes['series'].drop(['first_issue_id', 'last_issue_id'], axis=1)

## Artists table

We first scan through all the different categories of artists , clean the data and then store all artist in one single table as described in our ER diagram.

In [152]:
# Make table to store the list of all artists
all_artists = pd.Series()
# Dictionnary to store all artists of one category
artists = {}
categories = ['script', 'pencils', 'inks', 'colors', 'letters']

for category in categories:
    # Unpack the artists lists so we have all artists for every story
    unpacked = utils.unpack_column(dataframes['story'], 'id', category)
    
    # Clean the unpacked elements 
    unpacked[category] = utils.clean_column(unpacked[category])
    
    # We have now our relation with story IDs and artists names
    artists[category] = unpacked.dropna(how='any')
    
    # Add artists to the global artist list
    all_artists = all_artists.append(artists[category][category], ignore_index=True)

We can now extract our new artists table from the whole list of artists:

In [153]:
dataframes['artists'] = utils.extract_table(all_artists, 'name')
dataframes['artists'].head()

Unnamed: 0,id,name
1,1,Gustave Doré
2,2,Harry Rogers
3,3,Wilhelm Busch
4,4,The Donaldson Brothers
5,5,Richard Doyle


Now for each relation of artist, we map the names to the IDs:

In [154]:
for category in categories:
    relation = artists[category]
    relation.columns = ['story_id', 'artist_id']
    relation['artist_id'] = utils.map_column(relation['artist_id'], dataframes['artists'], 'id', 'name')
    dataframes[category] = relation

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [155]:
dataframes['script'].head()

Unnamed: 0,story_id,artist_id
7,13.0,1
8,14.0,1
9,15.0,1
10,16.0,1
11,17.0,1


We can now drop the different artists columns from the original Story dataframe:

In [156]:
dataframes['story'] = dataframes['story'].drop(categories, axis=1)

## Characters table

We are now interested in extracting the characters and building the corresponding relations between _Stories_ and _Characters_. We consider the _characters_ and _feature_ attributes of _Stories_ to be characters, but we build different relationships to keep the original meaning.

Note that some cells contains multiples values, so we need to unpack them, as we did for _Artists_.

In [157]:
all_characters = pd.Series()
char_types = ['feature', 'characters']
char_relations = {}

for c_type in char_types:
    # Extract relation and unpack the lists
    unpacked = utils.unpack_column(dataframes['story'][['id', c_type]], 'id', c_type)
    
    # Clean values
    unpacked[c_type] = utils.clean_column(unpacked[c_type])
    unpacked = unpacked.dropna(how='any')
    
    # We got our clean and unpacked relation for each type
    char_relations[c_type] = unpacked
    
    # Accumulate characters
    all_characters = all_characters.append(unpacked[c_type], ignore_index=True)

In [158]:
char_relations['characters'].head()

Unnamed: 0,id,characters
39,45.0,John Mishler
1875358,60.0,Maurice
54,60.0,Max
55,61.0,Max
1875359,61.0,Maurice


In [159]:
# Build the Characters table
dataframes['characters'] = utils.extract_table(all_characters, 'name')
dataframes['characters'].head()

Unnamed: 0,id,name
1,1,Rawhide Kid
2,2,Max and Maurice
3,3,Brown Jones and Robinson
4,4,Plish and Plum
5,5,Daral


We can now replace values by IDs in the relations:

In [160]:
for c_type in char_types:
    relation = char_relations[c_type]
    relation.columns = ['story_id', 'character_id']
    relation['character_id'] = utils.map_column(relation['character_id'], dataframes['characters'], 'id', 'name')
    
    if c_type == 'feature':
        name = 'stories_features'
    elif c_type == 'characters':
        name = 'stories_characters'
        
    dataframes[name] = relation

In [161]:
dataframes['stories_characters'].head()

Unnamed: 0,story_id,character_id
39,45.0,85178
1875358,60.0,85179
54,60.0,34106
55,61.0,34106
1875359,61.0,85179


In [162]:
dataframes['stories_features'].head()

Unnamed: 0,story_id,character_id
0,6.0,1
1,7.0,1
2,8.0,1
3,9.0,1
54,60.0,2


We can now delete the features and the character columns from the story dataframe

In [None]:
dataframes['story'] = dataframes['story'].drop(char_types, axis=1)

# Editors table 

We are interested in creating a spearate table for all the editor of both the _Stories_ and the _Issues_. We will however create two relations, one for each table. There can be multiple editors per item we therfore need to unpack the columns

In [None]:
all_editors = pd.Series()
dfs = ['story','issue']
editors_relations = {}

for df in dfs:
    # Extract relation and unpack the lists
    unpacked = utils.unpack_column(dataframes[df][['id', 'editing']], 'id', 'editing')
    
    # Clean values
    unpacked['editing'] = utils.clean_column(unpacked['editing'])
    unpacked = unpacked.dropna(how='any')
    
    # We got our clean and unpacked relation for each type
    editors_relations[df] = unpacked
    
    # Accumulate characters
    all_editors = all_editors.append(unpacked['editing'], ignore_index=True)

In [None]:
editors_relations['story'].head()

In [None]:
# Build the Editors table
dataframes['editors'] = utils.extract_table(all_editors, 'name')
dataframes['editors'].head()

We now map the editors relation table to the editor table

In [None]:
for df in dfs:
    relation = editors_relations[df]
    relation.columns = [df+'_id', 'editor_id']
    relation['editor_id'] = utils.map_column(relation['editor_id'], dataframes['editors'], 'id', 'name')
    
    if df == 'story':
        name = 'stories_editing'
    elif df == 'issue':
        name = 'issues_editing'
        
    dataframes[name] = relation

In [None]:
dataframes['stories_editing'].head()

In [None]:
dataframes['issues_editing'].head()

We can now delete the editing column form both issue and story dataframes

In [None]:
dataframes['story'] = dataframes['story'].drop('editing', axis=1)
dataframes['issue'] = dataframes['issue'].drop('editing', axis=1)

## Individual files cleaning

This part aims to clean each `.csv` file individually in order to remove dirty rows and clear values that need some special treatment.

### Country

By browsing the country data, we see that one row is not valid, with ID 248. We see in the cell below that for `publisher`, for example, no row references this ID, which is with high probably pure dirty data, we can safely remove it.

In [None]:
pub = dataframes['publisher']
print('Number of publisher with country_id 248: {}.'.format(len(pub[pub['country_id'] == 248])))

# Look for NaN values
print('NaN values: ')
df = dataframes['country']
df.isnull().sum()

In [None]:
# Remove the desired row
dataframes['country'] = df[df['id'] != 248]

### Story Reprint

The story reprint table needs to be full, as we don't accept _NULL_ foreign keys in this case. We see in the cell below that there are no empty cells in the table.

In [None]:
dataframes['story_reprint'].isnull().sum()

### Story Type

By looking at the story types we see that the third row is problematic:

In [None]:
df = dataframes['story_type']
df.ix[2]

We check if any story contains a reference to this row:

In [None]:
stories = dataframes['story']
print('Number of stories referencing ID 3: {}.'.format(len(stories[stories['type_id'] == 3])))

We can safely remove it:

In [None]:
dataframes['story_type'] = df[df['id'] != 3]

### Language

Looking at the language file, all the rows are clean and it's safe to keep them as it is.

In [None]:
dataframes['language'].isnull().sum()

### Brang group

In [None]:
dataframes['brand_group'].isnull().sum()

As we can see, the essential attributes don't have missing values.

In [None]:
dataframes['brand_group']['name'].value_counts().head()

However, we see that there are quite a lot of duplicates in the names. But if we look at the cell below, for the same names, we have each time different *publisher_id*s, so it makes sense to keep these duplicates.

In [None]:
dataframes['brand_group'][dataframes['brand_group']['name'] == 'Marvel']['publisher_id'].values

### Series Publication types
Obviously this table is ok.

In [None]:
dataframes['series_publication_type'].head()

### Issue Reprint

We make sure there is no null rows in the reprint table:

In [None]:
dataframes['issue_reprint'].isnull().sum()

### Indicia Publisher

For this table we need to make sure the *publisher_id* attribute is not null, which is the case:

In [None]:
dataframes['indicia_publisher'].isnull().sum()

### Publisher

We need to make sure that every publisher as a name, which is the case

In [None]:
dataframes['publisher'].isnull().sum()

### Stories

We did some cleaning at the beginning for cells that contained no information

### Issues

In [None]:
# TODO

### Series

In [None]:
# TODO

## Saving files

We can now save our clean and new tables, ready for database loading.

In [None]:
OUTPUT_PATH = os.path.join('..', 'data', 'clean')

#for name, df in dataframes.items():
#    df.to_csv(name.title() + '.csv', index=False, float_format='%.0f')


Also, we collect the max and average length of string attributes for each column of each table, in order to help use choosing right string lengths for the databas:

In [None]:
lengths = {}
for name, df in dataframes.items():
    tmp = {}
    for col in df.columns:
        col_type = df[col].dtype

        if (col_type == np.dtype('O') and type(df[col].dropna().iloc[0]) == str) or col_type == np.dtype(str):
            strs = df[col].dropna().str.len()
            tmp[col] = {'min': int(min(strs)),
                        'max': int(max(strs)),
                        'ave': int(sum(strs) / len(strs))}
    if len(tmp) > 0:        
        lengths[name] = tmp

In [None]:
with open(os.path.join(OUTPUT_PATH, 'lengths.json'), 'w') as file:
    json.dump([lengths], file)