# Data Cleaning

This notebook presents the whole data cleaning process, which consists in extracting new tables and relations, as well as cleaning the existing files from dirty tuples and values.

The new clean data files are saved in the `.csv` format, and will be used to load data to the database.

In [33]:
# Import packages
import pandas as pd
import os
import numpy as np
import csv
import json
import utils

## Data loading

We first import all the `.csv` files into `pandas` DataFrames.

_Note_: some lines are ill-formed, we choose to ignore them.

In [34]:
# Root of the data files
PATH = os.path.join('..', 'data', 'original')

# Dic: name -> dataframe
dataframes = {}

# Get all the original files
for file in os.listdir(PATH):
    # Skip hidden files
    if (file.startswith('.')):
        continue
        
    name = file.split('.')[0]
    # Note: some lines are ill-formed, we ignore them
    dataframes[name] = pd.read_csv(os.path.join(PATH, file), encoding='utf-8', quoting=csv.QUOTE_NONE)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


## Notes table

This part aims to extract the notes from each table containing a `notes` attributes. Notes are loaded in a new table, and replaced by foreign keys in the original tables.


We first concatenate all the notes from all the dataframes. And create the new note dataframe

In [35]:
notes = pd.Series()

# Get all the notes from all the dataframes containing notes
for _, df in dataframes.items():
    if 'notes' in df.columns: 
        notes = notes.append(df['notes'].dropna(), ignore_index=True)
    
    if 'reprint_notes' in df.columns:
        notes = notes.append(df['reprint_notes'].dropna(), ignore_index=True)

notes_df = utils.extract_table(notes, 'notes')
dataframes['notes'] = notes_df
notes_df.head()


Unnamed: 0,id,notes
1,1,Entire book available from gutenberg.org at ht...
2,2,"Auction at cqout.com in August 2007 states ""De..."
3,3,Theatrical giveaway depicting a small portion ...
4,4,Title from title page.
5,5,Best-known Wilhelm Busch book; The Katzenjamme...


Replace the notes by the IDs in the original tables:

In [36]:
for name, df in dataframes.items():
    # Skip the notes dataframe obviously
    if name == 'notes':
        continue
    
    if 'notes' in df.columns:
        # Map notes to their IDs
        df['notes_id'] = utils.map_column(df['notes'], dataframes['notes'], 'id', 'notes')
        df.drop('notes', axis=1, inplace=True)
        
    if 'reprint_notes' in df.columns:
        df['reprint_notes_id'] = utils.map_column(df['reprint_notes'], dataframes['notes'], 'id', 'notes')
        df.drop('reprint_notes', axis=1, inplace=True)

In [37]:
dataframes['story']['notes_id'].head()

0        NaN
1    86501.0
2    86502.0
3    86502.0
4        NaN
Name: notes_id, dtype: float64

In [39]:
dataframes['story']['reprint_notes_id'].head(10)

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
5    377020.0
6         NaN
7    377021.0
8    377021.0
9         NaN
Name: reprint_notes_id, dtype: float64

## First/Last issue Relation

We noticed that there is a cyclic dependency between the tables _Issues_ and _Series_, since issues belong to a serie, and series have a first and last issue. It's generally a bad idea (and impossible practically) to create such cyclic relations between tables. So we decide to create a new relation *First_last_issue* to link series with their first and last issue, and remove the reference to _Issues_ in _Series_.

In [6]:
# Extract relation
first_last_issue = dataframes['series'][['id', 'first_issue_id', 'last_issue_id']]

# Rename the columns
first_last_issue.columns = ['serie_id', 'first_issue_id', 'last_issue_id']

# Remove rows if first_issue_id and last_issue_id are both NULL
first_last_issue = first_last_issue.dropna(subset=['first_issue_id', 'last_issue_id'], how='all')

# Save the new relation
dataframes['first_last_issue'] = first_last_issue

first_last_issue.head()

Unnamed: 0,serie_id,first_issue_id,last_issue_id
0,1,1.0,1.0
1,2,2.0,2.0
2,3,3.0,3.0
3,4,6.0,6.0
4,5,4.0,4.0


We can now drop the *first_issue_id* and *last_issue_id* columns of _Series_

In [7]:
dataframes['series'] = dataframes['series'].drop(['first_issue_id', 'last_issue_id'], axis=1)

## Artists table

We first scan through all the different categories of artists , clean the data and then store all artist in one single table as described in our ER diagram.

In [8]:
# Make table to store the list of all artists
all_artists = pd.Series()
# Dictionnary to store all artists of one category
artists = {}
categories = ['script', 'pencils', 'inks', 'colors', 'letters']

for category in categories:
    # Unpack the artists lists so we have all artists for every story
    unpacked = utils.unpack_column(dataframes['story'], 'id', category)
    
    # Clean the unpacked elements 
    unpacked[category] = utils.clean_column(unpacked[category])
    
    # We have now our relation with story IDs and artists names
    artists[category] = unpacked.dropna(how='any')
    
    # Add artists to the global artist list
    all_artists = all_artists.append(artists[category][category], ignore_index=True)

We can now extract our new artists table from the whole list of artists:

In [9]:
dataframes['artists'] = utils.extract_table(all_artists, 'name')
dataframes['artists'].head()

Unnamed: 0,id,name
1,1,Gustave Doré
2,2,Harry Rogers
3,3,Wilhelm Busch
4,4,The Donaldson Brothers
5,5,Richard Doyle


Now for each relation fôf artist, we map the names to the IDs:

In [10]:
for category in categories:
    artists[category].columns = ['story_id', 'artist_id']
    artists[category]['artist_id'] = utils.map_column(artists[category]['artist_id'], dataframes['artists'], 'id', 'name')
    dataframes[category] = artists[category]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
dataframes['script'].head()

Unnamed: 0,story_id,artist_id
7,13,1
8,14,1
9,15,1
10,16,1
11,17,1


We can now drop the different artists columns from the original Story dataframe:

In [12]:
dataframes['story'] = dataframes['story'].drop(categories, axis=1)

## Individual files cleaning

This part aims to clean each `.csv` file individually in order to remove dirty rows and clear values that need some special treatment.

### Country

By browsing the country data, we see that one row is not valid, with ID 248. We see in the cell below that for `publisher`, for example, no row references this ID, which is with high probably pure dirty data, we can safely remove it.

In [13]:
pub = dataframes['publisher']
print('Number of publisher with country_id 248: {}.'.format(len(pub[pub['country_id'] == 248])))

# Look for NaN values
print('NaN values: ')
df = dataframes['country']
df.isnull().sum()

Number of publisher with country_id 248: 0.
NaN values: 


id      0
code    0
name    0
dtype: int64

In [14]:
# Remove the desired row
dataframes['country'] = df[df['id'] != 248]

### Story Reprint

The story reprint table needs to be full, as we don't accept _NULL_ foreign keys in this case. We see in the cell below that there are no empty cells in the table.

In [15]:
dataframes['story_reprint'].isnull().sum()

id           0
origin_id    0
target_id    0
dtype: int64

### Story Type

By looking at the story types we see that the third row is problematic:

In [16]:
df = dataframes['story_type']
df.ix[2]

id                                             3
name    (backcovers) *do not use* / *please fix*
Name: 2, dtype: object

We check if any story contains a reference to this row:

In [17]:
stories = dataframes['story']
print('Number of stories referencing ID 3: {}.'.format(len(stories[stories['type_id'] == 3])))

Number of stories referencing ID 3: 0.


We can safely remove it:

In [18]:
dataframes['story_type'] = df[df['id'] != 3]

### Language

Looking at the language file, all the rows are clean and it's safe to keep them as it is.

In [19]:
dataframes['language'].isnull().sum()

id      0
code    0
name    0
dtype: int64

### Brang group

In [20]:
dataframes['brand_group'].isnull().sum()

id                 0
name               0
year_began      2938
year_ended      3857
notes_id        4615
url             4701
publisher_id       0
dtype: int64

As we can see, the essential attributes don't have missing values.

In [21]:
dataframes['brand_group']['name'].value_counts().head()

Marvel                  17
DC                      10
Dargaud                  8
A                        7
Classics Illustrated     6
Name: name, dtype: int64

However, we see that there are quite a lot of duplicates in the names. But if we look at the cell below, for the same names, we have each time different *publisher_id*s, so it makes sense to keep these duplicates.

In [22]:
dataframes['brand_group'][dataframes['brand_group']['name'] == 'Marvel']['publisher_id'].values

array([2105,  613, 3434,   78, 4720, 3174, 4437,  592, 3029, 8492, 7151,
       2195, 5905, 1798, 1977, 3655, 6917])

### Series Publication types
Obviously this table is ok.

In [23]:
dataframes['series_publication_type'].head()

Unnamed: 0,id,name
0,1,book
1,2,magazine
2,3,album


### Issue Reprint

We make sure there is no null rows in the reprint table:

In [24]:
dataframes['issue_reprint'].isnull().sum()

id                 0
origin_issue_id    0
target_issue_id    0
dtype: int64

### Indicia Publisher

For this table we need to make sure the *publisher_id* attribute is not null, which is the case:

In [25]:
dataframes['indicia_publisher'].isnull().sum()

id                 0
name               0
publisher_id       0
country_id         0
year_began      2612
year_ended      3563
is_surrogate       0
notes_id        4282
url             4711
dtype: int64

### Publisher

In [26]:
# TODO

### Stories

In [27]:
# TODO

### Issues

In [28]:
# TODO

### Series

In [29]:
# TODO

## Saving files

We can now save our clean and new tables, ready for database loading.

In [30]:
OUTPUT_PATH = os.path.join('..', 'data', 'clean')

#for name, df in dataframes.items():
#    df.to_csv(name.title() + '.csv', index=False, float_format='%.0f')


Also, we collect the max and average length of string attributes for each column of each table, in order to help use choosing right string lengths for the databas:

In [31]:
lengths = {}
for name, df in dataframes.items():
    tmp = {}
    for col in df.columns:
        col_type = df[col].dtype

        if (col_type == np.dtype('O') and type(df[col].dropna().iloc[0]) == str) or col_type == np.dtype(str):
            strs = df[col].dropna().str.len()
            tmp[col] = {'min': int(min(strs)),
                        'max': int(max(strs)),
                        'ave': int(sum(strs) / len(strs))}
    if len(tmp) > 0:        
        lengths[name] = tmp

In [32]:
with open(os.path.join(OUTPUT_PATH, 'lengths.json'), 'w') as file:
    json.dump([lengths], file)