# Check & Validate Data

### - How much data do I have? Has all the data been entered, saved, and documented correctly? 
### - Is anything missing? Are there any mix ups? Any data that went null and have `-1` instead?
- As I gathered the data, I repeatedly checked for any errors using the methods below. Remnants of fixes to missing data can be found in [`clean_json.py`](../01_data_collection/02_clean_json.py).

In [1]:
from glob import glob
import pandas as pd
import pickle
import json
import re

In [2]:
def get_id(index_url):
    return re.findall(r'(\d+)',str(index_url))[0]

In [3]:
with open('../../data/all_books_per_category_saved.json') as f:
    books_dict = json.load(f)

In [4]:
def load_books():
    all_books = []
    for f_name in glob('../../data/*/*.json'):
        with open(str(f_name)) as f:
            try:
                all_books.append(json.load(f))
            except:
                print(str(f_name), " failed to load.")
    
    with open('../../assets/all_books.pkl', 'wb+') as f:
        pickle.dump(all_books, f)
    
    return all_books

all_books = load_books()

## How many books in total? - `251`

In [5]:
len(all_books)

251

In [6]:
df = pd.DataFrame(all_books)

## Correct types? - `Yup`

In [7]:
df.dtypes

author_dd       int64
author_id       int64
author_name    object
book_id         int64
category_id     int64
pages           int64
text           object
dtype: object

## Any missing or corrupted data? - `Nope`

In [8]:
df.loc[(df['pages'] == -1) | (df['author_dd'] == -1) | (df['author_id'] == -1) | (df['category_id'] == -1) | (df['book_id'] == -1)]

Unnamed: 0,author_dd,author_id,author_name,book_id,category_id,pages,text


In [9]:
df[df['author_dd'] == df['author_id']]

Unnamed: 0,author_dd,author_id,author_name,book_id,category_id,pages,text


## How many pages in total? - Nearly `500K`!

In [10]:
df['pages'].sum()

477048

## Any missing books? - `Nope`

In [11]:
for x in ['134','135','136','137']:
    print('\n#####\n#{}#\n#####\n'.format(x))
    for i,book in enumerate(books_dict['/index.php/category/'+x]):
        if int(get_id(book[1])) not in list(df['book_id']):
            print(x, ' - ',i, ' - ', get_id(book[1]))


#####
#134#
#####


#####
#135#
#####


#####
#136#
#####


#####
#137#
#####

