# GoodSciFi - Data Cleaning

## TODO:

1. Get images of all sci-fi
2. Get lists of all "classic,greatest"
3. Use lists to "label" the associated posters/covers as "good"

## Data 
Data cleanup and overview of all data gathered

### Data Review
Although some data cleanup/processing was performed during the web scraping processing, such as replacing spaces with dashes, this should be double checked along with items such as Null or None fields, and characters which may cause issues down stream. One of the most important items which may cause issues downstream are duplicate entries in the lists. This will be dealt with here as well.

1. Get list of dataset names (.json)
2. Sample "normal" list vs list with image references
3. Determine columns and what to drop
4. Copy all lists into dataframes with appropriate columns
 - Remove duplicates at the same time if possible
6. In DF, remove duplicate rows based on x columns
7. Review null/none colum entries, drop where appropriate
8. Once completed, save (pickle?) dataframes into processed folder

In [2]:
%matplotlib inline
import os, json, re, math
import numpy as np
import pandas as pd
import pickle
from glob import glob
from PIL import Image
from matplotlib import pyplot as plt
from shutil import copyfile
import utils 

In [3]:
# Get current top level directory
% cd ..
current_dir = os.getcwd()

/home/jason/DeepLearning/github/goodscifi/development


In [15]:
# Set up data folder paths
data_dir = current_dir+'/data/'
dataset_dir = data_dir+'dataset/'
data_raw = current_dir+'/data/raw/'
data_processed = current_dir+'/data/processed/'

In [None]:
# Take a peek into all json files
[x.split('/')[-1] for x in utils.glob(data_raw+'*.json')]

Above we have a list of all scraped files. And below, additional information regarding the naming convention and data found within each file category.

**Books**

Filenames with "all_books" contains are base of all books science fiction books written (based on the knowledge of the source). There are two such lists, we'll only be using the "wwend_all_books.json" file as the other references smaller images sizes. 

The "top" lists of books are represented by the name format of "books_" and then the source. For example, the file books_goodreads.json contains data regarding science fiction books from the website goodreads.com

**Movies and TV Shows**

Movie and TV Show data can be found together in the lists prefixed with "movies_" where appropriate. For example, the website On DVD Releases, only provides the sci-fi movies on DVD so the list, movies_ondvd.json, only contains movies. In contract, the file "movies_imdb.json" contains data from both movies and tv shows.


**Source of Data**

A brief explanation of the data source. Although in some cases it's pretty straight forward, others such as data from the file books_goodreads.json is abit mysterious. The website Goodreads allows users to create lists, vote and rank entries. I reviewed a number of science fiction lists and picked three of the most updated, voted on, commented and "appropriate" (e.g. all-time vs 90s) lists available. 


**Base / All Images**

The main source for the images (posters and covers) come from The Movie Database and World Without End. Below is a snap of what we can find in the raw output paths.

In [None]:
# Now lets look at our base images (posters and covers)
# These will be later 'labelled' as good is found in the lists above
g = glob(data_raw+'/tmdb_all_movies/*.jpg')
shuf = np.random.permutation(g)
[x.split('/')[-1] for x in shuf[100:110]]

In [None]:
g = glob(data_raw+'/tmdb_all_tvshows/*.jpg')
shuf = np.random.permutation(g)
[x.split('/')[-1] for x in shuf[100:110]]

In [None]:
# Notice for books, we will need to stitch some data together in order to associate title with image
# Associated title info can be found in wwend_all_books.json
g = glob(data_raw+'/wwend_all_books/full/*.jpg')
shuf = np.random.permutation(g)
[x.split('/')[-1] for x in shuf[100:110]]

### Preprocessing

**Review of the Data**

In [None]:
# List of json files
list_of_lists = [x.split('/')[-1] for x in glob(data_raw+'*.json')]; list_of_lists

In [None]:
# Review a file - 'year' can be empty
df = pd.read_json(data_raw+list_of_lists[0]); df.tail()

In [None]:
# ISFDB (and TMDB) contains images to be referenced
df = pd.read_json(data_raw+list_of_lists[1]); df.tail()

In [None]:
# Here you can see 'images' is a list which contains the path to the image
df.loc[494, 'images']

In [None]:
# Load image from path
image_path = df.loc[494, 'images'][0]['path']
img = Image.open(data_raw+'books_isfdb/'+image_path)

# Some re-sizing
width, height = img.size
new_width  = 200
new_height = int(new_width * height / width)
size = new_width, new_height

# Display image
img.resize(size, Image.ANTIALIAS)

**Import the Lists of Movies, TV Shows and Books**

In [None]:
# Load list of books and movies/tvshows and concat into two dataframes (books and movies)
list_of_books = ['books_goodreads.json', 'books_isfdb.json', 'books_wwe.json']
list_of_movies = ['movies_ign.json', 'movies_imdb.json', 'movies_ondvd.json', 
               'movies_ranker.json', 'movies_rt.json', 'movies_sff.json']

lists_of_dfs = []
for book_list in list_of_books:
    df = pd.read_json(data_raw+book_list)
    lists_of_dfs.append(df)

df_books = pd.concat(lists_of_dfs, ignore_index=True)

lists_of_dfs = []
for movie_list in list_of_movies:
    df = pd.read_json(data_raw+movie_list)
    lists_of_dfs.append(df)
    
df_movies = pd.concat(lists_of_dfs, ignore_index=True)

In [None]:
df_books.head()

In [None]:
df_movies.tail()

Lets start doing some clean up.

In [None]:
# Some fields are empty spaces, None and NaN - Drop cols and na
# 'year' only really needed for reboots with same name
cols_to_drop = ['image_urls', 'images', 'year']
books = df_books.drop(cols_to_drop, axis=1)
movies = df_movies.drop(cols_to_drop, axis=1)

books = books.replace(['None','','NaN','\s+' ,None], np.nan, regex=True).dropna(how='all')
movies = movies.replace(['None','','NaN','\s+' ,None], np.nan, regex=True).dropna(how='all')

books.loc[:,'title'] = books.title.str.lower()
movies.loc[:,'title'] = movies.title.str.lower()

# Remove duplicate entries - title and year
print('Total number of duplicate book titles: {}'.format(sum(books.title.duplicated())))
print('Total number of duplicate movie+tv titles: {}'.format(sum(movies.title.duplicated())))

books.drop_duplicates(inplace=True)
movies.drop_duplicates(inplace=True)

In [None]:
movies.reset_index(drop=True, inplace=True)
books.reset_index(drop=True, inplace=True)
print('Total number of "good" books: {}'.format(books.shape[0]))
print('Total number of "good" movies+tv shows: {}'.format(movies.shape[0]))

In [None]:
movies.info()

In [None]:
books.info()

In [None]:
dump(books, data_processed+'books.pickle')
dump(movies, data_processed+'movies.pickle')

**Use ISFDB, IMDB and WWEND Title Info to Rename Image Files**

In [None]:
# Lists with image references - used to update image file names
list_of_lists = ['books_isfdb.json', 'wwend_all_books.json', 'movies_imdb.json']

df = pd.DataFrame()
for book_list in list_of_lists:
    df = pd.read_json(data_raw+book_list)
    # TODO: disinfect title before using as filename
    for i in range(df.shape[0]):
        folder_name = book_list.split('.')[0]
        src_path = data_raw+folder_name+'/'+df.loc[i,'images'][0]['path']
        new_filename = df.loc[i,'title'].replace('/','-').lower() +'.jpg'
        
        copyfile(src_path, data_processed+folder_name+'/'+new_filename)

In [None]:
# Copy files from tmdb tv and movies into processed

Get all renamed images into 2 (movies and books) folder. Duplicate named files are kept (mostly different)

In [None]:
# BOOKS - move all book covers into same folder keeping dups (some have diff covers, same book)
list_of_books = ['books_isfdb', 'wwend_all_books']
for book_list in list_of_books:
    g = glob(data_processed+'/'+book_list+'/*.jpg')
    for i in range(len(g)):
        file_name = g[i].split('/')[-1].split('.jpg')[0]+'_'+str(i)+'.jpg'
        copyfile(g[i], dataset_dir+'books/train/'+file_name)
    

# MOVIES/TV SHOWS - move posters into same folder keeping dups (some have diff posters, same show)
list_of_movies_shows = ['tmdb_all_tvshows', 'tmdb_all_movies', 'movies_imdb']
for movie_list in list_of_movies_shows:
    g = glob(data_processed+'/'+movie_list+'/*.jpg')
    for i in range(len(g)):
        file_name = g[i].split('/')[-1].split('.jpg')[0]+'_'+str(i)+'.jpg'
        copyfile(g[i], dataset_dir+'movies/train/'+file_name)

In [None]:
# Clean all filenames
import re
def clean_title(title):
    if title is None: return None

    title = re.sub('[/:+=,]','-',title)
    title = title.lstrip('@,%,+,-,#,!')
    title = title.strip()
    title = title.replace(' ', '-')
    return title

In [None]:
g = glob(dataset_dir+'books/train/*.jpg')
for i in range(len(g)):
    file_name = clean_title(g[i].split('/')[-1]).lower()
    os.rename(g[i], dataset_dir+'books/train/'+file_name)

## Label Data

**Reload:** Movie and Book dataframes

In [31]:
b_df = load(data_processed+'books.pickle')
m_df = load(data_processed+'movies.pickle')

In [None]:
print(b_df.shape)
print(m_df.shape)

In [None]:
m_df['title'] = m_df['title'].apply(clean_title)
b_df['title'] = b_df['title'].apply(clean_title)

In [None]:
# Glob file paths, convert to dataframe and add title (filename) column
# Then merge dataframes with "good" list
g = glob(dataset_dir+'movies/train/*.jpg')

# Convert list to dataframe
file_paths = np.array(g)
col_names = ['path']

all_movies_df = pd.DataFrame(file_paths, columns=col_names)

In [None]:
# To match both df with 'title' need to clean up title_{year}_{tmdb_id}_index.jpg
all_movies_df['title'] = all_movies_df.loc[:,'path'].apply(lambda x: re.split('_', re.split('/', x)[-1])[0])

In [None]:
all_movies_df.loc[1000:1010]

In [None]:
m_df.head()

In [None]:
all_movies_df.loc[all_movies_df.loc[:,'title'] == '2001--a-space-odyssey']

In [None]:
# merge m_df with all_movies_df keeping "good" list and adding path
goodscifi_movies_df = pd.merge(m_df, all_movies_df, on='title')

In [None]:
goodscifi_movies_df.shape

In [None]:
goodscifi_movies_df.iloc[0:10]

In [None]:
# Just to make sure these are different files
goodscifi_movies_df.loc[goodscifi_movies_df['title'] == 'the-thing', 'path'].values

In [None]:
# Repeat but for books
g = glob(dataset_dir+'books/train/*.jpg')
file_paths = np.array(g)
col_names = ['path']

all_books_df = pd.DataFrame(file_paths, columns=col_names)

# To match both df with 'title' need to clean up title_{year}_{tmdb_id}_index.jpg
all_books_df['title'] = all_books_df.loc[:,'path'].apply(lambda x: re.split('_', re.split('/', x)[-1])[0])

In [None]:
all_books_df.iloc[995:1000]

In [None]:
# merge m_df with all_movies_df keeping "good" list and adding path
goodscifi_books_df = pd.merge(b_df, all_books_df, on='title')

In [None]:
goodscifi_books_df.head()

In [None]:
dump(goodscifi_movies_df, data_processed+'goodscifi_movies.pickle')
dump(goodscifi_books_df, data_processed+'goodscifi_books.pickle')

**Reload:** Goodscifi Movie and Book dataframes

In [6]:
goodscifi_books_df = load(data_processed+'goodscifi_books.pickle')
goodscifi_movies_df = load(data_processed+'goodscifi_movies.pickle')

In [9]:
print(goodscifi_books_df.shape)
print(goodscifi_movies_df.shape)

(1114, 2)
(1084, 2)


In [17]:
# Books
for i in range(goodscifi_books_df.shape[0]):
    file_to_be_moved = goodscifi_books_df.iloc[i]['path']
    file_name = file_to_be_moved.split('/')[-1]
    os.rename(file_to_be_moved, dataset_dir+'books/train/good/'+file_name)

g = glob(dataset_dir+'/books/train/*.jpg')
for i in range(len(g)):
    file_name = g[i].split('/')[-1]
    os.rename(g[i], dataset_dir+'books/train/not_good/'+file_name)

In [36]:
# Movies
for i in range(goodscifi_movies_df.shape[0]):
    file_to_be_moved = goodscifi_movies_df.iloc[i]['path']
    file_name = file_to_be_moved.split('/')[-1]
    os.rename(file_to_be_moved, dataset_dir+'movies/train/good/'+file_name)

g = glob(dataset_dir+'/movies/train/*.jpg')
for i in range(len(g)):
    file_name = g[i].split('/')[-1]
    os.rename(g[i], dataset_dir+'movies/train/not_good/'+file_name)

In [1]:
# Shuffle good, not_good files and split into validation and test sets
# movie_data_good = glob(dataset_dir+'/movies/train/good/*.jpg')
# movie_data_not_good = glob(dataset_dir+'/movies/train/not_good/*.jpg')
book_data_good = glob(dataset_dir+'/books/train/good/*.jpg')
book_data_not_good = glob(dataset_dir+'/books/train/not_good/*.jpg')

NameError: name 'glob' is not defined

In [41]:
m = (len(movie_data_good), len(movie_data_not_good))
b = (len(book_data_good), len(book_data_not_good))

print(m, b)

(1084, 6161) (1114, 7054)


In [None]:
# glob all training data
# randomly shuffle files (set seed)
# set ratio (e.g. 60% of total training and 1/6th of good data)
# loop over and move data to valid and test

In [53]:
# Calculate numbers to move to between valid and test (60:20:20)
m_total_take = (m[0]+m[1]) * 0.40 
m_total_good = m_total_take * 1/6
m_total_not_good = m_total_take - m_total_good

# Split up totals between valid and test (0.5)
m_good = math.floor(m_total_good / 2)
m_not_good = math.floor(m_total_not_good / 2)

print(m_total_take, m_total_good, m_total_not_good)
print(m_good, m_not_good)

2898.0 483.0 2415.0
241 1207


In [58]:
np.random.seed(42)

In [60]:
# Movies: Good Data

# Valid
g = glob(dataset_dir+'movies/train/good/*.jpg')
shuf = np.random.permutation(g)
for i in range(m_good):
    os.rename(shuf[i], dataset_dir+'movies/valid/good/'+shuf[i].split('/')[-1])

# Test
g = glob(dataset_dir+'movies/train/good/*.jpg')
shuf = np.random.permutation(g)
for i in range(m_good):
    os.rename(shuf[i], dataset_dir+'movies/test/good/'+shuf[i].split('/')[-1])

In [61]:
# Movies: Not Good Data

# Valid
g = glob(dataset_dir+'movies/train/not_good/*.jpg')
shuf = np.random.permutation(g)
for i in range(m_not_good):
    os.rename(shuf[i], dataset_dir+'movies/valid/not_good/'+shuf[i].split('/')[-1])

# Test
g = glob(dataset_dir+'movies/train/not_good/*.jpg')
shuf = np.random.permutation(g)
for i in range(m_not_good):
    os.rename(shuf[i], dataset_dir+'movies/test/not_good/'+shuf[i].split('/')[-1])

In [54]:
# Calculate numbers to move to between valid and test (60:20:20)
b_total_take = (b[0]+b[1]) * 0.40 
b_total_good = b_total_take * 1/6
b_total_not_good = b_total_take - b_total_good

# Split up totals between valid and test (0.5)
b_good = math.floor(b_total_good / 2)
b_not_good = math.floor(b_total_not_good / 2)

print(b_total_take, b_total_good, b_total_not_good)
print(b_good, b_not_good)

3267.2000000000003 544.5333333333334 2722.666666666667
272 1361


In [64]:
# Books: Good Data

# Valid
g = glob(dataset_dir+'books/train/good/*.jpg')
shuf = np.random.permutation(g)
for i in range(b_good):
    os.rename(shuf[i], dataset_dir+'books/valid/good/'+shuf[i].split('/')[-1])

# Test
g = glob(dataset_dir+'books/train/good/*.jpg')
shuf = np.random.permutation(g)
for i in range(b_good):
    os.rename(shuf[i], dataset_dir+'books/test/good/'+shuf[i].split('/')[-1])

In [65]:
# Books: Not Good Data

# Valid
g = glob(dataset_dir+'books/train/not_good/*.jpg')
shuf = np.random.permutation(g)
for i in range(b_not_good):
    os.rename(shuf[i], dataset_dir+'books/valid/not_good/'+shuf[i].split('/')[-1])

# Test
g = glob(dataset_dir+'books/train/not_good/*.jpg')
shuf = np.random.permutation(g)
for i in range(b_not_good):
    os.rename(shuf[i], dataset_dir+'books/test/not_good/'+shuf[i].split('/')[-1])