# Data Files

A notebook that provides a quick overview of the data files in the project.

## Notebook setup

Ensure that necessary libraries are install and imported into the workplace.

In [1]:
# setup relative path to import local module (needed when used in Conda) https://stackoverflow.com/questions/34478398
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
import visualising_poetry.data as vpd

# import libraries
from IPython.display import display

# get data and process (if necessary)
vpd.setup_if_needed()

## Excel 'source' files

The project has the following source (Excel) files:

In [2]:
source_files_df = vpd.source_files_info_as_df()
display(source_files_df.sort_values('Source Files'))

Unnamed: 0,Source Files,Rows,Columns
21,British Magazine 1746-1751 1.11.19.xlsx,692,31
10,Common Sense - Live.xlsx,40,40
6,Craftsman - live.xlsx,18,40
0,Daily Gazetteer 30.9.19.xlsx,129,40
11,Dublin Journal 29.10.19.xlsx,171,40
12,General Evening Post 29.10.19.xlsx,128,40
14,Gentleman's Magazine 1731-1800 29.10.19.xlsx,1437,34
18,LDPGA.xlsx,40,40
4,Ladies Magazine 1749-1753 29.10.19.xlsx,733,31
9,London Magazine 1732-1785 28.10.19.xlsx,3238,31


## Pickle 'preprocessed' files 

The project has the following 'preprocessed' Pandas data frames of the 'poem data' sheet from the Excel source file.
The data is cleaned on the generation of the pickle files, such as stripping extra whitespace and normalising case.

The row count might differ slightly since any empty rows are deleted when the pickle files are generated.

Two additional 'computed' columns are created when the pickle files are created:

 * 'printed' a numpy datetime64 object constructed from other columns. Magazines with no day given are moved to the 
   1st of the following month to represent their likely publication and/or distribution date.
 * 'printed string' which is a string representation of the 'printed' field in the format YYYY-MM-DD

In [3]:
preprocessed_file_df = vpd.preprocessed_files_info_as_df()
display(preprocessed_file_df.sort_values('Preprocessed Files'))

Unnamed: 0,Preprocessed Files,Rows,Columns
3,British Magazine 1746-1751 1.11.19.pickle,692,35
15,Common Sense - Live.pickle,40,44
6,Craftsman - live.pickle,18,44
13,Daily Gazetteer 30.9.19.pickle,129,44
4,Dublin Journal 29.10.19.pickle,171,44
18,General Evening Post 29.10.19.pickle,128,44
9,Gentleman's Magazine 1731-1800 29.10.19.pickle,1436,38
5,LDPGA.pickle,40,44
19,Ladies Magazine 1749-1753 29.10.19.pickle,733,35
12,London Magazine 1732-1785 28.10.19.pickle,3238,35


## Complete dataset

Using the 'complete_dataset()' method we can get all of the pickle files as a single Pandas data frame.

We only return poems that are below or equal to a maximum year (MAX_YEAR in settings.py).

In [4]:
# get data
df = vpd.complete_dataset()

# rows
print("Complete dataset has {} rows and {} columns".format(df.shape[0], df.shape[1]))

Complete dataset has 10245 rows and 48 columns
