# Data Files

A notebook that provides a quick overview of the data files in the project.

## Notebook setup

Ensure that necessary libraries are install and imported into the workplace.

In [17]:
# install libraries
!pip install -r ../requirements.txt

# import libraries
import visualising_poetry.data as vpd
from IPython.display import display

# get data if necessary
vpd.setup_if_needed()



## Excel 'source' files

The project has the following source (Excel) files:

In [18]:
source_files_df = vpd.source_files_info_as_df()
display(source_files_df.sort_values('Source Files'))

Unnamed: 0,Source Files,Rows,Columns
21,British Magazine 1746-1751.xlsx,692,31
10,Common Sense - Live.xlsx,40,40
6,Craftsman - live.xlsx,18,40
0,Daily Gazetteer 30.9.19.xlsx,129,40
11,Dublin Journal - live.xlsx,171,40
12,General Evening Post - live.xlsx,128,40
14,Gentleman's Magazine 1731-1800.xlsx,1437,34
18,LDPGA.xlsx,40,40
4,Ladies Magazine 1749-1753 - live.xlsx,733,31
9,London Magazine 1732-1785.xlsx,3238,31


## Pickle 'preprocessed' files 

The project has the following 'preprocessed' Pandas data frames of the 'poem data' sheet from the Excel source file.
The data is cleaned on the generation of the pickle files, such as stripping extra whitespace and normalising case.

The row count might differ slightly since any empty rows are deleted when the pickle files are generated.

Two additional 'computed' columns are created when the pickle files are created:

 * 'printed' a numpy datetime64 object constructed from other columns. Magazines with no day given are moved to the 
   1st of the following month to represent their likely publication and/or distribution date.
 * 'printed string' which is a string representation of the 'printed' field in the format YYYY-MM-DD

In [19]:
preprocessed_file_df = vpd.preprocessed_files_info_as_df()
display(preprocessed_file_df.sort_values('Preprocessed Files'))

Unnamed: 0,Preprocessed Files,Rows,Columns
0,British Magazine 1746-1751.pickle,692,33
14,Common Sense - Live.pickle,40,42
5,Craftsman - live.pickle,18,42
11,Daily Gazetteer 30.9.19.pickle,129,42
18,Dublin Journal - live.pickle,171,42
16,General Evening Post - live.pickle,128,42
10,Gentleman's Magazine 1731-1800.pickle,1436,36
3,LDPGA.pickle,40,42
15,Ladies Magazine 1749-1753 - live.pickle,733,33
8,London Magazine 1732-1785.pickle,3238,33


## Complete dataset

Using the 'complete_dataset()' method we can get all of the pickle files as a single Pandas data frame.

In [21]:
# get data
df = vpd.complete_dataset()

# rows
print("Complete dataset has {} rows and {} columns".format(df.shape[0], df.shape[1]))

Complete dataset has 10245 rows and 46 columns
