# Papers Past Genre Classification
# Notebook 1: Preprocessing
---

## Processing the raw data

This notebook processes raw Papers Past METS/ALTO XML files (saved in tar.gz format by newspaper and year) to return a Pandas dataframe that is then saved for later use. The newspaper issues are randomly selected, with the option to set a seed for  reproducibility.

The final dataframe columns are:
* Date (of publication - in date format)
* Newspaper_ID (the code used to identify each newspaper in the Papers Past filename structure - string)
* Newspaper (the full newspaper name - string)
* Article_ID (the article's relative position in the newspaper - integer)
* Title (the title of the article - string)
* Text (the full text of the article - string)
* Numeric features extracted from the ALTO XML files, such as average line length, average offset of lines from block (floats)

The code for processing the tar.gz files and extracting data from the METS/ALTO files is based on code developed by Joshua Black as part of a previous DATA601 project using the Papers Past open data (https://github.com/JoshuaDavidBlack/newspaper-philosophy-methods/blob/main/Notebooks/Preprocessing%20Stage.ipynb). 

In [1]:
# Import necessary libraries

import re
import os
import statistics
import tarfile
import random
import numpy as np
import pandas as pd
import xml.etree.ElementTree as ET
import pickle
import time
from datetime import date
from datetime import datetime

In [2]:
# Set seed for reproducibility
random.seed(a=14)

In [3]:
# Load mets namespace
# http://www.loc.gov/standards/mets/namespace.html

NS = {'mets':'http://www.loc.gov/METS/'}

### Functions

Run the following cells to define the functions that will be used in generating the final output of this notebook.

In [4]:
def process_tarball(filepath):
    """
    Given path to tarball, open and return dictionary containing 
    article codes as keys and texts (as list of strings for each block) as values.
    """
    newspaper_year = tarfile.open(filepath, mode='r')

    # Return the members of the archive as a list of TarInfo objects. 
    # The list has the same order as the members in the archive.
    # https://docs.python.org/3/library/tarfile.html
    files = newspaper_year.getmembers() 

    issues = collect_issues(files)
    selected_issue = select_random_issue(issues)
    mets_tarinfo = issues[selected_issue][-1]
    pages_tarinfo = issues[selected_issue][0:-1]
    article_codes = mets2codes(mets_tarinfo, newspaper_year)
    articles = codes2texts(article_codes, pages_tarinfo, newspaper_year, selected_issue)

    return articles

In [5]:
def collect_issues(files):
    """
    Given list of files in tarball, return a dictionary keyed
    by the issue code with list of xml files of form [0001.xml, ..., mets.xml]
    as values.
    """
    issues = {}
    issue_code = ''

    for file in files:
        match = re.search("[A-Z]*_\d{8}$", file.name)
        if match:
            issue_code = match.group(0)
        if file.name.endswith('.xml'):
            xml_list = issues.get(issue_code, [])
            xml_list.append(file)
            issues[issue_code] = xml_list

    return issues

In [6]:
def select_random_issue(issues):
    """
    Select a random issue from a given dictionary of the all of a newspaper's issues for one year 
    (with the issue code as the key and the list of XML files as the elements).
    """
    selected_issue = random.choice(list(issues))

    return selected_issue 

In [7]:
def mets2codes(mets_tarinfo, newspaper_year):
    """
    Given mets as tarinfo, return text block codes for articles
    contained in mets file. Edited for processing with tarfile
    object newspaper_year.

    Returns dictionary of article codes as keys,
    with a 2-tuple containing the article title
    and a list of corresponding text block codes as values.
    """
    with newspaper_year.extractfile(mets_tarinfo) as file:
        text = file.read()

    art_dict = mets2codes_inner(text, newspaper_year)

    return art_dict

In [8]:
def mets2codes_inner(text, newspaper_year):
    """
    Given METS file as text string, return a dictionary of
    articles, with article codes as keys and, as values, tuples containing
    the corresponding article title and a list of text blocks from the
    corresponding ALTO files as values.
    """
    # Load the mets xml file (which comes into the function as a string)
    mets_root = ET.fromstring(text) 
    
    # Find the "logical structure" part of the file, which lists all the articles and the blocks they contain.
    logical_structure = mets_root.find("./mets:structMap[@LABEL='Logical Structure']", NS) 
    
    # This returns all of the "div" elements in the logical structure part of the xml with the attribute "TYPE='ARTICLE'". 
    # This is where we lose the display advertisements.
    articles_div = logical_structure.findall(".//mets:div[@TYPE='ARTICLE']", NS) 

    # This is an empty dictionary which will collect what we need from the mets file. 
    # It will have articles ids as keys and have the ids of the text blocks which are part of the article as values.
    art_dict = {} 
    for article in articles_div:
        
        attributes = article.attrib
        article_id = attributes['DMDID']
        article_title = attributes.get('LABEL', 'UNTITLED')

        text_blocks = article.findall(".//mets:div[@TYPE='TEXT']", NS)
        block_ids = []
        for block in text_blocks:
            try:
                areas = block.findall(".//mets:area", NS)
                for area in areas:
                    block_id = area.attrib['BEGIN']
                    block_ids.append(block_id)
            except AttributeError:
                print(f'Error in {newspaper_year}')
        
        art_dict[article_id] = (article_title, block_ids)

    mets_root.clear() # When processing lots of these, we want to free up memory.

    # print(art_dict)
    return art_dict

In [9]:
def codes2texts(article_codes, pages_tarinfo, newspaper_year, selected_issue):
    """
    Given article codes, the issue pages as tar info objects, 
    the newspaper year and the issue code, return a dictionary
    with article codes as keys and a list of text blocks as
    strings as values.
    """
    page_roots = parse_pages_tar(pages_tarinfo, newspaper_year)
    # page_roots returns a dictionary with pages numbers (of form 'P1'
    # etc...) as keys and the XML roots of the pages as values.

    texts_dict = codes2texts_inner(article_codes, page_roots, selected_issue)

    # Clear roots.
    for i in range(len(page_roots)):
        k, v = page_roots.popitem()
        v.clear()

    return texts_dict

In [10]:
def parse_pages_tar(pages, newspaper_year):
    """
    Given iterable of paths to page files, return
    dictionary with 'P1', 'P2', etc as keys, and the
    root element of each page as values.
    """
    page_roots = {}
    for i, page in enumerate(pages):
        with newspaper_year.extractfile(page) as f:
            text = f.read()
        root = ET.fromstring(text)
        page_roots[f'P{i+1}'] = root

    return page_roots

In [11]:
def codes2texts_inner(article_codes, page_roots, selected_issue):
    """
    Given XML roots of ALTO pages and collection of article codes
    and corresponding blocks, return a dictionary with article codes
    as keys and a list of text blocks as strings as values.
    """
    texts_dict = {}  
    
    for article_id in article_codes.keys():
        title, blocks = article_codes[article_id]
        text = []
        line_widths = []
        line_hpos = []
        line_offsets = []
        
        for block in blocks:

            # The block ids have page numbers as part. We collect the page number.
            end_loc = block.find('_')
            page_no = block[0:end_loc]

            # Collect the relevant page (the alto file) for the block.
            page_root = page_roots[page_no]

            # Collect the specific block from the page and identify the desired elements 
            # (strings, lines, horizontal position etc.)
            xml_block = page_root.find(f".//TextBlock[@ID='{block}']")

            block_strings = xml_block.findall('.//String')
            block_lines = xml_block.findall('.//TextLine')
            block_hpos = int(xml_block.get("HPOS"))

            # Collect the information we want from the blocks.
            block_as_string = process_block(block_strings)
            block_line_widths = process_block_lines(block_lines)
            block_line_hpos = process_lines_hpos(block_lines)
        
            block_line_offsets = [hpos - block_hpos for hpos in block_line_hpos]
            
            text.append(block_as_string)
            line_widths.extend(block_line_widths)
            line_hpos.extend(block_line_hpos)
            line_offsets.extend(block_line_offsets)

        text = ' '.join(text)
        issue_article_id = selected_issue + '_' + article_id[7:]
        # texts_dict[issue_article_id] = (title, text, line_widths, line_hpos)
        texts_dict[issue_article_id] = (title, text, line_widths, line_hpos, line_offsets)

    
    return texts_dict

In [12]:
def process_block(block_strings):
    """
    Given xml String elements from text block, return whole block
    as single string.
    """
    words = []
    for s in block_strings:
        words.append(s.attrib['CONTENT'])
    total_string = ' '.join(words)

    return total_string

In [13]:
def process_block_lines(block_lines):
    """
    Given xml TextLine elements from text block, return a list of the widths.
    """
    line_w = []
    
    for line in block_lines:
        line_w.append(int(line.attrib['WIDTH']))

    return line_w

In [14]:
def process_lines_hpos(block_lines):
    """
    Given xml TextLine elements from text block, return a list of the
    horizontal starting position of each line.
    """
    line_hpos = []
    
    for line in block_lines:
        line_hpos.append(int(line.attrib['HPOS']))

    return line_hpos

In [15]:
def process_and_collect(filepath):
    """
    Return dataframe for the selected newspaper/year.
    """
    # print(f'Processing {path}')
    try:
        articles = process_tarball(filepath)
        dataframe = pd.DataFrame.from_dict(
            articles,
            orient='index',
            dtype = object,
            # columns=['title', 'text', 'line_widths', 'line_hpos']
            columns=['title', 'text', 'line_widths', 'line_hpos', 'line_offsets']
            )
    except:
        print(f'Problem with {filepath}')
        dataframe = None
        
    return dataframe

In [16]:
# Code source: https://thispointer.com/python-how-to-get-list-of-files-in-directory-and-sub-directories/

def get_files(dir_name):
    """
    For the given path, get a list of all files in the directory tree
    """
    # Create a list of files and sub directories 
    files_dir = os.listdir(dir_name)
    file_list = list()
    
    # Iterate over all the entries
    for item in files_dir:
        
        # Create full path
        full_path = os.path.join(dir_name, item)
        
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(full_path):
            file_list = file_list + get_files(dir_name)
        else:
            file_list.append(full_path)
                
    return file_list


In [17]:
def select_and_create(dir_name, num_issues):
    """
    Randomly select a given number of issues
    and return a dataframe.
    """

    file_list = get_files(dir_name)
    file_paths = []
    single_dfs = [] # A list of the dataframes created for each newspaper issue

    for random_selection in range(0, num_issues):
        
        selected_tar = random.choice(file_list)
        file_paths.append(selected_tar)  
    
    for filepath in file_paths:
        single_df = process_and_collect(filepath)
        single_dfs.append(single_df)
    
    final_df = pd.concat(single_dfs, axis = 0)
    final_df.reset_index(drop=False, inplace=True, col_level=0)
        
    return final_df

In [18]:
def produce_df(dir_name, num_issues):
    """
    Given a directory for the Papers Past 
    open data (with newspaper-year files 
    in tar.gz format) and a number of issues 
    to randomly select, return a dataframe 
    of articles and features.
    """
                
    final_df = select_and_create(dir_name, num_issues)
    
    # Calculate features from the lists of line widths and positions
    final_df['avg_line_width'] = pd.DataFrame(final_df['line_widths'].values.tolist()).mean(1)
    final_df['max_line_width'] = pd.DataFrame(final_df['line_widths'].values.tolist()).max(1)
    final_df['min_line_width'] = pd.DataFrame(final_df['line_widths'].values.tolist()).min(1)
    final_df['line_width_range'] = final_df['max_line_width'] - final_df['min_line_width']
    
    # Line offsets relate to the difference between the starting horizontal position of each line compared to the block
    final_df['avg_line_offset'] = pd.DataFrame(final_df['line_offsets'].values.tolist()).mean(1)
    final_df['max_line_offset'] = pd.DataFrame(final_df['line_offsets'].values.tolist()).max(1)
    final_df['min_line_offset'] = pd.DataFrame(final_df['line_offsets'].values.tolist()).min(1)
       
    return final_df

### Extracting a random selection of issues and loading them into a dataframe

* Provide the filepath for the newspaper-year tar.gz files.
* Select the number of issues to sample
* Return dataframe

In [19]:
# Set the top level directory to sample from
dir_name = 'T:/projects/2021/Papers/raw_data'

# Set the number of issues to be selected
num_issues = 25

# Create the final dataframe and measure time to load
t1 = time.perf_counter()

final_df = produce_df(dir_name, num_issues)

t2 = time.perf_counter()

print(f"Returned dataframe in {t2 - t1:0.4f} seconds")

Returned dataframe in 311.0030 seconds


In [20]:
# Display all the rows of the final dataframe with a vertical scrollbar

pd.set_option('display.max_rows', None)
display(final_df)

Unnamed: 0,index,title,text,line_widths,line_hpos,line_offsets,avg_line_width,max_line_width,min_line_width,line_width_range,avg_line_offset,max_line_offset,min_line_offset
0,THD_18950803_ARTICLE1,PORT OF TIMARU.,The flagstaff of Timaru is situated m 171deg 1...,"[508, 534, 366, 118, 50, 95, 506, 264, 275, 27...","[106, 79, 78, 288, 108, 300, 108, 82, 224, 110...","[32, 5, 4, 214, 34, 226, 34, 8, 150, 36, 36, 9...",430.25,534.0,50.0,484.0,34.333333,226.0,4.0
1,THD_18950803_ARTICLE2,HIGH WATER AT TIMARU.,,[],[],[],,,,,,,
2,THD_18950803_ARTICLE3,COMMERCIAL.,The grain market continues quiet. Oats have re...,"[506, 531, 534, 533, 534, 532, 532, 533, 108]","[115, 90, 87, 88, 87, 89, 88, 88, 87]","[29, 4, 1, 2, 1, 3, 2, 2, 1]",482.555556,534.0,108.0,426.0,5.0,29.0,1.0
3,THD_18950803_ARTICLE4,MAIL NOTICES.,Subject to any necessary alterations mails wil...,"[505, 368, 319, 504, 332, 505, 532, 533, 223, ...","[115, 88, 190, 115, 87, 115, 87, 87, 87, 113, ...","[28, 1, 1, 29, 1, 29, 1, 1, 1, 26, 1, 124, 31,...",430.321429,533.0,151.0,382.0,16.035714,124.0,1.0
4,THD_18950803_ARTICLE5,OCEAN MAIL SERVICES.,Via Brindisi and Napies to London. Leave Melbo...,"[476, 227, 186, 298, 282, 283, 281, 146, 164, ...","[112, 120, 407, 90, 106, 105, 107, 443, 425, 4...","[1, 1, 1, 0, 16, 15, 17, 19, 1, 56, 56, 28, 2,...",349.444444,532.0,109.0,423.0,30.833333,270.0,0.0
5,THD_18950803_ARTICLE6,ABSTRACT OF SALES BY AUCTION.,;[SKE ADVERTISEMENTS.] This Day. By C.F.C. Ass...,"[315, 136, 533, 507, 46, 532, 506, 506, 507, 3...","[2974, 3071, 2871, 2897, 2897, 2871, 2896, 289...","[0, 1, 1, 27, 27, 1, 26, 28, 27, 27, 1, 26, 27]",393.615385,533.0,46.0,487.0,16.846154,28.0,0.0
6,THD_18950803_ARTICLE7,The Timaru Herald.,"SATURDAY, AUGUST 3, 1895. With the exception o...","[438, 530, 531, 532, 530, 530, 531, 530, 531, ...","[2917, 2871, 2870, 2869, 2871, 2871, 2870, 286...","[1, 11, 10, 9, 11, 11, 10, 9, 10, 9, 10, 9, 8,...",530.449664,535.0,438.0,97.0,8.04698,17.0,1.0
7,THD_18950803_ARTICLE8,TOWN & COUNTRY.,The Geraldine Farmers' Club meet on Monday eve...,"[502, 384, 504, 529, 256, 504, 531, 532, 531, ...","[3455, 3429, 3453, 3428, 3427, 3453, 3426, 342...","[26, 0, 26, 1, 0, 28, 1, 0, 1, 2, 26, 0, 0, 1,...",508.458101,538.0,56.0,482.0,6.480447,38.0,0.0
8,THD_18950803_ARTICLE9,BRITISH & FOREIGN.,"Per Electric Telegraph— Copyright, Peb Peess A...","[520, 311, 422, 235, 505, 531, 533, 533, 533, ...","[3975, 4071, 4021, 4234, 3994, 3968, 3966, 396...","[0, 1, 0, 1, 30, 4, 2, 2, 2, 0, 3, 2, 1, 1, 3,...",472.416667,549.0,133.0,416.0,32.791667,265.0,0.0
9,THD_18950803_ARTICLE10,COMMERCIAL.,Pbr Electric Telegraph— Copyright. Per Press A...,"[527, 311, 232, 507, 534, 534, 536, 536, 531, ...","[77, 182, 341, 103, 77, 77, 76, 76, 76, 77, 77...","[1, 0, 1, 28, 2, 2, 1, 1, 1, 1, 1, 0, 1, 25, 0...",462.923077,537.0,75.0,462.0,10.820513,211.0,0.0


In [21]:
len(final_df)

771

In [22]:
# Check for and remove any duplicate articles

final_df = final_df.drop_duplicates(subset='index', keep="first")

In [23]:
len(final_df)

771

In [24]:
final_df['text'].values[8]

"Per Electric Telegraph— Copyright, Peb Peess Association ANTARCTIC EXPLORATION. London, August 1. At the Geographical Conference Mr Borchgrevinck read a paper on Antarctic exploration. He recommended Cape Adah* as a base for south polar expedi tions, whence dogs and sledges could reach beyond the gouth magnetic pblfe. He offered to lead a landing party to CapeAdair. He estimated the extent of Victoria Land at eight million square miles. A committee of the conffereQ<s passed a resolution that Antarctic exploration was the greatest piece of ex ploration which still remained to be undertaken, and recommended the scientific societies of the world to take it m hand before the close of the century. Mr Borchgrevinck's paper was favourably discussed. Mr Harmsworth, who is financing Jackson's north polar expedi tion, is willing to contribute to Antarctic research, and he believes that Australia will subscribe largely. The newspapers generally contain complimentary notices on Herr Borch grevinc

### Data wrangling

* Extract individual data items (newspaper ID, date, article ID) from the index column
* Remove index column
* Convert date column to date format
* Add full newspaper names based on newspaper ID
* Reorder columns

The dictionary of newspaper codes mapped to newspaper name and region used in the following function was created by:

* Downloading the Papers Past open data csv file: https://natlib.govt.nz/files/paperspast/NLNZ_newspaperData.csv and saving with a new name 'PP_Codes2Newspaper.csv'
* Using the remove duplicates function in Excel on the newspaper name row to leave only one instance of each paper (79 unique newspapers)
* Deleting the year, region, issues, pages, downloadSize, and link columns
* Delete header row
* Deleting year and file details from the elements in the first column to leave only the newspaper ID code 

In [25]:
def wrangled_df(final_df):
    """
    Given the combined final dataframe of Papers Past articles, 
    rename and reorder columns, and add the full newspaper name from
    a given dictionary supplied as a csv file.
    """
    # A dictionary of newspaper codes mapped to newspaper name and region is loaded
    codes2newspaper = pd.read_csv('PP_Codes2Newspaper.csv', 
                                  header=None, 
                                  dtype={0: str}).set_index(0).squeeze().to_dict()
    
    # Separate features are extracted from the 'index' column
    
    # Extract the letters before the first underscore as Newspaper ID
    final_df['newspaper_id'] = final_df["index"].str.extract(r"([^_]*)") 
    
    # Extract the numbers between the underscores as date
    final_df['date'] = final_df["index"].str.extract(r"(?<=\_)(.*?)(?=\_)") 
    
    # Extract the numeric portion of the article ID
    final_df['article_id'] = final_df["index"].str.extract(r"(\d+)(?!.*\d)") 
    
    # Drop the index column
    final_df.drop('index', inplace=True, axis=1) 
    
    # The Northern Advocate's code is NA so it comes through as NaN. 
    # This is deleted and then replaced correctly in the dictionary
    codes2newspaper = {key: value for key, value in codes2newspaper.items() if pd.notna(key)}
    codes2newspaper['NA'] = 'Northern Advocate' 
    final_df['newspaper'] = final_df['newspaper_id'].map(codes2newspaper)
    
    # The data types of the columns are updated
    final_df['date'] = pd.to_datetime(final_df['date'], format='%Y%m%d') 
    final_df['article_id'] = (final_df['article_id']).astype(int)
    final_df['text'] = (final_df['text']).astype('string')
    final_df['title'] = (final_df['title']).astype('string')
    
    final_df['newspaper_id'] = (final_df['newspaper_id']).astype('string')
    final_df['newspaper'] = (final_df['newspaper']).astype('string')
    
    # Columns are reordered
    new_order = ["date", 
                 "newspaper_id", 
                 "newspaper", 
                 "article_id", 
                 "avg_line_width", 
                 "min_line_width", 
                 "max_line_width", 
                 "line_width_range", 
                 "avg_line_offset", 
                 "max_line_offset", 
                 "min_line_offset", 
                 "title", 
                 "text"]
    clean_df = final_df.reindex(columns = new_order)
    
    return clean_df

In [26]:
clean_df = wrangled_df(final_df)
display(clean_df)

Unnamed: 0,date,newspaper_id,newspaper,article_id,avg_line_width,min_line_width,max_line_width,line_width_range,avg_line_offset,max_line_offset,min_line_offset,title,text
0,1895-08-03,THD,Timaru Herald,1,430.25,50.0,534.0,484.0,34.333333,226.0,4.0,PORT OF TIMARU.,The flagstaff of Timaru is situated m 171deg 1...
1,1895-08-03,THD,Timaru Herald,2,,,,,,,,HIGH WATER AT TIMARU.,
2,1895-08-03,THD,Timaru Herald,3,482.555556,108.0,534.0,426.0,5.0,29.0,1.0,COMMERCIAL.,The grain market continues quiet. Oats have re...
3,1895-08-03,THD,Timaru Herald,4,430.321429,151.0,533.0,382.0,16.035714,124.0,1.0,MAIL NOTICES.,Subject to any necessary alterations mails wil...
4,1895-08-03,THD,Timaru Herald,5,349.444444,109.0,532.0,423.0,30.833333,270.0,0.0,OCEAN MAIL SERVICES.,Via Brindisi and Napies to London. Leave Melbo...
5,1895-08-03,THD,Timaru Herald,6,393.615385,46.0,533.0,487.0,16.846154,28.0,0.0,ABSTRACT OF SALES BY AUCTION.,;[SKE ADVERTISEMENTS.] This Day. By C.F.C. Ass...
6,1895-08-03,THD,Timaru Herald,7,530.449664,438.0,535.0,97.0,8.04698,17.0,1.0,The Timaru Herald.,"SATURDAY, AUGUST 3, 1895. With the exception o..."
7,1895-08-03,THD,Timaru Herald,8,508.458101,56.0,538.0,482.0,6.480447,38.0,0.0,TOWN & COUNTRY.,The Geraldine Farmers' Club meet on Monday eve...
8,1895-08-03,THD,Timaru Herald,9,472.416667,133.0,549.0,416.0,32.791667,265.0,0.0,BRITISH & FOREIGN.,"Per Electric Telegraph— Copyright, Peb Peess A..."
9,1895-08-03,THD,Timaru Herald,10,462.923077,75.0,537.0,462.0,10.820513,211.0,0.0,COMMERCIAL.,Pbr Electric Telegraph— Copyright. Per Press A...


In [27]:
display(final_df['newspaper'].unique())

<StringArray>
[                         'Timaru Herald',
                        'Lyttelton Times',
                          'Auckland Star',
                        'Southland Times',
                        'Wanganui Herald',
                          'Waikato Times',
                        'Taranaki Herald',
 "Pelorus Guardian and Miners' Advocate.",
                       'Grey River Argus',
                      'Lake Wakatip Mail',
                        'Inangahua Times',
                        'Manawatu Herald',
                          'Otago Witness',
                           'Western Star',
                                  'Press',
                      'New Zealand Times',
                             'Patea Mail',
                     'Ashburton Guardian',
                       'West Coast Times',
                   'Daily Southern Cross']
Length: 20, dtype: string

In [28]:
display(final_df.groupby(['newspaper'])['newspaper'].count())

newspaper
Ashburton Guardian                         34
Auckland Star                              73
Daily Southern Cross                       22
Grey River Argus                           18
Inangahua Times                            19
Lake Wakatip Mail                          26
Lyttelton Times                           114
Manawatu Herald                            13
New Zealand Times                          63
Otago Witness                             139
Patea Mail                                 16
Pelorus Guardian and Miners' Advocate.     12
Press                                      14
Southland Times                            25
Taranaki Herald                            15
Timaru Herald                              27
Waikato Times                              56
Wanganui Herald                            47
West Coast Times                           20
Western Star                               18
Name: newspaper, dtype: int64

In [29]:
# Check data types of the columns
clean_df.dtypes

date                datetime64[ns]
newspaper_id                string
newspaper                   string
article_id                   int32
avg_line_width             float64
min_line_width             float64
max_line_width             float64
line_width_range           float64
avg_line_offset            float64
max_line_offset            float64
min_line_offset            float64
title                       string
text                        string
dtype: object

## Save dataframe for later use

In [30]:
# Save dataframe for later use 
# https://stackoverflow.com/questions/17098654/how-to-reversibly-store-and-load-a-pandas-dataframe-to-from-disk

# -----------------------------------------------------
# Uncomment below to use date and time for filename

# time_now = datetime.now()
# file_date = time_now.strftime("%Y%m%d_%H%M%S")
# clean_df.to_pickle(f"{file_date}_PP_df.pkl")


# -----------------------------------------------------
# Uncomment below to use custom filename

pkl_filename = '20220105_PP_25Issues_df' # Change filename here
clean_df.to_pickle(f"{pkl_filename}.pkl")