# Papers Past Genre Classification
# End-to-End Notebook: Poetry
---

## Logistic Regression (excluding TF-IDF features)

* This notebook processes the raw Papers Past METS/ALTO XML files (saved in tar.gz format by newspaper and year) to return a Pandas dataframe of articles sorted by the probability that they are poetry. 
* The final dataframe is exported as a csv file with links to the online Papers Past newspaper issue for each article. This allows you to view the scanned image of the original article.   
* The given number of newspaper issues are randomly selected, with the option to set a seed for reproducibility. 
* This version of the notebook excludes TF-IDF feature extraction, which is computationally expensive and may result in out of memory errors for larger sample sizes. 
* The model is less precise than the version that incorporates the TF-IDF feature and consequently more false positives can be expected in the results. However, this model still performs well in terms of recall, meaning you can expect most of the poetry in the sample to be returned high up in the final dataframe of results when sorted by probability.
* Note that the sampling and feature extraction processes can take quite a while to run. For example, a sample of 300 newspaper issues could take up to an hour on an average laptop and 6,000 issues (about 2% of the entire dataset) could take 20+ hours. You should change your computer's settings to prevent it going to sleep or powering off while the notebook is running.  
* Dataframes are saved in pickle format following sampling and feature extraction so that if an error occurs or the notebook doesn't run all the way through for some reason, the saved dataframe can be loaded and you can restart the process from that point. These points in the notebook are highlighted in orange. 

In [1]:
# Import necessary libraries

import re
import os
import statistics
import tarfile
import random
import numpy as np
import pandas as pd
import xml.etree.ElementTree as ET
import pickle
import time
from datetime import date
from datetime import datetime

import spacy
import math
import textstat
import textfeatures as tf

from collections import Counter

# Features
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Classifiers
from sklearn.linear_model import LogisticRegression


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# This cell loads the mets namespace- there is no need to change anything
# http://www.loc.gov/standards/mets/namespace.html

NS = {'mets':'http://www.loc.gov/METS/'}

<div class="alert alert-success">
    <h3>Set the variables in the cell below before running all cells</h3>
    <p>
        <li>Select a random seed for reproducibility.
        <li>Provide the filepath for the newspaper-year tar.gz files.
        <li>Select the number of issues to sample.
        <li>Provide a filename for saving dataframes and exporting the final csv file
</div>

In [49]:
# Enter a random seed for reproducibility of results
random.seed(a=6)

# Set the top level directory to sample from
dir_name = 'E:/PapersPast_OpenData'

# Set the number of newspaper issues to be sampled
num_issues = 6000

# Filename for saved files 
# Do not include filetype suffix such as .csv - this is added as part of the code
export_filename = "20220219_NewPoetry_6000issues_seed6_df_exclTFIDF"

### Load the saved model and feature set

* The following cells load the saved model and the relevant feature set. 
* Make sure that the model file is saved in the same location as this notebook. 

In [4]:
# Filename of the saved scikit-learn model (should not need to be changed)
filename = 'pp_logistic_reg_poetry_noTFIDF.sav'

In [5]:
# Feature set (should not need to be changed)
# 'all_features_excl_penn' set from binary trials notebook (excl TFIDF)
features = ["propn_freq", 
            "verb_freq", 
            "noun_freq", 
            "adj_freq",
            "nums_freq", 
            "pron_freq", 
            "stopword_freq", 
            "avg_line_offset", 
            "max_line_offset", 
            "avg_line_width", 
            "min_line_width", 
            "max_line_width", 
            "line_width_range", 
            "polysyll_freq", 
            "monosyll_freq", 
            "sentence_count", 
            "word_count", 
            "avg_word_length", 
            "char_count"
           ]

### Data extraction and page layout features

The following cells extract the article details and page layout features from the tar.gz files and return a Pandas dataframe. 

In [6]:
def process_tarball(filepath):
    """
    Given path to tarball, open and return dictionary containing 
    article codes as keys and texts (as list of strings for each block) as values.
    """
    newspaper_year = tarfile.open(filepath, mode='r')

    # Next, return the members of the archive as a list of TarInfo objects. 
    # The list has the same order as the members in the archive.
    # https://docs.python.org/3/library/tarfile.html
    files = newspaper_year.getmembers() 

    issues = collect_issues(files)
    selected_issue = select_random_issue(issues)
    mets_tarinfo = issues[selected_issue][-1]
    pages_tarinfo = issues[selected_issue][0:-1]
    article_codes = mets2codes(mets_tarinfo, newspaper_year)
    articles = codes2texts(article_codes, pages_tarinfo, newspaper_year, selected_issue)

    return articles

In [7]:
def collect_issues(files):
    """
    Given list of files in tarball, return a dictionary keyed
    by the issue code with list of xml files of form [0001.xml, ..., mets.xml]
    as values.
    """
    issues = {}
    issue_code = ''

    for file in files:
        match = re.search("[A-Z]*_\d{8}$", file.name)
        if match:
            issue_code = match.group(0)
        if file.name.endswith('.xml'):
            xml_list = issues.get(issue_code, [])
            xml_list.append(file)
            issues[issue_code] = xml_list

    return issues

In [8]:
def select_random_issue(issues):
    """
    Select a random issue from a given dictionary of the all of a newspaper's issues for one year 
    (with the issue code as the key and the list of XML files as the elements).
    """
    selected_issue = random.choice(list(issues))

    return selected_issue 

In [9]:
def mets2codes(mets_tarinfo, newspaper_year):
    """
    Given mets as tarinfo, return text block codes for articles
    contained in mets file. Edited for processing with tarfile
    object newspaper_year.

    Returns dictionary of article codes as keys,
    with a 2-tuple containing the article title
    and a list of corresponding text block codes as values.
    """
    with newspaper_year.extractfile(mets_tarinfo) as file:
        text = file.read()

    art_dict = mets2codes_inner(text, newspaper_year)

    return art_dict

In [10]:
def mets2codes_inner(text, newspaper_year):
    """
    Given METS file as text string, return a dictionary of
    articles, with article codes as keys and, as values, tuples containing
    the corresponding article title and a list of text blocks from the
    corresponding ALTO files as values.
    """
    mets_root = ET.fromstring(text) # Loads the mets xml file (which comes into the function as a string)
    logical_structure = mets_root.find("./mets:structMap[@LABEL='Logical Structure']", NS) # Finds the "logical structure" part of the file, which lists all the articles and the blocks they contain.
    articles_div = logical_structure.findall(".//mets:div[@TYPE='ARTICLE']", NS) # This returns all of the "div" elements in the logical structure part of the xml that have the attribute "TYPE='ARTICLE'". This is where we lose the advertisements.

    art_dict = {} # This is an empty dictionary which will collect what we need from the mets file. It will have articles ids as keys and have the ids of the text blocks which are part of the article as values.
    for article in articles_div:
        
        attributes = article.attrib
        article_id = attributes['DMDID']
        article_title = attributes.get('LABEL', 'UNTITLED')

        text_blocks = article.findall(".//mets:div[@TYPE='TEXT']", NS)
        block_ids = []
        for block in text_blocks:
            try:
                areas = block.findall(".//mets:area", NS)
                for area in areas:
                    block_id = area.attrib['BEGIN']
                    block_ids.append(block_id)
            except AttributeError:
                print(f'Error in {newspaper_year}')
        
        art_dict[article_id] = (article_title, block_ids)

    mets_root.clear() # When processing lots of these, we want to free up memory.

    # print(art_dict)
    return art_dict

In [11]:
def codes2texts(article_codes, pages_tarinfo, newspaper_year, selected_issue):
    """
    Given article codes, the issue pages as tar info objects, 
    the newspaper year and the issue code, return a dictionary
    with article codes as keys and a list of text blocks as
    strings as values.
    """
    page_roots = parse_pages_tar(pages_tarinfo, newspaper_year)
    # page_roots returns a dictionary with pages numbers (of form 'P1'
    # etc...) as keys and the XML roots of the pages as values.

    texts_dict = codes2texts_inner(article_codes, page_roots, selected_issue)

    # Clear roots.
    for i in range(len(page_roots)):
        k, v = page_roots.popitem()
        v.clear()

    return texts_dict

In [12]:
def parse_pages_tar(pages, newspaper_year):
    """
    Given iterable of paths to page files, return
    dictionary with 'P1', 'P2', etc as keys, and the
    root element of each page as values.
    """
    page_roots = {}
    for i, page in enumerate(pages):
        with newspaper_year.extractfile(page) as f:
            text = f.read()
        root = ET.fromstring(text)
        page_roots[f'P{i+1}'] = root

    return page_roots

In [13]:
def codes2texts_inner(article_codes, page_roots, selected_issue):
    """
    Given XML roots of ALTO pages and collection of article codes
    and corresponding blocks, return a dictionary with article codes
    as keys and a list of text blocks as strings as values.
    """
    texts_dict = {}  
    
    for article_id in article_codes.keys():
        title, blocks = article_codes[article_id]
        text = []
        line_widths = []
        line_hpos = []
        line_offsets = []
        
        for block in blocks:

            # The block ids have page numbers as part. We collect the page number.
            end_loc = block.find('_')
            page_no = block[0:end_loc]

            # Collect the relevant page (the alto file) for the block.
            page_root = page_roots[page_no]

            # Collect the specific block from the page and identify the desired elements 
            # (strings, lines, horizontal position etc.)
            xml_block = page_root.find(f".//TextBlock[@ID='{block}']")

            block_strings = xml_block.findall('.//String')
            block_lines = xml_block.findall('.//TextLine')
            block_hpos = int(xml_block.get("HPOS"))

            # Collect the information we want from the blocks.
            block_as_string = process_block(block_strings)
            block_line_widths = process_block_lines(block_lines)
            block_line_hpos = process_lines_hpos(block_lines)
        
            block_line_offsets = [hpos - block_hpos for hpos in block_line_hpos]
            
            text.append(block_as_string)
            line_widths.extend(block_line_widths)
            line_hpos.extend(block_line_hpos)
            line_offsets.extend(block_line_offsets)

        text = ' '.join(text)
        issue_article_id = selected_issue + '_' + article_id[7:]
        # texts_dict[issue_article_id] = (title, text, line_widths, line_hpos)
        texts_dict[issue_article_id] = (title, text, line_widths, line_hpos, line_offsets)

    
    return texts_dict

In [14]:
def process_block(block_strings):
    """
    Given xml String elements from text block, return whole block
    as single string.
    """
    words = []
    for s in block_strings:
        words.append(s.attrib['CONTENT'])
    total_string = ' '.join(words)

    return total_string

In [15]:
def process_block_lines(block_lines):
    """
    Given xml TextLine elements from text block, return a list of the widths.
    """
    line_w = []
    
    for line in block_lines:
        line_w.append(int(line.attrib['WIDTH']))

    return line_w

In [16]:
def process_lines_hpos(block_lines):
    """
    Given xml TextLine elements from text block, return a list of the
    horizontal starting position of each line.
    """
    line_hpos = []
    
    for line in block_lines:
        line_hpos.append(int(line.attrib['HPOS']))

    return line_hpos

In [17]:
def process_and_collect(filepath):
    """
    Return dataframe for the selected newspaper/year.
    """
    # print(f'Processing {path}')
    try:
        articles = process_tarball(filepath)
        dataframe = pd.DataFrame.from_dict(
            articles,
            orient='index',
            dtype = object,
            # columns=['title', 'text', 'line_widths', 'line_hpos']
            columns=['title', 'text', 'line_widths', 'line_hpos', 'line_offsets']
            )
    except:
        print(f'Problem with {filepath}')
        dataframe = None
        
    return dataframe

In [18]:
# Code source: https://thispointer.com/python-how-to-get-list-of-files-in-directory-and-sub-directories/

def get_files(dir_name):
    """
    For the given path, get a list of all files in the directory tree
    """
    # Create a list of files and sub directories 
    files_dir = os.listdir(dir_name)
    file_list = list()
    
    # Iterate over all the entries
    for item in files_dir:
        
        # Create full path
        full_path = os.path.join(dir_name, item)
        
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(full_path):
            file_list = file_list + get_files(dir_name)
        else:
            file_list.append(full_path)
                
    return file_list


In [19]:
def select_and_create(dir_name, num_issues):
    """
    Randomly select a given number of issues
    and return a dataframe.
    """

    file_list = get_files(dir_name)
    file_paths = []
    single_dfs = [] # A list of the dataframes created for each newspaper issue

    for random_selection in range(0, num_issues):
        
        selected_tar = random.choice(file_list)
        file_paths.append(selected_tar)  
    
    for filepath in file_paths:
        single_df = process_and_collect(filepath)
        single_dfs.append(single_df)
    
    final_df = pd.concat(single_dfs, axis = 0)
    final_df.reset_index(drop=False, inplace=True, col_level=0)
        
    return final_df

In [20]:
def produce_df(dir_name, num_issues):
    """
    Given a directory for the Papers Past 
    open data (with newspaper-year files 
    in tar.gz format) and a number of issues 
    to randomly select, return a dataframe 
    of articles and features.
    """
                
    final_df = select_and_create(dir_name, num_issues)
    
    # Calculate features from the lists of line widths and positions
    final_df['avg_line_width'] = pd.DataFrame(final_df['line_widths'].values.tolist()).mean(1)
    final_df['max_line_width'] = pd.DataFrame(final_df['line_widths'].values.tolist()).max(1)
    final_df['min_line_width'] = pd.DataFrame(final_df['line_widths'].values.tolist()).min(1)
    final_df['line_width_range'] = final_df['max_line_width'] - final_df['min_line_width']
    
    # Line offsets relate to the difference between the starting horizontal position of each line compared to the block
    final_df['avg_line_offset'] = pd.DataFrame(final_df['line_offsets'].values.tolist()).mean(1)
    final_df['max_line_offset'] = pd.DataFrame(final_df['line_offsets'].values.tolist()).max(1)
    final_df['min_line_offset'] = pd.DataFrame(final_df['line_offsets'].values.tolist()).min(1)
       
    return final_df

In [21]:
# Create the final dataframe and measure time to load
t1 = time.perf_counter()

final_df = produce_df(dir_name, num_issues)

t2 = time.perf_counter()

print(f"Returned dataframe in {t2 - t1:0.4f} seconds")

Problem with E:/PapersPast_OpenData\LT_1891.tar.gz
Problem with E:/PapersPast_OpenData\LT_1891.tar.gz
Problem with E:/PapersPast_OpenData\LT_1890.tar.gz
Problem with E:/PapersPast_OpenData\LT_1891.tar.gz
Problem with E:/PapersPast_OpenData\LT_1890.tar.gz
Problem with E:/PapersPast_OpenData\LT_1890.tar.gz
Problem with E:/PapersPast_OpenData\LT_1891.tar.gz
Problem with E:/PapersPast_OpenData\LT_1890.tar.gz
Returned dataframe in 62180.5564 seconds


In [22]:
len(final_df)

159417

In [23]:
# Check for and remove any duplicate articles

final_df = final_df.drop_duplicates(subset='index', keep="first")

In [24]:
def wrangled_df(final_df):
    """
    Given the combined final dataframe of Papers Past articles, 
    rename and reorder columns, and add the full newspaper name from
    a given dictionary supplied as a csv file.
    """
    # A dictionary of newspaper codes mapped to newspaper name and region is loaded
    codes2newspaper = pd.read_csv('PP_Codes2Newspaper.csv', header=None, dtype={0: str}).set_index(0).squeeze().to_dict()
    
    # Separate features are extracted from the 'index' column
    final_df['newspaper_id'] = final_df["index"].str.extract(r"([^_]*)") # Extract the letters before the first underscore as Newspaper ID
    final_df['date'] = final_df["index"].str.extract(r"(?<=\_)(.*?)(?=\_)") # Extract the numbers between the underscores as date
    final_df['article_id'] = final_df["index"].str.extract(r"(\d+)(?!.*\d)") # Extract the numeric portion of the article ID
    final_df.drop('index', inplace=True, axis=1) # Drop the index column
    
    # The Northern Advocate's code is NA so it comes through as nan. This is deleted and then replaced correctly in the dictionary
    codes2newspaper = {key: value for key, value in codes2newspaper.items() if pd.notna(key)}
    codes2newspaper['NA'] = 'Northern Advocate' 
    final_df['newspaper'] = final_df['newspaper_id'].map(codes2newspaper)
    
    # The data types of the columns are updated
    final_df['date'] = pd.to_datetime(final_df['date'], format='%Y%m%d') 
    final_df['article_id'] = (final_df['article_id']).astype(int)
    final_df['text'] = (final_df['text']).astype('string')
    final_df['title'] = (final_df['title']).astype('string')
    
    final_df['newspaper_id'] = (final_df['newspaper_id']).astype('string')
    final_df['newspaper'] = (final_df['newspaper']).astype('string')
    
    # Columns are reordered
    new_order = ["date", "newspaper_id", "newspaper", "article_id", 
                 "avg_line_width", "min_line_width", "max_line_width", 
                 "line_width_range", "avg_line_offset", "max_line_offset", 
                 "min_line_offset", "title", "text"]
    clean_df = final_df.reindex(columns = new_order)
    
    return clean_df

In [25]:
clean_df = wrangled_df(final_df)
display(clean_df)

Unnamed: 0,date,newspaper_id,newspaper,article_id,avg_line_width,min_line_width,max_line_width,line_width_range,avg_line_offset,max_line_offset,min_line_offset,title,text
0,1891-12-23,WSTAR,Western Star,1,473.923529,71.0,496.0,425.0,5.252941,171.0,0.0,A Fool and A Woman.,Bayminster was in a flutter of excite ment. Ch...
1,1891-12-23,WSTAR,Western Star,2,325.100000,80.0,500.0,420.0,53.100000,216.0,4.0,Shipping.,THE WEATHER. Saturday—Showery. Sunday—Fair. Mo...
2,1891-12-23,WSTAR,Western Star,3,410.500000,214.0,503.0,289.0,29.500000,40.0,0.0,MARRIAGE.,"Haywood—Hitchcock.—On Dec. 16th, at the Manse,..."
3,1891-12-23,WSTAR,Western Star,4,444.000000,367.0,501.0,134.0,26.000000,39.0,1.0,DEATH.,"CAMPDKM,. —At Ermdaloon the 2flth Dec., Isabel..."
4,1891-12-23,WSTAR,Western Star,5,496.663462,253.0,525.0,272.0,19.788462,30.0,0.0,The Western Star. (PUBLISHED BI-WEEKLY.) WEDNE...,"Thio whirligig of time rolls oh, am once again..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
159412,1899-02-14,ST,Southland Times,19,528.000000,76.0,565.0,489.0,22.809859,258.0,-1.0,ALLEGED INDECENT ASSAULT.,William James Staunton was charged with having...
159413,1899-02-14,ST,Southland Times,20,519.036364,114.0,563.0,449.0,41.236364,335.0,25.0,Education Board Election.,"TO THE EDITOR. Sib, — Mr I. W. Raymond, in hie..."
159414,1899-02-14,ST,Southland Times,21,509.117647,193.0,551.0,358.0,29.264706,238.0,12.0,The Gore Prohibition Meeting.,"TO THE EDITOR. Sib, — In your issue of 7th ins..."
159415,1899-02-14,ST,Southland Times,22,530.781250,230.0,565.0,335.0,14.375000,43.0,-1.0,Just a Minute Please!,"Tired men, whether suffering from phy sical or..."


In [51]:
clean_df.to_pickle(f"{export_filename}_SAMPLE.pkl")

<div class="alert alert-warning">
    <h3>Reload saved dataframe (if required)</h3>
    <p>The previously saved dataframe of sample results and article layout features can be reloaded by uncommenting and running the following cell.
</div>

In [None]:
# clean_df = pd.read_pickle(f"{export_filename}_SAMPLE.pkl")

### Feature extraction: linguistic features and text statistics

The following cells extract parts-of-speech and text statistic features and add them to the dataframe.

In [26]:
nlp = spacy.load('en_core_web_lg')

In [27]:
def cleaner(df, column_name):
    """
    Remove unnecessary symbols to create a clean text column from the original dataframe column using a regex.
    """
    # A column of sentence count is added to the dataframe before punctuation is removed.
    df['sentence_count'] = df[column_name].apply(lambda x: textstat.sentence_count(x))

    # Regex pattern for only alphanumeric, hyphenated text
    pattern = re.compile(r"[A-Za-z0-9\-]{1,50}")
    df['clean_text'] = df[column_name].str.findall(pattern).str.join(' ')
    
    return df

In [28]:
clean_df = cleaner(clean_df, 'text')

In [29]:
def count_propn_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: proper nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_propn = 0
    # propn_list = []

    for token in doc:
        if token.pos_ == 'PROPN':
            count_propn += 1
        
    return count_propn 


def count_verb_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_verb = 0
    # verb_list = []

    for token in doc:
        if token.pos_ == 'VERB':
            count_verb += 1

            # verb_list.append(token)
    # print(verb_list)

    return count_verb


def count_noun_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_noun = 0
    # noun_list = []

    for token in doc:
        if token.pos_ == 'NOUN':
            count_noun += 1

            # noun_list.append(token)
    # print(noun_list)
        
    return count_noun


def count_adj_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: adjectives.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_adj = 0
    # adj_list = []

    for token in doc:
        if token.pos_ == 'ADJ':
            count_adj += 1

            # adj_list.append(token)
    # print(adj_list)
        
    return count_adj


def count_nums_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: numbers.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """ 
    count_nums = 0
    # nums_list = []

    for token in doc:
        if token.pos_ == 'NUM':
            count_nums += 1

            # nums_list.append(token)
    # print(nums_list)
        
    return count_nums


def count_pron_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: pronouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_pron = 0
    # pron_list = []
    
    for token in doc:
        if token.pos_ == 'PRON':
            count_pron += 1

            # pron_list.append(token)
    # print(pron_list)
        
    return count_pron


def count_nnps_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: plural proper nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_nnps = 0
    # nnps_list = []

    for token in doc:
        if token.tag_ == 'NNPS':
            count_nnps += 1
        
    return count_nnps


def count_vb_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: base form verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_vb = 0
    # vb_list = []

    for token in doc:
        if token.tag_ == 'VB':
            count_vb += 1

            # vb_list.append(token)
    # print(vb_list)

    return count_vb


def count_nn_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: singular or mass nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_nn = 0
    # nn_list = []

    for token in doc:
        if token.tag_ == 'NN':
            count_nn += 1

            # nn_list.append(token)
    # print(nn_list)
        
    return count_nn


def count_jj_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: adjectives.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_jj = 0
    # jj_list = []

    for token in doc:
        if token.tag_ == 'JJ':
            count_jj += 1

            # jj_list.append(token)
    # print(jj_list)
        
    return count_jj


def count_cd_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: cardinal numbers.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """ 
    count_cd = 0
    # cd_list = []

    for token in doc:
        if token.tag_ == 'CD':
            count_cd += 1

            # nums_cd.append(token)
    # print(cd_list)
        
    return count_cd


def count_prp_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: personal pronouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_prp = 0
    # prp_list = []
    
    for token in doc:
        if token.tag_ == 'PRP':
            count_prp += 1

            # prp_list.append(token)
    # print(prp_list)
        
    return count_prp


def count_rb_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: adverbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_rb = 0
    # rb_list = []
    
    for token in doc:
        if token.tag_ == 'RB':
            count_rb += 1

            # rb_list.append(token)
    # print(rb_list)
        
    return count_rb


def count_cc_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: coordinating conjunctions.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_cc = 0
    # cc_list = []
    
    for token in doc:
        if token.tag_ == 'CC':
            count_cc += 1

            # cc_list.append(token)
    # print(cc_list)
        
    return count_cc


def count_nnp_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: singular proper nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_nnp = 0
    # nnp_list = []
    
    for token in doc:
        if token.tag_ == 'NNP':
            count_nnp += 1

            # nnp_list.append(token)
    # print(nnp_list)
        
    return count_nnp


def count_vbd_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: past tense verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_vbd = 0
    # vbd_list = []
    
    for token in doc:
        if token.tag_ == 'VBD':
            count_vbd += 1

            # vbd_list.append(token)
    # print(vbd_list)
        
    return count_vbd


def count_vbz_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: third-person singular present verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_vbz = 0
    # vbz_list = []
    
    for token in doc:
        if token.tag_ == 'VBZ':
            count_vbz += 1

            # vbz_list.append(token)
    # print(vbz_list)
        
    return count_vbz

In [30]:
def text_features_pipe(text_col, df):
    """
    Process given text column of a dataframe to 
    extract linguistic features and add them to
    the dataframe. Return the updated dataframe.
    """
    
    input_col = df[text_col]  
    
    propn_count = []
    verb_count = []
    noun_count = []
    adj_count = []
    nums_count = []
    pron_count = []
    
    nnps_count = []
    vb_count = []
    nn_count = []
    jj_count = []
    cd_count = []
    prp_count = []
    rb_count = []
    cc_count = []
    nnp_count = []
    vbd_count = []
    vbz_count = []
    
    # spaCy processing pipeline
    nlp_text_pipe = nlp.pipe(input_col, batch_size=20)
    
    for doc in nlp_text_pipe:
        
        # POS tags
        # Universal POS Tags
        # http://universaldependencies.org/u/pos/
        
        # Count proper nouns
        propn_total = 0
        count_propn = count_propn_spacy(doc)
        propn_total += count_propn
        propn_count.append(propn_total)
        
        # Count verbs
        verb_total = 0
        count_verb = count_verb_spacy(doc)
        verb_total += count_verb
        verb_count.append(verb_total)
        
        # Count nouns
        noun_total = 0
        count_noun = count_noun_spacy(doc)
        noun_total += count_noun
        noun_count.append(noun_total)
        
        # Count adjectives
        adj_total = 0
        count_adj = count_adj_spacy(doc)
        adj_total += count_adj
        adj_count.append(adj_total)
        
        # Count numbers
        nums_total = 0
        count_nums = count_nums_spacy(doc)
        nums_total += count_nums
        nums_count.append(nums_total)
        
        # Count pronouns
        pron_total = 0
        count_pron = count_pron_spacy(doc)
        pron_total += count_pron
        pron_count.append(pron_total)
        
        # POS tags (English)
        # OntoNotes 5 / Penn Treebank
        # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
        
        # Count plural proper nouns
        nnps_total = 0
        count_nnps = count_nnps_spacy(doc)
        nnps_total += count_nnps
        nnps_count.append(nnps_total)
        
        # Count base form verbs
        vb_total = 0
        count_vb = count_vb_spacy(doc)
        vb_total += count_vb
        vb_count.append(vb_total)
        
        # Count singular or mass nouns
        nn_total = 0
        count_nn = count_nn_spacy(doc)
        nn_total += count_nn
        nn_count.append(nn_total)
        
        # Count adjectives
        jj_total = 0
        count_jj = count_jj_spacy(doc)
        jj_total += count_jj
        jj_count.append(jj_total)
        
        # Count cardinal numbers
        cd_total = 0
        count_cd = count_cd_spacy(doc)
        cd_total += count_cd
        cd_count.append(cd_total)
        
        # Count personal pronouns
        prp_total = 0
        count_prp = count_prp_spacy(doc)
        prp_total += count_prp
        prp_count.append(prp_total)
        
        # Count adverbs
        rb_total = 0
        count_rb = count_rb_spacy(doc)
        rb_total += count_rb
        rb_count.append(rb_total)
        
        # Count coordinating conjunctions
        cc_total = 0
        count_cc = count_cc_spacy(doc)
        cc_total += count_cc
        cc_count.append(cc_total)
        
        # Count singular proper nouns
        nnp_total = 0
        count_nnp = count_nnp_spacy(doc)
        nnp_total += count_nnp
        nnp_count.append(nnp_total)
        
        # Count past tense verbs
        vbd_total = 0
        count_vbd = count_vbd_spacy(doc)
        vbd_total += count_vbd
        vbd_count.append(vbd_total)
        
        # Count third-person singular present verbs
        vbz_total = 0
        count_vbz = count_vbz_spacy(doc)
        vbz_total += count_vbz
        vbz_count.append(vbz_total)
        
    # Add features using the textstat library to the dataframe
    # https://pypi.org/project/textstat/
    df['word_count'] = input_col.apply(lambda x: textstat.lexicon_count(x, removepunct=True)) 
    df['syll_count'] = input_col.apply(lambda x: textstat.syllable_count(x))
    df['polysyll_count'] = input_col.apply(lambda x: textstat.polysyllabcount(x)) # Returns the number of words with a syllable count greater than or equal to 3.
    df['monosyll_count'] = input_col.apply(lambda x: textstat.monosyllabcount(x)) # Returns the number of words with a syllable count equal to one.
    
    # Add features using the textfeatures library to the dataframe
    # https://towardsdatascience.com/textfeatures-library-for-extracting-basic-features-from-text-data-f98ba90e3932
    tf.stopwords_count(df,text_col,'stopwords_count')
    # tf.stopwords(df,text_col,'stopwords')  # Include a column that lists the stopwords found in the text
    
    try:
        tf.avg_word_length(df,text_col,'avg_word_length')
    except:
        df['avg_word_length'] = 0
    
    try:
        tf.char_count(df,text_col,'char_count')
    except:
        df['char_count'] = 0
    
    # Add features based on the spaCy pipeline to the dataframe
    df['propn_count'] = propn_count
    df['verb_count'] = verb_count
    df['noun_count'] = noun_count
    df['adj_count'] = adj_count
    df['nums_count'] = nums_count
    df['pron_count'] = pron_count
    
    df['nnps_count'] = nnps_count
    df['vb_count'] = vb_count
    df['nn_count'] = nn_count
    df['jj_count'] = jj_count
    df['cd_count'] = cd_count
    df['prp_count'] = prp_count
    df['rb_count'] = rb_count
    df['cc_count'] = cc_count
    df['nnp_count'] = nnp_count
    df['vbd_count'] = vbd_count
    df['vbz_count'] = vbz_count
    
    # Add frequency columns
    
    df['propn_freq'] = df['propn_count']/df['word_count']
    df['verb_freq'] = df['verb_count']/df['word_count']
    df['noun_freq'] = df['noun_count']/df['word_count']
    df['adj_freq'] = df['adj_count']/df['word_count']
    df['nums_freq'] = df['nums_count']/df['word_count']
    df['pron_freq'] = df['pron_count']/df['word_count']
    
    df['nnps_freq'] = df['nnps_count']/df['word_count']
    df['vb_freq'] = df['vb_count']/df['word_count']
    df['nn_freq'] = df['nn_count']/df['word_count']
    df['jj_freq'] = df['jj_count']/df['word_count']
    df['cd_freq'] = df['cd_count']/df['word_count']
    df['prp_freq'] = df['prp_count']/df['word_count']
    df['rb_freq'] = df['rb_count']/df['word_count']
    df['cc_freq'] = df['cc_count']/df['word_count']
    df['nnp_freq'] = df['nnp_count']/df['word_count']
    df['vbd_freq'] = df['vbd_count']/df['word_count']
    df['vbz_freq'] = df['vbz_count']/df['word_count']
    
    df['polysyll_freq'] = df['polysyll_count']/df['word_count']
    df['monosyll_freq'] = df['monosyll_count']/df['word_count']
    df['stopword_freq'] = df['stopwords_count']/df['word_count']
    
    return df 

In [47]:
text_col = 'clean_text'  # The name of dataframe column containing the text to be processed
features_df = text_features_pipe(text_col, clean_df)

In [50]:
features_df.to_pickle(f"{export_filename}_FEATURES.pkl")

<div class="alert alert-warning">
    <h3>Reload saved dataframe (if required)</h3>
    <p>The previously saved dataframe of articles including all features can be reloaded by uncommenting and running the following cell.
</div>

In [None]:
# features_df = pd.read_pickle(f"{export_filename}_FEATURES.pkl")

### Apply the saved model to the final dataset and export a csv file of results

The following cells remove any rows with missing values, apply the saved model to the dataset, and display and export the final results in a dataframe sorted by probability.

In [52]:
features_df = features_df.dropna()

In [53]:
features_df.reset_index(drop=True, inplace=True) # reset the index

In [54]:
display(features_df)

Unnamed: 0,date,newspaper_id,newspaper,article_id,avg_line_width,min_line_width,max_line_width,line_width_range,avg_line_offset,max_line_offset,...,cd_freq,prp_freq,rb_freq,cc_freq,nnp_freq,vbd_freq,vbz_freq,polysyll_freq,monosyll_freq,stopword_freq
0,1891-12-23,WSTAR,Western Star,1,473.923529,71.0,496.0,425.0,5.252941,171.0,...,0.007833,0.019147,0.040905,0.033072,0.084421,0.062663,0.007833,0.086162,0.728460,0.406440
1,1891-12-23,WSTAR,Western Star,2,325.100000,80.0,500.0,420.0,53.100000,216.0,...,0.047619,0.011905,0.011905,0.035714,0.547619,0.023810,0.035714,0.011905,0.642857,0.178571
2,1891-12-23,WSTAR,Western Star,3,410.500000,214.0,503.0,289.0,29.500000,40.0,...,0.000000,0.000000,0.000000,0.000000,0.615385,0.000000,0.000000,0.000000,0.653846,0.230769
3,1891-12-23,WSTAR,Western Star,4,444.000000,367.0,501.0,134.0,26.000000,39.0,...,0.105263,0.000000,0.000000,0.000000,0.368421,0.052632,0.000000,0.157895,0.578947,0.105263
4,1891-12-23,WSTAR,Western Star,5,496.663462,253.0,525.0,272.0,19.788462,30.0,...,0.006319,0.020537,0.071090,0.053712,0.033175,0.006319,0.036335,0.097946,0.703002,0.443918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154117,1899-02-14,ST,Southland Times,19,528.000000,76.0,565.0,489.0,22.809859,258.0,...,0.022923,0.033429,0.025788,0.037249,0.167144,0.087870,0.004776,0.082139,0.762178,0.382044
154118,1899-02-14,ST,Southland Times,20,519.036364,114.0,563.0,449.0,41.236364,335.0,...,0.010230,0.040921,0.040921,0.028133,0.109974,0.025575,0.028133,0.120205,0.693095,0.445013
154119,1899-02-14,ST,Southland Times,21,509.117647,193.0,551.0,358.0,29.264706,238.0,...,0.004115,0.057613,0.049383,0.037037,0.090535,0.041152,0.028807,0.074074,0.748971,0.415638
154120,1899-02-14,ST,Southland Times,22,530.781250,230.0,565.0,335.0,14.375000,43.0,...,0.008969,0.031390,0.026906,0.053812,0.174888,0.004484,0.035874,0.112108,0.654709,0.390135


In [55]:
def genres_binary(df, model):
    """
    Run the model, and return the dataframe
    with appended predictions.
    """

    X = df.filter(features, axis=1)
    indices = df.index.values

    y_pred = model.predict(X)
    y_prob = model.predict_proba(X)
    

    # Add the predictions to a copy of the original dataframe
    df_new = df.copy()
    df_new.loc[indices,'pred'] = y_pred
    df_new.loc[indices,'prob_0'] = y_prob[:,0]
    df_new.loc[indices,'prob_1'] = y_prob[:,1]
    
    return df_new

In [56]:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [57]:
preds_df = genres_binary(features_df, loaded_model)

In [58]:
final_df = preds_df.filter(["date", "newspaper_id", "newspaper", "article_id", "title", "text", "pred", "prob_1"], axis=1)
final_df["newspaper_web"] = final_df["date"].astype('string')
final_df["newspaper_web"] = final_df["newspaper_web"].str.replace('-','/')
final_df["newspaper_web"] = "https://paperspast.natlib.govt.nz/newspapers/" \
                            + final_df["newspaper"].str.replace(' ','-', regex = False).str.lower().str.replace("'", "", regex = False).str.replace(".", "", regex = False) \
                            + "/" \
                            + final_df["newspaper_web"]

# This line of code will provide a dataframe with only the articles classified as the given genre
# final_df = final_df[final_df["pred"] == 1].sort_values(by="prob_1", ascending=False)

# This line of code will provide a dataframe with all articles - ranked by probability of being the given genre
final_df = final_df.sort_values(by="prob_1", ascending=False)

  + final_df["newspaper"].str.replace(' ','-').str.lower().str.replace("'", "").str.replace(".", "") \


In [None]:
# Uncomment the line below to view the final dataframe if required

# display(final_df)

In [59]:
# Export the dataframe of results to a CSV file 
final_df.to_csv(f"{export_filename}_FINAL.csv")