# Papers Past Genre Classification
# Notebook 3B: Linguistic Features and Text Statistics (excluding TF-IDF)
---

This notebook reads in a Pandas dataframe saved as a pickle (the output of Notebook 2: Labelling) and uses spaCy, textstat, and textfeatures to add columns of linguistic features. This version of the notebook does not include the TF-IDF feature (the TF-IDF feature provides a more precise model but is computationally expensive).

In [1]:
from datetime import date
from datetime import datetime

import re
import pandas as pd
import numpy as np
import json
import pickle
import spacy
import math
import textstat
import textfeatures as tf

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
nlp = spacy.load('en_core_web_lg')

In [3]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [4]:
filepath = '20220110_PP_4628articles_labelled.pkl'
input_df = pd.read_pickle(filepath)

The genre label 'Other' includes anything that did not easily fit in one of the other genre categories, including articles that couldn't be classified due to high number of OCR errors. For efficiency, articles labelled as 'Other' are removed before linguistic feature extraction.

In [5]:
input_df = input_df[input_df.genre != "Other"]
len(input_df )

3518

The text column contains unnecessary symbols that are the result of OCR errors. The cleaner function cleans the text using a regular expression that parses only alphanumeric strings and hyphens (to include hyphenated words). The basis of this code is sourced from: https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html. 

In [6]:
def cleaner(df, column_name):
    """
    Remove unnecessary symbols to create a clean text column from the original dataframe column using a regex.
    """
    # A column of sentence count is added to the dataframe before punctuation is removed.
    df['sentence_count'] = df[column_name].apply(lambda x: textstat.sentence_count(x))

    # Regex pattern for only alphanumeric, hyphenated text
    pattern = re.compile(r"[A-Za-z0-9\-]{1,50}")
    df['clean_text'] = df[column_name].str.findall(pattern).str.join(' ')
    
    return df

In [7]:
clean_df = cleaner(input_df, 'text')

In [8]:
pd.set_option('display.max_columns', None)
display(clean_df)

Unnamed: 0,date,newspaper_id,newspaper,article_id,avg_line_width,min_line_width,max_line_width,line_width_range,avg_line_offset,max_line_offset,min_line_offset,title,text,genre,sentence_count,clean_text
0,1878-10-26,KUMAT,Kumara Times,1,452.272727,282.0,512.0,230.0,33.090909,174.0,0.0,MAILS CLOSE,"For the United Kingdom, Continent of Europe, a...",Notice,3,For the United Kingdom Continent of Europe and...
2,1878-10-26,KUMAT,Kumara Times,3,429.500000,95.0,515.0,420.0,22.000000,104.0,0.0,LATEST TELEGRAMS.,": [PRESS AGENCY.] i : ;; BtUPP, October 25. / ...",FamilyNotice,5,PRESS AGENCY i BtUPP October 25 Arrived Stella...
3,1878-10-26,KUMAT,Kumara Times,4,469.765854,64.0,589.0,525.0,22.575610,378.0,0.0,GENERAL ASSEMBLY.,Continued from 4th page. [press agency.] .;-•'...,Report,32,Continued from 4th page press agency - J LEGIS...
4,1878-10-26,KUMAT,Kumara Times,5,480.044444,109.0,599.0,490.0,18.183333,244.0,0.0,ADDITIONAL NEWS BY THE SAN FRANCISCO MAIL.,■ f > • ; •* .■<■ j.f> '■' EUROPEAN ;/ ■ LoNi>...,News,42,f j f EUROPEAN LoNi oii Septemb r 30i v - In C...
5,1878-10-26,KUMAT,Kumara Times,6,461.769231,51.0,531.0,480.0,22.745562,363.0,0.0,GENERAL ASSEMBLY.,'pPKSSS AGENCY.] LEGISLATIVE COUNCIL. Wellingt...,Report,39,pPKSSS AGENCY LEGISLATIVE COUNCIL Wellington O...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4613,1858-10-15,DSC,Daily Southern Cross,8,587.000000,508.0,666.0,158.0,12.000000,24.0,0.0,BIRTH,"At Wesley College. Auckland, on the 12th Octob...",FamilyNotice,3,At Wesley College Auckland on the 12th October...
4614,1858-10-15,DSC,Daily Southern Cross,9,536.000000,404.0,668.0,264.0,12.500000,25.0,0.0,DIED.,"On Monday, the 11th instant, Mary, second daug...",FamilyNotice,1,On Monday the 11th instant Mary second daughte...
4615,1858-10-15,DSC,Daily Southern Cross,10,656.030075,18.0,704.0,686.0,35.338346,368.0,0.0,THE COUNCIL.,"We have always maintained, that of the two bra...",Opinion,36,We have always maintained that of the two bran...
4617,1858-10-15,DSC,Daily Southern Cross,12,612.764706,96.0,704.0,608.0,21.176471,67.0,0.0,PRESBYTERY OF AUCKLAND,", The principal meeting for the year of this C...",Report,22,The principal meeting for the year of this Cou...


### Functions

Run the following cells to define the functions that will be used to extract features from the text. 

In [9]:
def count_propn_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: proper nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_propn = 0
    # propn_list = []

    for token in doc:
        if token.pos_ == 'PROPN':
            count_propn += 1
        
    return count_propn 

In [10]:
def count_verb_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_verb = 0
    # verb_list = []

    for token in doc:
        if token.pos_ == 'VERB':
            count_verb += 1

            # verb_list.append(token)
    # print(verb_list)

    return count_verb

In [11]:
def count_noun_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_noun = 0
    # noun_list = []

    for token in doc:
        if token.pos_ == 'NOUN':
            count_noun += 1

            # noun_list.append(token)
    # print(noun_list)
        
    return count_noun

In [12]:
def count_adj_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: adjectives.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_adj = 0
    # adj_list = []

    for token in doc:
        if token.pos_ == 'ADJ':
            count_adj += 1

            # adj_list.append(token)
    # print(adj_list)
        
    return count_adj

In [13]:
def count_nums_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: numbers.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """ 
    count_nums = 0
    # nums_list = []

    for token in doc:
        if token.pos_ == 'NUM':
            count_nums += 1

            # nums_list.append(token)
    # print(nums_list)
        
    return count_nums

In [14]:
def count_pron_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: pronouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_pron = 0
    # pron_list = []
    
    for token in doc:
        if token.pos_ == 'PRON':
            count_pron += 1

            # pron_list.append(token)
    # print(pron_list)
        
    return count_pron

In [15]:
def count_nnps_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: plural proper nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_nnps = 0
    # nnps_list = []

    for token in doc:
        if token.tag_ == 'NNPS':
            count_nnps += 1
        
    return count_nnps

In [16]:
def count_vb_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: base form verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_vb = 0
    # vb_list = []

    for token in doc:
        if token.tag_ == 'VB':
            count_vb += 1

            # vb_list.append(token)
    # print(vb_list)

    return count_vb

In [17]:
def count_nn_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: singular or mass nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_nn = 0
    # nn_list = []

    for token in doc:
        if token.tag_ == 'NN':
            count_nn += 1

            # nn_list.append(token)
    # print(nn_list)
        
    return count_nn

In [18]:
def count_jj_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: adjectives.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_jj = 0
    # jj_list = []

    for token in doc:
        if token.tag_ == 'JJ':
            count_jj += 1

            # jj_list.append(token)
    # print(jj_list)
        
    return count_jj

In [19]:
def count_cd_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: cardinal numbers.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """ 
    count_cd = 0
    # cd_list = []

    for token in doc:
        if token.tag_ == 'CD':
            count_cd += 1

            # cd_list.append(token)
    # print(cd_list)
        
    return count_cd

In [20]:
def count_prp_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: personal pronouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_prp = 0
    # prp_list = []
    
    for token in doc:
        if token.tag_ == 'PRP':
            count_prp += 1

            # prp_list.append(token)
    # print(prp_list)
        
    return count_prp

In [21]:
def count_rb_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: adverbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_rb = 0
    # rb_list = []
    
    for token in doc:
        if token.tag_ == 'RB':
            count_rb += 1

            # rb_list.append(token)
    # print(rb_list)
        
    return count_rb

In [22]:
def count_cc_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: coordinating conjunctions.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_cc = 0
    # cc_list = []
    
    for token in doc:
        if token.tag_ == 'CC':
            count_cc += 1

            # cc_list.append(token)
    # print(cc_list)
        
    return count_cc

In [23]:
def count_nnp_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: singular proper nouns.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_nnp = 0
    # nnp_list = []
    
    for token in doc:
        if token.tag_ == 'NNP':
            count_nnp += 1

            # nnp_list.append(token)
    # print(nnp_list)
        
    return count_nnp

In [24]:
def count_vbd_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: past tense verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_vbd = 0
    # vbd_list = []
    
    for token in doc:
        if token.tag_ == 'VBD':
            count_vbd += 1

            # vbd_list.append(token)
    # print(vbd_list)
        
    return count_vbd

In [25]:
def count_vbz_spacy(doc):
    """
    Given a Spacy doc object return counts of the 
    following parts-of-speech: third-person singular present verbs.
    
    Optional: uncomment the code lines to collect and 
    print a list of the tagged words.
    """
    count_vbz = 0
    # vbz_list = []
    
    for token in doc:
        if token.tag_ == 'VBZ':
            count_vbz += 1

            # vbz_list.append(token)
    # print(vbz_list)
        
    return count_vbz

In [26]:
def text_features_pipe(text_col, df):
    """
    Process given text column of a dataframe to 
    extract linguistic features and add them to
    the dataframe. Return the updated dataframe.
    """
    
    input_col = df[text_col]  
    
    propn_count = []
    verb_count = []
    noun_count = []
    adj_count = []
    nums_count = []
    pron_count = []
    
    nnps_count = []
    vb_count = []
    nn_count = []
    jj_count = []
    cd_count = []
    prp_count = []
    rb_count = []
    cc_count = []
    nnp_count = []
    vbd_count = []
    vbz_count = []
    
    # spaCy processing pipeline
    nlp_text_pipe = nlp.pipe(input_col, batch_size=20)
    
    for doc in nlp_text_pipe:
        
        # POS tags
        # Universal POS Tags
        # http://universaldependencies.org/u/pos/
        
        # Count proper nouns
        propn_total = 0
        count_propn = count_propn_spacy(doc)
        propn_total += count_propn
        propn_count.append(propn_total)
        
        # Count verbs
        verb_total = 0
        count_verb = count_verb_spacy(doc)
        verb_total += count_verb
        verb_count.append(verb_total)
        
        # Count nouns
        noun_total = 0
        count_noun = count_noun_spacy(doc)
        noun_total += count_noun
        noun_count.append(noun_total)
        
        # Count adjectives
        adj_total = 0
        count_adj = count_adj_spacy(doc)
        adj_total += count_adj
        adj_count.append(adj_total)
        
        # Count numbers
        nums_total = 0
        count_nums = count_nums_spacy(doc)
        nums_total += count_nums
        nums_count.append(nums_total)
        
        # Count pronouns
        pron_total = 0
        count_pron = count_pron_spacy(doc)
        pron_total += count_pron
        pron_count.append(pron_total)
        
        # POS tags (English)
        # OntoNotes 5 / Penn Treebank
        # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
        
        # Count plural proper nouns
        nnps_total = 0
        count_nnps = count_nnps_spacy(doc)
        nnps_total += count_nnps
        nnps_count.append(nnps_total)
        
        # Count base form verbs
        vb_total = 0
        count_vb = count_vb_spacy(doc)
        vb_total += count_vb
        vb_count.append(vb_total)
        
        # Count singular or mass nouns
        nn_total = 0
        count_nn = count_nn_spacy(doc)
        nn_total += count_nn
        nn_count.append(nn_total)
        
        # Count adjectives
        jj_total = 0
        count_jj = count_jj_spacy(doc)
        jj_total += count_jj
        jj_count.append(jj_total)
        
        # Count cardinal numbers
        cd_total = 0
        count_cd = count_cd_spacy(doc)
        cd_total += count_cd
        cd_count.append(cd_total)
        
        # Count personal pronouns
        prp_total = 0
        count_prp = count_prp_spacy(doc)
        prp_total += count_prp
        prp_count.append(prp_total)
        
        # Count adverbs
        rb_total = 0
        count_rb = count_rb_spacy(doc)
        rb_total += count_rb
        rb_count.append(rb_total)
        
        # Count coordinating conjunctions
        cc_total = 0
        count_cc = count_cc_spacy(doc)
        cc_total += count_cc
        cc_count.append(cc_total)
        
        # Count singular proper nouns
        nnp_total = 0
        count_nnp = count_nnp_spacy(doc)
        nnp_total += count_nnp
        nnp_count.append(nnp_total)
        
        # Count past tense verbs
        vbd_total = 0
        count_vbd = count_vbd_spacy(doc)
        vbd_total += count_vbd
        vbd_count.append(vbd_total)
        
        # Count third-person singular present verbs
        vbz_total = 0
        count_vbz = count_vbz_spacy(doc)
        vbz_total += count_vbz
        vbz_count.append(vbz_total)
        
    # Add features using the textstat library to the dataframe
    # https://pypi.org/project/textstat/
    df['word_count'] = input_col.apply(lambda x: textstat.lexicon_count(x, removepunct=True)) 
    df['syll_count'] = input_col.apply(lambda x: textstat.syllable_count(x))
    df['polysyll_count'] = input_col.apply(lambda x: textstat.polysyllabcount(x)) # Returns the number of words with a syllable count greater than or equal to 3.
    df['monosyll_count'] = input_col.apply(lambda x: textstat.monosyllabcount(x)) # Returns the number of words with a syllable count equal to one.
    
    # Add features using the textfeatures library to the dataframe
    # https://towardsdatascience.com/textfeatures-library-for-extracting-basic-features-from-text-data-f98ba90e3932
    tf.stopwords_count(df,text_col,'stopwords_count')
    # tf.stopwords(df,text_col,'stopwords')  # Include a column that lists the stopwords found in the text

    try:
        tf.avg_word_length(df,text_col,'avg_word_length')
    except:
        df['avg_word_length'] = 0
    
    try:
        tf.char_count(df,text_col,'char_count')
    except:
        df['char_count'] = 0
    
    # Add features based on the spaCy pipeline to the dataframe
    df['propn_count'] = propn_count
    df['verb_count'] = verb_count
    df['noun_count'] = noun_count
    df['adj_count'] = adj_count
    df['nums_count'] = nums_count
    df['pron_count'] = pron_count
    
    df['nnps_count'] = nnps_count
    df['vb_count'] = vb_count
    df['nn_count'] = nn_count
    df['jj_count'] = jj_count
    df['cd_count'] = cd_count
    df['prp_count'] = prp_count
    df['rb_count'] = rb_count
    df['cc_count'] = cc_count
    df['nnp_count'] = nnp_count
    df['vbd_count'] = vbd_count
    df['vbz_count'] = vbz_count
    
    # Add frequency columns
    
    df['propn_freq'] = df['propn_count']/df['word_count']
    df['verb_freq'] = df['verb_count']/df['word_count']
    df['noun_freq'] = df['noun_count']/df['word_count']
    df['adj_freq'] = df['adj_count']/df['word_count']
    df['nums_freq'] = df['nums_count']/df['word_count']
    df['pron_freq'] = df['pron_count']/df['word_count']
    
    df['nnps_freq'] = df['nnps_count']/df['word_count']
    df['vb_freq'] = df['vb_count']/df['word_count']
    df['nn_freq'] = df['nn_count']/df['word_count']
    df['jj_freq'] = df['jj_count']/df['word_count']
    df['cd_freq'] = df['cd_count']/df['word_count']
    df['prp_freq'] = df['prp_count']/df['word_count']
    df['rb_freq'] = df['rb_count']/df['word_count']
    df['cc_freq'] = df['cc_count']/df['word_count']
    df['nnp_freq'] = df['nnp_count']/df['word_count']
    df['vbd_freq'] = df['vbd_count']/df['word_count']
    df['vbz_freq'] = df['vbz_count']/df['word_count']
    
    df['polysyll_freq'] = df['polysyll_count']/df['word_count']
    df['monosyll_freq'] = df['monosyll_count']/df['word_count']
    df['stopword_freq'] = df['stopwords_count']/df['word_count']
    
    return df 

### Run the text feature extraction pipeline

* Provide the dataframe and the name of the text column to extract features from.
* Return dataframe

In [27]:
text_col = 'clean_text'  # The name of dataframe column containing the text to be processed
features_df = text_features_pipe(text_col, clean_df)

In [28]:
features_df.reset_index(drop=True, inplace=True) # reset the index

In [29]:
# Drop any rows that have NaN values (the avg_line_width column is used - which will be NaN for any empty articles)
features_df.dropna(subset = ["avg_line_width"], inplace=True) 

In [30]:
pd.set_option('display.max_columns', None)
display(features_df)

Unnamed: 0,date,newspaper_id,newspaper,article_id,avg_line_width,min_line_width,max_line_width,line_width_range,avg_line_offset,max_line_offset,min_line_offset,title,text,genre,sentence_count,clean_text,word_count,syll_count,polysyll_count,monosyll_count,stopwords_count,avg_word_length,char_count,propn_count,verb_count,noun_count,adj_count,nums_count,pron_count,nnps_count,vb_count,nn_count,jj_count,cd_count,prp_count,rb_count,cc_count,nnp_count,vbd_count,vbz_count,propn_freq,verb_freq,noun_freq,adj_freq,nums_freq,pron_freq,nnps_freq,vb_freq,nn_freq,jj_freq,cd_freq,prp_freq,rb_freq,cc_freq,nnp_freq,vbd_freq,vbz_freq,polysyll_freq,monosyll_freq,stopword_freq
0,1878-10-26,KUMAT,Kumara Times,1,452.272727,282.0,512.0,230.0,33.090909,174.0,0.0,MAILS CLOSE,"For the United Kingdom, Continent of Europe, a...",Notice,3,For the United Kingdom Continent of Europe and...,57,87,7,35,15,4.620690,325,25,4,7,1,3,0,2,2,5,1,3,0,1,3,23,1,0,0.438596,0.070175,0.122807,0.017544,0.052632,0.000000,0.035088,0.035088,0.087719,0.017544,0.052632,0.000000,0.017544,0.052632,0.403509,0.017544,0.000000,0.122807,0.614035,0.263158
1,1878-10-26,KUMAT,Kumara Times,3,429.500000,95.0,515.0,420.0,22.000000,104.0,0.0,LATEST TELEGRAMS.,": [PRESS AGENCY.] i : ;; BtUPP, October 25. / ...",FamilyNotice,5,PRESS AGENCY i BtUPP October 25 Arrived Stella...,46,72,5,30,10,4.098039,259,16,3,12,1,2,1,0,0,12,1,2,1,1,0,16,2,1,0.347826,0.065217,0.260870,0.021739,0.043478,0.021739,0.000000,0.000000,0.260870,0.021739,0.043478,0.021739,0.021739,0.000000,0.347826,0.043478,0.021739,0.108696,0.652174,0.217391
2,1878-10-26,KUMAT,Kumara Times,4,469.765854,64.0,589.0,525.0,22.575610,378.0,0.0,GENERAL ASSEMBLY.,Continued from 4th page. [press agency.] .;-•'...,Report,32,Continued from 4th page press agency - J LEGIS...,1131,1628,104,794,337,4.469317,6327,298,123,235,40,36,27,7,44,185,35,36,22,16,22,292,51,5,0.263484,0.108753,0.207781,0.035367,0.031830,0.023873,0.006189,0.038904,0.163572,0.030946,0.031830,0.019452,0.014147,0.019452,0.258179,0.045093,0.004421,0.091954,0.702034,0.297966
3,1878-10-26,KUMAT,Kumara Times,5,480.044444,109.0,599.0,490.0,18.183333,244.0,0.0,ADDITIONAL NEWS BY THE SAN FRANCISCO MAIL.,■ f > • ; •* .■<■ j.f> '■' EUROPEAN ;/ ■ LoNi>...,News,42,f j f EUROPEAN LoNi oii Septemb r 30i v - In C...,1079,1576,103,744,370,4.538462,6119,255,111,219,56,29,40,8,12,174,55,29,21,14,37,247,46,27,0.236330,0.102873,0.202966,0.051900,0.026877,0.037071,0.007414,0.011121,0.161260,0.050973,0.026877,0.019462,0.012975,0.034291,0.228916,0.042632,0.025023,0.095459,0.689527,0.342910
4,1878-10-26,KUMAT,Kumara Times,6,461.769231,51.0,531.0,480.0,22.745562,363.0,0.0,GENERAL ASSEMBLY.,'pPKSSS AGENCY.] LEGISLATIVE COUNCIL. Wellingt...,Report,39,pPKSSS AGENCY LEGISLATIVE COUNCIL Wellington O...,1043,1518,115,711,373,4.614573,5855,254,121,175,37,25,35,9,39,132,37,25,26,17,23,245,55,2,0.243528,0.116012,0.167785,0.035475,0.023969,0.033557,0.008629,0.037392,0.126558,0.035475,0.023969,0.024928,0.016299,0.022052,0.234899,0.052733,0.001918,0.110259,0.681687,0.357622
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3513,1858-10-15,DSC,Daily Southern Cross,8,587.000000,508.0,666.0,158.0,12.000000,24.0,0.0,BIRTH,"At Wesley College. Auckland, on the 12th Octob...",FamilyNotice,3,At Wesley College Auckland on the 12th October...,18,24,1,13,6,3.944444,88,7,0,3,1,0,0,0,0,3,1,0,0,0,0,7,0,0,0.388889,0.000000,0.166667,0.055556,0.000000,0.000000,0.000000,0.000000,0.166667,0.055556,0.000000,0.000000,0.000000,0.000000,0.388889,0.000000,0.000000,0.055556,0.722222,0.333333
3514,1858-10-15,DSC,Daily Southern Cross,9,536.000000,404.0,668.0,264.0,12.500000,25.0,0.0,DIED.,"On Monday, the 11th instant, Mary, second daug...",FamilyNotice,1,On Monday the 11th instant Mary second daughte...,14,18,0,10,2,4.571429,77,4,1,3,2,1,0,0,0,2,2,1,0,0,0,4,0,0,0.285714,0.071429,0.214286,0.142857,0.071429,0.000000,0.000000,0.000000,0.142857,0.142857,0.071429,0.000000,0.000000,0.000000,0.285714,0.000000,0.000000,0.000000,0.714286,0.142857
3515,1858-10-15,DSC,Daily Southern Cross,10,656.030075,18.0,704.0,686.0,35.338346,368.0,0.0,THE COUNCIL.,"We have always maintained, that of the two bra...",Opinion,36,We have always maintained that of the two bran...,1088,1580,103,750,536,4.573921,6069,69,123,203,77,6,97,4,54,163,70,6,53,55,38,65,27,33,0.063419,0.113051,0.186581,0.070772,0.005515,0.089154,0.003676,0.049632,0.149816,0.064338,0.005515,0.048713,0.050551,0.034926,0.059743,0.024816,0.030331,0.094669,0.689338,0.492647
3516,1858-10-15,DSC,Daily Southern Cross,12,612.764706,96.0,704.0,608.0,21.176471,67.0,0.0,PRESBYTERY OF AUCKLAND,", The principal meeting for the year of this C...",Report,22,The principal meeting for the year of this Cou...,575,845,74,400,267,4.680556,3271,91,64,92,26,7,28,4,14,66,25,7,15,10,14,87,29,3,0.158261,0.111304,0.160000,0.045217,0.012174,0.048696,0.006957,0.024348,0.114783,0.043478,0.012174,0.026087,0.017391,0.024348,0.151304,0.050435,0.005217,0.128696,0.695652,0.464348


## Save dataframe for later use

In [31]:
# Save dataframe for later use 
# https://stackoverflow.com/questions/17098654/how-to-reversibly-store-and-load-a-pandas-dataframe-to-from-disk

# -----------------------------------------------------
# Uncomment below to use date and time for filename

# time_now = datetime.now()
# file_date = time_now.strftime("%Y%m%d_%H%M%S")
# clean_df.to_pickle(f"{file_date}_PP_df.pkl")


# -----------------------------------------------------
# Uncomment below to use custom filename

pkl_filename = '20220219_PP_3518articles_features_exclTFIDF' # Change filename here
features_df.to_pickle(f"{pkl_filename}.pkl")