***Description***

This notebook shows the pre-processing steps of NYTAC corpus, which were previously extracted into .csv file. The processing steps include filtering 'Types of Material' (ed, op-ed, none (a.k.a. news)) and 'Word Count' (> 50 words), and remove mentions of news piece which often appeared in ed and op-ed articles. Then, each article is assigned to a topic depending on keyword in *key_dict* in *filter_col*. If an article has more than one keyword from different topic, I assigned it to the topic belonged to the keyword that appears first. Last, I report the number and ratio of news and editorial articles in each topic, and save everything as .csv file.

In [1]:
# import libraries
import nltk
import numpy as np
import pandas as pd
import re
from collections import Counter, OrderedDict
from operator import itemgetter
from nltk.corpus import stopwords
nltk.download('stopwords')
stops = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/users/rldall/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Processing data

In [2]:
# set-up
# specify required fields
fields = ['Body', 'Descriptors', 'General Online Descriptors', 'Lead Paragraph',
          'News Desk', 'Online Section', 'Types Of Material','Word Count']

# interested materials
material_list = ['editorial','op-ed','letter', 'none']

# filter for interested topics
key_dict = {
    'law' : ['law','right','court'],
    'politics' : ['politics','relation','international','regional'],
    'medicine' : ['medicine','health','disease'],
    'finance' : ['finances','business'],
    'military' : ['defense','armament','military'],
    'education' : ['education','school','teacher']
}

# select lists of keywords
l = [key_dict.get(k) for k in list(key_dict)]
flatten_l = [item for sublist in l for item in sublist]

# column used for filtering
filter_col = 'Descriptors'
s = '|'.join([item for item in flatten_l])

# import data
# train and val data
df1 = pd.read_csv('/data/RAW/nyt1996.csv',encoding='latin-1',usecols=fields)
df2 = pd.read_csv('/data/RAW/nyt2005.csv',encoding='latin-1',usecols=fields)
train_df = pd.concat([df1,df2])

# test data
test_df = pd.read_csv('/data/RAW/nyt1986.csv',encoding='latin-1',usecols=fields)

# Helper functions

In [3]:
# Helper function for cleaning dataframes
def filter_explore(raw):
    print('Current no. rows:', len(raw))
    # delete the Lead Paragraph from the Body 
    raw['Body'] = raw.apply(lambda row : str(row['Body']).replace(str(row['Lead Paragraph']), ''), axis=1)
    # selecting columns
    filtered_df = raw[['Body', 'Descriptors', 'General Online Descriptors'#, 'News Desk', 'Online Section'
                      ,'Types Of Material','Word Count']]
    filtered_df = filtered_df.drop_duplicates(subset=['Body'])
    # filter Word Count
    filtered_df = filtered_df[filtered_df['Word Count'] > 50]
    # fill NaN value with 'None'
    filtered_df = filtered_df.fillna('None')
    # filter Types Of Material
    filtered_df['Types Of Material'] = filtered_df['Types Of Material'].str.lower()
    filtered_df = filtered_df[filtered_df['Types Of Material'].isin(material_list)]
    print('\nFiltered Material: done')
    print('\ncurrent no. rows:', len(filtered_df))
    print('\nExploring Types of Material')
    # exploring types of materials
    for m in material_list:
        temp = filtered_df[filtered_df['Types Of Material']==m].sort_values('Body')
        print('For {}, there are {} articles in total. {} articles lack Descriptors. {} articles lack General Online Descriptors.\
              {} articles does not have any descriptors at all.'.format(\
            m, len(temp), 
            len(temp[temp['Descriptors']=='None']), len(temp[temp['General Online Descriptors']=='None']),
            len(temp[(temp['Descriptors']=='None') & (temp['General Online Descriptors']=='None')]))
             )
    ### many editorials lack topic labels
    return filtered_df

# Helper function for filtering keywords
def filter_clean(filtered_df, filter_col, s):
    filtered_df[filter_col] = filtered_df[filter_col].str.lower()
    filtered_topics = filtered_df[filtered_df[filter_col].str.contains(s)==True]
    for row in filtered_topics.iterrows():
        body = row[1]['Body'].lstrip()
        get = re.findall("\([^\(]*\.\s\d{2}\)", body)
        if get:        
            body = re.sub("\([^\(]*\.\s\d{2}\)",'', body)
        body_after = re.sub('To the Editor:','', body)    
        filtered_topics._set_value(row[0],'Body',body_after)
    print('\nFiltered Topics: done')
    print('\nCurrent no. rows:', len(filtered_topics))
    # split Descriptors into list
    filtered_topics['Descriptors'] = filtered_topics['Descriptors'].str.split('|')
    filtered_topics['General Online Descriptors'] = filtered_topics['General Online Descriptors'].str.split('|')
    return filtered_topics

In [4]:
# Helper functions to assign topic to the articles

# define the main topic by the keyword in descriptors and add Topic column
def match_string(list_string,search_string):
    result = [re.search(i, search_string).group() for i in list_string if re.search(i, search_string) is not None]
    if len(result) > 0:
        return result[0]
    
def match_key(dictionary, search_string):
    match_list = [key for key,val in dictionary.items() if any(search_string in s for s in val)]
    if len(match_list) > 0:
        return match_list[0]

def match_topic(df, key_dict):
    # add empty column to df
    df.insert(0,"Topic", "None")
    for row in df.iterrows():
        topic = ''                
        if row[1]['Topic'] in list(key_dict):
            pass
        else:
            for des in row[1]['Descriptors']:
                match_res = match_string(flatten_l,des)
                if match_res:
                    topic = match_key(key_dict,match_res)
                    if topic:
                        df._set_value(row[0],'Topic', topic)
    return df

In [5]:
# Helper functions for count and save

def count_result(df):
    # at some point have to change Types of Material to 'news' and 'editorials'
    df.loc[df["Types Of Material"] == 'letter', "Types Of Material"] = 'editorial'
    df.loc[df["Types Of Material"] == 'op-ed', "Types Of Material"] = 'editorial'
    df.loc[df["Types Of Material"] == 'none', "Types Of Material"] = 'news'
    count_df = df.groupby(["Topic", "Types Of Material"]).size().reset_index(name="Count")
    count_df = count_df.pivot('Topic','Types Of Material','Count').reset_index()
    count_df['ratio'] = count_df['editorial']/(count_df['editorial']+count_df['news'])
    for row in count_df.iterrows():
        print ('On the topic {}, we have {} news articles and {} editorials, or {} % editorial'\
               .format(row[1]['Topic'], row[1]['news'], row[1]['editorial'], round(row[1]['ratio'],4)*100))

def save_topic_csv(df, data_type, key_dict):
    df = df[['Topic','Types Of Material','Body']]
    for k in list(key_dict):
        save = df[df['Topic']==k]
        save[['Types Of Material','Body']].to_csv('/data/ProcessedNYT/'+data_type+'_'+str(k)+'.txt',
                                                  sep='\t', header=False, index=False)
        print ('saved topic_txt:', k)

# Main Function 

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [6]:
def main_func(input_df, data_type):

    # cleaning
    filtered_df = filter_explore(input_df)
    filtered_topics = filter_clean(filtered_df, filter_col, s)

    #assign topic
    return_df = match_topic(filtered_topics, key_dict)

    # once again drop duplicated bodies
    return_df = return_df.drop_duplicates(subset='Body')
    print('Remove Duplicates: done')

    # check dataset size
    print('\nCurrent no. rows:',len(return_df))
    print('Topics size:\n{}'.format(return_df['Topic'].value_counts()))

    # count stat
    print('\nData Summary:')
    count_result(return_df)
    
    # save .txt
    save_topic_csv(return_df,data_type, key_dict)
    
    return return_df

# Processing

In [9]:
clean_train_df = main_func(train_df, 'train')

Current no. rows: 169081

Filtered Material: done

current no. rows: 106286

Exploring Types of Material
For editorial, there are 2801 articles in total. 578 articles lack Descriptors. 209 articles lack General Online Descriptors.              171 articles does not have any descriptors at all.
For op-ed, there are 2601 articles in total. 626 articles lack Descriptors. 222 articles lack General Online Descriptors.              212 articles does not have any descriptors at all.
For letter, there are 12418 articles in total. 4845 articles lack Descriptors. 4866 articles lack General Online Descriptors.              4839 articles does not have any descriptors at all.
For none, there are 88466 articles in total. 24116 articles lack Descriptors. 8700 articles lack General Online Descriptors.              7780 articles does not have any descriptors at all.

Filtered Topics: done

Current no. rows: 17538
Remove Duplicates: done

Current no. rows: 16678
Topics size:
politics     5883
law       

In [8]:
clean_test_df = main_func(test_df, 'test')

Current no. rows: 26128

Filtered Material: done

current no. rows: 17549

Exploring Types of Material
For editorial, there are 370 articles in total. 3 articles lack Descriptors. 18 articles lack General Online Descriptors.              3 articles does not have any descriptors at all.
For op-ed, there are 221 articles in total. 12 articles lack Descriptors. 12 articles lack General Online Descriptors.              12 articles does not have any descriptors at all.
For letter, there are 1247 articles in total. 71 articles lack Descriptors. 102 articles lack General Online Descriptors.              71 articles does not have any descriptors at all.
For none, there are 15711 articles in total. 356 articles lack Descriptors. 857 articles lack General Online Descriptors.              356 articles does not have any descriptors at all.

Filtered Topics: done

Current no. rows: 3700
Remove Duplicates: done

Current no. rows: 3700
Topics size:
politics     1374
law           610
military      60

In [12]:
clean_all_df = main_func(pd.concat([train_df,test_df]), 'all')

Current no. rows: 195209

Filtered Material: done

current no. rows: 123835

Exploring Types of Material
For editorial, there are 3171 articles in total. 581 articles lack Descriptors. 227 articles lack General Online Descriptors.              174 articles does not have any descriptors at all.
For op-ed, there are 2822 articles in total. 638 articles lack Descriptors. 234 articles lack General Online Descriptors.              224 articles does not have any descriptors at all.
For letter, there are 13665 articles in total. 4916 articles lack Descriptors. 4968 articles lack General Online Descriptors.              4910 articles does not have any descriptors at all.
For none, there are 104177 articles in total. 24472 articles lack Descriptors. 9557 articles lack General Online Descriptors.              8136 articles does not have any descriptors at all.

Filtered Topics: done

Current no. rows: 21238
Remove Duplicates: done

Current no. rows: 19623
Topics size:
politics     7009
law      