In [1]:
import pandas as pd
import os
import numpy as np

<b> 1: Read the files </b>

I used the ExcelFile function as the format of the input is xlsx with multiple pages.

In [2]:
#because the file has multiple pages - using ExcelFile function
def read_data(file_name):
    xl = pd.ExcelFile(file_name)
    data_dic = dict()
    for sheet_name in xl.sheet_names:
        data_dic[sheet_name] = pd.read_excel(xl, sheet_name)
    return data_dic

file_name = 'Interview Dataset.xlsx'
data_dic = read_data(file_name)

In [3]:
for key in data_dic.keys():
    print(len(data_dic[key]))

15
40
8


There are 63 articles from 3 different databases. Before we start any kind of analysis, let's examine the data for missing values or any inconsistencies

In [4]:
null_data = dict()
for key in data_dic.keys():
    null_data[key] = data_dic[key][data_dic[key].isnull().any(axis=1)]
    print(key, data_dic[key].columns[data_dic[key].isnull().any()])
    print(key, null_data[key].index)

NYT ProQuest Index(['author', 'body'], dtype='object')
NYT ProQuest Int64Index([5, 7, 10, 12], dtype='int64')
NYT Factiva Index(['author', 'LP', 'TD'], dtype='object')
NYT Factiva Int64Index([7, 27, 30, 32, 37], dtype='int64')
WP ProQuest Index(['author'], dtype='object')
WP ProQuest Int64Index([0, 2, 3, 5], dtype='int64')


Several rows have missing values in them. The columns that have missing values are author, body, LP and TD. For author, I am going to just fill a value 'Unknown' into the columns since it won't affect the analysis greatly. The body however is necessary for the analysis. For the rows where body is empty, I will drop them from the main dataset and save them elsewhere so that they can be corrected later or discarded of. For the rows where both LP and TD is empty I will follow the same procedure. If I had found that there were rows where either LP or only TD is empty I would have replaced the NaN value with a blank value - ' ' - so that when we combine the columns data is not lost. This isn't the case in this dataset so I just joined them as is.

<b>a.	For the Factiva data, the article text is split into the “LD” (lead paragraph) and “TD” (text after lead paragraph) columns and will need to be combined. </b>

In [5]:
#combine data and make header same
def combine(data, column1, column2):
    data[column1 + column2] = data[column1].astype(str) + data[column2]
combine(data_dic['NYT Factiva'], 'LP', 'TD')
data_dic['NYT Factiva'].drop('LP', axis = 1, inplace = True)
data_dic['NYT Factiva'].drop('TD', axis = 1, inplace = True)

This works and it is the fastest method to combine text present in pandas columns. Next, I rearranged the columns to match with the other databases and then checked if the columns are identical. If all is well, the following cell will print true.

In [6]:
#change formatting to match with other datasets 
data_dic['NYT Factiva'].rename(columns = {'pub': 'publication', 'title': 'headline', 'LPTD':'body'}, inplace = True)
columns = ['access_num', 'publication', 'dataset', 'headline', 'author', 'pubdate', 'body', 'wordcount']
data_dic['NYT Factiva'] = data_dic['NYT Factiva'][columns]
all(data_dic['NYT Factiva'].columns == data_dic['WP ProQuest'].columns)
#the headers match! Yay!

True

Great, it's True. Next, I fill in the NaN values for author and drop the rows with empty body.

In [7]:
db1, db2, db3 = data_dic.keys()
merged_data = pd.concat([data_dic[db1], data_dic[db2], data_dic[db3]])
merged_data['author'].replace(np.nan, 'Unknown', inplace = True)
missing_rows_backed_up = merged_data[merged_data['body'].isnull()]
merged_data = merged_data[~merged_data['body'].isnull()]
merged_data.reset_index(drop=True, inplace = True)
merged_data.head()

Unnamed: 0,access_num,publication,dataset,headline,author,pubdate,body,wordcount
0,I5U9AAMC,The New York Times,ProQuest,Personality: The Kelly Girls Are Two Men: They...,"COOPER, RICHARD",1963-04-14,﻿Personality: The Kelly Girls Are Two Men: The...,1519
1,94SF4ZHL,The New York Times,ProQuest,Computers Are Getting Ideas From Women: I.B.M....,"VARTAN, VARTANIG G.",1964-03-12,﻿Computers Are Getting Ideas From Women: I.B.M...,901
2,N4SG5ZZT,The New York Times,ProQuest,Big Business Discovers The Business of Charm,"COOK, JOAN",1966-01-28,﻿Big Business Discovers The Business of Charm\...,781
3,NGLR7NTL,The New York Times,ProQuest,"Woman, 48, Is Slain During L.i. Robbery","Times, Special to The New York",1967-12-30,"﻿WOMAN, 48, IS SLAIN DURING L.I. ROBBERY\nSpec...",281
4,QPYGDSN9,The New York Times,ProQuest,Door to Executive Suite Opens Wider to Working...,"BENDER, MARYLIN",1969-01-06,﻿Door to Executive Suite Opens Wider to Workin...,1248


In [8]:
len(merged_data)

61

In [9]:
merged_data['body'].iloc[0]

'\ufeffPersonality: The Kelly Girls Are Two Men: They Head Supplier of <span ...\nBy RICHARD COOPER\nNew York Times (1923-Current file); Apr 14, 1963; ProQuest Historical Newspapers: The New York Times pg. 133\nThe Kelly Girls Are Two Men\nThey Head Supplier I of Women Who Do I\n\' ■\t’\t\' I\nPart-Time Work\n*/\'\n\nCompany President] Destroyed Red Tape During the War\ni\nBy RICHARD COOPER\nwnnani nusseii Keuy is a man wrho has made a full-time career out of part-time work. He and his brother Richard are the Kellys in Kelly Girl Service, Inc.\nWhat is Kelly Girl? It is the nation’s leading supplier of women who work on a temporary basis in a score of jobs, ranging from clerical operations to complex office procedures; from hostesses at conventions and shopping center openings to tasting and sampling cookies and other goodies.\nIt is a company that last yeap billed Its clients for $26,900,000 for services and supplied jobs for more than 75,000 employes. This compares with $92,000 in sa

In [10]:
merged_data['body'].iloc[60]

'Women May Work Overtime In District, Legal Aide Rules: District Overtime Law Ruled Discrimina By Richard Prince; The , Times Herald (1959-1973); Apr 3, 1970; : The .-C1__________________________________________________________________________ Women May Work Overtime In District, Legal Aide Rules By Richard Prince Sta/t Writer Washington\'s chief legal officer has ruled that women may work long hours and long weeks, the same as men. Corporation Counsel Charles T. Duncan declared that a 1914 law limiting women in the city t& eight-hour days and six-day weeks in certain industries conflicted with sex discrimina tion provisions of the Civil Rights Act of 1964. The old statute applies to women "employed in any manufacturing, mechanical or mercantile establishment, laundry, hotel or restaurant, or telegraph or telephone establishment or office, or by any express or transportation company in the District of Columbia." "Regardless of the protective purpose for which the \'female 8-hour law* m

In [11]:
merged_data['body'].nunique()

58

So there are only 58 unique articles which means there are 3 exact duplicates present in the dataset. We can remove the duplicates in this step and then further de-duplicate the dataset for fuzzy matches. This would also speed up the analysis for much larger datasets.

In [12]:
full_data = merged_data.copy(deep=True)
merged_data.drop_duplicates(subset = 'body', inplace = True)
merged_data.reset_index(drop=True, inplace = True)
len(full_data)

61

<b>Pre-processing body data</b>

Upon examining a few rows of the body column, I noticed that the data is not plain text and has a lot of markup language in it such as \ufeff and \t. The block above also has several typos and some strange words - probably from imprecise OCR. My next step is to remove all special characters, markup language and convert all the text to lowercase. I use the apply function to do this as it is highly optimized and much faster than iterating through the dataframe. All the pre-processed text is stored in a new column 'pp_body'

In [13]:
import re
def remove_special (line):
    return re.sub('\W+',' ', line.lower())
merged_data['pp_body'] = merged_data['body'].apply(remove_special)

In [14]:
merged_data['pp_body'].iloc[0]

' personality the kelly girls are two men they head supplier of span by richard cooper new york times 1923 current file apr 14 1963 proquest historical newspapers the new york times pg 133 the kelly girls are two men they head supplier i of women who do i i part time work company president destroyed red tape during the war i by richard cooper wnnani nusseii keuy is a man wrho has made a full time career out of part time work he and his brother richard are the kellys in kelly girl service inc what is kelly girl it is the nation s leading supplier of women who work on a temporary basis in a score of jobs ranging from clerical operations to complex office procedures from hostesses at conventions and shopping center openings to tasting and sampling cookies and other goodies it is a company that last yeap billed its clients for 26 900 000 for services and supplied jobs for more than 75 000 employes this compares with 92 000 in sales in 1947 kelly girl s first full year of operation the man 

In [15]:
merged_data['pp_body'].iloc[-1]

'women may work overtime in district legal aide rules district overtime law ruled discrimina by richard prince the times herald 1959 1973 apr 3 1970 the c1__________________________________________________________________________ women may work overtime in district legal aide rules by richard prince sta t writer washington s chief legal officer has ruled that women may work long hours and long weeks the same as men corporation counsel charles t duncan declared that a 1914 law limiting women in the city t eight hour days and six day weeks in certain industries conflicted with sex discrimina tion provisions of the civil rights act of 1964 the old statute applies to women employed in any manufacturing mechanical or mercantile establishment laundry hotel or restaurant or telegraph or telephone establishment or office or by any express or transportation company in the district of columbia regardless of the protective purpose for which the female 8 hour law may have been originally enacted d

The data still has lots of underscores and extra spaces. I use the next function to deal with that.

In [16]:
def remove_repeated(line):
    line = re.sub('_+', ' ', line)
    return re.sub('\s+',' ', line)
merged_data['pp_body'] = merged_data['pp_body'].apply(remove_repeated)

In [17]:
merged_data['pp_body'].iloc[0]

' personality the kelly girls are two men they head supplier of span by richard cooper new york times 1923 current file apr 14 1963 proquest historical newspapers the new york times pg 133 the kelly girls are two men they head supplier i of women who do i i part time work company president destroyed red tape during the war i by richard cooper wnnani nusseii keuy is a man wrho has made a full time career out of part time work he and his brother richard are the kellys in kelly girl service inc what is kelly girl it is the nation s leading supplier of women who work on a temporary basis in a score of jobs ranging from clerical operations to complex office procedures from hostesses at conventions and shopping center openings to tasting and sampling cookies and other goodies it is a company that last yeap billed its clients for 26 900 000 for services and supplied jobs for more than 75 000 employes this compares with 92 000 in sales in 1947 kelly girl s first full year of operation the man 

This looks decent, but there are still a lot of inconsistencies present which could be fixed with regex or OCR/manual data entry being done again on the newspapers.

In [18]:
merged_data

Unnamed: 0,access_num,publication,dataset,headline,author,pubdate,body,wordcount,pp_body
0,I5U9AAMC,The New York Times,ProQuest,Personality: The Kelly Girls Are Two Men: They...,"COOPER, RICHARD",1963-04-14,﻿Personality: The Kelly Girls Are Two Men: The...,1519,personality the kelly girls are two men they ...
1,94SF4ZHL,The New York Times,ProQuest,Computers Are Getting Ideas From Women: I.B.M....,"VARTAN, VARTANIG G.",1964-03-12,﻿Computers Are Getting Ideas From Women: I.B.M...,901,computers are getting ideas from women i b m ...
2,N4SG5ZZT,The New York Times,ProQuest,Big Business Discovers The Business of Charm,"COOK, JOAN",1966-01-28,﻿Big Business Discovers The Business of Charm\...,781,big business discovers the business of charm ...
3,NGLR7NTL,The New York Times,ProQuest,"Woman, 48, Is Slain During L.i. Robbery","Times, Special to The New York",1967-12-30,"﻿WOMAN, 48, IS SLAIN DURING L.I. ROBBERY\nSpec...",281,woman 48 is slain during l i robbery special ...
4,QPYGDSN9,The New York Times,ProQuest,Door to Executive Suite Opens Wider to Working...,"BENDER, MARYLIN",1969-01-06,﻿Door to Executive Suite Opens Wider to Workin...,1248,door to executive suite opens wider to workin...
5,GRUFAYXH,The New York Times,ProQuest,Togo Women Give Economy Vitality,Unknown,1970-01-30,﻿Togo Women Give Economy Vitality\nNew York Ti...,375,togo women give economy vitality new york tim...
6,GY6IPTEH,The New York Times,ProQuest,A Woman's Place May Be on the Production Line:...,"GRAHAM, -- FRED P.",1971-01-31,﻿A Woman's Place May Be on the Production Line...,760,a woman s place may be on the production line...
7,LT3446BU,The New York Times,ProQuest,A Bay State Woman Prefers Atlantic Flights Solo,"Times, Special to The New York",1973-01-21,﻿A Bay State Woman Prefers Atlantic Flights So...,931,a bay state woman prefers atlantic flights so...
8,J7CKEXA8,The New York Times,ProQuest,Kirby at Westinghouse; Coldwell for Fed: Peopl...,"Cray, Douglas W.",1974-09-27,﻿People and Business: Strauss Cites Optimism o...,874,people and business strauss cites optimism on...
9,QEUXWMMK,The New York Times,ProQuest,Dismissal of a Clerk For Long Hair Held Sex Di...,Unknown,1975-10-25,﻿Dismissal of a Clerk For Long Hair Held Sex D...,278,dismissal of a clerk for long hair held sex d...


<b> 2.	De-duplicate the dataset. Some articles may be almost the same but not a 100% match. We want to de-duplicate those as well. </b>

To de-duplicate I looked into several packages. I found this implementation of shingling to be the easiest to implement and understand, but there are packages such as dedupe which use machine learning and might have potential to be faster than this method.

In [19]:
#Source for this code: https://mattilyra.github.io/2017/05/23/document-deduplication-with-lsh.html

import itertools
# from lsh import lsh, minhash # https://github.com/mattilyra/lsh

# a pure python shingling function that will be used in comparing
# LSH to true Jaccard similarities
def get_shingles(text, char_ngram=5):
    """Create a set of overlapping character n-grams.
    
    Only full length character n-grams are created, that is the first character
    n-gram is the first `char_ngram` characters from text, no padding is applied.

    Each n-gram is spaced exactly one character apart.

    Parameters
    ----------

    text: str
        The string from which the character n-grams are created.

    char_ngram: int (default 5)
        Length of each character n-gram.
    """
    return set(text[head:head + char_ngram] for head in range(0, len(text) - char_ngram))


def jaccard(set_a, set_b):
    """Jaccard similarity of two sets.
    
    The Jaccard similarity is defined as the size of the intersection divided by
    the size of the union of the two sets.

    Parameters
    ---------
    set_a: set
        Set of arbitrary objects.

    set_b: set
        Set of arbitrary objects.
    """
    intersection = set_a & set_b
    union = set_a | set_b
    return len(intersection) / len(union)

I wrote a function to perform the de-duplication using the jaccard index. The parameter threshold can be increased or reduced based on the needs of the researcher. To test the function I tested it against a block of text from body and randomly removed markup flags, special characters and spaces and the function returned a similarity of 0.96 between them.

In [20]:
#de-dupe test
def dedupe(data, threshold = 0.75):
    shingles = []
    for article_text in data:
        shingles.append(get_shingles(article_text.lower()))
    duplicates = []
    drop_rows = []
    for i_doc in range(len(shingles)):
        for j_doc in range(i_doc + 1, len(shingles)):
            jaccard_similarity = jaccard(shingles[i_doc], shingles[j_doc])
            is_duplicate = jaccard_similarity >= threshold
            if is_duplicate:
                print(i_doc, j_doc)
                duplicates.append((i_doc, j_doc, jaccard_similarity))
                drop_rows.append(j_doc)
    return(duplicates, drop_rows)

#test
s1 = merged_data['body'].iloc[8]
s2 = 'People and Business: Strauss Cites Optimism on World Trade New York Times (1923-Current file); Sep 16, 1977; ProQuest Historical Newspapers: The New York Times pg. 88 People and Business Strauss Cites Optimism on World Trade Robert S. Strauss, President Carter’s special representative to the international trade negotiations in Geneva, told a meeting of 600 world business leaders in San Francisco that there is “reason for cautious optimism” on the international trade front, but added that some short-term domestic interests must be cast aside in. order to achieve present goals. Declaring that such an action would be "economic cowardice,” Mr. Strauss emphasized that the United States has no intention of leaving the bargaining tables in Geneva. He was the keynote speaker at the concluding session of the International Industrial Conference, a week-long meeting. The conference is co-sponsored by the Conference Board of New York and the Stanford Research Institute of Menlo Park, Calif.The United States does not seem to be heading toward another recession, according to Alan Greenspan, chairman of the Council of Economic Advisers under President Nixon and Ford, and now president of the economic consulting firm of Townsend-Greenspan & Company.Speaking at a meeting here of the Financial Women’s Association of New York yesterday Mr. Greenspan predicted that the inflation rate next year will be about 6 percent. The nation’s gross national product, adjusted for inflation, he said, will rise about 4.7 percent. “I can’t find the evidence” to support the view that the nation is sliding toward another recession.” he said. There is n state lethargy developing” but the reduced rate of inflation so far is encouraging, he added.  Urging Government leaders to contain the money supply, Mr. Greenspan declared that he could see no reason why the inflation rate could not be lowered to around 3 percent or less within two years. Meantime in Zurich, Leland Prussia, executive vice president of the Bank of America, said he thinks United States prime rates may have peaked with the increase to 7*4 percent by the Chase Manhattan Bank earlier this week. He said he did not feel that this increase would be followed across the boards by other banks in the nation. David M. Culver, 52 years old, has been elected president of Alcan Aluminium Ltd., one of the world’s largest aluminum producers, it was announced yesterday by Nathanael V. Davis, chairman and chief executive officer. Mr. Culver succeeds Paul H. Leman, who was elected vice chairman of the board. Both appointments are effective Oct. 1. Effective Jan. 1, 1978, Patrick J. J. Rich, 46, will become regional executive vice president, Western Hemisphere. As such, he will oversee Alcan operations and area general managers in Canada, the United States and the Caribbean, succeeding Mr. Culver in this role. He will retain his present responsibilities for Latin America. Burke Knapp, senior vice president, operations, of the World Bank, who will become 65 in 1978, will leave the bank at the end of its fiscal year on June 30, 1978, after 25 years service. Effective July 1, 1978, I. T. M. Cargill of Britain, will be appointed senior vice president, finance. In this capacity he will continue his present responsibilities for the bank\'s financial activities and will serve as acting president and chairman of the board in the absence of Robert S. McNamara, president and chairman. On the same day, July 1, 1978, Ernest Stern, an American citizen, will be appointed vice president of operations.\nThis Oct. 1, two other officers will be promoted. Eugene Rotberg, an American who is presently treasurer of the bank, will become vice president and treasurer, and Purviz Damry, a citizen of India, presently secretary of the bank, will become vice president and secretary.\nHoward M. Peck has been named by the board of Seatrain Lines Inc., operator of ocean-going vessels and tankers, as vice chairman of the board and chairman of the executive committee. The board also appointed Stephen Russell as president, the post formerly held by Mr. Pack. The new president had been executive vice president and chief operating officer. He will continue in the latter capacity.\n•\nRoger C. Greene, 60 years old, will retire as vice chairman of the Peavey Company, Minneapolis, on March 1. The chairman and chief executive officer of the company, Fritz Corrigan, had previously said he would retire on Jan. 1, 1978. Peavey is a major producer of flour and flour products.\nWilliam G. Stocks, successor to Mr. Corrigan, already has realigned the duties for Peavey executive vice presidents George K. Gosko and Frank T. Heffelfinger. They, with Mr. Stocks, will comprise a new three-man president’s office.\n•\nJOBS: Kenneth W. Maxfield has been appointed president and chief operating officer of North American Van Lines Inc., the nation’s sixth-largest motor carrier, Fort Wayne, Ind.... Edward J. Wilsmann has been appointed president and chief operating officer of the Olsten Corporation, He succeeds as president William Olsten, who continues\' as chairman and chief executive officer of the company .... Vico E. Henriques has been named president of the Computer and Business Equipment Manufacturers Association,, effective Oct. 1. succeeding Peter F. McCloskey. The latter has been appointed president of the Electronics Industry Association. . . . Donald R. Whalen has been appointed staff vice president, operations analysis for the RCA Corporation.\tJAMES J. NAGLE\nReproduced with permission of the copyright owner. Further reproduction prohibited without permission.'
dedupe([s1, s2])

0 1


([(0, 1, 0.96175799086758)], [1])

In [21]:
#another test
test = pd.DataFrame()
test = test.append(merged_data.iloc[0])
test = test.append(merged_data.iloc[0])
test = test.append(merged_data.iloc[0])
test = test.reset_index(drop=True)
_, drop_rows = dedupe(test['pp_body'])
print(drop_rows)
test.drop(drop_rows, axis = 0, inplace = True)
test

0 1
0 2
1 2
[1, 2, 2]


Unnamed: 0,access_num,author,body,dataset,headline,pp_body,pubdate,publication,wordcount
0,I5U9AAMC,"COOPER, RICHARD",﻿Personality: The Kelly Girls Are Two Men: The...,ProQuest,Personality: The Kelly Girls Are Two Men: They...,personality the kelly girls are two men they ...,1963-04-14,The New York Times,1519.0


Once I confirmed it was working as expected, I ran it on the entire dataset. No near matches were found. 

In [22]:
duplicates, drop_rows = dedupe(merged_data['pp_body'])
print("Found duplicates with index:", duplicates)
print("Deleting rows: ", drop_rows)
print(duplicates, drop_rows )

Found duplicates with index: []
Deleting rows:  []
[] []


Since I already identified the duplicates and there aren't any close duplicates present the function didn't work. To verify I just ran my function on a backup of the data (before dropping the duplicates) and verified that it was working as intended. This can be seen below.

In [23]:
f_dup, f_drop = dedupe(full_data['body'])
print("Found duplicates with index:", duplicates)
print("Deleting rows: ", drop_rows)
print(duplicates, drop_rows )

8 11
22 23
33 34
Found duplicates with index: []
Deleting rows:  []
[] []


In [24]:
merged_data.drop(drop_rows, inplace = True)
merged_data.reset_index(drop = True, inplace = True)
len(merged_data)

58

In [25]:
merged_data

Unnamed: 0,access_num,publication,dataset,headline,author,pubdate,body,wordcount,pp_body
0,I5U9AAMC,The New York Times,ProQuest,Personality: The Kelly Girls Are Two Men: They...,"COOPER, RICHARD",1963-04-14,﻿Personality: The Kelly Girls Are Two Men: The...,1519,personality the kelly girls are two men they ...
1,94SF4ZHL,The New York Times,ProQuest,Computers Are Getting Ideas From Women: I.B.M....,"VARTAN, VARTANIG G.",1964-03-12,﻿Computers Are Getting Ideas From Women: I.B.M...,901,computers are getting ideas from women i b m ...
2,N4SG5ZZT,The New York Times,ProQuest,Big Business Discovers The Business of Charm,"COOK, JOAN",1966-01-28,﻿Big Business Discovers The Business of Charm\...,781,big business discovers the business of charm ...
3,NGLR7NTL,The New York Times,ProQuest,"Woman, 48, Is Slain During L.i. Robbery","Times, Special to The New York",1967-12-30,"﻿WOMAN, 48, IS SLAIN DURING L.I. ROBBERY\nSpec...",281,woman 48 is slain during l i robbery special ...
4,QPYGDSN9,The New York Times,ProQuest,Door to Executive Suite Opens Wider to Working...,"BENDER, MARYLIN",1969-01-06,﻿Door to Executive Suite Opens Wider to Workin...,1248,door to executive suite opens wider to workin...
5,GRUFAYXH,The New York Times,ProQuest,Togo Women Give Economy Vitality,Unknown,1970-01-30,﻿Togo Women Give Economy Vitality\nNew York Ti...,375,togo women give economy vitality new york tim...
6,GY6IPTEH,The New York Times,ProQuest,A Woman's Place May Be on the Production Line:...,"GRAHAM, -- FRED P.",1971-01-31,﻿A Woman's Place May Be on the Production Line...,760,a woman s place may be on the production line...
7,LT3446BU,The New York Times,ProQuest,A Bay State Woman Prefers Atlantic Flights Solo,"Times, Special to The New York",1973-01-21,﻿A Bay State Woman Prefers Atlantic Flights So...,931,a bay state woman prefers atlantic flights so...
8,J7CKEXA8,The New York Times,ProQuest,Kirby at Westinghouse; Coldwell for Fed: Peopl...,"Cray, Douglas W.",1974-09-27,﻿People and Business: Strauss Cites Optimism o...,874,people and business strauss cites optimism on...
9,QEUXWMMK,The New York Times,ProQuest,Dismissal of a Clerk For Long Hair Held Sex Di...,Unknown,1975-10-25,﻿Dismissal of a Clerk For Long Hair Held Sex D...,278,dismissal of a clerk for long hair held sex d...


<b>3.	Only keep articles that contain at least two occurrences of the keywords in these two lists: [women, woman], [employer, employee, employment]. Plural versions or suffixes are included such as “employers” or “women’s”
a.	Example: “the woman found employment” – invalid, does not meet threshold 
b.	Example: “the employer talked with women about hiring more woman employees" – valid, meets threshold </b>

To accomplish this I thought of two methods:
- just add the keyword along with its "apostrophe s" and "s" variants. If the list of keywords was to remain of this size and we are sure that those are the only variants then this would be the simplest and fastest method. Since I wasn't sure if those were the only variants I proceeded with a different method.
- I made use of WordNetLemmatizer and word_tokenize from nltk package to normalize the text. This ensures that words such as women's/womans are converted to women/woman. I then used the highly efficient Counter collection to count the words found in the body of each line and following this checked if the counts were < 2. The lines with < 2 were dropped.
- "Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore, we usually prefer using lemmatization over stemming."


In [26]:
keyword_lists = [['women', 'woman'], ['employer', 'employee', 'employment']]

In [27]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
wnl = WordNetLemmatizer()
counts = [0] * len(keyword_lists)
drop_rows = []
for i in range(len(merged_data)):
    tokenized_body = [wnl.lemmatize(i) for i in word_tokenize(str(merged_data.iloc[i]['pp_body']).lower())]
    word_count = Counter(tokenized_body)
    for keyword_list in keyword_lists:
        sum_ = 0
        for keyword in keyword_list:
            sum_ += word_count[keyword]
        if sum_ < 2:
            drop_rows.append(i)

In [28]:
#all these rows do not meet the 2 keyword expectation
print(drop_rows)

[1, 2, 3, 7, 8, 8, 9, 11, 12, 13, 16, 20, 22, 23, 24, 26, 27, 29, 30, 31, 32, 34, 35, 37, 38, 43, 44, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 55, 56]


In [29]:
#test to see if the counter worked as expected
tokenized_body = [wnl.lemmatize(i) for i in word_tokenize(str(merged_data.iloc[50]['pp_body']).lower())]
print(Counter(tokenized_body)['woman'])
print(Counter(tokenized_body)['women'])
print(Counter(tokenized_body)['employer'])
print(Counter(tokenized_body)['employee'])
print(Counter(tokenized_body)['employment'])

9
0
0
0
0


In [30]:
merged_data.drop(drop_rows, axis = 0, inplace = True)
merged_data

Unnamed: 0,access_num,publication,dataset,headline,author,pubdate,body,wordcount,pp_body
0,I5U9AAMC,The New York Times,ProQuest,Personality: The Kelly Girls Are Two Men: They...,"COOPER, RICHARD",1963-04-14,﻿Personality: The Kelly Girls Are Two Men: The...,1519,personality the kelly girls are two men they ...
4,QPYGDSN9,The New York Times,ProQuest,Door to Executive Suite Opens Wider to Working...,"BENDER, MARYLIN",1969-01-06,﻿Door to Executive Suite Opens Wider to Workin...,1248,door to executive suite opens wider to workin...
5,GRUFAYXH,The New York Times,ProQuest,Togo Women Give Economy Vitality,Unknown,1970-01-30,﻿Togo Women Give Economy Vitality\nNew York Ti...,375,togo women give economy vitality new york tim...
6,GY6IPTEH,The New York Times,ProQuest,A Woman's Place May Be on the Production Line:...,"GRAHAM, -- FRED P.",1971-01-31,﻿A Woman's Place May Be on the Production Line...,760,a woman s place may be on the production line...
10,Y6NGNI3Q,The New York Times,ProQuest,Washington and Business: Job-Equality Complain...,"Times, ERNEST HOLSENDOLPH Special to The New York",1976-06-14,﻿Washington and Business: Job-Equality Complai...,1273,washington and business job equality complain...
14,NYTF000020050518ddc60071d,The New York Times,Factiva,THE CONTROVERSY OVER INFANT FORMULA,By Stephen Solomon; Stephen Solomon is a contr...,1981-12-06,There is no third-world poverty in Coopersto...,4991,there is no third world poverty in cooperstow...
15,NYTF000020050513de1a00202,The New York Times,Factiva,Economic Affairs; AN AFFIRMATIVE LOOK AT HIRIN...,By BARBARA R. BERGMANN; Barbara R. Bergmann is...,1982-01-10,IS affirmative action a burden on American b...,931,is affirmative action a burden on american bu...
17,NYTF000020050506dh3o00k58,The New York Times,Factiva,"THE MARKET; FOR THE WOMEN, STILL A LONG WAY TO GO",By Peggy Schmidt; Peggy Schmidt is working on ...,1985-03-24,Although a gowing number of young women are ...,1468,although a gowing number of young women are s...
18,NYTF000020050502di1v00acd,The New York Times,Factiva,"FOR COLLEGE STUDENTS, JOB HUNTING WITH A TWIST",By MICHAEL E. ROSS,1986-01-31,Ninety-three young men and women arrived in ...,804,ninety three young men and women arrived in m...
19,NYTF000020050429dj3d00s4c,The New York Times,Factiva,FOSTERING WOMEN'S CAREERS,By CAROL LAWSON,1987-03-13,Now that women are entering the work force i...,765,now that women are entering the work force in...


<b>4.	How many total articles were published each month </b>

I'm assuming that only articles related to womens employment should be counted here. 

In [31]:
merged_data['pubdate'].groupby([merged_data.pubdate.dt.month]).agg('count')

pubdate
1     6
3     5
4     4
5     1
6     1
8     1
9     1
11    1
12    1
Name: pubdate, dtype: int64

Maximum articles seemed to be published in the starting of the year.

In [32]:
merged_data['pubdate'].groupby([merged_data.publication]).agg('count')

publication
The New York Times     20
The Washington Post     1
Name: pubdate, dtype: int64

<b>5.	Save the final articles to disk for working with it in the future. The saved data should be ready to analyze or use in a machine learning model. 

In [33]:
def update_wordlen(line):
    return len(line.split(' '))
merged_data['wordcount'] = merged_data['pp_body'].apply(update_wordlen)

In [34]:
merged_data.to_csv('final.csv', encoding = 'utf-8-sig', index = False)

In [35]:
test = pd.read_csv('final.csv')
test

Unnamed: 0,access_num,publication,dataset,headline,author,pubdate,body,wordcount,pp_body
0,I5U9AAMC,The New York Times,ProQuest,Personality: The Kelly Girls Are Two Men: They...,"COOPER, RICHARD",1963-04-14,﻿Personality: The Kelly Girls Are Two Men: The...,1550,personality the kelly girls are two men they ...
1,QPYGDSN9,The New York Times,ProQuest,Door to Executive Suite Opens Wider to Working...,"BENDER, MARYLIN",1969-01-06,﻿Door to Executive Suite Opens Wider to Workin...,1322,door to executive suite opens wider to workin...
2,GRUFAYXH,The New York Times,ProQuest,Togo Women Give Economy Vitality,Unknown,1970-01-30,﻿Togo Women Give Economy Vitality\nNew York Ti...,411,togo women give economy vitality new york tim...
3,GY6IPTEH,The New York Times,ProQuest,A Woman's Place May Be on the Production Line:...,"GRAHAM, -- FRED P.",1971-01-31,﻿A Woman's Place May Be on the Production Line...,797,a woman s place may be on the production line...
4,Y6NGNI3Q,The New York Times,ProQuest,Washington and Business: Job-Equality Complain...,"Times, ERNEST HOLSENDOLPH Special to The New York",1976-06-14,﻿Washington and Business: Job-Equality Complai...,1334,washington and business job equality complain...
5,NYTF000020050518ddc60071d,The New York Times,Factiva,THE CONTROVERSY OVER INFANT FORMULA,By Stephen Solomon; Stephen Solomon is a contr...,1981-12-06,There is no third-world poverty in Coopersto...,10294,there is no third world poverty in cooperstow...
6,NYTF000020050513de1a00202,The New York Times,Factiva,Economic Affairs; AN AFFIRMATIVE LOOK AT HIRIN...,By BARBARA R. BERGMANN; Barbara R. Bergmann is...,1982-01-10,IS affirmative action a burden on American b...,1005,is affirmative action a burden on american bu...
7,NYTF000020050506dh3o00k58,The New York Times,Factiva,"THE MARKET; FOR THE WOMEN, STILL A LONG WAY TO GO",By Peggy Schmidt; Peggy Schmidt is working on ...,1985-03-24,Although a gowing number of young women are ...,1554,although a gowing number of young women are s...
8,NYTF000020050502di1v00acd,The New York Times,Factiva,"FOR COLLEGE STUDENTS, JOB HUNTING WITH A TWIST",By MICHAEL E. ROSS,1986-01-31,Ninety-three young men and women arrived in ...,903,ninety three young men and women arrived in m...
9,NYTF000020050429dj3d00s4c,The New York Times,Factiva,FOSTERING WOMEN'S CAREERS,By CAROL LAWSON,1987-03-13,Now that women are entering the work force i...,855,now that women are entering the work force in...


In [36]:
merged_data['body'].iloc[5]

"  There is no third-world poverty in Cooperstown, N.Y. The big homes and manicured lawns give the aura of a town happily stalled in the security of the 1950's.  But Allan Cunningham, who had been a pediatrician for a tribe of Sioux Indians before moving to Cooperstown, became aware of something odd about his new patients at Mary Imogene Bassett Hospital: Nearly all of the sick in-fants he treated were formula-fed.  Dr. Cunningham's subsequent investigation, published as two studies in The Journal of Pediatrics showed that illness occurred twice as often among babies who were not breast-fed; in the first two months of life, the difference was 16-fold.  Dana Raphael was one of the first scientists to hold the formula companies responsible for high infant mortality, but then she decided it wasn't quite that simple. An anthropologist who heads the Human Lactation Center Ltd., in Westport, Conn., Mrs. Raphael changed her mind after a study team she led spent two years observing how women i

<b> Bonus: interesting patterns in the data </b>
- I noticed that the data is quite messy with lots of potential for further cleaning such as reducing spelling errors, changing numbers into text, removing single character words apart from "a" and "i". I could also remove stop words from the text but I chose not to at this stage. 
- Most of the articles were written in January. This could be because jobs are added to the market in January normally and there is more public interest in these kinds of articles. 
- NYT tends to write much longer articles than the WAPO. Some datapoints such as row 5 shown above, have the necessary keywords but aren't really related to women's employment. Since NYT articles tend to be longer I think the probability of its articles getting filtered out is lower than that of WAPO. Maybe different keyword rules will be required for the two of them.


In [37]:
#FULL is the entire dataset as it was given, I examined the counts for publication, dataset etc. here
FULL = pd.concat([data_dic[db1], data_dic[db2], data_dic[db3]])
FULL['pubdate'].groupby([FULL.publication]).agg('count')

publication
The New York Times     55
The Washington Post     8
Name: pubdate, dtype: int64

In [38]:
FULL['pubdate'].groupby([FULL.dataset]).agg('count')

dataset
Factiva     40
ProQuest    23
Name: pubdate, dtype: int64

In [39]:
FULL['wordcount'].groupby([FULL.publication]).agg('mean')

publication
The New York Times     1183.836364
The Washington Post     389.000000
Name: wordcount, dtype: float64

In [40]:
test['pp_body'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))

0      47
1      29
2      10
3       7
4      47
5     142
6       5
7      33
8      14
9      19
10     19
11     23
12     23
13      1
14      8
15      6
16     32
17     17
18     14
19     59
20     15
Name: pp_body, dtype: int64

Bonus: TFIDF - the most frequent words (not stop words) are women, business and executive in the data

In [41]:
tf1 = (test['pp_body'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
for i,word in enumerate(tf1['words']):
    tf1.loc[i, 'idf'] = np.log(test.shape[0]/(len(test[test['pp_body'].str.contains(word)])))
for i,word in enumerate(tf1['words']):
    tf1.loc[i, 'idf'] = np.log(test.shape[0]/(len(test[test['pp_body'].str.contains(word)])))
tf1

Unnamed: 0,words,tf,idf
0,the,75,0.000000
1,of,45,0.000000
2,women,34,0.000000
3,to,33,0.000000
4,in,30,0.000000
5,a,26,0.000000
6,for,22,0.000000
7,and,22,0.000000
8,s,17,0.000000
9,is,16,0.000000
