# The Initial Text Cleaning
Online supplementary material to "The Evolution of Work in the United States" by Enghin Atalay, Phai Phongthiengtham, Sebastian Sotelo and Daniel Tannenbaum. 

* [Project data library](https://occupationdata.github.io) 

* [GitHub repository](https://github.com/phaiptt125/newspaper_project)

***

This IPython notebook demonstrates initial processing of the raw text, provided by ProQuest. The main components of this step are to retrieve document metadata, to remove markup from the newspaper text, and to perform an initial spell-check of the text.

<b> Due to copyright restrictions, we are not authorized to publish a large body of newspaper text. </b>
***

## List of auxiliary files (see project data library or GitHub repository)

* *extract_information* : This python code removes markup and extract relevant information.
* *edit_distance.py* : This python code computes string edit distance, used in the spelling correction procedure.
* *OCRcorrect_enchant.py* : This python code performs basic word-by-word spelling error correction.
* *PWL.txt* : This file contains words such as software and state names that are not contained in the dictionary provided by python's enchant module.
***

## Import python modules

In [1]:
import os
import re
import sys
import enchant #spelling correction module

sys.path.append('./auxiliary files')

from extract_information import *
from edit_distance import *
from OCRcorrect_enchant import *

## Import raw text file

ProQuest has provided us with text files which have been transcribed from scanned images of newspaper pages. The file 'ad_sample.txt', as shown below, is one of these text files. ProQuest only provided us with the information that this file belongs to a page of Wall Street Journal.     

In [2]:
# input files
input_file = 'ad_sample.txt'

# bring in raw ads 
raw_ad = open(input_file).read()
print(raw_ad)

<?xml version="1.0" encoding="UTF-8"?><record> <version>  TDM_Record_v1.0.xsd </version> <recordid>  4a667155d557ab68c878224bc3de0979 </recordid> <recordtitle>  Classified Ad 45 -- No Title </recordtitle> <alphapubdate>  Sep 12, 1978 </alphapubdate> <numericpubdate>  19780912 </numericpubdate> <objecttype>  classified_ad </objecttype> <objecttype>  Classified Advertisement </objecttype> <objecttype>  Advertisement </objecttype> <copyright>  Copyright Dow Jones &amp; Company Inc Sep 12, 1978 </copyright> <languagecode>  English </languagecode> <groupid>  506733 </groupid> <pubid>  45441 </pubid> <pubtitle>  Wall Street Journal  (1923 - Current file) </pubtitle> <fulltext>  &amp;lt;html&amp;gt;             &amp;lt;head&amp;gt;               &amp;lt;meta name=&amp;quot;ValidationSchema&amp;quot; content=&amp;quot;http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd&amp;quot;/&amp;gt;               &amp;lt;title/&amp;gt;             &amp;lt;/head&amp;gt;             &amp;lt;body&amp;gt;      

***
Relevant information we have to extract is:

1. publication date - "19780912" (September 12, 1978)
2. page title - "Classified Ad 45" (classified ad, page 45)
3. content - all text in the "fulltext" field

Fortunately, job ads appear only in either "Display Ad" or "Classified Ad" pages. As such, we only need to include pages that are either "Display Ad" or "Classified Ad" in this step.

However, not all pages in "Display Ad" or "Classified Ad" are job ads. The next step, as demonstated in the next IPython notebook [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb), is to use Latent Dirichlet Allocation (LDA) procedure to idenfity which pages are job ads.

## Assign unique page identifier
* Assign a unique identifier for each newpaper page that is either Display Ad or Classified Ad.

In [3]:
page_identifier = AssignPageIdentifier(raw_ad, 'WSJ') # see extract_information.py
print(page_identifier)

WSJ_classifiedad_19780912_45


The value "WSJ_classifiedad_19780912_45" refers to the 45th page of classified ads in the September 12, 1978 edition of the Wall Street Journal.

## Extract posting and remove markup

In [4]:
# extract <fulltext> field 
fulltext = ExtractElement(raw_ad,'fulltext') # see extract_information.py
# remove xml markups
posting = CleanXML(fulltext) # see extract_information.py
print(posting)


 Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate , 
 
 precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards . 
 
 ELECTROMECHANICAL ENGINEERS , 
 
 Minimum of 6 eara experience in developing of electromechanical consumer or atmlar products . BSME or BSEE degree required , 
 
 advanced degree preferred . 
 
 ELECTRONIC ENGINEERS MECHANICAL ENGINEERS 
 
 A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas: and machine design analysis . Working know ! edga of computers as a design tool would be 
 
 1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD knowl chanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circu

## Perform basic spelling error correction, remove extra spaces and empty lines 

In [5]:
posting_by_line = [w for w in re.split('\n',posting) if len(w)>0] 
clean_posting_by_line = list()
        
for line in posting_by_line:
    clean_line = line
    # spelling error correction
    clean_line = EnchantErrorCorrection(clean_line, 'PWL.txt')
    # remove extra white spaces
    clean_line = ' '.join([w for w in re.split(' ',clean_line) if not w==''])  
    clean_posting_by_line.append(clean_line)

# remove empty lines
clean_posting_by_line = [w for w in clean_posting_by_line if not w=='']

# print final output of this step
print('\n'.join(clean_posting_by_line))

Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate ,
precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards .
ELECTROMECHANICAL ENGINEERS ,
Minimum of 6 Meara experience in developing of electromechanical consumer or atmlar products . BSME or B SEE degree required ,
advanced degree preferred .
ELECTRONIC ENGINEERS MECHANICAL ENGINEERS
A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas and machine design analysis . Working know ! Edgar of computers as a design tool would be
1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD kn owl mechanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circuitry , logic de and analy

The final output of this step is the variable "clean_posting_by_line". The next step, as demonstated in the next IPython notebook [here](https://github.com/phaiptt125/newspaper_project/blob/master/data_cleaning/LDA.ipynb), is to use Latent Dirichlet Allocation (LDA) procedure to idenfity which pages are job ads. 