# The Initial Text Cleaning
A supplementary material to "The Evolving U.S. Occupational Structure: A Textual Analysis."  
***

This IPython notebook demonstrates initial processing of the raw text, provided by ProQuest. The main components of this step are to retrieve document metadata, to remove markup from the newspaper text, and to perform an initial spell-check of the text.

* [Project data library](http://ssc.wisc.edu/~eatalay/occupation_data.html) 

* [The most recent version of the paper](
http://ssc.wisc.edu/~eatalay/skills.pdf)


<div class="span5 alert alert-info">
<b> Due to copyright restrictions, we are not authorized to publish a large body of unprocessed text files. </b>
</div>

### List of auxiliary files (see project data library or GitHub profile)

* *compute_spelling.py* : This python function computes ratio of correctly-spelled words and records all correctly-spelled words.
* *edit_distance.py* : This python function computes string edit distance.
* *OCRcorrect_enchant.py* : This python function performs basic spelling error correction using python's enchant module.
* *OCRcorrect_hyphen.py* : This python function corrects hyphenated words.
* *OCRcorrect_substitute.py* : This python function performs frequently-occurred spelling error correction. 
* *PWL.txt* : This file contains words such as software and state names that are not contained in the dictionary provided by python's enchant module.
***
### Import necessary modules

In [1]:
import os
import re

# spelling correction-related modules 
from compute_spelling import *
from edit_distance import *
from OCRcorrect_enchant import *
from OCRcorrect_hyphen import *

### Define functions for initial text cleaning

In [2]:
def RemoveCharacters(text):
    # This function removes some non-grammatical characters 
    # and add extra spaces to punctuations in order to facilitate 
    # spelling error correction.
    output = text
    output = output.replace('"','')
    output = output.replace('.', ' . ')
    output = output.replace(',', ' , ')
    output = output.replace('?', ' ? ')
    output = output.replace('(', ' ( ')
    output = output.replace(')', ' ) ')
    output = output.replace('$', ' $ ')
    output = output.replace(';',' ; ')
    output = output.replace('!',' ! ')
    output = output.replace('}','')
    output = output.replace('{','')
    output = output.replace('/',' ')
    output = output.replace('_',' ')
    output = output.replace('*','')
    return output

def CleanXML(text):
    # This function removes markups
    
    output = text #initialize output
    
    # '&amp;lt;/p&amp;gt;' and '&amp;lt;p&amp;gt;' are line-breaks
    NewlinePattern = re.compile( re.escape('&amp;lt;/p&amp;gt;') 
                                + '|' 
                                + re.escape('&amp;lt;p&amp;gt;') )
    
    output = re.sub(NewlinePattern,'\n',output)
    
    # replace all other markups

    XMLmarkups = ['name=&amp;quot;ValidationSchema&amp;quot;', 
                   'content=&amp;quot;', 
                   '&amp;quot;/&amp;gt;', 
                   '&amp;lt;meta']
    
    for pattern in XMLmarkups: 
        output = re.sub(re.escape(pattern),'',output , re.IGNORECASE)
        
    html_header = re.compile(re.escape('&amp;lt;') 
                             + '/?html/?' 
                             + re.escape('&amp;gt;'))
    
    output = re.sub(html_header,'',output)
    
    body_header = re.compile(re.escape('&amp;lt;') 
                             + '/?body/?' 
                             + re.escape('&amp;gt;'))
    
    output = re.sub(body_header,'',output)
    
    title_header = re.compile(re.escape('&amp;lt;') 
                              + '/?title/?' 
                              + re.escape('&amp;gt;'))
    
    output = re.sub(title_header,'',output)
    
    head_header = re.compile(re.escape('&amp;lt;') 
                             + '/?head/?' 
                             + re.escape('&amp;gt;'))
    
    output = re.sub(head_header,'',output)

    HTTPpattern = re.compile( re.escape('http://') + '\S*'  
                             + re.escape('.xsd') )

    output = re.sub(HTTPpattern,'',output)
    output = re.sub(re.escape('&amp;quot;'),'"',output)
    output = re.sub(re.escape('&amp;apos;'),"'",output)
    output = re.sub(re.escape('&amp;amp;'),"&",output)
    output = re.sub(re.escape('&amp'),'',output)
    output = re.sub(re.escape('&lt;'),'',output)
    output = re.sub(re.escape('&gt;'),'',output)
    output = RemoveCharacters(output)
        
    return ' '.join([w for w in re.split(' ',output) if not w==''])

def ExtractElement(text,field):
    # This function takes input string (text) and looks for markups.
    # input "field" is a specific element that the code looks for.
    # For example, the page title can be located in the text as:   
    # <recordtitle> Display Ad 33 -- No Title </recordtitle>
    # Here, "field" variable is "recordtitle".
    # (Note: all searches are not case-sensitive.)

    beginMarkup = '<' + field + '>' #example: <recordtitle> 
    endMarkup = '</' + field + '>' #example: </recordtitle>

    textNoLineBreak = re.sub(r'\n|\r\n','',text) #delete the line break

    # Windows and Linux use different line break ('\n' vs '\r\n')

    ElementPattern = re.compile( re.escape(beginMarkup) + '.*' + re.escape(endMarkup), re.IGNORECASE )
    ElementMarkup = re.compile( re.escape(beginMarkup) + '|' + re.escape(endMarkup), re.IGNORECASE)

    DetectElement = re.findall(ElementPattern,textNoLineBreak)
    
    #strip markup
    Content = str(re.sub(ElementMarkup,'',str(DetectElement[0]))) 
    
    #reset space
    Content = ' '.join([w for w in re.split(' ',Content) if not w=='']) 
    
    return Content

def AssignAdIdentifier(text, journal):
    # This function assigns ad identifier.
    # For example, 'WSJ_classifiedad_19780912_45'
    # 'WSJ' is the journal name specified by the user
    # 'classifiedad' refers to Classified Ad page
    # '19780912' is the publication date
    # '45' is the page number
    
    recordtitle = ExtractElement(text,'recordtitle')
    Match = re.findall('Ad \d+ -- No Title',recordtitle,re.IGNORECASE)
    
    if Match:
        
        if re.findall('Display Ad',recordtitle,re.IGNORECASE):
            ad_type = 'displayad'
        elif re.findall('Classified Ad',recordtitle,re.IGNORECASE): 
            ad_type = 'classifiedad'
        
        ad_number = re.findall('\d+',recordtitle)[0]
        
        numericpubdate = ExtractElement(text,'numericpubdate')
        pub_date = re.findall('\d{8}',numericpubdate)[0]
        
        output = '_'.join([journal,ad_type,pub_date,ad_number])
    else:
        output = None
        
    return output

### Import raw text file
Note: This IPython notebook provides an example of the 45th page of classified ads in the September 12, 1978 edition of the Wall Street Journal.

In [3]:
# input files
input_files = 'ad_sample.txt'

# bring in raw ads 
raw_ad = open(input_files).read()
print(raw_ad)

<?xml version="1.0" encoding="UTF-8"?><record> <version>  TDM_Record_v1.0.xsd </version> <recordid>  4a667155d557ab68c878224bc3de0979 </recordid> <recordtitle>  Classified Ad 45 -- No Title </recordtitle> <alphapubdate>  Sep 12, 1978 </alphapubdate> <numericpubdate>  19780912 </numericpubdate> <objecttype>  classified_ad </objecttype> <objecttype>  Classified Advertisement </objecttype> <objecttype>  Advertisement </objecttype> <copyright>  Copyright Dow Jones &amp; Company Inc Sep 12, 1978 </copyright> <languagecode>  English </languagecode> <groupid>  506733 </groupid> <pubid>  45441 </pubid> <pubtitle>  Wall Street Journal  (1923 - Current file) </pubtitle> <fulltext>  &amp;lt;html&amp;gt;             &amp;lt;head&amp;gt;               &amp;lt;meta name=&amp;quot;ValidationSchema&amp;quot; content=&amp;quot;http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd&amp;quot;/&amp;gt;               &amp;lt;title/&amp;gt;             &amp;lt;/head&amp;gt;             &amp;lt;body&amp;gt;      

### Extract relevant information
* Assign a unique identifier for each newpaper page that is either Display Ad or Classified Ad.

In [4]:
ad_identifier = AssignAdIdentifier(raw_ad, 'WSJ')
print(ad_identifier)

WSJ_classifiedad_19780912_45


* Extract actual posting and remove xml markups

In [5]:
# extract <fulltext> field 
fulltext = ExtractElement(raw_ad,'fulltext')
# remove xml markups
posting = CleanXML(fulltext)
print(posting)


 Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate , 
 
 precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards . 
 
 ELECTROMECHANICAL ENGINEERS , 
 
 Minimum of 6 eara experience in developing of electromechanical consumer or atmlar products . BSME or BSEE degree required , 
 
 advanced degree preferred . 
 
 ELECTRONIC ENGINEERS MECHANICAL ENGINEERS 
 
 A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas: and machine design analysis . Working know ! edga of computers as a design tool would be 
 
 1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD knowl chanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circu

In [6]:
posting_by_line = [w for w in re.split('\n',posting) if len(w)>0] 
clean_posting_by_line = list()
        
for line in posting_by_line:
    clean_line = line
    # spelling error correction
    clean_line = CorrectHyphenated(clean_line, 'PWL.txt')
    clean_line = EnchantErrorCorrection(clean_line, 'PWL.txt')
    clean_line = ' '.join([w for w in re.split(' ',clean_line) if not w==''])  
    clean_posting_by_line.append(clean_line)

# remove empty lines
clean_posting_by_line = [w for w in clean_posting_by_line if not w=='']

# print final output of this step
print('\n'.join(clean_posting_by_line))

Singer has long been one of the world s gr ' pacesetters in volume manufacturing of intricate ,
precision machines that achieve extreme reliability and durability . Our sewing machines are in use around the globe in every kind of climate . As pioneers in electronic sewing machines , we have again set new standards .
ELECTROMECHANICAL ENGINEERS ,
Minimum of 6 Meara experience in developing of electromechanical consumer or atmlar products . BSME or B SEE degree required ,
advanced degree preferred .
ELECTRONIC ENGINEERS MECHANICAL ENGINEERS
A least 2+ years is needed in one of 2+ years of practical in mechanisms the following areas and machine design analysis . Working know ! Edgar of computers as a design tool would be
1 ) Analog and digital industrial electron helpful . Experience in sophisticated , with microprocessor and CAD kn owl mechanical products . Background should include edge desirable ; mechanism or gear or machine design 2 ) Analog sad digital circuitry , logic de and analy