### Purpose

This file is a module to perform the transformation and load of the MODSIM papers into a working data set. 

**Inputs**

The inputs for this module include:

* A CSV file created from HTML parsed using BeautifulSoup, extracting key information about each MODSIM Paper
* Text files containing full text from each MODSIM Papers (2014 to 2018)
  * Downloaded PDF from MODSIM website using Google Chrono Sniffer extension
  * Converted to text using Mac OS Automator workflow
  * Assumes a file structure `./data/<year>/` exists for years 2014-2018
  
**Output**

The results of this module is a folder called `./data/abstracts/` containing .txt files each with a custom label and containing abstracts extracted from MODSIM papers.

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
#Added comment here for test
#file toy_html_parse.csv serve as toy example of expected input
df = pd.read_csv('./data/toy_html_parse.csv')
df

Unnamed: 0,track,filename,author_id,title,year
0,Analytics and Decision Making,file1.txt,Bockelman,Practical Human-Systems Integration Methods fo...,2014
1,Science and Engineering,file2.txt,Drucker,An Adaptive Planning Tool for Ship Constructio...,2015
2,Training and Education,file3.txt,Schoenbaum,Animal Disease Spread Modeling for Epidemiolog...,2016
3,Visualization and Gamification,file4.txt,Zhou,A Simulation on the Effect of a Major World Wa...,2017
4,Training and Education,file5.txt,Axdahl,Shifting Data Collection from a Fixed to an Ad...,2018


In [3]:
df['year'] = df['year'].apply(str)
mapped_sub = {'track': {'Training and Education': 'TE',
                        'Analytics and Decision Making': 'AT',
                        'Science and Engineering': 'SE',
                        'Visualization and Gamification': 'VG'},
             }
df_mod = df.replace(to_replace=mapped_sub)

In [4]:
#place in track-year-author format  
df_mod['label'] = './data/abstracts/' + df_mod['track'] + '-' + \
                df_mod['year'] + '-' + df_mod['author_id'] + '.txt'
df_mod

Unnamed: 0,track,filename,author_id,title,year,label
0,AT,file1.txt,Bockelman,Practical Human-Systems Integration Methods fo...,2014,./data/abstracts/AT-2014-Bockelman.txt
1,SE,file2.txt,Drucker,An Adaptive Planning Tool for Ship Constructio...,2015,./data/abstracts/SE-2015-Drucker.txt
2,TE,file3.txt,Schoenbaum,Animal Disease Spread Modeling for Epidemiolog...,2016,./data/abstracts/TE-2016-Schoenbaum.txt
3,VG,file4.txt,Zhou,A Simulation on the Effect of a Major World Wa...,2017,./data/abstracts/VG-2017-Zhou.txt
4,TE,file5.txt,Axdahl,Shifting Data Collection from a Fixed to an Ad...,2018,./data/abstracts/TE-2018-Axdahl.txt


Create a zipped object of tuples of `(filename, label)` that will iterate through files and write text file with label name

In [5]:
#Pull key info from csv file of parsed HTML info
tup = zip(df_mod['filename'],df_mod['label'])

for file,label in tup:
    #create filename to open text file of MODSIM paper
    filename = './data/' + label[20:24] + '/' + file
    
    f = open(filename, 'r')
    raw = f.read()
    ####
    ## NEED error handling for text files without begin or stop flag
    ## --> suggest we run routine before this and create list of files to repair
    ####
    begin = re.search('ABSTRACT\n', raw) #flag for beginning of abstract
    stop = re.search('ABOUT THE', raw) #flag for stop of abstract
    #ironic but we want the end of the begin flag and start of stop flag
    abstract = raw[begin.end():stop.start()] #extracts the abstract
    f.close()
    print(f'Opening {file} and extracting abstract') #output to show progress
    
    #Write the extracted abstract to text file 
    text = open(label, "w")
    text.write(abstract)
    text.close()
    print(f'Writing extracted abstract as file {label}\n') #output to show progress

Opening file1.txt and extracting abstract
Writing extracted abstract as file ./data/abstracts/AT-2014-Bockelman.txt

Opening file2.txt and extracting abstract
Writing extracted abstract as file ./data/abstracts/SE-2015-Drucker.txt

Opening file3.txt and extracting abstract
Writing extracted abstract as file ./data/abstracts/TE-2016-Schoenbaum.txt

Opening file4.txt and extracting abstract
Writing extracted abstract as file ./data/abstracts/VG-2017-Zhou.txt

Opening file5.txt and extracting abstract
Writing extracted abstract as file ./data/abstracts/TE-2018-Axdahl.txt

