## Extract Information from Documents

#### Part 1: Extract information from Microsoft Word Documents

In [15]:
from docx import Document

def extract_text_from_docx(docx_path):
    # Load the Word document
    doc = Document(docx_path)
    
    # List to hold all the text in the document
    full_text = []

    # Iterate over each paragraph in the document
    for para in doc.paragraphs:
        full_text.append(para.text)

    # Join all the text separated by a newline
    return '\n'.join(full_text)

def extract_tables_from_docx(docx_path):
    # Load the Word document
    doc = Document(docx_path)
    
    # List to hold all tables data
    tables_data = []

    # Iterate over each table in the document
    for table in doc.tables:
        table_text = []
        for row in table.rows:
            row_data = []
            for cell in row.cells:
                row_data.append(cell.text.strip())
            table_text.append(row_data)
        tables_data.append(table_text)
    
    return tables_data

from xml.etree import ElementTree as ET
def extract_hyperlinks(docx_path):
    doc = Document(docx_path)
    hyperlinks = []

    # Define the WordprocessingML namespace
    w_namespace = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    r_namespace = '{http://schemas.openxmlformats.org/officeDocument/2006/relationships}'

    # Helper function to process hyperlink elements
    def process_hyperlink(hyperlink_el, rels):
        r_id = hyperlink_el.get(r_namespace + 'id')
        if r_id and r_id in rels:
            link = rels[r_id]._target
            text = ''.join(node.text for node in hyperlink_el if node.tag == w_namespace + 't')
            hyperlinks.append({'text': text, 'url': link})

    # Get the relationships dictionary
    rels = doc.part.rels

    # Iterate through the document elements to find hyperlinks
    for el in doc.element.body.iter():
        if el.tag == w_namespace + 'hyperlink':
            process_hyperlink(el, rels)

    return hyperlinks


inP = "C:\\Users\\jiay\\EconS524\\EconS524 Syllabus.docx"
text = extract_text_from_docx(inP)
print(text)

tables = extract_tables_from_docx(inP)

# Print all extracted table data
for i, table in enumerate(tables, start=1):
    print(f"Table {i}:")
    for row in table:
        print(row)

hyperlinks = extract_hyperlinks(inP)

for hyperlink in hyperlinks:
    print(f"Text: {hyperlink['text']}, URL: {hyperlink['url']}")


EconS 524, Spring 2024
Applied Machine Learning for Economics
Time: Mon. and Wed. 4:10pm-5:25pm
Location: Hulbert 23
Credit 3 

Jia Yan
Office Hours (Hulbert 301E): Tuesday 2:00pm – 4:00pm
E-mail: jiay@wsu.edu

Class Website: canvas
You will have to enter your WSU username and password to access the course materials.

Class Description
This course provides a comprehensive introduction to fundamental concepts and algorithms in machine learning, highlighting their relevance and application in econometrics. Initially, the course will explore a range of machine learning methods for regression and classification. Following this, it will delve into how these methodologies can be effectively applied within the field of economics, demonstrating their practical utility and impact.

Readings:

Required Text book: T. Hastie, R. Tibshirani and J. Friedman, , 
 
QuantEcon Data Science 

Course Objectives: This course is designed for Master and PhD students with interest in econometrics.  This cours

#### Part 2: Extract from PDF

In [4]:
import tabula
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

def extract_text_from_pdf(path):
    '''
    This script defines a function extract_text_from_pdf which:

    1. Opens a PDF file in binary read mode.
    2. Creates a PDF parser for the file and a PDF document object from the parser.
    3. Initializes a resource manager and a text converter with layout parameters.
    4. Sets up a PDF page interpreter.
    5. Iterates over each page in the PDF document and processes it to extract the text.
    6. Captures the extracted text into a StringIO object.
    7. Retrieves the value from the StringIO object, closes it, and then returns the text.
    '''
    output_string = StringIO()
    with open(path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    text = output_string.getvalue()
    output_string.close()
    return text
   
def extract_tables_from_pdf(path):
    '''
    You need to have Java installed on your system to use Tabula. 
    '''
    df = tabula.read_pdf(path, pages='all', multiple_tables=True)
    return df 

inP = "C:\\Users\\jiay\\EconS524\\EconS524 Syllabus.pdf"
pdf_text = extract_text_from_pdf(inP)
print(pdf_text)

tables =  extract_tables_from_pdf(inP)
for i, table in enumerate(tables):
    print(f"Table {i}:")
    print(table)
    # You can also export to CSV, Excel, etc.
    # table.to_csv(f"table_{i}.csv")


EconS 524, Spring 2024 

Applied Machine Learning for Economics 
Time: Mon. and Wed. 4:10pm-5:25pm 

Location: Hulbert 23 

Credit 3  

 
Jia Yan 
Office Hours (Hulbert 301E): Tuesday 2:00pm – 4:00pm 
E-mail: jiay@wsu.edu 
 
Class Website: canvas 
You will have to enter your WSU username and password to access the course materials. 
 
Class Description 
This course provides  a  comprehensive  introduction to  fundamental  concepts  and  algorithms  in 
machine learning, highlighting their relevance and application in econometrics. Initially, the course 
will explore a range of machine learning methods for regression and classification. Following this, 
it will delve into how these methodologies can be effectively applied within the field of economics, 
demonstrating their practical utility and impact. 
 
Readings: 
 

1.  Required Text book: T. Hastie, R. Tibshirani and J. Friedman, The Elements of 

Statistical Learning: Data Mining, Inference, and Prediction (2nd edition), 
http://we

Got stderr: Jan 16, 2024 5:35:53 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: Your current java version is: 9.0.1
Jan 16, 2024 5:35:53 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: To get higher rendering speed on old java 1.8 or 9 versions,
Jan 16, 2024 5:35:53 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),
Jan 16, 2024 5:35:53 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or
Jan 16, 2024 5:35:53 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Jan 16, 2024 5:35:53 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")



Table 0:
    Week                                               Plan
0      1  Introduction to statistical learning and Pytho...
1    2-6  Parametric Supervised Learning Methods\r1.\rLi...
2    7-9  Non-parametric Supervised Learning Methods\r1....
3     10  Un-supervised learning\r1.\rClustering\r2.Dime...
4  11-12            Model ensembling: Boosting and Stacking
5  13-14                    Introduction to neural networks


### Part 3: Extract from image

In [7]:
import cv2
import pytesseract
import platform

def image_to_text(image_file_path):
    if platform.system() == "Windows":
        # this sentence should be platform independent; under winsdows need the following command
        pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
    # load image
    image = cv2.imread(os.path.join(self.recycle, image_file))
        
    # Convert the image to gray scale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
   
    # Use OpenCV's threshold function to create a binary image
    # You can experiment with different thresholding methods and values
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # Optionally apply some noise reduction
    denoise = cv2.medianBlur(thresh, 5)
        
    # Use Tesseract to extract text from the image
    text = pytesseract.image_to_string(denoise, lang='eng')
    text = text.strip().replace("\n", "")
    return text
