# Extracting Text Data

## Overview:
- In this lesson, we will practice extracting text data from various documents such as PDF, DOCX, and JSON files.
- Then, we will clean the extracted text using regular expressions.
- The exercises require knowledge of Python programming and libraries: `PyPDF2`, `docx`, `json`, and `re`.


## Question 1: Extracting Data from a PDF File

Using the `PyPDF2` library, write a Python script to extract the entire text from a PDF file. Ensure that you handle cases where the PDF file has multiple pages.

In [1]:
# Installing the PyPDF2 Library
%pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [1]:
# import thư viện
import PyPDF2
from PyPDF2 import PdfFileReader

Task Completion
- Find a PDF File with More Than 20,000 Words
- Read the Content and Page Information
- Store the Content in a String Variable

+ Step 1: We create a function extract_text_from_pdf for simpler work

+ Step 2: Find a pdf that have more than 20000 words, also a pdf that have fewer than 20000 words for testing

+ Step 3: Use the function that have used the library PyPDF2

+ Step 4: Then check if the words is more than 20000, print out it content

In [None]:
#### YOUR CODE HERE ####
def extract_text_from_pdf(file_path):
    # open the PDF file
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        all_text = ""
        
        # extract text from each page
        for page in reader.pages:
            all_text += page.extract_text()
    
    return all_text

# pdf_file_path = "part05-empirical-methods.pdf"  # test for fewer than 20000 works
pdf_file_path = "Programming_Languages_Principles_and_Paradigms.pdf" 

# call the function to extract text
pdf_text = extract_text_from_pdf(pdf_file_path)

# count words in the pdf
word_count = len(pdf_text.split())
print(f"Word count: {word_count}")

# check if it > 20000 or not ?
if word_count > 20000:
    print("The pdf file contains more than 20,000 words.")
    print("The content in the string variable is: ")
    print(pdf_text)
else:
    print("The file contains fewer than 20,000 words.")
#### END YOUR CODE #####

Word count: 168952
The pdf file contains more than 20,000 words.
The content in the string variable is: 
Undergraduate Topics in Computer ScienceUndergraduate Topics in Computer Science (UTiCS) delivers high-quality instruc-
tional content for undergraduates studying in all areas of computing and informationscience. From core foundational and theoretical material to ﬁnal-year topics and ap-
plications, UTiCS books take fresh, concise, and modern approach and are ideal
for self-study or for a one- or two-semester course. The texts are all authored byestablished experts in their ﬁelds, reviewed by an international advisory board, andcontain numerous examples and problems. Many include fully worked solutions.
For further volumes:
http://www.springer.com/series/7592Maurizio Gabbrielli and Simone Martini
Programming
Languages:Principlesand ParadigmsProf. Dr. Maurizio Gabbrielli
Università di Bologna
Bologna
ItalyProf. Dr. Simone MartiniUniversità di Bologna
Bologna
Italy
Series editorIan Ma

## Question 2: Extracting Data from a DOCX File


In [4]:
# Installing the docx Library
%pip install python-docx

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
Installing collected packages: python-docx
Successfully installed python-docx-1.1.2
Note: you may need to restart the kernel to use updated packages.


In [5]:
#Import library
from docx import Document

Task Completion
- Find a PDF File with More Than 20,000 Words
- Read the Content and Page Information
- Store the Content in a String Variable

+ Step 1: We create a function extract_text_from_docx for simpler work

+ Step 2: Find a docx that have more than 20000 words, also a docx that have fewer than 20000 words for testing

+ Step 3: Use the function that have used the library docx

+ Step 4: Then check if the words is more than 20000, print out it content

In [10]:
#### YOUR CODE HERE ####
def extract_text_from_docx(file_path):
    # open  file DOCX
    doc = Document(file_path)

    content = ""
    word_count = 0

    for paragraph in doc.paragraphs:
        content += paragraph.text
        word_count += len(paragraph.text.split())

    return content,word_count

# file_path = "Calculus-Report-1.docx"   # test for fewer than 20000 words
file_path = "Report Talkshow.docx"

docx_text, word_count = extract_text_from_docx(file_path)

if word_count > 20000:
    print("The docx file contains moew than 20,000 words.")
    print("The content in the string variable is: ")
    print(docx_text)
else:
    print("The file contains fewer than 20,000 words")

#### END YOUR CODE #####

The docx file contains moew than 20,000 words.
The content in the string variable is: 
ĐẠI HỌC QUỐC GIA TP HỒ CHÍ MINH	TRƯỜNG ĐẠI HỌC BÁCH KHOA		KHOA…	REPORT TALKSHOW WITH PROF. WILMS………………………………………………………………………………………………………………………………………………………………………….Giảng viên hướng dẫn: Nguyễn Thanh Bình  Mai Đức TrungThành phố Hồ Chí Minh – 2022 HELPFUL ADVICE FROM PROF. WILMSThrough the talk show on December 1, 2022. Mr. Wilms shared with us a lot of information about the field as well as how to study well, how to develop yourself in the best way. He briefly and methodically introduced the development process of computers. The process by which people learn and invent new and more modern devices, helping society more. He told us about computers: IBM PC was introduced in 1981, first GUI in 85. Internet became publicly available (dialup) in 1992. First smartphone (iPhone) was released in 2007. Computer in the 70s : Mainframes ( Punched card, Cobol / Fortran, some C), Some character-based home / game comp

## Question 3: Extracting Data from a JSON File

In [8]:
# import thư viện
import json

Task Completion
- Find a JSON File from  with More Than 20,000 Words
- Store the Content in a String Variable
- Then concatenate the results from the previous questions into this variable and store them in a string variable, with each result saved on a new line.

+ Step 1: We create a function extract_text_from_json for simpler work

+ Step 2: Find a json that have more than 20000 words, also a json that have fewer than 20000 words for testing

+ Step 3: Use the function that have used the library json

+ Step 4: Then check if the words is more than 20000, print out it content

+ Step 5: After that concatenate the result from the previous 2 question and print out it content

In [11]:
#### YOUR CODE HERE ####
def extract_text_from_json(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        json_content = json.load(file)
        # convert JSON to string
        json_string = json.dumps(json_content, ensure_ascii=False, indent=4)
        return json_string

# count words in the JSON content
def count_words(text):
    return len(text.split())

# json_file_path = "small_json_file.json"   test for fewer than 20000 words
json_file_path = "large_json_file.json"

json_text = extract_text_from_json(json_file_path)

json_word_count = count_words(json_text)
print(f"Total words in JSON file: {json_word_count}")

if json_word_count > 20_000:
    print("The JSON file contains more than 20,000 words.")
    print("The content in the string variable is:")
    print(json_text)
else:
    print("The JSON file contains fewer than 20,000 words.")

# string concatenate

concatenate = pdf_text + '\n' + docx_text + '\n' + json_text +'\n'

print("The concatenate string: ")
print(concatenate)

#### END YOUR CODE #####

Total words in JSON file: 20034
The JSON file contains more than 20,000 words.
The content in the string variable is:
{
    "title": "Sample JSON with More Than 20,000 Words",
    "content": "word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word wor

## Question 4: Processing the Extracted Data

### Question 4.1: From the data extracted in Questions 1, 2, and 3, concatenate them into a single string variable.







+ Concatenate all 3 previous question string into 1 single string

In [12]:
#### YOUR CODE HERE ####
concatenate = pdf_text + '\n' + docx_text + '\n' + json_text +'\n'

print("The concatenate string: ")
print(concatenate)
#### END YOUR CODE #####

The concatenate string: 
Undergraduate Topics in Computer ScienceUndergraduate Topics in Computer Science (UTiCS) delivers high-quality instruc-
tional content for undergraduates studying in all areas of computing and informationscience. From core foundational and theoretical material to ﬁnal-year topics and ap-
plications, UTiCS books take fresh, concise, and modern approach and are ideal
for self-study or for a one- or two-semester course. The texts are all authored byestablished experts in their ﬁelds, reviewed by an international advisory board, andcontain numerous examples and problems. Many include fully worked solutions.
For further volumes:
http://www.springer.com/series/7592Maurizio Gabbrielli and Simone Martini
Programming
Languages:Principlesand ParadigmsProf. Dr. Maurizio Gabbrielli
Università di Bologna
Bologna
ItalyProf. Dr. Simone MartiniUniversità di Bologna
Bologna
Italy
Series editorIan Mackie
Advisory board
Samson Abramsky, University of Oxford, UKChris Hankin, Imper

### Question 4.2: Complete the String Processing Function

Description of the function: This function takes a string as input and returns a processed version of the string. The main tasks performed in the function are as follows:

- Replace characters matching the pattern `^A-Za-z0-9(),!?\'\`` with a space (" ").
- Replace `\'s` with ` \'s`.
- Replace `\'ve` with ` \'ve`.
- Replace `n\'t` with ` n\'t`.
- Replace `\'re` with ` \'re`.
- Replace `\'d` with ` \'d`.
- Replace `\'ll` with ` \'ll`.
- Replace `,` with ` , `.
- Replace `!` with ` ! `.
- Replace `\(` with ` \( `.
- Replace `\)` with ` \) `.
- Replace `\?` with ` \? `.
- Replace multiple spaces (`\s{2,}`) with a single space.
- Trim leading spaces.
- Convert the text to lowercase.


In [15]:
#### YOUR CODE HERE ####
import re

def process_string(input_string):
    # Replace characters matching the pattern ^A-Za-z0-9(),!?\'\` with a space (" ")
    input_string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", input_string)
    
    # Replace specific contractions with a space before them
    input_string = input_string.replace(r"\'s", r" \'s")
    input_string = input_string.replace(r"\'ve", r" \'ve")
    input_string = input_string.replace(r"n\'t", r" n\'t")
    input_string = input_string.replace(r"\'re", r" \'re")
    input_string = input_string.replace(r"\'d", r" \'d")
    input_string = input_string.replace(r"\'ll", r" \'ll")
    
    # Replace punctuation with spaces around them
    input_string = input_string.replace(",", " , ")
    input_string = input_string.replace("!", " ! ")
    input_string = input_string.replace(r"\(", r" \( ")
    input_string = input_string.replace(r"\)", r" \) ")
    input_string = input_string.replace(r"\?", r" \? ")

    # Replace multiple spaces with a single space
    input_string = re.sub(r'\s{2,}', ' ', input_string)
    
    # Trim leading and trailing spaces
    input_string = input_string.strip()

    # Convert the text to lowercase
    input_string = input_string.lower()

    return input_string

#### END YOUR CODE #####

Check the results with the function just written on the extracted data.


In [14]:
#### YOUR CODE HERE ####
preprocess_str = process_string(concatenate)

print("The result after preprocessing: ")
print(preprocess_str)
#### END YOUR CODE #####

The result after preprocessing: 
undergraduate topics in computer scienceundergraduate topics in computer science (utics) delivers high quality instruc tional content for undergraduates studying in all areas of computing and informationscience from core foundational and theoretical material to nal year topics and ap plications , utics books take fresh , concise , and modern approach and are ideal for self study or for a one or two semester course the texts are all authored byestablished experts in their elds , reviewed by an international advisory board , andcontain numerous examples and problems many include fully worked solutions for further volumes http www springer com series 7592maurizio gabbrielli and simone martini programming languages principlesand paradigmsprof dr maurizio gabbrielli universit di bologna bologna italyprof dr simone martiniuniversit di bologna bologna italy series editorian mackie advisory board samson abramsky , university of oxford , ukchris hankin , imperia