# Working with Text and PDF Files

This notebook presents the basic python commands to open and handle files and the text in them.

Overview of contents:

1. Working with Text Strings
    - 1.1 f-Strings
    - 1.2 Minimum Widths, Alignment and Padding
    - 1.3 Date Formatting
2. Working with Text Files
    - 2.1 Create a File with Magic Commands
    - 2.2 Opening and Handling Text Files
    - 2.3 Writing to Files
    - 2.4 Appending to a File
    - 2.5 Context Managers
3. Working with PDF Files
    - 3.1 Opening PDFs
    - 3.2 Adding to PDFs
    - 3.3 Example: Extracting Text from PDFs

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Working with Text Strings

### 1.1 f-Strings

In [7]:
# Variable string
name = 'Jose'

# Using the old .format() method
print('His name is {}.'.format(name))

# Using f-strings (since Python 3.6)
print(f'His name is {name}.')

His name is Jose.
His name is Jose.


In [8]:
# We can perform operations inside the cury braces: with dicts, lists, etc.
# but make sure we use "" if we use '' inside
d = {'a':123,'b':456}
print(f"Address: {d['a']} Main Street")

Address: 123 Main Street


### 1.2 Minimum Widths, Alignment and Padding

In [15]:
# Tuples which represent table rows
library = [('Author', 'Topic', 'Pages'),
           ('Twain', 'Rafting', 601),
           ('Feynman', 'Physics', 95),
           ('Hamilton', 'Mythology', 144)]

# We print with f-strings, tuple unpacking
# and minimum width using :{width}
# Note that we can pass <, > or ^ between :{ for justification
# and a symbol for filling
for author, topic, pages in library:
    print(f'{author:{10}} {topic:{12}} {pages:.>{12}}')

Author     Topic        .......Pages
Twain      Rafting      .........601
Feynman    Physics      ..........95
Hamilton   Mythology    .........144


### 1.3 Date Formatting

In [17]:
# Datetime object
from datetime import datetime
today = datetime(year=2018, month=1, day=27)

In [20]:
# Print with native formatting
print(f'{today}')

2018-01-27 00:00:00


In [21]:
# We can format datetime as we want using the codes.
# Look at this page to get the codes
# https://strftime.org/
print(f'{today:%B %d, %Y}')

January 27, 2018


## 2. Working with Text Files

### 2.1 Create a File with Magic Commands

In [25]:
# This creates a text file in Jupyter:
# magic commad %%writefile + filename + contents

In [26]:
%%writefile test.txt
Hello, this is a quick test file.
This is the second line of the file.

Overwriting test.txt


### 2.2 Opening and Handling Text Files

In [30]:
pwd

'/Users/mxagar/nexo/git_repositories/nlp_guide/01_Python_Text_Basics'

In [49]:
# Open the text.txt file we created earlier: it is loaded as a file object
my_file = open('test.txt')

In [50]:
# We can now read the COMPLETE file: the content is returned as a string.
# After that, the reading cursor is at the end, and we can't read anymore.
my_file.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [51]:
# Set cursor to index position 0 = start
# After that, we can read() th complete text again
my_file.seek(0)

0

In [52]:
# readlines() returns a list of the lines in the file: very practical!
my_file.seek(0)
mylines = my_file.readlines()
for line in mylines:
    print(line)

Hello, this is a quick test file.

This is the second line of the file.



In [53]:
# Always close the opened files
# otherwise if a file is opened by several processes we could get problems
my_file.close()

### 2.3 Writing to Files

In [72]:
# Add a second argument to the function, 'w' which stands for write.
# Passing 'w+' lets us read (+) and write (w) to the file.
# Use TAB to access the docs
# HOWEVER: 'w' removes any content in the file automatically!
my_file = open('test.txt','w+')

In [73]:
# Write to the file
my_file.write('This is a new first line')

24

In [74]:
# Don't forget we have a cursor!
my_file.seek(0)

0

In [75]:
# Check the content was overwritten
my_file.read()

'This is a new first line'

In [76]:
my_file.close()

### 2.4 Appending to a File

In [77]:
# Passing 'a+' lets us read (+) and append (a) to the file.
# Cursor is set at the end of the file
my_file = open('test.txt','a+')
# We need to addd line breaks manually!
my_file.write('\nThis line is being appended to test.txt')
my_file.write('\nAnd another line here.')

23

In [78]:
# Don't forget setting the cursor where we want!
my_file.seek(0)
print(my_file.read())

This is a new first line
This line is being appended to test.txt
And another line here.


In [79]:
my_file.close()

In [80]:
# Append with magic commands

In [82]:
%%writefile -a test.txt
This is more text being appended to test.txt
And another line here.

Appending to test.txt


### 2.5 Context Managers

In [84]:
# We define a context in which the file object is aliased with a variable name
# When we exit the with block/context, the file is closed automatically
# and the variable destroyed - thus, we need to read the content to another object
with open('test.txt','r') as txt:
    first_line = txt.readlines()[0]
print(first_line)

This is a new first line



In [85]:
# Iterating through a file
with open('test.txt','r') as txt:
    for line in txt:
        print(line, end='')  # the end='' argument removes extra linebreaks

This is a new first line
This line is being appended to test.txt
And another line here.
This is more text being appended to test.txt
And another line here.
This is more text being appended to test.txt
And another line here.


## 3. Working with PDF Files

### 3.1 Opening PDFs

We can use the library PyPDF2 to open PDF files; however, not all PDF files are always readable.

```
pip install PyPDF2
```

In [99]:
# We can use the library PyPDF2 to open PDF files
# however, not all PDF files are always readable.
# To install it: pip install PyPDF2
import PyPDF2

In [100]:
# Notice we read it as a binary with 'rb' (b)
f = open('US_Declaration.pdf','rb')

In [101]:
# We instantiate our PDF reader
pdf_reader = PyPDF2.PdfFileReader(f)

In [102]:
# Use . TAB for seeing all possible functions, attributes, etc.
pdf_reader.numPages

5

In [103]:
# Get page 1
page_one = pdf_reader.getPage(0)

In [104]:
# Extract text from a page
page_one_text = page_one.extractText()

In [105]:
print(page_one_text)

Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the
political bands which have connected them with another, and to assume among the powers of the
earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle

them, a decent respect to the opinions of mankind requires that they should declare the causes

which impel them to the separation. 
We hold these truths to be self-evident, that all men are created equal, that they are endowed by

their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving

their just powers from the consent of the governed,ŠThat whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish it,

In [106]:
# Always close the file!
f.close()

### 3.2 Adding to PDFs

In [107]:
# We can only copy pages and append them to the end
f = open('US_Declaration.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(f)

In [108]:
first_page = pdf_reader.getPage(0)

In [114]:
# We create a PDF writer and add to it the page we extracted
pdf_writer = PyPDF2.PdfFileWriter()

In [115]:
pdf_writer.addPage(first_page)

In [119]:
# New file: wb = write binary
pdf_output = open("Some_New_Doc.pdf","wb")

In [120]:
# Write contents
pdf_writer.write(pdf_output)

In [122]:
# Close both files
# We can check the PDF
pdf_output.close()
f.close()

### 3.3 Example: Extracting Text from PDFs

In [123]:
f = open('US_Declaration.pdf','rb')
# List of every page's text.
# The index will correspond to the page number.
pdf_text = [0]  # zero is a placehoder to make page 1 = index 1
# Create PDF reader
pdf_reader = PyPDF2.PdfFileReader(f)
# Extract text page by page
for p in range(pdf_reader.numPages):
    page = pdf_reader.getPage(p)
    pdf_text.append(page.extractText())
# Close file
f.close()

In [125]:
len(pdf_text)

6

In [127]:
# Print page 1
print(pdf_text[1])

Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the
political bands which have connected them with another, and to assume among the powers of the
earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle

them, a decent respect to the opinions of mankind requires that they should declare the causes

which impel them to the separation. 
We hold these truths to be self-evident, that all men are created equal, that they are endowed by

their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving

their just powers from the consent of the governed,ŠThat whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish it,