# ch13 Working with PDF and Word Documents

## Working with PDF and word Documents

- PyPDF2
- Python-Docx

## PDF Documents

- PDF: Portable Document Format

### The Problematic PDF Format

- 정확히 plaintext로 추출할 수 없다.
- 몇 개의 PDF 파일은 열 수가 없다.

### Extracting Text from PDFs

In [6]:
import PyPDF2

In [9]:
pdfFileObj = open('src/meetingminutes.pdf', 'rb')

In [11]:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [12]:
pdfReader.numPages

19

In [13]:
!open src/meetingminutes.pdf

In [14]:
pageObj = pdfReader.getPage(0)

In [16]:
print pageObj.extractText()

OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of March 7, 2014        
     The Board of Elementary and Secondary Education shall provide leadership and create policies for education that expand opportunities for children, empower families and communities, and advance Louisiana in an increasingly competitive global market. BOARD  of ELEMENTARY and  SECONDARY EDUCATION  


### Decrypting PDFs

In [1]:
import PyPDF2

In [2]:
pdfReader = PyPDF2.PdfFileReader(open('src/encrypted.pdf', 'rb'))

In [3]:
pdfReader.isEncrypted

True

In [5]:
!open src/encrypted.pdf

In [4]:
pdfReader.decrypt('rosebud')

1

In [5]:
pdfReader.numPages

19

In [6]:
pdfReader.isEncrypted

True

In [8]:
# Why IndexError?
# 암호 풀기 전에 getPage를 하면 에러가 나네.
# 다시 열어야 하는듯..
pageObj = pdfReader.getPage(0)

In [9]:
pageObj.extractText()

u'OOFFFFIICCIIAALL  BBOOAARRDD  MMIINNUUTTEESS   Meeting of March 7, 2014        \n     The Board of Elementary and Secondary Education shall provide leadership and create policies for education that expand opportunities for children, empower families and communities, and advance Louisiana in an increasingly competitive global market. BOARD  of ELEMENTARY and  SECONDARY EDUCATION  '

### Creating PDFs

- 아무 텍스트나 PDF로 저장할 수 없다.
- 다른 PDF의 페이지를 복사, 페이지 돌리기, 페이지에 덧붙이기, 파일 암호화에 제한된다.

### general approach

1. Open one or more existing PDFs(the source PDFs) into PDFFileReader objects.
2. Create a new PdfFileWriter object.
3. Copy pages from the PdfFileReader objects into the PdfFileWriter object.
4. Finally, use the PdfFileWriter object to write the output PDF.


- open으로 여는데 'wb' 모드로 지정한다. 

### Copying pages

In [10]:
import PyPDF2

In [11]:
pdf1File = open('src/meetingminutes.pdf', 'rb')

In [12]:
pdf2File = open('src/meetingminutes2.pdf', 'rb')

In [13]:
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)

In [14]:
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)

In [15]:
pdfWriter = PyPDF2.PdfFileWriter()

In [17]:
for pageNum in range(pdf1Reader.numPages):
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

In [18]:
for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

In [19]:
pdfOutputFile = open('combinedminutes.pdf', 'wb')

In [20]:
pdfWriter.write(pdfOutputFile)

In [21]:
pdfOutputFile.close()

In [22]:
pdf1File.close()
pdf2File.close()

In [23]:
!open combinedminutes.pdf

#### addPage() 사용시 유의사항

- 항상 페이지의 마지막에만 추가할 수 있다. PdfFileWriter 중간에 추가할 수 없으니 입력할 때 제어를 해야한다.

### Rotating Pages

In [26]:
import openpyxl

In [27]:
minutesFile = open('src/meetingminutes.pdf', 'rb')

In [28]:
pdfReader = PyPDF2.PdfFileReader(minutesFile)

In [29]:
page = pdfReader.getPage(0)

In [30]:
page.rotateClockwise(90)

{'/Contents': [IndirectObject(961, 0),
  IndirectObject(962, 0),
  IndirectObject(963, 0),
  IndirectObject(964, 0),
  IndirectObject(965, 0),
  IndirectObject(966, 0),
  IndirectObject(967, 0),
  IndirectObject(968, 0)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': {'/Count': 9,
  '/Kids': [IndirectObject(959, 0),
   IndirectObject(1, 0),
   IndirectObject(11, 0),
   IndirectObject(13, 0),
   IndirectObject(15, 0),
   IndirectObject(17, 0),
   IndirectObject(19, 0),
   IndirectObject(24, 0),
   IndirectObject(26, 0)],
  '/Parent': {'/Count': 19,
   '/Kids': [IndirectObject(953, 0),
    IndirectObject(954, 0),
    IndirectObject(955, 0)],
   '/Type': '/Pages'},
  '/Type': '/Pages'},
 '/Resources': {'/ColorSpace': {'/CS0': ['/ICCBased', IndirectObject(969, 0)],
   '/CS1': ['/ICCBased', IndirectObject(970, 0)],
   '/CS2': ['/ICCBased', IndirectObject(970, 0)]},
  '/ExtGState': {'/GS0': {'/AIS': <PyPDF2.generic.BooleanObject at 0x1053d5f10>,
    '/BM': '/Norm

In [31]:
pdfWriter = PyPDF2.PdfFileWriter()

In [32]:
pdfWriter.addPage(page)

In [33]:
resultPdfFile = open('rotatePage.pdf', 'wb')

In [34]:
pdfWriter.write(resultPdfFile)

In [35]:
resultPdfFile.close()
minutesFile.close()

In [36]:
!open rotatePage.pdf

### Overlaying Pages

- Adding a logo, timestamp, or watermark to a page

In [37]:
import PyPDF2

In [40]:
minutesFile = open('src/meetingminutes.pdf', 'rb')

In [41]:
pdfReader = PyPDF2.PdfFileReader(minutesFile)

In [42]:
minutesFirstPage = pdfReader.getPage(0)

In [43]:
pdfWatermarkReader = PyPDF2.PdfFileReader(open('src/watermark.pdf', 'rb'))

In [44]:
minutesFirstPage.mergePage(pdfWatermarkReader.getPage(0))

In [45]:
pdfWriter = PyPDF2.PdfFileWriter()

In [46]:
pdfWriter.addPage(minutesFirstPage)

In [47]:
for pageNum in range(1, pdfReader.numPages):
    pageObj = pdfReader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

In [48]:
resultPdfFile = open('watermarkedCover.pdf', 'wb')

In [49]:
pdfWriter.write(resultPdfFile)

In [50]:
minutesFile.close()
resultPdfFile.close()

In [3]:
!open watermarkedCover.pdf

In [2]:
!open src/watermark.pdf

### Encrypting PDFs

In [53]:
import PyPDF2

In [54]:
pdfFile = open('src/meetingminutes.pdf', 'rb')

In [55]:
pdfReader = PyPDF2.PdfFileReader(pdfFile)

In [56]:
pdfWriter = PyPDF2.PdfFileWriter()

In [57]:
for pageNum in range(pdfReader.numPages):
    pdfWriter.addPage(pdfReader.getPage(pageNum))

In [58]:
pdfWriter.encrypt('swordfish')

In [59]:
resultPdf = open('encyptedminutes.pdf', 'wb')

In [60]:
pdfWriter.write(resultPdf)

In [61]:
pdfFile.close()
resultPdf.close()

In [62]:
!open encyptedminutes.pdf

## Project: Combining Select Pages from Many PDFs

### High level

- Find all PDF files in the current working directory
- Sort the filenames so the PDFs are added in order.
- Write each page, excluding the first page, of each PDF to the output file.

### need python code

- Call os.listdir() to find all the fiels in the working directory and remove any non-PDF files.
- Call Python's sort() list method to alphabetize the filenames.
- Create a PDFFileWriter object for the output PDF.
- Loop over each PDF file, creating a PdfFileReader object for it.
- Loop over each page (except the first) in each PDF file.
- Add the pages to the output PDF.
- Write the output PDF to a file named allminutes.pdf

### Step 1: Find All PDF Files

In [64]:
l = [1, 5, 3, 9, 4]

In [78]:
pdfFiles = [filename for filename in os.listdir('.') if filename.endswith('.pdf')]
pdfFiles

['combinedminutes.pdf',
 'encyptedminutes.pdf',
 'rotatePage.pdf',
 'watermarkedCover.pdf']

In [75]:
pdfFiles.sort(reverse=True)
pdfFiles

['watermarkedCover.pdf',
 'rotatePage.pdf',
 'encyptedminutes.pdf',
 'combinedminutes.pdf']

In [77]:
pdfFiles.sort(key=str.lower)
pdfFiles

['combinedminutes.pdf',
 'encyptedminutes.pdf',
 'rotatePage.pdf',
 'watermarkedCover.pdf']

In [79]:
import os
import PyPDF2

# Get all the PDF filenames.

pdfFiles = [filename for filename in os.listdir('.') 
            if filename.endswith('.pdf')]
pdfFiles.sort(key=str.lower)

pdfWriter = PyPDF2.PdfFileReader()

# TODO: Loop through all the PDF files.

# TODO: Loop through all the pages (except the first) and add them.

# TODO: Save the resulting PDF to a file.

### Step 2: Open Each PDF

In [None]:
import os
import PyPDF2

# Get all the PDF filenames.

pdfFiles = [filename for filename in os.listdir('.') 
            if filename.endswith('.pdf')]
pdfFiles.sort(key=str.lower)

pdfWriter = PyPDF2.PdfFileReader()

# TODO: Loop through all the PDF files.
for filename in pdfFiles:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# TODO: Loop through all the pages (except the first) and add them.

# TODO: Save the resulting PDF to a file.

### Step 3: Add Each Page

#### index 0 주의사항

- 0부터 시작한다.
- 1부터 시작하므로 첫 페이지 제외
- 마지막 pdfReader.numPages를 하게되면 마지막 페이지가 포함이 된다. 그러니 +1을 해줄 필요가 없다. 아, 겁나 헷갈린다. 조심하자.


- **제외할 것은 가장 처음부터 제외하자**. pdfFiles에서부터 제외시켜 버리면 나중에 이걸 포함해야 하는지 제외해야 하는지 고민하지 않아도 된다.
- 입력받을 때는 command line부터 입력을 받자. 그 중간에 입력받게 되면 소스를 중간에 수정해야 되기 때문에 귀찮다.

In [97]:
import os
import PyPDF2

# Get all the PDF filenames.

pdfFiles = [filename for filename in os.listdir('.') 
            if filename.endswith('.pdf') and not 'encyptedminutes.pdf' in filename
           and not 'allminutes.pdf' in filename]
# and not 'encryptedminutes.pdf' in filename 
pdfFiles.sort(key=str.lower)

pdfWriter = PyPDF2.PdfFileWriter()

# TODO: Loop through all the PDF files.
for filename in pdfFiles:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#     print(filename, pdfReader.numPages)

    # TODO: Loop through all the pages (except the first) and add them.
    for pageNum in range(1, pdfReader.numPages):
#         print pageNum
        pageObj = pdfReader.getPage(pageNum)
        pdfWriter.addPage(pageObj)

# TODO: Save the resulting PDF to a file.
pdfOutput = open('allminutes.pdf', 'wb')
pdfWriter.write(pdfOutput)
pdfOutput.close()

### Ideas for Similar Programs

- Cut out specific pages from PDFs.
- Reorder pages in a PDF.
- Create a PDF from only those pages that have some specific text, identified by extractText()

## Word Documents

> pip install python-docx

```python
import docx
```

- 이렇게 사용하자
- Document: 모든 문서
- Paragraph: Document 오브젝트가 여러개의 Paragraph를 가지공 ㅣㅆ다.

### Reading Word Documents

In [12]:
import docx

In [28]:
doc = docx.Document('src/demo.docx')

In [29]:
len(doc.paragraphs)

7

In [30]:
doc.paragraphs[0].text

u'Document Title'

In [31]:
len(doc.paragraphs[0].runs)

1

In [32]:
doc.paragraphs[1].text

u'A plain paragraph with some bold and some italic'

In [33]:
len(doc.paragraphs[1].runs)

5

In [34]:
for i in doc.paragraphs[1].runs:
    print i.text

A plain paragraph with
 some 
bold
 and some 
italic


In [35]:
doc.paragraphs[1].runs[0].text

'A plain paragraph with'

### Getting the Full Text from a .docx File

In [13]:
%%writefile readDocx.py
import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = [' ' + para.text for para in doc.paragraphs]
    return '\n\n'.join(fullText)

Overwriting readDocx.py


In [1]:
import readDocx

In [2]:
print(readDocx.getText('src/demo.docx'))

 Document Title

 A plain paragraph with some bold and some italic

 Heading, level 1

 Intense quote

 first item in unordered list

 first item in ordered list

 



### Styling Paragraph and Run Objects

In [3]:
!pip freeze | grep python-docx

python-docx==0.8.5


- Paragraph styles -> Paragraph objects
- chracter styles -> Run objects
- linked styles -> both
- style 이름에 띄어쓰기하면 안됨. 인식하지 못함
- Run object를 사용할 떄는 끝에 'Char'를 붙여준다.
``` python
paragraphObj.style = 'Quote'
runObj.style = 'QuoteChar'
```

### Creating Word Documents with Nondefault Styles

### Run Attributes

- True: 항상 실행
- False: 항상 비실행
- None: run's style

#### Table 13-1. Run Object text Attributes

Attribute | Description
--- | ---
bold | The text appears in bold.
italic | The text appears in italic.
underline | The text is underlined.
strike | The text appears with strikethrough.
double_strike | the text appears with double strikethrough.
all_caps | The text appears in capital letters.
small_caps | The text appears in capital letters, with lowercase letters two points smaller.
shadow | The text appears with a shadow.
outline | The text appears outlined rather than solid.
rtl | The text is written right-to-left.
imprint | The text appears pressed into the page.
**emboss** | The text appears raised off the page in relief.

In [4]:
import docx

In [5]:
doc = docx.Document('src/demo.docx')

In [6]:
doc.paragraphs[0].text

u'Document Title'

In [7]:
doc.paragraphs[0].style

_ParagraphStyle('Title') id: 4500860560

In [8]:
doc.paragraphs[0].style = 'Normal'

In [9]:
doc.paragraphs[1].text

u'A plain paragraph with some bold and some italic'

In [11]:
(doc.paragraphs[1].runs[0].text, doc.paragraphs[1].runs[1].text, doc.paragraphs[1].runs[2].text, doc.paragraphs[1].runs[3].text, doc.paragraphs[1].runs[4].text)

('A plain paragraph with', ' some ', 'bold', ' and some ', 'italic')

In [18]:
document = docx.Document()

In [12]:
doc.paragraphs[1].runs[0].style = 'QuoteChar'



In [30]:
document.styles

<docx.styles.styles.Styles at 0x109ec0b90>

In [31]:
doc.paragraphs[1].runs[0].style = document.styles['Heading1Char']

In [32]:
doc.paragraphs[1].runs[0].underline= True

In [33]:
doc.paragraphs[1].runs[3].underline = True

In [34]:
doc.save('restyled.docx')

### Writing Word Documents

In [35]:
import docx

In [36]:
doc = docx.Document()

In [37]:
doc.add_paragraph('Hello world!')

<docx.text.paragraph.Paragraph at 0x109ec0b50>

In [38]:
doc.save('hellworld.docx')

#### add_run()

In [39]:
import docx

In [40]:
doc = docx.Document()

In [41]:
doc.add_paragraph('Hello world!')

<docx.text.paragraph.Paragraph at 0x10a196190>

In [42]:
paraObj1 = doc.add_paragraph('This is a second paragraph.')

In [43]:
paraObj2 = doc.add_paragraph('This is a yet another paragraph.')

In [46]:
help(paraObj1.add_run)

Help on method add_run in module docx.text.paragraph:

add_run(self, text=None, style=None) method of docx.text.paragraph.Paragraph instance
    Append a run to this paragraph containing *text* and having character
    style identified by style ID *style*. *text* can contain tab
    (``\t``) characters, which are converted to the appropriate XML form
    for a tab. *text* can also include newline (``\n``) or carriage
    return (``\r``) characters, each of which is converted to a line
    break.



In [44]:
# run object를 추가한다는 얘기네.
# run object는 다른 양식의 단어로 바뀌기 전까지는 동일한 run object
paraObj1.add_run(' This text is being added to the second paragraph.')

<docx.text.run.Run at 0x10a196450>

In [47]:
doc.add_paragraph('Hello world!', 'Title')

<docx.text.paragraph.Paragraph at 0x10a1965d0>

In [48]:
doc.save('multipleParagraphs.docx')

### Adding Headings

In [49]:
doc = docx.Document()

In [50]:
doc.add_heading('Header 0', 0)

<docx.text.paragraph.Paragraph at 0x10a19f5d0>

In [51]:
doc.add_heading('Header 1', 1)

<docx.text.paragraph.Paragraph at 0x10a19f690>

In [52]:
doc.add_heading('Header 2', 2)

<docx.text.paragraph.Paragraph at 0x10a19f850>

In [53]:
doc.add_heading('Header 3', 3)

<docx.text.paragraph.Paragraph at 0x10a19fad0>

In [54]:
doc.add_heading('Header 4', 4)

<docx.text.paragraph.Paragraph at 0x10a19fb10>

In [55]:
doc.save('headings.docx')

### Adding Line and Page Breaks

In [56]:
doc = docx.Document()

In [57]:
doc.add_paragraph('This is on the first page!')

<docx.text.paragraph.Paragraph at 0x10a196550>

In [60]:
doc.add_page_break()

<docx.text.paragraph.Paragraph at 0x10a1857d0>

In [61]:
# doc.paragraphs[0].runs[0].add_break(docx.text.WD_BREAK.PAGE)

In [62]:
doc.add_paragraph('This is on the second page!')

<docx.text.paragraph.Paragraph at 0x10a185890>

In [63]:
doc.save('twoPage.docx')

### Adding Pictures

In [64]:
doc.add_picture('src/zophie.png', 
                width=docx.shared.Inches(1), 
                height=docx.shared.Cm(4))

<docx.shape.InlineShape at 0x10a185310>

In [65]:
doc.save('twoPage_pic.docx')

## Summary

- 텍스트 정보가 오직 텍스트 파일만 있는것은 아니다. 
- PDF와 Word를 다룰 일이 꽤 자주있다.
- PDF에서 완벽히 텍스트를 추출할 수는 없으니 주의할 것
- Word는 Paragraph와 Run objects로 이루어져 있다.
- paragraphs, headings, break, pictures 등을 추가할 수 있다.

## Practice Projects

### PDF Paranoia

- 모든 폴더를 돌면서 command line으로 password를 제공해 암호화 된 PDF를 풀 수 있다.
- 암호화 된 PDF는 '\_encypted.pdf' 라고 붙여줄 수 있다. 그리고 원본 파일을 삭제한다. 암호화된 파일이 확실히 해독됐는지 확인이 되야 한다.
- 모든 암호화된 PDF를 찾고 제공된 패스워드로 해독된 PDF를 만든다. 만약에 암호가 일치하지 않는다면 사용자에게 메시지로 알리고 다음 PDF로 진행한다.

### Custom Invitations as Word Documents

- 5명의 사람에게 초대장을 보낸다.
- 빈 Word file을 만들고 첫번째 스타일을 적용한다.
- 각 사용자들을 넣고 add_break 이라는 page break를 추가한다.
- 각 파일을 읽은 다음에 사용자 이름을 가져온다.

In [66]:
!cat src/guests.txt

Prof. Plum
Miss Scarlet
Col. Mustard
Al Sweigart
Robocop

### Brute-Force PDF Password Breaker

- 44000 개의 영어 단어로 decrypt()를 호출하면서 어떤 결과값을 리턴하는지 확인한다.
- 리턴값이 1이면 암호가 해제된 것이고 리턴값이 0이라면 암호가 틀린 것
- uppercase와 lowercase를 모두 실험해 봐야 하니 44000 * 2 = 88000번 시도

In [68]:
!head src/dictionary.txt

AARHUS
AARON
ABABA
ABACK
ABAFT
ABANDON
ABANDONED
ABANDONING
ABANDONMENT
ABANDONS


In [67]:
!wc -l src/dictionary.txt

   45332 src/dictionary.txt
