# Reading other kinds of files

## Word Documents

The best package for reading the contents of modern Word documents (i.e,. files with a  docx exctension) in `docx2txt`. It returns the full text, stripping out all formatting information. 

To install (from within a notebook):
~~~~
    %conda install -c conda-forge docx2txt
~~~~

In [3]:
import docx2txt

In [4]:
text = docx2txt.process('data/pandas_wiki.docx')

In [5]:
text

'In\xa0computer programming,\xa0pandas\xa0is a\xa0software library\xa0written for the\xa0Python programming language\xa0for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and\xa0time series. It is\xa0free software\xa0released under the\xa0three-clause BSD license.\xa0The name is derived from the term "panel data", an\xa0econometrics\xa0term for data sets that include observations over multiple time periods for the same individuals.\n\n\n\nLibrary features\n\nDataFrame object for data manipulation with integrated indexing.\n\nTools for reading and writing data between in-memory data structures and different file formats.\n\nData alignment and integrated handling of missing data.\n\nReshaping and pivoting of data sets.\n\nLabel-based slicing, fancy indexing, and subsetting of large data sets.\n\nData structure column insertion and deletion.\n\nGroup by engine allowing split-apply-combine operations on data sets.\n

In [6]:
print(text)

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.



Library features

DataFrame object for data manipulation with integrated indexing.

Tools for reading and writing data between in-memory data structures and different file formats.

Data alignment and integrated handling of missing data.

Reshaping and pivoting of data sets.

Label-based slicing, fancy indexing, and subsetting of large data sets.

Data structure column insertion and deletion.

Group by engine allowing split-apply-combine operations on data sets.

Data set merging and joining.

Hierarchical axis indexing to 

## PDF Documents

The best package for reading the contents of PDF documents in `PyPDF2`. It can be used to read and write PDF documents. It can extract the text, but only when the text is stored in the document, as is commonly the case when PDFs are created on a computer from the original document. It can not extract the text from scanned documents, as the text there is stored as an image.

To install (from within a notebook):
~~~~
    %conda install -c conda-forge docx2txt
~~~~

In [7]:
import PyPDF2

Extracting the textual elements of a PDF is more complicated than a Word document. After the file is opened, it needs to be parsed, and then the text extracted from each page.

In [9]:
pdfFileObj = open('data/l09r01.pdf', 'rb')

In [10]:
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [11]:
print(pdfReader.numPages)

32


In [15]:
first_page_text = pdfReader.getPage(0).extractText()

print(first_page_text[:100])

 
GE.15
-
21932(E)
 
*1521932*
 
 
 
Conference of the Parties
 
Twenty
-
first session
 
Paris, 30 


If you wanted all of the text from a particular PDF, you could iterate over each page. In this case, I add a line break (`\n`) in between each page. 

In [19]:
full_text = ''

for page_number in range(0, pdfReader.numPages):
    new_page_text = pdfReader.getPage(page_number).extractText()
    full_text = full_text +  '\n' + new_page_text

In [20]:
len(text)

116390

In [21]:
print(text[5135:6135])

equests 
Parties to provide 
notification of any such provisional 
application to the Depositary;
 

FCCC/CP/2015/L.9
/Rev.1
 
 
3
 
6.
 
Notes
 
that the work of the Ad Hoc Working Group on the Durban Platform for 
Enhanced Action, in accordance with decision 1/CP.17, paragraph 4, has been completed;
 
7.
 
Decides
 
to establish the Ad Hoc Working Group on the Paris Agreement under the 
same arrangement, mutatis mutandis, as those concerning the election of officers to the 
Bureau of the 
Ad Hoc Working Group on the Durban Platform for Enhanced Action
;
1
 
8.
 
Also
 
decides
 
that the Ad Hoc Working Group on the Paris Agreement shall prepare 
for the entry into force of the Agreement and for the convening of the first session of the 
Conference of the Parties serving as the meeting of the Parties to the Paris Agreement;
 
9.
 
Furthe
r
 
decides
 
to oversee the implementation of the work programme resulting 
from the relevant requests contained in this decision;
 
10.
 
Requests


This could all be bundled in function:

In [26]:
def extract_text(pdf_file_name):
    '''Extract text contents from a PDF file'''
    
    pdfFileObj = open(pdf_file_name, 'rb')
        
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    full_text = ''

    for page_number in range(0, pdfReader.numPages):
        new_page_text = pdfReader.getPage(page_number).extractText()
        full_text = full_text +  '\n' + new_page_text
    return full_text
    

In [27]:
text = extract_text('data/l09r01.pdf')

In [29]:
print(text[53454:53654])

o
-
benefits of 
policies, practices and actions for enhancing mitigation ambition, as well as on options 
for 
supporting their implementation, information on which should be made available in a user


Alternatively, if you wanted to preserve the pages as individual documents:

In [36]:
import pandas as pd
pd.set_option('display.max_colwidth', 200)


def extract_page(page_number):
    text = pdfReader.getPage(page_number).extractText()
    return {'page' : page_number + 1,
            'text' : text}

def extract_pages(pdf_file_name):
    '''Extract text contents from a PDF file'''
    
    pdfFileObj = open(pdf_file_name, 'rb')
        
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    pages = []

    for page_number in range(0, pdfReader.numPages):
        new_page_dict = extract_page(page_number)
        pages.append(new_page_dict)
    df = pd.DataFrame(pages)
    df['source'] = pdf_file_name
    
    return df
    

In [37]:
extract_pages('data/l09r01.pdf')

Unnamed: 0,page,text,source
0,1,"\nGE.15\n-\n21932(E)\n \n*1521932*\n \n \n \nConference of the Parties\n \nTwenty\n-\nfirst session\n \nParis, 30 November \nto \n11\n \nDecember 201\n5\n \nAgenda item \n4\n(\nb\n)\n \nDurban Pl...",data/l09r01.pdf
1,2,"FCCC/CP/2015/L.9\n/Rev.1\n \n2\n \n \nlocal communities, migrants, children, persons with disabilities and people in vulnerable \nsituations and the right to development, as well as gender equalit...",data/l09r01.pdf
2,3,"FCCC/CP/2015/L.9\n/Rev.1\n \n \n3\n \n6.\n \nNotes\n \nthat the work of the Ad Hoc Working Group on the Durban Platform for \nEnhanced Action, in accordance with decision 1/CP.17, paragraph 4, has...",data/l09r01.pdf
3,4,"FCCC/CP/2015/L.9\n/Rev.1\n \n4\n \n \n\n-\nindustrial levels by reducing to \na level to be identified in the special\n \nreport referred to in paragraph 21 below;\n \n18.\n \nAlso notes, in this ...",data/l09r01.pdf
4,5,"FCCC/CP/2015/L.9\n/Rev.1\n \n \n5\n \nemissions and, as appropriate, removals, and how the Party considers that its nationally \ndetermined contribution is fair and ambitious, in \nthe \nlight of ...",data/l09r01.pdf
5,6,"FCCC/CP/2015/L.9\n/Rev.1\n \n6\n \n \nand the excha\nnge of information, experiences, and best practices amongst Parties to raise \ntheir resilience to these impacts\n;\n*\n \n36.\n \nInvites\n \n...",data/l09r01.pdf
6,7,FCCC/CP/2015/L.9\n/Rev.1\n \n \n7\n \n42.\n \nRequests\n \nthe Adaptation Committee and the Least Developed Countries Expert \nGroup to jointly develop modalities to recognize the adaptation effor...,data/l09r01.pdf
7,8,"FCCC/CP/2015/L.9\n/Rev.1\n \n8\n \n \ndevelop recommendations for integrated approa\nches to avert, minimize and address \ndisplacement related to the adverse impacts of climate change;\n \n51.\n ...",data/l09r01.pdf
8,9,"FCCC/CP/2015/L.9\n/Rev.1\n \n \n9\n \n60.\n \nRecognizes\n \nthat the Adaptation Fund may serve the Agreement, subject to relevant \ndecisions by the Conference of the Parties serving as the meeti...",data/l09r01.pdf
9,10,FCCC/CP/2015/L.9\n/Rev.1\n \n10\n \n \n(c)\n \nThe assessment of technologies that are ready for transfer;\n \n(d)\n \nThe enhancement of enabling environments for and the addressing of barriers \...,data/l09r01.pdf


## Plain text files

Plain text files can be opened and contents stored in a string.

In [54]:
filename = 'lyrics/ani-difranco_grand-canyon.txt'

In [58]:
text = open(filename,'r').read()

In [56]:
text[:250]

'I love my country\nBy which I mean\nI am indebted joyfully\nTo all the people throughout its history\nWho have fought the government to make right\nWhere so many cunning sons and daughters\nOur fore mothers and forefathers\nCame singing through slaughter\nCa'

The above method does not scale well, as Python keeps the file open until you close it. Leaving it open creates memory problems when opening hundreds or thousands of files, so the preferred way to open files is using `with`:

In [59]:
with open(filename, 'r') as infile:
    text = infile.read()

In [60]:
text[:250]

'I love my country\nBy which I mean\nI am indebted joyfully\nTo all the people throughout its history\nWho have fought the government to make right\nWhere so many cunning sons and daughters\nOur fore mothers and forefathers\nCame singing through slaughter\nCa'

## Loading multiple files

In [38]:
from glob import glob

In [44]:
glob('lyrics/*.txt')

['lyrics/the-byrds_dolphin-s-smile.txt',
 'lyrics/barclay-james-harvest_in-memory-of-the-martyrs.txt',
 'lyrics/five-man-electrical-band_you-and-i.txt',
 'lyrics/cd9_amiga.txt',
 'lyrics/carnage_rari.txt',
 'lyrics/adiva_letting-go.txt',
 'lyrics/aygun-kaza-mova_sevdi-urek.txt',
 'lyrics/die-toten-hosen_kanzler-sein.txt',
 'lyrics/esham_hoe-role.txt',
 'lyrics/delgadillo-fernando_buenas-intenciones.txt',
 'lyrics/ani-difranco_grand-canyon.txt',
 'lyrics/einstuerzende-neubauten_installation-no1.txt',
 'lyrics/britny-fox_midnight-moses.txt',
 'lyrics/emf_ballad-o-the-bishop.txt',
 'lyrics/game_cats-and-dogs.txt',
 'lyrics/dave-mason_what-do-we-got-here.txt',
 'lyrics/architects_gone-with-the-wind.txt',
 'lyrics/the-atlicic_skin.txt',
 'lyrics/amos-lee_wait-up-for-me.txt',
 'lyrics/elevation-worship_undivided.txt',
 'lyrics/damnwells_she-goes-down.txt',
 'lyrics/cassius_cassius-1999.txt',
 'lyrics/fool-s-gold_balmy.txt',
 'lyrics/chamillionaire_slow-city-don.txt',
 'lyrics/blood-on-the-da

In [62]:
filename_list = glob('lyrics/*')

In [63]:
lyrics_list = []
for filename in filename_list:
    with open(filename, 'r') as infile:
        text = infile.read()
        
    info = {'filename' : filename,
            'lyrics'   : text}
    
    lyrics_list.append(info)

In [66]:
lyrics_df = pd.DataFrame(lyrics_list)

lyrics_df.head()

Unnamed: 0,filename,lyrics
0,lyrics/the-byrds_dolphin-s-smile.txt,Out at sea for a year\nFloating free from all fear\nEvery day blowin' spray\nIn a dolphin's smile\nWind-taut line split the sky\nCurlin' crest rollin' by\nFloating free aimlessly\nIn a dolphin's s...
1,lyrics/barclay-james-harvest_in-memory-of-the-martyrs.txt,Life is like a tall ship\nDrifting gently from the shore\nTime is like a fair wind\nWith a lifetime to explore.\nThe beauty that surrounds you\nWas meant to be adored\nThe problems that surround y...
2,lyrics/five-man-electrical-band_you-and-i.txt,"CHORUS:\nWhen a butterfly flies by my window,\nAnd I can see which way the wind blows.\nOnly you and I know,\nYou and I, butterfly.\nGot a pillow under my mind yeah.\nBelieve that woman to be a co..."
3,lyrics/cd9_amiga.txt,
4,lyrics/carnage_rari.txt,"[Hook: Famous Dex]\nI just bought me a Ferrari\nI just bought me a Ferrari\nI just bought me a Ferrari\nI just pulled up in a Rari\nFucked a little bitch in a Rari\nMy diamonds they blind you, I'm..."


If you had subdirectories, you can modify your glob statement using `**` and `recursive=True`.

In [71]:
glob('**/*.txt', recursive=True)

['lyrics/the-byrds_dolphin-s-smile.txt',
 'lyrics/barclay-james-harvest_in-memory-of-the-martyrs.txt',
 'lyrics/five-man-electrical-band_you-and-i.txt',
 'lyrics/cd9_amiga.txt',
 'lyrics/carnage_rari.txt',
 'lyrics/adiva_letting-go.txt',
 'lyrics/aygun-kaza-mova_sevdi-urek.txt',
 'lyrics/die-toten-hosen_kanzler-sein.txt',
 'lyrics/esham_hoe-role.txt',
 'lyrics/delgadillo-fernando_buenas-intenciones.txt',
 'lyrics/ani-difranco_grand-canyon.txt',
 'lyrics/einstuerzende-neubauten_installation-no1.txt',
 'lyrics/britny-fox_midnight-moses.txt',
 'lyrics/emf_ballad-o-the-bishop.txt',
 'lyrics/game_cats-and-dogs.txt',
 'lyrics/dave-mason_what-do-we-got-here.txt',
 'lyrics/architects_gone-with-the-wind.txt',
 'lyrics/the-atlicic_skin.txt',
 'lyrics/amos-lee_wait-up-for-me.txt',
 'lyrics/elevation-worship_undivided.txt',
 'lyrics/damnwells_she-goes-down.txt',
 'lyrics/cassius_cassius-1999.txt',
 'lyrics/fool-s-gold_balmy.txt',
 'lyrics/chamillionaire_slow-city-don.txt',
 'lyrics/blood-on-the-da