## Working with Text Files

Text files are probably the most basic types of files that you are <br> going to encounter in your NLP endeavors. In this section, we <br> will see how to read from a text file in Python, create a text file, and write data to the text file.

### Reading a Text File

In [0]:
myfile = open("text_file.txt")

In [2]:
myfile

<_io.TextIOWrapper name='text_file.txt' mode='r' encoding='UTF-8'>

In [3]:
myfile.read()

'Welcome to Natural Language Processing\nIt is one of the most exciting research areas as of today\nWe will see how Python can be used to work with text files.\n'

Now if you try to call the read method again, you will see that nothing will be printed on the console:

In [4]:
myfile.read()

''

This is because once you call the read method, the cursor is moved to the end of the text. Therefore, when you call read again, nothing is displayed since there is no more text to print.

A solution to this problem is that after calling the read() method, call the seek() method and pass 0 as the argument. This will move the cursor back to the start of the text file. Look at the following script to see how this works:

In [9]:
myfile = open("text_file.txt")
print(myfile.read())
myfile.seek(0)
print(myfile.read())

Welcome to Natural Language Processing
It is one of the most exciting research areas as of today
We will see how Python can be used to work with text files.

Welcome to Natural Language Processing
It is one of the most exciting research areas as of today
We will see how Python can be used to work with text files.



In [0]:
myfile.close()

**Reading a File Line by Line**

Instead of reading all the contents of the file at once, we can also read the file contents line by line.<br> To do so, we need to execute the readlines() method, which returns each line in the text file as list item.

In [11]:
myfile = open("text_file.txt")
print(myfile.readlines())

['Welcome to Natural Language Processing\n', 'It is one of the most exciting research areas as of today\n', 'We will see how Python can be used to work with text files.\n']


In many cases this makes the text easier to work with.<br> For example, we can now easily iterate through each line and print the first word in the line.

In [12]:
myfile = open("text_file.txt")
for lines in myfile:
    print(lines.split()[0])

Welcome
It
We


### Writing to a Text File

To write to a text file, you simply have to open a file with mode set to w or w+. The former opens a file in the write mode, while the latter opens the file in both read and write mode. If the file doesn't exist, it will be created. It is important to mention that if you open a file that already contains some text with w or w+ mode, all the existing file contents will be removed, as shown below:

In [13]:
myfile = open("text_file.txt", 'w+')
print(myfile.read())




In [14]:
myfile = open("text_file.txt", 'w+')
print(myfile.read())
myfile.write("The file has been rewritten")
myfile.seek(0)
print(myfile.read())


The file has been rewritten


Often times, you dont simply need to wipe out the existing contents of the file. Rather, you may need to add the contents at the end of the file.

To do so, you need to open the file with a+ mode which refers to append plus read.

In [15]:
myfile = open("text_file.txt", 'a+')
myfile.seek(0)
print(myfile.read())

Welcome to Natural Language Processing
It is one of the most exciting research areas as of today
We will see how Python can be used to work with text files.



In [16]:
myfile.write("\nThis is a new line")

19

In [17]:
myfile.seek(0)
print(myfile.read())

Welcome to Natural Language Processing
It is one of the most exciting research areas as of today
We will see how Python can be used to work with text files.

This is a new line


Using the with keyword, as shown below, you don't <br> need to explicitly close the file. Rather, the <br>  above script opens the file, reads its <br> contents, and then closes it automatically.

In [18]:
with open("text_file.txt") as myfile:
    print(myfile.read())

Welcome to Natural Language Processing
It is one of the most exciting research areas as of today
We will see how Python can be used to work with text files.

This is a new line


### Working with PDF Files

In [19]:
!pip install PyPDF2

Collecting PyPDF2
[?25l  Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
[K     |████▎                           | 10kB 8.4MB/s eta 0:00:01[K     |████████▌                       | 20kB 1.6MB/s eta 0:00:01[K     |████████████▊                   | 30kB 2.3MB/s eta 0:00:01[K     |█████████████████               | 40kB 2.4MB/s eta 0:00:01[K     |█████████████████████▏          | 51kB 1.9MB/s eta 0:00:01[K     |█████████████████████████▍      | 61kB 2.1MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71kB 2.3MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 2.1MB/s 
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-cp36-none-any.whl size=61086 sha256=2ffe138f7b34da3824e5025945a95c4b84145f6cbe92a341779c87d45366862f
  Stored in directory: /root

In [0]:
import PyPDF2
mypdf = open('Lorem-Ipsum.pdf', mode='rb')

In [0]:
pdf_document = PyPDF2.PdfFileReader(mypdf)

In [23]:
pdf_document.numPages

1

In [24]:
first_page = pdf_document.getPage(0)

print(first_page.extractText())

Lorem Ipsum
is simply dummy text of the printing and typesetting 
industry. Lorem Ipsum has been the industry's standard dummy text ever 
since the 1500s, when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has survived not only five 
centuries, but also the leap into electronic typesetting, remaining essentially 
unchanged. It was popularised in the 1960s with the release of Letraset 
sheets containing Lorem Ipsum passages, and more recently with desktop 
publishing software like Aldus PageMaker including versions of Lorem Ipsum.It is a long established fact that a reader will be distracted by the readable 
content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to 
using 'Content here, content here', making it look like readable English. 
Many desktop publishing packages and web page editors now use Lorem 
Ipsum as their default model text, and a 

### Writing to a PDF Document

It is not possible to directly write Python strings to PDF document using the PyPDF2 library due to fonts and other constraints. However, for the sake of demonstration, we will read contents from our PDF document and then will write that content to another PDF file that we will create.

In [0]:
import PyPDF2

mypdf = open('Lorem-Ipsum.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)
pdf_document.numPages

page_one = pdf_document.getPage(0)

In [0]:
pdf_document_writer = PyPDF2.PdfFileWriter()

In [0]:
pdf_document_writer.addPage(page_one)

In [0]:
pdf_output_file = open('new_pdf_file.pdf', 'wb')

In [0]:
pdf_document_writer.write(pdf_output_file)

In [0]:
import PyPDF2

mypdf = open(r'new_pdf_file.pdf', mode='rb')

pdf_document = PyPDF2.PdfFileReader(mypdf)
pdf_document.numPages
page_one = pdf_document.getPage(0)

print(page_one.extractText())

### Let's now work with a bigger PDF file

In [31]:
import PyPDF2

mypdf = open(r'lipsum.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)
pdf_document.numPages

87

In [32]:
import PyPDF2

mypdf = open(r'lipsum.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)

for i in range(pdf_document.numPages):
    page_to_print = pdf_document.getPage(i)
    print(page_to_print.extractText())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
vestibulumrisus,vitaemattismetusnequenonpede.}
1988
%{123}
1989
\NewLipsumPar{Suspendissemolliseratetrisus.Vestibulum
1990
etodioeunislmalesuadadapibus.Morbiactortoretmagna
1991
tinciduntullamcorper.Utpellentesquefermentummi.Etiamsedneque
1992
sitametleoconsectetuersagittis.Nullafacilisi.Sedlobortis
1993
eratvitaenulla.Duisbibendumipsumetmiscelerisquedapibus.
1994
Fuscenonummyvestibulumorci.Donecanisl.Integeracnibh.
1995
Pellentesquehabitantmorbitristiquesenectusetnetusetmalesuada
1996
famesacturpisegestas.Aeneannecnuncsedduilobortis
47

1997
vestibulum.Praesentmetusligula,auctorvitae,laciniased,
1998
hendrerita,felis.Etiamsapien.Proinetsemvitaedolorsodales
1999
venenatis.Integerluctusaliquamrisus.}
2000
%{124}
2001
\NewLipsumPar{Maecenasmimassa,fermentumeu,venenatis
2002
et,cursusid,ipsum.Morbivehiculajustofaucibusmauris.Donec
2003
nonneque.Fusceidmiutnequetinciduntposuere.Suspendissequis
2004
enim.Crasporttitor.Sedquisv