# Basic Deal with Text on Python

In [51]:
person = 'Test'

In [52]:
print('name is {}'.format(person))

name is Test


In [53]:
mylist = [0,1,2]

In [54]:
print('number is {}'.format(mylist[0]))

number is 0


### Formatted String Literals (f-strings)
 * Working with f-strings (formatted string literals) to format printed text
 * Working with Files - opening, reading, writing and appending text files


<strong>f-strings</strong> offer several benefits over the older `.format()` string method. <br>For one, you can bring outside variables immediately into to the string rather than pass them through as keyword arguments:

In [55]:
print(f'number is {mylist[0]}')

number is 0


In [56]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

In [57]:
library

[('Author', 'Topic', 'Pages'),
 ('Twain', 'Rafting', 601),
 ('Feynman', 'Physics', 95),
 ('Hamilton', 'Mythology', 144)]

### Minimum Widths, Alignment and Padding
You can pass arguments inside a nested set of curly braces to set a minimum width for the field, the alignment and even padding characters.

In [58]:
for author,topic,pages in library:
    print(f'{author:{10}} {topic:{30}} {pages:.>{10}}')

Author     Topic                          .....Pages
Twain      Rafting                        .......601
Feynman    Physics                        ........95
Hamilton   Mythology                      .......144


In [59]:
from datetime import datetime

In [60]:
today = datetime(year=2019,month=2,day=28)

In [61]:
print(f'({today:%B %d, %Y})')

(February 28, 2019)


In [62]:
today

datetime.datetime(2019, 2, 28, 0, 0)

# Files

Python uses file objects to interact with external files on your computer. These file objects can be any sort of file you have on your computer, whether it be an audio file, a text file, emails, Excel documents, etc. Note: You will probably need to install certain libraries or modules to interact with those various file types, but they are easily available. (We will cover downloading modules later on in the course).

Python has a built-in open function that allows us to open and play with basic file types. First we will need a file though. We're going to use some IPython magic to create a text file!

## Creating a File with IPython

In [63]:
%%writefile test.txt
Hello, This is a quick test file.
2nd line of the file.

Overwriting test.txt


In [64]:
myfile = open('test.txt')

In [65]:
myfile

<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

In [66]:
myfile.read()

'Hello, This is a quick test file.\n2nd line of the file.\n'

In [67]:
myfile.read()

''

In [68]:
myfile.seek(0)

0

In [69]:
content = myfile.read()

In [70]:
print(content)

Hello, This is a quick test file.
2nd line of the file.



In [71]:
myfile.close()

In [72]:
myfile = open('test.txt')

### .readlines()
You can read a file line by line using the readlines method. Use caution with large files, since everything will be held in memory. We will learn how to iterate over large files later in the course.

In [73]:
mylines = myfile.readlines()

In [74]:
for line in mylines:
    print(line.split()[0])

Hello,
2nd


## Writing to a File

By default, the `open()` function will only allow us to read the file. We need to pass the argument `'w'` to write over the file. For example:

In [75]:
myfile = open('test.txt','w+')

In [76]:
myfile.read()

''

In [77]:
myfile.write('A new text')

10

In [78]:
myfile.seek(0)

0

In [79]:
myfile.read()

'A new text'

## Appending to a File
Passing the argument `'a'` opens the file and puts the pointer at the end, so anything written is appended. Like `'w+'`, `'a+'` lets us read and write to a file. If the file does not exist, one will be created.

In [80]:
myfile = open('test.txt','a+')

#### Instead producing error these create a new file

In [81]:
myfile = open('textssss.txt','a+')

In [82]:
myfile.write('1st Line in A+')

14

In [83]:
myfile.close()

In [84]:
newfile = open('textssss.txt')

In [85]:
newfile.read()

'1st Line in A+Added lines, because using append argument\nAdded Newline1st Line in A+1st Line in A+'

In [86]:
newfile.write('try to write something with')

UnsupportedOperation: not writable

In [None]:
newfile.close()

#### Error above produced because we only open, and not passing an append argument

In [87]:
newfile = open('textssss.txt','a+')

In [88]:
newfile.write('Added lines, because using append argument')

42

In [89]:
newfile.seek(0)

0

In [90]:
newfile.read()

'1st Line in A+Added lines, because using append argument\nAdded Newline1st Line in A+1st Line in A+Added lines, because using append argument'

In [91]:
newfile.write('\nAdded Newline')

14

In [92]:
newfile.seek(0)

0

In [93]:
print(newfile.read())

1st Line in A+Added lines, because using append argument
Added Newline1st Line in A+1st Line in A+Added lines, because using append argument
Added Newline


In [94]:
newfile.close()

#### Automatic open/close with with open function

In [95]:
with open('textssss.txt','r') as mynewfile:
    myvars = mynewfile.readlines()

In [96]:
myvars

['1st Line in A+Added lines, because using append argument\n',
 'Added Newline1st Line in A+1st Line in A+Added lines, because using append argument\n',
 'Added Newline']

#### Working with PDF

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.

In [97]:
import PyPDF2

In [98]:
myfile = open('US_Declaration.pdf',mode='rb')

In [99]:
pdf_reader = PyPDF2.PdfFileReader(myfile)

In [100]:
pdf_reader.numPages

5

In [101]:
page_one = pdf_reader.getPage(0)

In [102]:
print(page_one.extractText())

Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the
political bands which have connected them with another, and to assume among the powers of the
earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle

them, a decent respect to the opinions of mankind requires that they should declare the causes

which impel them to the separation. 
We hold these truths to be self-evident, that all men are created equal, that they are endowed by

their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving

their just powers from the consent of the governed,ŠThat whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abolish it,

In [103]:
myfile.close()

### Adding Page into PDF

In [104]:
f = open('US_Declaration.pdf','rb')

In [105]:
pdf_reader = PyPDF2.PdfFileReader(f)

In [106]:
first_page = pdf_reader.getPage(0)

In [107]:
pdf_writer = PyPDF2.PdfFileWriter()

In [108]:
pdf_writer.addPage(first_page)

In [109]:
pdf_output = open('MyNewPDF.pdf','wb')

In [110]:
pdf_writer.write(pdf_output)

In [111]:
pdf_output.close()

In [112]:
f.close()

### Open New PDF

In [113]:
brand_new = open('MyNewPDF.pdf','rb')

In [114]:
pdf_reader = PyPDF2.PdfFileReader(brand_new)

In [115]:
pdf_reader.numPages

1

In [116]:
f = open('US_Declaration.pdf','rb')

In [117]:
pdf_text = [0]
pdf_reader = PyPDF2.PdfFileReader(f)

for p in range(pdf_reader.numPages):
    page = pdf_reader.getPage(p)
    pdf_text.append(page.extractText())

In [118]:
f.close()

In [119]:
len(pdf_text)

6

In [120]:
for page in pdf_text:
    print(page)
    print('\n')
    print('\n')
    print('\n')

0






Declaration of IndependenceIN CONGRESS, July 4, 1776. The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the
political bands which have connected them with another, and to assume among the powers of the
earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle

them, a decent respect to the opinions of mankind requires that they should declare the causes

which impel them to the separation. 
We hold these truths to be self-evident, that all men are created equal, that they are endowed by

their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit
of Happiness.ŠThat to secure these rights, Governments are instituted among Men, deriving

their just powers from the consent of the governed,ŠThat whenever any Form of Government
becomes destructive of these ends, it is the Right of the People to alter or to abo

### Searching for Basic Patterns

Let's imagine that we have the following string:

In [121]:
listword = 'a b c'
listword.split()

['a', 'b', 'c']

In [122]:
'a' in listword

True

### Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document.

In [123]:
import re

In [124]:
pattern = 'phone'

In [125]:
text = 'The phone number of the agent is 408-555-1234. Call soon!'

In [126]:
my_match = re.search(pattern,text)

In [127]:
my_match.span()

(4, 9)

In [128]:
my_match.start()

4

In [129]:
my_match.end()

9

#### Where string of matched found in index (start = 4, end = 9)

In [130]:
text = 'my phone is new phone'

In [131]:
match = re.search(pattern,text)

In [132]:
match.span()

(3, 8)

#### This regex function instead of return all matched it only return the first instance. Unless we use either findall or finditer method.

In [133]:
all_match = re.findall('phone',text)

In [134]:
all_match

['phone', 'phone']

In [135]:
len(all_match)

2

In [136]:
for match in re.finditer('phone',text):
    print(match.span())

(3, 8)
(16, 21)


### Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

### Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

### Start Define a Pattern

Example we want to find a phone number within the string, based on table we could create a pattern, as long as the pattern of digit count and dash separated is equal it will able to find any matched phone number within string.

In [137]:
text = 'The phone number of the agent is 408-555-1234. Call soon!'

In [138]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'

In [139]:
phone_number = re.search(pattern,text)

In [140]:
phone_number

<_sre.SRE_Match object; span=(33, 45), match='408-555-1234'>

In [141]:
phone_number.group()

'408-555-1234'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

### Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [142]:
pattern = r'\d{3}-\d{3}-\d{4}'

In [143]:
phone_number = re.search(pattern,text)

In [144]:
phone_number.group()

'408-555-1234'

### Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions

In [145]:
phone_pattern = r'(\d{3})-(\d{3})-(\d{4})'

In [146]:
phone_number = re.search(phone_pattern,text)

In [147]:
phone_number.group()

'408-555-1234'

In [148]:
phone_number.group(1)

'408'

In [149]:
phone_number.group(2)

'555'

In [150]:
phone_number.group(3)

'1234'

### Search with operator

In [151]:
re.search(r'man|woman','This woman was here')

<_sre.SRE_Match object; span=(5, 10), match='woman'>

### Wildcard character

A wildcard character is a placement that will match with any character that placed within the string

In [152]:
re.findall(r'.at','The cat in the hat sat')

['cat', 'hat', 'sat']

#### Dollar Sign to return a digit in the end of the string, while ^ is to return a digit in begining of the string

In [153]:
re.findall(r'\d$','A number was 1234')

['4']

In [154]:
re.findall(r'^\d','1234 is the number')

['1']

#### Exclusion

To match any character except a list of excluded characters, put the excluded charaters between [^  and ]. The caret ^ must immediately follow the [ or else it stands for just itself.

<table ><tr><th>Regex</th><th>Example Pattern Code</th></tr>

<tr ><td><span >[^\d]</span></td><td>Removing any number exist within</td></tr>

<tr ><td><span >[^a]</span></td><td>Any Character except a</td></tr>



<tr ><td><span >[^a-z]</span></td><td>any character except lowercase</td></tr>

In [155]:
re.findall(r'[^\d]+','There are 2 numbers 35 in this 9 sentence.')

['There are ', ' numbers ', ' in this ', ' sentence.']

#### Combine with Occurence.

Adding square bracket to indicating which to exclude these punctuation. and a '+' meaning if it is occurs once or more.

In [156]:
mylist = re.findall(r'[^!?.,]+','This is a String!! but it has punctuation, How to remove it??.')

In [157]:
''.join(mylist)

'This is a String but it has punctuation How to remove it'

In [159]:
text = 'Only find the hyphen-word. Were are the long-ish dash words?'

This basically regex looking for a group where a string that occurs once or more with dash along with another string that also occurs once or more.

In [162]:
re.findall(r'[\w]+-[\w]+',text)

['hyphen-word', 'long-ish']