# Lecture 10—Text Processing

In this lecture, we will explore reading and writing plain text files.

We will learn:

- [Process text using regular expression](#Process-Text-with-Regular-Expression)
- [Combining DataFrame and Regex](#Combining-DataFrame-and-Regex)

## Process Text with Regular Expression

We can use Regular Expression to process text with patterns.

In [13]:
with open("sample-years.txt") as file_obj:
    lines = file_obj.read().splitlines()
    
lines

['Steven Hawking was born in 1942.',
 'Albert Einstein was born in 1879',
 'Albert Einstein won Nobel Prize in 1921.',
 'Stephen Curry wear No. 30.',
 'Stephen Curry went into NBA in 2009',
 'Stephen Curry won NBA MVP in 2015 and 2016.',
 'Micheal Jordan was born in 1963.']

The following code finds all the years in the text document.

In [14]:
import re

for line in lines:
    pattern = '\d{4}'
    print(re.findall(pattern, line))


['1942']
['1879']
['1921']
[]
['2009']
['2015', '2016']
['1963']


What if we only want the first year found? 

Let’s try using `[0]` to get the first result for each line. And then, we have an error:

In [18]:
import re

for line in lines:
    pattern = '\d{4}'
    print(re.findall(pattern, line)[0])


1942
1879
1921


IndexError: list index out of range

The error occurs because there is one line that failed to find any year result.

We can ensure there is empty result by searching the ending of line too. This result in an extra result in every reuslt:

In [20]:
import re

for line in lines:
    pattern = '\d{4}|$'
    print(re.findall(pattern, line))


['1942', '']
['1879', '']
['1921', '']
['']
['2009', '']
['2015', '2016', '']
['1963', '']


But it is useful if we need to ensure the first result.

In [21]:
import re

for line in lines:
    pattern = '\d{4}|$'
    print(re.findall(pattern, line)[0])


1942
1879
1921

2009
2015
1963


The following code finds all the names in the text document

In [15]:
import re

for line in lines:
    pattern = '[A-Z][a-z]* [A-Z][a-z]*'
    print(re.findall(pattern, line))


['Steven Hawking']
['Albert Einstein']
['Albert Einstein', 'Nobel Prize']
['Stephen Curry']
['Stephen Curry']
['Stephen Curry', 'A M']
['Micheal Jordan']


In [16]:
import re

for line in lines:
    pattern = '[A-Z][a-z]+ [A-Z][a-z]+'
    print(re.findall(pattern, line))


['Steven Hawking']
['Albert Einstein']
['Albert Einstein', 'Nobel Prize']
['Stephen Curry']
['Stephen Curry']
['Stephen Curry']
['Micheal Jordan']


In [17]:
import re

for line in lines:
    pattern = '^[A-Z][a-z]+ [A-Z][a-z]+'
    print(re.findall(pattern, line))


['Steven Hawking']
['Albert Einstein']
['Albert Einstein']
['Stephen Curry']
['Stephen Curry']
['Stephen Curry']
['Micheal Jordan']


You can read more examples of [using Regular Expression on Programiz.com](https://www.programiz.com/python-programming/regex).

## Combining DataFrame and Regex

We can combine data frame and regular expression to perform column-based operation to all data at once.

In [3]:
import pandas as pd

df = pd.read_csv('sample-years.txt', header=None, names=['Original Text'])

df

Unnamed: 0,Original Text
0,Steven Hawking was born in 1942.
1,Albert Einstein was born in 1879
2,Albert Einstein won Nobel Prize in 1921.
3,Stephen Curry wear No. 30.
4,Stephen Curry went into NBA in 2009
5,Stephen Curry won NBA MVP in 2015 and 2016.
6,Micheal Jordan was born in 1963.


Now that we loaded the text into a column, we can create a new column that applies our own transformation.

We define the function that find first year and name given the string parameter input.

In [4]:
def find_first_year(string):
    pattern = '\d{4}|$'
    return re.findall(pattern, string)[0]

def find_first_name(string):
    pattern = '^[A-Z][a-z]+ [A-Z][a-z]+|$'
    return re.findall(pattern, string)[0]

In [5]:
import re

df["Years"] = df['Original Text'].apply(find_first_year)
df["Name"] = df['Original Text'].apply(find_first_name)

df

Unnamed: 0,Original Text,Years,Name
0,Steven Hawking was born in 1942.,1942.0,Steven Hawking
1,Albert Einstein was born in 1879,1879.0,Albert Einstein
2,Albert Einstein won Nobel Prize in 1921.,1921.0,Albert Einstein
3,Stephen Curry wear No. 30.,,Stephen Curry
4,Stephen Curry went into NBA in 2009,2009.0,Stephen Curry
5,Stephen Curry won NBA MVP in 2015 and 2016.,2015.0,Stephen Curry
6,Micheal Jordan was born in 1963.,1963.0,Micheal Jordan


In [6]:
df.sort_values(by="Years")

Unnamed: 0,Original Text,Years,Name
3,Stephen Curry wear No. 30.,,Stephen Curry
1,Albert Einstein was born in 1879,1879.0,Albert Einstein
2,Albert Einstein won Nobel Prize in 1921.,1921.0,Albert Einstein
0,Steven Hawking was born in 1942.,1942.0,Steven Hawking
6,Micheal Jordan was born in 1963.,1963.0,Micheal Jordan
4,Stephen Curry went into NBA in 2009,2009.0,Stephen Curry
5,Stephen Curry won NBA MVP in 2015 and 2016.,2015.0,Stephen Curry


In [7]:
df.sort_values(by="Name")

Unnamed: 0,Original Text,Years,Name
1,Albert Einstein was born in 1879,1879.0,Albert Einstein
2,Albert Einstein won Nobel Prize in 1921.,1921.0,Albert Einstein
6,Micheal Jordan was born in 1963.,1963.0,Micheal Jordan
3,Stephen Curry wear No. 30.,,Stephen Curry
4,Stephen Curry went into NBA in 2009,2009.0,Stephen Curry
5,Stephen Curry won NBA MVP in 2015 and 2016.,2015.0,Stephen Curry
0,Steven Hawking was born in 1942.,1942.0,Steven Hawking


More on [sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) on pandas documentation.