# Manipulating Text

- Text manipulation warrant its own lecture for two reasons:
    - There's lots of different operations you can do to text
    - When doing web-scraping or image analysis or machine learning, much of the time, you'll get a block of text to clean

## Strings and Manipulations

- Strings are class!
- They come with a set of operations that you can use on them
- Before going into string data, it's useful to know what you can just do to the string, itself

In [1]:
type('hello')

str

In [2]:
# All ways of defining strings

my_string = "Hello World"
print(my_string)
my_string = 'Hello World'
print(my_string)
my_string = """Hello World"""
print(my_string)

Hello World
Hello World
Hello World


In [3]:
# Different ways of defining strings allow you to put in other quotations in the string
my_string = 'Hello " World'
print(my_string)
my_string = "Hello ' World"
print(my_string)
my_string = """Hello "'
World"""

print(my_string)

Hello " World
Hello ' World
Hello "'
World


In [4]:
# What happens when you don't do that

"Hello " World"

SyntaxError: invalid syntax (2743816767.py, line 3)

In [5]:
"Hello \" World"

'Hello " World'

## F-strings and Raw Strings

- These are ways to tell Python that you want the string to be understood differently by the interpreter

In [6]:
x = 1
f"Hello {x} World"

'Hello 1 World'

### Raw Strings

- Raw strings tell python not to interpret any special string keywords in your text, like newlines (`\n`) or tabs (`\t`). It just outputs the text as-is.

In [7]:
my_string = r"Hello \" World"
print(my_string)
my_string = "Hello\n\tWorld"
print(my_string)
my_string = r"Hello\n\tWorld"
print(my_string)

Hello \" World
Hello
	World
Hello\n\tWorld


### F-strings

- F-strings allow you to input a variable or some transformation of a variable directly into the text and get a result.

In [8]:
var = 'World'

my_string = f"Hello {var}"
print(my_string)

x = 1

my_string = f"Hello World {x + 1}"
print(my_string)

# Also allows you to change formatting

y = 1098907976097698.234354565645454634543

print(f"this is a number: {y}")
print(f"this is a number: {y:0,.3f}")

Hello World
Hello World 2
this is a number: 1098907976097698.2
this is a number: 1,098,907,976,097,698.250


- We can also use them in functions (or lambda functions) to make a nice way to print out dynamic output 

In [9]:
def print_progress(i):
    print(f"{i} is done!")

for i in range(10):
    print_progress(i)

0 is done!
1 is done!
2 is done!
3 is done!
4 is done!
5 is done!
6 is done!
7 is done!
8 is done!
9 is done!


## Strings as a list 

- Strings can be treated as an iterator in python
- Meaning when you make it a list or loop over it, it does interesting things

In [10]:
my_string = "Hello World"

print(list(my_string))

for i in my_string:
    print(i)

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
H
e
l
l
o
 
W
o
r
l
d


## Overloaded Multiplication and Addition

- You can use `+` and `*` with strings 



In [11]:
print("Hello" + "World")
# print("Hello"*3)
print("Hello" + " " + "World")
'\t'.join(["Hello", 'World', 'I'])

my_string = "I am the World"

my_string.split("a")

HelloWorld
Hello World


['I ', 'm the World']

- Strings are accessed by index, so you can easily reverse a string if needed:

In [12]:
my_string[-1]

'd'

In [13]:
my_string[::-1]

'dlroW eht ma I'

## Replacing, Containing, and Finding

- The basic python string has some helpful utilities for manipulating it.


In [14]:
my_string = " Hello World 23 "

print(my_string.replace("World", "Me"))

print('23' in my_string)

# my_string.__contains__("23")

# # find the first index where a substring occurs
print(my_string.rfind("23"))

print(my_string[13])
print(my_string.strip())

# # IMPORTANT: lowercase, uppercase, titlecase
# # Particularly important for string processing

print(my_string.lower())
print(my_string.upper())
print(my_string.title())

# # can this be cast as a number?
print(my_string.isalnum())

# my_string[13]

 Hello Me 23 
True
13
2
Hello World 23
 hello world 23 
 HELLO WORLD 23 
 Hello World 23 
False


## Joining and Splitting

- You can split strings by some delimiter and then join them by another!

In [15]:
my_string.split(' ')

','.join(my_string)

' ,H,e,l,l,o, ,W,o,r,l,d, ,2,3, '

## For everything else, there's regex

- Regex is an EXTREMELY powerful way to do string search
- I'm going to ATTEMPT to teach you the beginnings of it, but it's just the tip of the iceberg.
- Note that python does not use regex by default, `replace` and `find` just looks for the particular substring
    - we use the library `re` for that. No need to install


In [16]:
import re

- Let's start with a big block of text and try to parse out the email address, and phone numbers
- and then further, the sender and hostname, area code and phone number

In [17]:
long_string = "For more details, please contact, am2497@cornell.edu or call us at (718)-555-9987"


-  We could use `find` to know where the email *might* occur, but we can't extract and won't be able to do any find-tuning
- I'm going to give you the finished regex string and then we'll break it down

In [19]:
find_email_phone = re.compile(r".*?(\S+)@(\S+).*?([\(\)\-\d]+)")
# type(find_email_phone)
# NOTE: raw string

# find_email_phone.findall(long_string, flags=re.MULTILINE)

# re.findall(r".*?(\S+)@(\S+).*?([\(\)\-\d]+)", long_string)

- Let's break this down and see what's going on:
- It's important to note first that parentheses are an important thing here as it gives us the ability to extract from the string:
    - This is called a *capture group* as we are telling the expression that we want whatever satisfies the conditions to be extracted
- An important way to this about this conceptually is that each expression is like a conditional: we are telling python to break down the text to a set of tokens with that particular structure and to break it up based on those conditions.

`.*?(\S+)@(\S+).*?([\(\)\-0-9]+)`

`. => ANYTHING that isn't a linebreak (\n \r \b)`

- `.` here denotes to capture ALL text. Since this is a one-line string, this is basically taking everything

`* => 0 or more occurrences`

- In this case, `*` modifies the expression before it and says "give me 0 or more occurrences of EVERYTHING"

`? => The lazy quantifier`

- To illustrate what this is, suppose we didn't use the lazy quantifier after each occurrence of `.*`
- `.*(\S+)@(\S+).*([\(\)\-0-9]+)`
- Our output would be: `[('7', 'cornell.edu', '7')]`
- Without the quantifier, `.*` is "greedy", so it takes up as much text that satisfies its condition before the next part of the text's conditions are satisfied.
- In this case, we don't get what we want, which is the username of the email address

`\S => any character that's not whitespace`

- The opposite of this is `\s`, any whitespace
- since we know that email addresses cannot include whitespace, `\S` is really useful here.

`+ => 1 or more occurences`

- This is similar to `*` but means that we need at least 1 occurrence of this
- In this case, `*` would have worked too, but this works better for readability.
- So now we have this expression: `\S+` which means "at least one occurrence of a non-whitespace character"
- We also put this in a capture group because we want to extract it.

`@ => the @ sign for emails`

- This is here to denote for the capture group to stop when it reaches `@`

`(\S+) => capture 1 or more occurrences of a non-whitespace character`

- Now do the same thing but for AFTER `@` to get the hostname

`.*? => lazy ANYTHING`

- Then a lazy ANYTHING to extract everything 

`\( and \- => escaped parentheses`

- Since parentheses (`(`, `)`) and dash (`-`) are an important part of regex syntax, we need to "escape" by putting a backslash `\` in front of it.

`[<stuff>] => character set`

-  square brackets are a way to clump together different things that you want to be searched for together, so in this case:

```
[\(\)\-\d]
```

The square brackets denote, "all parentheses, dashes or digits

`\d => a digit`

- Or can also be `[0-9]`


Many of the reserved characters are a shorthand for a character set. For example:

`. => [a-zA-Z0-9!\@#\$%\^&\*\(\)_+\{\}\|:"<>\?-=\[\]\\\;',\./`~]`


So finally when use `findall`, and put our capture groups in correctly, we get our desired result.


- What is this?!
- This is horrible
- How do we parse this out?

![](images/image.svg)

## But wait, there's more!

- seldom in life will you be confronted with a situation with text, where everything will be easy to extract
- If you text DATA, there's not guarantee that the email will ALWAYS come before the phone number for instance.
- Every regular expression makes assumptions about the input text, but it's our job to make it robust to extract as much information as we can.

In [20]:
long_string = "For more details, please call us at (718)-555-9987 or email at am2497@cornell.edu"

find_email_phone = re.compile(r".*?(\S+)@(\S+).*?([\(\)\-\d]+)")

# NOTE: raw string

find_email_phone.findall(long_string)

[]

- In this case, simply switching the order of the email and phone number leads to an empty string
- There are two ways around this:
    1. break up the regex into two parts and extract each separately (easier to code but probably slower)
    2. Make a more robust regex to extract both regardless of their position (more complicated regex, but can be done in one operation)
    
The one you decide to do will be based on you data size and patience for regex...



### Breaking up the Regex

-  This is pretty simple, but may not be perfect...



In [21]:
email_regex = re.compile(r".*?(\S+)@(\S+).*?")
phone_regex = re.compile(r".*?([\(\)\-\d]+)")

print(email_regex.findall(long_string))
print(phone_regex.findall(long_string))

[('am2497', 'cornell.edu')]
['(718)-555-9987', '2497']


- Simply copy-pasting `phone_regex` was not enough as it also picks up numbers from the email address
- You can modify it by using the structure of the phone number:



In [34]:
phone_regex = re.compile(r".*?(\(*\d{3}\)*\-*\s*\d{3}\-*\s*\d{4}).*") 
# {3} means exactly 3 occurrences
# add stars to make sure that you don't NEED dashes to capture the expression
# add whitespace in case phone number is broken up by spaces
# surround it by .* to make sure that it come in any place in the string
phone_regex.findall(long_string)


['(718)-555-9987']

### Creating a more Robust Regex expression

- In order to be robust to changes of position or all-around weirdness in the text, we need a way for the regex to "search" and find the occurrence of the email or phone number.
- For this we can use pipes for looking for *either* an email or phone number


In [36]:
robust_regex = re.compile(r"(\S+@\S+|[\(\)\-\d]+)")
robust_regex.findall(long_string)

['(718)-555-9987', 'am2497@cornell.edu']

## Other `re` functions

- `re` isn't just about `find`, it has tons of functions you can use that have the same basic syntax but are used for different things:

https://docs.python.org/3/library/re.html#functions

## An Example Courtesy of ChatGPT

-  ChatGPT was able to create some code for a random list of text where emails and phone numbers vary.
- As long as we're on the topic, make sure to be aware of the ChatGPT policy!

In [46]:
import random

def generate_example():
    templates = [
        "You can contact me at my email address, {} or call my phone number, {}.",
        "For any questions, please reach out to us via email at {} or call our hotline at {}.",
        "To get in touch, you can email me at {} or call me at {}.",
        "Please provide your contact information, including your email ({}) and your phone number ({}).",
        "If you need assistance, feel free to call us at {} or send an email to {}."
    ]

    email_formats = [
        "user{}@example.com",
        "contact.{}@domain.net",
        "{}@email-provider.org",
        "{}123@gmail.com",
        "info_{}@company-website.com"
    ]

    phone_formats = [
        "({}) {}-{}",
        "{}-{}-{}",
        "+1 ({}) {}-{}",
        "+44 {} {} {}",
        "{}.{}.{}"
    ]

    email = random.choice(email_formats).format(random.randint(100, 999))
    phone = random.choice(phone_formats).format(
        random.randint(100, 999),
        random.randint(100, 999),
        random.randint(1000, 9999)
    )

    template = random.choice(templates)
    if random.random() < 0.5:
        sentence = template.format(email, phone)
    else:
        sentence = template.format(phone, email)

    return sentence

string_list = []
# Generate and print five random sentences
for _ in range(5):
    example = generate_example()
    print(example)
    string_list.append(example)

For any questions, please reach out to us via email at +44 828 955 1438 or call our hotline at contact.804@domain.net.
To get in touch, you can email me at 151.866.8439 or call me at user219@example.com.
You can contact me at my email address, +1 (326) 630-5168 or call my phone number, user347@example.com.
If you need assistance, feel free to call us at 564123@gmail.com or send an email to 955-164-8874.
Please provide your contact information, including your email (464@email-provider.org) and your phone number ((640) 277-4255).


## Processing Text Data with Pandas

- Now that we have our regex expression and our example data, let's see how we can process this all in one go with `pandas`
- What we can do is create a dataframe out of this data

In [47]:
import pandas as pd

In [49]:
text_series = pd.Series(
    data=string_list,
    name= 'text',
)

In [50]:
text_series

0    For any questions, please reach out to us via ...
1    To get in touch, you can email me at 151.866.8...
2    You can contact me at my email address, +1 (32...
3    If you need assistance, feel free to call us a...
4    Please provide your contact information, inclu...
Name: text, dtype: object

- Since this is one list, we can create `Series` out of it, not a dataframe
- We can then use `to_frame` to make it a dataframe if need be
-  In pandas we can use the `str` accessor to do everything we described above but for all text.
- `pandas` uses the same functions as `re` and `str` 

In [52]:
text_series.to_frame()

Unnamed: 0,text
0,"For any questions, please reach out to us via ..."
1,"To get in touch, you can email me at 151.866.8..."
2,"You can contact me at my email address, +1 (32..."
3,"If you need assistance, feel free to call us a..."
4,"Please provide your contact information, inclu..."


In [56]:
text_df = (
    text_series
    .to_frame()
    .assign(text_upper = lambda df: df['text'].str.upper(),
            text_lower = lambda df: df['text'].str.lower(),
            text_contains_num_2 = lambda df: df['text'].str.contains('2') # same as `in`
            )
    )

text_df

# '2' in text => True/False

Unnamed: 0,text,text_upper,text_lower,text_contains_num_2
0,"For any questions, please reach out to us via ...","FOR ANY QUESTIONS, PLEASE REACH OUT TO US VIA ...","for any questions, please reach out to us via ...",True
1,"To get in touch, you can email me at 151.866.8...","TO GET IN TOUCH, YOU CAN EMAIL ME AT 151.866.8...","to get in touch, you can email me at 151.866.8...",True
2,"You can contact me at my email address, +1 (32...","YOU CAN CONTACT ME AT MY EMAIL ADDRESS, +1 (32...","you can contact me at my email address, +1 (32...",True
3,"If you need assistance, feel free to call us a...","IF YOU NEED ASSISTANCE, FEEL FREE TO CALL US A...","if you need assistance, feel free to call us a...",True
4,"Please provide your contact information, inclu...","PLEASE PROVIDE YOUR CONTACT INFORMATION, INCLU...","please provide your contact information, inclu...",True


- for extracting, we just use `findall` as above
- For this one, though let's inspect the data and make sure we're getting it right

In [59]:
string_list

['For any questions, please reach out to us via email at +44 828 955 1438 or call our hotline at contact.804@domain.net.',
 'To get in touch, you can email me at 151.866.8439 or call me at user219@example.com.',
 'You can contact me at my email address, +1 (326) 630-5168 or call my phone number, user347@example.com.',
 'If you need assistance, feel free to call us at 564123@gmail.com or send an email to 955-164-8874.',
 'Please provide your contact information, including your email (464@email-provider.org) and your phone number ((640) 277-4255).']

- in this case, we can see that some numbers are international, have +1 in front of them. 
- The email have numbers, periods or no letters at all
- Let's see what happens we run this with our existing piped regex

In [60]:


(
    text_df
    .assign(find_groups = lambda df: df['text'].str.findall(robust_regex),
            )
#     .find_groups.iloc[0]
    )

Unnamed: 0,text,text_upper,text_lower,text_contains_num_2,find_groups
0,"For any questions, please reach out to us via ...","FOR ANY QUESTIONS, PLEASE REACH OUT TO US VIA ...","for any questions, please reach out to us via ...",True,"[44, 828, 955, 1438, contact.804@domain.net.]"
1,"To get in touch, you can email me at 151.866.8...","TO GET IN TOUCH, YOU CAN EMAIL ME AT 151.866.8...","to get in touch, you can email me at 151.866.8...",True,"[151, 866, 8439, user219@example.com.]"
2,"You can contact me at my email address, +1 (32...","YOU CAN CONTACT ME AT MY EMAIL ADDRESS, +1 (32...","you can contact me at my email address, +1 (32...",True,"[1, (326), 630-5168, user347@example.com.]"
3,"If you need assistance, feel free to call us a...","IF YOU NEED ASSISTANCE, FEEL FREE TO CALL US A...","if you need assistance, feel free to call us a...",True,"[564123@gmail.com, 955-164-8874]"
4,"Please provide your contact information, inclu...","PLEASE PROVIDE YOUR CONTACT INFORMATION, INCLU...","please provide your contact information, inclu...",True,"[(464@email-provider.org), ((640), 277-4255)]"


- We can see here that the emails have parentheses around them and the phone number is split up
- We can always do some extra processing to it if need be, but let's if we can make it right in one go. 

In [78]:
(
    text_series
    .to_frame()
    .assign(find_groups = lambda df: df['text'].str.findall(r'(\S+@\S+)|([\+\(\)\-\d\s]+)'),
            )
#     .find_groups#.iloc[0]
    )

Unnamed: 0,text,find_groups
0,"For any questions, please reach out to us via ...","[(, ), (, ), (, ), (, ), (, ), (, ), (, ..."
1,"To get in touch, you can email me at 151.866.8...","[(, ), (, ), (, ), (, ), (, ), (, ), (, ..."
2,"You can contact me at my email address, +1 (32...","[(, ), (, ), (, ), (, ), (, ), (, ), (, ..."
3,"If you need assistance, feel free to call us a...","[(, ), (, ), (, ), (, ), (, ), (, ), (, ..."
4,"Please provide your contact information, inclu...","[(, ), (, ), (, ), (, ), (, ), (, ), (, ..."


-  This is tough... maybe we should just do each separately

In [92]:
text_found = (
    text_series
    .to_frame()
    .assign(find_email = lambda df: df['text'].str.extract(r'(\S+@\S+)', expand=True),#.str.replace(r"\(|\)", '', regex=True),
            find_phone = lambda df: df['text'].str.extract(r"(\+*\d*[\s\-\.]+\(*\d*\)*[\s\-\.]+\d*\s*\-*\.*\d*)"),
            find_else = lambda df:
            )
    
    )

# When expand=True, it can create a dataframe from capture groups
# So we set expand=False, so that it returns the capture (assuming we just want the first one)

text_found


SyntaxError: invalid syntax (1593660449.py, line 7)

- But now that we have figured OUT each separately, we can do it in on go

In [95]:
x = 1

x + 's'

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [90]:
text_series.str.extractall(r"(?P<email>\S+@\S+)|(?P<phone>\+*\d*[\s\-]+\(*\d*\)*[\s\-]+\d*\s*\-*\d*)")


Unnamed: 0_level_0,Unnamed: 1_level_0,email,phone
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,,+44 828 955 1438
0,1,contact.804@domain.net.,
1,0,user219@example.com.,
2,0,,+1 (326) 630-5168
2,1,user347@example.com.,
3,0,564123@gmail.com,
3,1,,955-164-8874
4,0,(464@email-provider.org),
4,1,,((640) 277-4255


## Homework 1

- Due on **OCT 17**
- https://www.nytimes.com/interactive/2016/01/07/us/drug-overdose-deaths-in-the-us.html
- Recreate the maps in the figure as closely as you can using geopandas
- Extra Credit: Make it interactive using plotly or folium (will discuss later)
- Come to office hours if your struggling!

## Exercises

1. For each text in `text_series`, output a boolean (True/False) series if the word `email` appears
2. extract just the username from each email address (you can just use the `find_email` column in `text_found`)
3. How long is each text string in `text_series`
    - Hint: check the `str` accessor functions in `pandas`: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.len.html