**Regular Expressions**: a string of characters that define a search pattern.

Example: "MM/DD/YYYY" is a string of characters that defines a pattern for entering dates.

## Example: Convert Single Quotes to Double Quotes

Trying to convert the data string with single quotes around keys results in an error, as seen below.

In [11]:
# Import the json module.
import json
# Assign the string data to a variable. 
data = "{'contact_id': 4661, 'name': 'Cecilia Velasco', 'email': 'cecilia.velasco@rodrigues.fr'}"

# Convert the string data to a dictionary.
converted_data = json.loads(data)
# Iterate through the dictionary (row) and get the values.
row_values = [v for k, v in converted_data.items()]
print(row_values)

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

The above error occured because we need to have double qoutes around the keys. An easy solution is to add the following code:

data = data.replace(" ' ", ' " ') \
print(data)

See revised below.

In [17]:
# Import the json module.
import json
# Assign the string data to a variable. 
data = "{'contact_id': 4661, 'name': 'Cecilia Velasco', 'email': 'cecilia.velasco@rodrigues.fr'}"
data = data.replace("'", '"')
print(data)

# Convert the string data to a dictionary.
converted_data = json.loads(data)
# Iterate through the dictionary (row) and get the values.
row_values = [v for k, v in converted_data.items()]
print(row_values)

{"contact_id": 4661, "name": "Cecilia Velasco", "email": "cecilia.velasco@rodrigues.fr"}
[4661, 'Cecilia Velasco', 'cecilia.velasco@rodrigues.fr']


## **Skill Drill** 
### Convert the following updated string data to a dictionary, and then print the value of each key:

In [3]:
# Convert string data to a dictionary, then print the value of each key.
dict_data = {"contact_id": 4661, "name": "Cecilia Velasco", "email": "cecilia.velasco@rodrigues.fr"}

# Iterate through the dictionary (row) and get the values.
row_values = [v for k, v in dict_data.items()]
print(row_values)

[4661, 'Cecilia Velasco', 'cecilia.velasco@rodrigues.fr']


# Example: Finding Substrings Without Punctuation as a Guide

Use regular expressions and the **findall** function to extract the strings that we need. The **findall** function searches for all the strings that match a specific pattern.

The finadall syntax used: re.finall(pattern, string)

Pattern used: '(d{4})'
- The open and close parentheses contain the pattern for the capture group.
- **"\d"** matches a numerical digit.
- **"{4}"** says to match a numerical digit exactly four times.
- Regular expressions use the backslash (\).
- Tell Python to treat our regular expressions as raw strings of text, using the letter **r** before the ’(d{4})’ pattern.
- **Capture group** the regular expression within parentheses, which will capture or extract the desired substring from the variable,string_data.
- This results in the **r'(\d{4})'** that we used.
- We need to do this every time that we create a regular expression string.

In [4]:
# Import the regular expression module.
import re
# Assign the string data to a variable. 
string_data = "contact_id 4661 name Cecilia Velasco email cecilia.velasco@rodrigues.fr"
# Extract the four digit number.
contact_id = re.findall(r'(\d{4})', string_data)
print(contact_id)

['4661']


# Example: Finding Substrings in Multiple Rows

Use the Pandas **str.extract** function.

In [6]:
# Import the Pandas dependency.
import pandas as pd


# Read the contacts string data into a Pandas DataFrame
contacts_string_df = pd.read_csv("Resources/contacts_string_data.csv")
contacts_string_df.head()

Unnamed: 0,contact_info
0,contact_id 4661 name Cecilia Velasco email cec...
1,contact_id 3765 name Mariana Ellis email maria...
2,contact_id 4187 name Sofie Woods email sofie.w...
3,contact_id 4941 name Jeanette Iannotti email j...
4,contact_id 2199 name Samuel Sorgatz email samu...


To extract multiple rows of the contact_id, the **str.extract** function is used with the capture group **'(\d{4})'** as the parameter.

In [18]:
# Create new column named contact_id and extract the four-digit contact ID number.
contacts_string_df['contact_id'] = contacts_string_df['contact_info'].str.extract(r'(\d{4})')
contacts_string_df.head()

Unnamed: 0,contact_info,contact_id
0,contact_id 4661 name Cecilia Velasco email cec...,4661
1,contact_id 3765 name Mariana Ellis email maria...,3765
2,contact_id 4187 name Sofie Woods email sofie.w...,4187
3,contact_id 4941 name Jeanette Iannotti email j...,4941
4,contact_id 2199 name Samuel Sorgatz email samu...,2199


In [20]:
# Extract the first and last name after the word "name". 
name = re.findall(r'([^name\s+][A-Za-z]+\s+[A-Za-z]+)', string_data)
name

['Cecilia Velasco', 'il cecilia']

In [21]:
# Extract the first and last name after the word "name". 
name = re.findall(r'([^nameil\s+][A-Za-z]+\s+[A-Za-z]+)', string_data)
name

['Cecilia Velasco']

In [22]:
# Create new column named name and extract the name.
contacts_string_df['name'] = contacts_string_df['contact_info'].str.extract(r'([^nameil\s+][A-Za-z]+\s+[A-Za-z]+)')
contacts_string_df.head()

Unnamed: 0,contact_info,contact_id,name
0,contact_id 4661 name Cecilia Velasco email cec...,4661,Cecilia Velasco
1,contact_id 3765 name Mariana Ellis email maria...,3765,Mariana Ellis
2,contact_id 4187 name Sofie Woods email sofie.w...,4187,Sofie Woods
3,contact_id 4941 name Jeanette Iannotti email j...,4941,Jeanette Iannotti
4,contact_id 2199 name Samuel Sorgatz email samu...,2199,Samuel Sorgatz


In [23]:
# Extract the email address using a regular expression pattern. 
email_address = re.findall(r'(\S+@\S+)', string_data)
email_address

['cecilia.velasco@rodrigues.fr']

In [24]:
# Create new column named name and extract the name.
contacts_string_df['email'] = contacts_string_df['contact_info'].str.extract(r'(\S+@\S+)')
contacts_string_df.head()

Unnamed: 0,contact_info,contact_id,name,email
0,contact_id 4661 name Cecilia Velasco email cec...,4661,Cecilia Velasco,cecilia.velasco@rodrigues.fr
1,contact_id 3765 name Mariana Ellis email maria...,3765,Mariana Ellis,mariana.ellis@rossi.org
2,contact_id 4187 name Sofie Woods email sofie.w...,4187,Sofie Woods,sofie.woods@riviere.com
3,contact_id 4941 name Jeanette Iannotti email j...,4941,Jeanette Iannotti,jeanette.iannotti@yahoo.com
4,contact_id 2199 name Samuel Sorgatz email samu...,2199,Samuel Sorgatz,samuel.sorgatz@gmail.com
