# Module 5 - Simple Regular Expressions (regex) (35 minutes)

A regular expression (regex) is a sequence of characters that define a search pattern. In practice, **regex** is used to search for strings in data. This practice is called pattern matching (or string matching). Leveraging regex is a vast topic, but we only cover the very basics in this module. The best way to learn regex is to practice. So, let's get to it!

Here are a couple of good references:

https://www.w3schools.com/python/python_regex.asp

https://www.computerhope.com/unix/regex-quickref.htm

* Please feel free to ask questions at any time!

In [None]:
!pip install pymongo

In [None]:
from pymongo import MongoClient

client = MongoClient('localhost', port=27017)
db = client.test

# Create a regex

1. Identify the pattern you wish to match
2. Write a regex to accomodate the pattern
3. Use the **'\\$regex'** operator to define the pattern

# Let's explore the 'movies' collection

* The 'movies' collection contains 3,883 movie titles
* It is available as a JSON file
* So, let's load it and create the 'movies' collection

In [None]:
# read JSON into memory

import json

def read_json(f):
    with open(f) as f:
        return json.load(f)

json_file = 'data/movies.json'  # path to the JSON file    
movies_data_from_json = read_json(json_file)

In [None]:
# establish the DB instance

movies = db.movies
movies.drop()  # good idea when learning ...

In [None]:
# create MongoDB collection

for i, row in enumerate(movies_data_from_json):
    row['_id'] = i
    movies.insert_one(row)  # create a new document for every record

In [None]:
# find number of documents

len(list(movies.find()))

In [None]:
# find features

dict = list(movies.find())
dict[0].keys()

* Why do we have to use index '0' with 'dict'?

## Query all 'Comedy' movies

In [None]:
# query all 'comedy' movies

q_regex = movies.find({'genres': {'$regex': 'Comedy'}})

# get the number of documents returned

len(list(q_regex))  # convert the query to a list and then find its length!

So, our query returned 1163 documents containing the string 'Comedy'. But, a query is <strong><font color=red>not</font></strong> a collection, so we had to use a different method to count the number of documents returned. However, the method we used for counting documents **consumes** the cursor, which means that it empties it. So, we must rerun the query.

In [None]:
# rerun the query that returns all 'comedy' movies

q_regex = movies.find({'genres': {'$regex': 'Comedy'}})

[(row['title'], row['genres'])  for i, row in enumerate(q_regex) if i < 5]

Although we returned 'Comedy' movies, we didn't return purely comedy ones!

## Query only 'Comedy' movies released in 2000

In [None]:
q_regex = movies.find({
    'title': {'$regex':'2000'},
    'genres': {'$regex': '^Comedy$'}},
{'_id':0, 'title':1, 'genres':1})  # explicitly project 'title' and 'genres'

[(row['title'], row['genres']) for i, row in enumerate(q_regex) if i < 5]

The **first** condition matches all movie titles released in the year 2000. The **second** condition matches genres that begin and end with the string 'Comedy'.

For the second condition, the **'^'** matches the first character of a string. So, the first character of the string must be an uppercase 'C'. The **'\\$'** matches the last character of a string. So, the last character must be a lowercase 'y'. You can also project the fields you want to view at the end of the query!

Also, notice that each condition is <strong><font color=blue>automatically</font></strong> an 'AND'. So, each condition we add must be met!

## Query movies for children released in 2000

In [None]:
q_regex = movies.find({
    'title': {'$regex':'2000'},
    'genres': {'$regex':'Children'}
    })

[(row['title'], row['genres']) for i, row in enumerate(q_regex) if i < 5]

Notice that we only include the word 'Children' in the regular expression for 'genres'.

## Query 'Drama' movies released in 1998 with 'Angel' in the title 

In [None]:
q_regex = movies.find({
    '$and':[
        {'title': {'$regex':'Angel'}},
        {'title': {'$regex':'1998'}}],
    'genres': {'$regex': '^Drama$'}
    })

[(row['title'], row['genres']) for i, row in enumerate(q_regex)]

* We had to use two 'title' expressions to get what we want! Notice that we had to use the '\\$and' operator to make the query work as desired.

## Query 'Comedy' movies with number titles released in the 1990s 

In [None]:
q_regex = movies.find({
    '$and':[
        {'title': {'$regex':'^[0-9]'}},
        {'title': {'$regex':'19*'}}],
    'genres': {'$regex': '^Comedy$'}
    })

[(row['title'], row['genres']) for i, row in enumerate(q_regex)]

We can get any digit with **'\[0-9\]'**. Notice how we use the wildcard symbol **'*'** to get movies in the 1990s.

# Let's explore the 'sales' collection

* The 'sales' collection contains 100 documents
* It is available as a CSV file
* So, let's load it and create the 'sales' collection

In [None]:
import pandas as pd

f = 'data/sales.csv'
df = pd.read_csv(f)
df[['Country', 'Item Type']].head(3)

In [None]:
# establish the DB instance

sales = db.sales
sales.drop()  # good idea when learning ...

In [None]:
# create a list of dictionary elements for MongoDB consumption

data = df.to_dict('records')

In [None]:
# create MongoDB collection

for i, row in enumerate(data):
    row['_id'] = i
    sales.insert_one(row)  # create a new document for every record

In [None]:
# find number of documents

len(list(sales.find()))

In [None]:
# find features

dict = list(sales.find())
dict[0].keys()

## Query countries starting with 'A' with ship dates in the 2000s

In [None]:
q_regex = sales.find({
    'Country': {'$regex':'^A'},
    'Ship Date': {'$regex':'20*'}},
{'_id':0, 'Country':1, 'Ship Date':1, 'Region':1})  # explicitly project some features

[(row['Country'], row['Ship Date'], row['Region'])
 for i, row in enumerate(q_regex) if i < 5]

## Query order dates in February of 2015

In [None]:
# return all comedy movies released in '1996'

q_regex = sales.find({
    '$and':[
        {'Order Date': {'$regex':'^2/'}},  # February is month '2'
        {'Order Date': {'$regex':'2015'}},
    ]
})

[(row['Country'], row['Order Date']) for i, row in enumerate(q_regex)]

# Finally, let's explore the familiar 'cars' collection

* The 'cars' collection contains 406 documents 

In [None]:
# establish the DB instance

cars = db.cars

* No need to drop the collection because we are not creating it from scratch!

In [None]:
# find number of documents

len(list(cars.find()))

In [None]:
# find features

dict = list(cars.find())
dict[0].keys()

## Let's perform a really complex one!

In [None]:
q = cars.find({
    'HP': {'$gt':113},
    'Weight': {'$gt':2000},
    'Cylinders':{'$in':[4, 6]},
    'Origin':{'$in':['Europe', 'Japan']},
    'Car':{'$regex':'^s', '$options':'i'}
    },
{'_id':0, 'Car':1, 'Origin':1, 'MPG':1})

[(row['Car'], row['Origin'], row['MPG']) for row in q]

We logically 'AND' 5 conditions, namely, 'HP' greater than 113, 'Weight' greater than 2000, 'Cylinders' of either 4 or 6, 'Origin' either Europe or Japan, and 'Car' starting with the letter 's'. The '\$options' with **'i'** includes either uppercase or lowercase strings. Whew!

## Query cars with 'fury' in the name not from Japan

In [None]:
q = cars.find({
    'Car': {'$regex': 'fury', '$options':'i'},
    'Origin': {'$nin':['Japan']}},
    {'_id':1, 'Car':1, 'Origin':1}
    )

[(row['_id'], row['Car'], row['Origin']) for row in q]

The '\\$nin' can be a very useful and efficient operator.

## Query US cars with name ending as 'duster'

In [None]:
q = cars.find({
    'Car': {'$regex': 'duster$', '$options':'i'},
    'Origin': {'$in':['US']}},
    {'_id':0, 'Car':1, 'Origin':1})

[(row['Car'], row['Origin']) for row in q]

## Finally, return Japanese cars with name starting with 'm' and containing 'rx', and sort asc

In [None]:
# return Japanese cars starting with 'm' and sort by 'Car' ascending

q = cars.find({
    'Car': {'$regex': '^m', '$options':'i'},
    'Car': {'$regex': 'rx', '$options':'i'},
    'Origin': {'$in':['Japan']}},
    {'_id':0, 'Car':1, 'Origin':1, 'HP':1}).sort('HP', 1)

[row for row in q]

# Module 5 Exercise

Create a regex query with the following specifications:

* Work with the **'sales'** collection
* Return only 'African' sales documents
* Return only countries starting with 'r'
* Project only country, total profit, and item type
* Include the '_id' key

# Our solution

* Begin with a plan. Create one query condition at a time and test. In this case, we only have to conditions and one projection. Let's begin by limiting the query to only 'Africa':

In [None]:
q = sales.find({
    'Region': {'$regex': 'africa', '$options':'i'}
},
{'Region':1})

[(row['_id'], row['Region']) for i,row in enumerate(q) if i < 3]

Let's continue with countries starting with 'r':

In [None]:
q = sales.find({
    'Region': {'$regex': 'africa', '$options':'i'},
    'Country': {'$regex': '^r', '$options':'i'}
},
{'_id':1, 'Region':1, 'Country':1})

[(row['_id'], row['Country'], row['Region'])
 for i,row in enumerate(q) if i < 10]

Now, all we have to do is project:

In [None]:
import numpy as np

q = sales.find({
    'Region': {'$regex': 'africa', '$options':'i'},
    'Country': {'$regex': '^r', '$options':'i'}
},
{'_id':1, 'Country':1, 'Item Type':1, 'Total Profit':1})

data = [(row['_id'], row['Country'], row['Item Type'],
         '${:,.2f}'.format(row['Total Profit'])) for row in q]

data

We went a bit crazy and formatted total profit!

# What did we learn?

1. we retreived the number of documents in a collection
2. we retrieved the features (keys) from a collection
3. we created several regex queries to identify patterns from three collections
4. we sharpened our regex querying skills with an exercise

## Questions?

# <font color=red>5 minute break</font>