<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/03-Regular_Expressions/A-Regular_Expressions_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions




Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists. 

This document will present basic regular expression syntax and cover common use cases for regular expressions. 

In [1]:
# The code below is written in Python to replicate the 
# behavior of grep, the UNIX utility
# We will examine the details of how the code works in a subsequent notebook.
# For now, just execute the code, and use the function 
# grep(regex_expression, name_list) as-is

import re

def printMatches(text, regex_expression):
  BACKGROUND_YELLOW = '\x1b[43m'
  COLOR_RESET  = "\x1b[0m"
  regex= re.compile(regex_expression)
  matches = regex.finditer(text)
  for m in matches:
    highlighted  = text[:m.start()] # the string before the regex match
    highlighted += BACKGROUND_YELLOW + text[m.start():m.end()] + COLOR_RESET 
    highlighted += text[m.end():] # the string after the regex match
    print(highlighted)

def grep(regex_expression, name_list):
  for line in name_list:
    printMatches(line, regex_expression)

### NYC Restaurant Names Data

In the notebook, we will demonstrating the various regular expressions using the set of restaurant names from `/data/uniquenames.txt`.

In [2]:
!pip install -U -q PyMySQL sqlalchemy

from sqlalchemy import create_engine
import pandas as pd

conn_string = 'mysql+pymysql://{user}:{password}@{host}/{db}?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org', 
    user = 'student',
    password = 'dwdstudent2015', 
    db = 'doh_restaurants',
    encoding = 'utf8mb4')

with create_engine(conn_string).connect() as mysql_conn:
  # This query returns back the restaurants in the DOH database
  sql = 'SELECT DISTINCT UPPER(DBA) AS DBA FROM restaurants WHERE DBA IS NOT NULL'
  uniquenames = pd.read_sql(sql, con=mysql_conn)
  uniquenames = uniquenames.DBA.values

print(f"Found {len(uniquenames)} unique restaurants names")

Found 14851 unique restaurants names


Let's take a peek at the entries of the `uniquenames` list:

In [3]:
uniquenames[:5]

array(['TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST',
       "WENDY'S", 'DJ REYNOLDS PUB AND RESTAURANT', 'RIVIERA CATERERS',
       "WILKEN'S FINE FOOD"], dtype=object)

In [4]:
uniquenames[-5:]

array(['PHO HOANG', 'CHICK ROCKS', 'NEW DOUBLE DRAGON',
       'ADONAY RESTAURANT', 'OH! BAGEL'], dtype=object)

Now, let's see if there are any restaurants with the string 'PANO' in them:

In [5]:
grep('PANO', uniquenames)

LA CANDELA ES[43mPANO[0mLA
[43mPANO[0mRAMA OF MY SILENCE-HEART
GOOGLE [43mPANO[0mRAMA
CENTRO ES[43mPANO[0mL
[43mPANO[0mRAMA STEAKHOUSE
SABOR HIS[43mPANO[0m
[43mPANO[0mRAMA


What can we do if we want to search for something more complex than a fixed string? Regular expressions are solving exactly this problem. 

### The atoms

The simplest regular expressions are a sequence of `atoms`. An atom can be any of the following:
* single character, 
* a dot,
* a bracket expression, 
* an anchor.

#### Single character atom

A single character atom matches itself.

#### The `.` character atom

A dot atom matches any single character (except for a new line character `\n`).

Example: Using single character atoms, and the `.` atom, let's find all restaurant names that contain the characters `AB`, followed by any character (`.`) and then the character `D`:

In [6]:
grep('AB.D', uniquenames)

P[43mABAD[0mE BAKERY & CAFE
CR[43mAB D[0mU JOUR XPRESS


#### Bracket expression atom

A bracket expression (defined by square brackets []) defines a set of characters. matches only one single character that can be any of the characters defined in a set. Example: [ABL] matches either A, B, or L.

Now, let's use a bracket expression: We want to find restaurants that contain one of the letters A,B,C,X,Y,Z followed by a digit. We specify the set of letters as `[ABCXYZ]` and the set of digits as `[0123456789]`.  

In [31]:
grep('[ABCXYZ][0123456789]', uniquenames)

[43mB6[0m6 CLUB
[43mB2[0m HARLEM
WORLD BEAN, VELOCITY BAR (E[43mC2[0m)
BARCLAYS UPPER SUITE STOLI BAR AND STORAGE ROOM 5[43mC2[0m9.03
COTTO MARKET-GATE [43mC3[0m0
GARDEN MARKET, STREET/HEALTH (F[43mA6[0m090)
BAR AT THE GARDEN (B[43mA6[0m110)
EVENT LEVEL CLUB (DELTA SKY 360 BAR) B[43mA5[0m075
HOT DOG CONCESSION (F[43mC6[0m100)
FOUR SEASONS HOTEL EMPLOYEE CAFETERI[43mA4[0m
[43mA1[0m JAMAICA BREEZE
SI[43mX2[0m RESTAURANT & BAKERY
SK[43mY5[0m5 BAR AND GRILL


##### Brackets and ranges

Instead of typing long lists of characters in a bracket expression, we can use the range character: [0-9] is equivalent to [0123456789]. Similarly [A-Z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. And [D-T] is equivalent to [DEFGHIJKLMNOPQRST]. (You get the idea.) You can also combine multiple ranges: [a-e1-9] is equivalent to [abcde123456789]. Finally, you can even specify to be excluded from the set using the character (^). For example, [^0-9] matches any character other than a number.

For example, let's find restaurants that contain a letter, followed by a number, and then followed by a charather that is not a number:

In [8]:
grep('[A-Z][0-9][^0-9]', uniquenames)

[43mB2 [0mHARLEM
WORLD BEAN, VELOCITY BAR (E[43mC2)[0m
TW[43mO8T[0mWO BAR & BURGER
B[43mT4 [0mAISLE B
B[43mT3 [0mAISLE A
OPPA , [43mW4 [0mPIZZA
[43mA1 [0mJAMAICA BREEZE
SI[43mX2 [0mRESTAURANT & BAKERY
TEAMWORKO[43mN3 [0mJUICES
[43mO2 [0mK-BBQ


Hm, we do not want to get results that have a space after the number, so let's also exclude the space character:

In [9]:
grep('[A-Z][0-9][^0-9 ]', uniquenames) 

WORLD BEAN, VELOCITY BAR (E[43mC2)[0m
TW[43mO8T[0mWO BAR & BURGER


In [10]:
# Digit, not letter not digit not space, digit
grep('[0-9][^A-Z0-9 ][0-9]', uniquenames) 

10[43m4-0[0m1 FOSTER AVENUE COFFEE SHOP(UPS)
SUNSWICK 3[43m5/3[0m5
THE BEST $[43m1.0[0m0 PIZZA
BARCLAYS UPPER SUITE STOLI BAR AND STORAGE ROOM 5C2[43m9.0[0m3
[43m1.5[0m DAK GALBI
CHAIKHANA [43m7:4[0m0


In [11]:
# Restaurants with five digits
grep('[0-9][0-9][0-9][0-9][0-9]', uniquenames) 

STARBUCKS COFFEE # [43m26528[0m
STARBUCKS COFFEE #[43m22716[0m
STARBUCKS COFFEE #[43m23591[0m
SUBWAY #[43m47857[0m
STARBUCKS COFFEE #[43m29856[0m
STARBUCKS COFFEE COMPANY #[43m29897[0m
STARBUCKS COFFEE #[43m48170[0m
STARBUCKS (STORE #[43m50483[0m)
STARBUCKS COFFEE #[43m49952[0m
STARBUCKS #[43m50611[0m
STARBUCKS COFFEE # [43m49196[0m
STARBUCKS #[43m48990[0m
STARBUCKS COFFEE #[43m49550[0m
MCDONALD'S #[43m11542[0m
MCDONALD'S #[43m13068[0m
MCDONALD'S #[43m23105[0m
MCDONALDS # [43m18093[0m
MCDONALDS [43m14520[0m
SUBWAY (#[43m29887[0m)
STARBUCKS COFFEE #[43m50622[0m
SUBWAY STORE #[43m30214[0m
CARVEL # [43m10222[0m4
STARBUCKS COFFEE #[43m29719[0m
STARBUCKS COFFEE #[43m53473[0m
STARBUCKS COFFEE #[43m49450[0m
STARBUCKS #[43m54446[0m
STARBUCKS COFFE #[43m55085[0m
TACO BELL CANTINA [43m03464[0m6
STARBUCKS COFFEE#[43m54771[0m
CARVEL [43m10266[0m4
STARBUCKS COFFEE #[43m52530[0m
SUBWAY # [43m30658[0m
TACO BELL # [43m35457[0m
STARBUCKS

#### Anchor

Anchor atoms are special characters, used to define the location of a regex within a line. 

The anchor `^` specifies the *beginning of a line*, the anchor `$` specifies the end of a line. The anchor `\b` specifies the word boundary.

Example: Find restaurant names that start with the characters `BAL`

In [12]:
grep('^BAL', uniquenames)

[43mBAL[0mTHAZAR RESTAURANT
[43mBAL[0mTHAZAR BAKERY
[43mBAL[0mLATO'S RESTAURANT
[43mBAL[0mDOR SPECIALTY FOODS
[43mBAL[0mBOA RESTAURANT
[43mBAL[0mADE
[43mBAL[0mIMAYA RESTAURANT
[43mBAL[0mABOOSTA
[43mBAL[0mZEM
[43mBAL[0mVANERA
[43mBAL[0mADE EASTERN MEDITERRANEAN
[43mBAL[0mANCERO


Example: Find restaurant names that end with the characters `SQUARE`

In [13]:
grep('SQUARE$', uniquenames)

MADISON [43mSQUARE[0m
MERRION [43mSQUARE[0m
TONIC TIMES [43mSQUARE[0m
NORTH [43mSQUARE[0m
HOLIDAY INN EXPRESS NYC TIMES [43mSQUARE[0m
HAMPTON INN-HERALD [43mSQUARE[0m
RESIDENCE INN TIMES [43mSQUARE[0m
THE MANHATTAN AT TIMES [43mSQUARE[0m
BEAN [43mSQUARE[0m
BEST WESTERN PREMIER HERALD [43mSQUARE[0m
HAMPTON INN TIMES [43mSQUARE[0m
ELEMENT HOTEL TIME [43mSQUARE[0m
TACOS TIME [43mSQUARE[0m
HOMEWOOD SUITES BY HILTON NEW YORK MIDTOWN MANHATTAN TIMES [43mSQUARE[0m
HOTEL RIU PLAZA NEW YORK TIMES [43mSQUARE[0m
MIAS BAKERY TIMES [43mSQUARE[0m
ICHIRAN TIMES [43mSQUARE[0m
PINNACLE BAGELS ON THE [43mSQUARE[0m


In [14]:
# All restaurants that end with 4 digits
grep('[0-9][0-9][0-9][0-9]$', uniquenames)

STARBUCKS #[43m7277[0m
STARBUCKS COFFEE #[43m7358[0m
STARBUCKS #[43m7378[0m
GALLAGHER'S [43m2000[0m
CARVEL [43m2848[0m
KAFFE [43m1668[0m
THE ORIGINAL VINCENT'S ESTABLISH [43m1904[0m
CHIPTOLE MEXICAN GRILL #[43m2407[0m
EVENT LEVEL CLUB (DELTA SKY 360 BAR) BA[43m5075[0m
SUITE 200, [43m1879[0m
STARBUCKS COFFEE # 2[43m6528[0m
STARBUCKS COFFEE #2[43m2716[0m
STARBUCKS COFFEE #2[43m3591[0m
LN [43m1380[0m
SUBWAY #4[43m7857[0m
CARVEL [43m1939[0m
CHIPOTLE MEXICAN GRILL #[43m2308[0m
CHIPOTLE MEXICAN GRILL #[43m2254[0m
PANDA EXPRESS #[43m2679[0m
PANDA EXPRESS #[43m2633[0m
CHIPOTLE MEXCIAN GRILL # [43m2760[0m
PANDA EXPRESS [43m2614[0m
STARBUCKS COFFEE #2[43m9856[0m
CHIPOTLE MEXICAN GRILL #[43m2834[0m
MEXICO [43m2000[0m
CHIPOTLE MEXICAN GRILL #[43m2838[0m
STARBUCKS COFFEE COMPANY #2[43m9897[0m
NEW GREAT WALL [43m1419[0m
CHIPOTLE MEXICAN GRILL #[43m2570[0m
CHIPOTLE MEXICAN GRILL #[43m2879[0m
STARBUCKS COFFEE #4[43m8170[0m
STARBUCKS COFFEE 

Example: Let's try to find restaurants containing the word `MEXICO`:

In [15]:
# Note that we capture also words like 'TULCIMEXICO' and 'MEXICOCIANA' 
grep('MEXICO', uniquenames)

[43mMEXICO[0m LINDO RESTAURANT
PIAXTLA ES [43mMEXICO[0m DELI
TACOS [43mMEXICO[0m
NEW [43mMEXICO[0m PLACE
SABOR A [43mMEXICO[0m II
EL SOL DE [43mMEXICO[0m DELI GROCERY
LAS MARAVILLAS DE [43mMEXICO[0m RESTAURANT
NUEVO [43mMEXICO[0m MEXICAN RESTAURANT
SABOR A [43mMEXICO[0m TAQUERIA
TACOS Y QUESADILLAS [43mMEXICO[0m
TAQUITOS [43mMEXICO[0m RESTAURANT
MADE IN [43mMEXICO[0m
[43mMEXICO[0m EN LA PIEL
[43mMEXICO[0m EL SALVADOR INC
MANJARES [43mMEXICO[0m
[43mMEXICO[0m 2000
VIVA [43mMEXICO[0m MEXICAN CUISINE
[43mMEXICO[0mCIANA
CON SABOR A [43mMEXICO[0m
LA ADELITA, EL CORAZON DE [43mMEXICO[0m
TULCI[43mMEXICO[0m RESTAURANT
EL RINCON DE [43mMEXICO[0m


In [16]:
# Notice that adding space is not sufficient
grep(' MEXICO ', uniquenames)

PIAXTLA ES[43m MEXICO [0mDELI
NEW[43m MEXICO [0mPLACE
SABOR A[43m MEXICO [0mII
EL SOL DE[43m MEXICO [0mDELI GROCERY
LAS MARAVILLAS DE[43m MEXICO [0mRESTAURANT
NUEVO[43m MEXICO [0mMEXICAN RESTAURANT
SABOR A[43m MEXICO [0mTAQUERIA
TAQUITOS[43m MEXICO [0mRESTAURANT
VIVA[43m MEXICO [0mMEXICAN CUISINE


Note that we also get TULCIMEXICO, MEXICOCIANA, which we _may_ not want. If we want only the word `MEXICO`, we add the word anchors:

In [17]:
# The r'....' is a "raw" string, and allows us to enter
# backslash without having to "escape" the backslash.
# Otherwise Python will interpret \b as a single special
# character, and not as two characters \b that are part of the regex
grep(r'\bMEXICO\b', uniquenames)

[43mMEXICO[0m LINDO RESTAURANT
PIAXTLA ES [43mMEXICO[0m DELI
TACOS [43mMEXICO[0m
NEW [43mMEXICO[0m PLACE
SABOR A [43mMEXICO[0m II
EL SOL DE [43mMEXICO[0m DELI GROCERY
LAS MARAVILLAS DE [43mMEXICO[0m RESTAURANT
NUEVO [43mMEXICO[0m MEXICAN RESTAURANT
SABOR A [43mMEXICO[0m TAQUERIA
TACOS Y QUESADILLAS [43mMEXICO[0m
TAQUITOS [43mMEXICO[0m RESTAURANT
MADE IN [43mMEXICO[0m
[43mMEXICO[0m EN LA PIEL
[43mMEXICO[0m EL SALVADOR INC
MANJARES [43mMEXICO[0m
[43mMEXICO[0m 2000
VIVA [43mMEXICO[0m MEXICAN CUISINE
CON SABOR A [43mMEXICO[0m
LA ADELITA, EL CORAZON DE [43mMEXICO[0m
EL RINCON DE [43mMEXICO[0m


#### Basic Patterns

* `a, X, 9, ....`: -- ordinary characters just match themselves exactly. 
* `. ^ \$ * + ? { [ ] \ | ( )`: The **meta-characters** which do not match themselves because they have special meanings (more info below)
* `.` (a period) -- matches any single character except newline '\n'
* `\t, \n, \r`: Special characters, tab, newline, return
* `^` = start, `$` = end -- match the start or end of the string
* `\`: inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

#### Shortcuts

A few of the bracket expressions that we discussed above occur very often. For this reason, we have shortcuts for them:

* `\d`: matches the digits: `[0-9]`.
* `\D`: matches anything but `\d`: `[^0-9]`.
* `\w`: matches any alphanumeric character plus underscore: `[A-Za-z0-9_]`.
* `\W`: matches anything but `\w`: `[^A-Za-z0-9_]`
* `\s`: matches any "whitespace" character (space, tab, newline, etc): `[ \t\n\r\f\v]`.
* `\S`: matches anything but `\s`: `[^ \t\n\r\f\v]` .
* `\b`: matches the breaks between alphanumeric and non-alphanumeric characters (an empty string), the boundary between `\w` and `\W`. Useful for ensuring that what you match is actually a word.
* `\B`: matches anything but `\b`. Useful for ensuring your match is in the middle of a word.



#### In class exercises

Write a regular expression for:

* Match any character
* Match the end of line
* Match any digit
* Find all characters that are not digits
* Find all words with four letters
* Find every line that starts with a digit
* Find all empty lines
* Find all lines with 4 characters


### Regular Expressions: Operators

#### Alternation |

The alternation operator `|` defines one or more alternatives regular expressions that need to be true for the string to match the regular expression. 

For example, if we are looking for names that contain either the word `GREEK` or the word `RUSSIAN`, we issue the following command: 

In [18]:
grep('GREEK|RUSSIAN|FRENCH', uniquenames)

SYMPOSIUM [43mGREEK[0m RESTAURANT
[43mRUSSIAN[0m TURKISH BATHS
[43mRUSSIAN[0m SAMOVAR
[43mFRENCH[0m ROAST
[43mRUSSIAN[0m VODKA ROOM
JEAN DANET [43mFRENCH[0m PASTRY
THE [43mGREEK[0m KITCHEN
[43mRUSSIAN[0m BATHS
[43mGREEK[0m ISLANDS
AVLI THE LITTLE [43mGREEK[0m TAVERN
[43mGREEK[0m EXPRESS
SOMETHIN[43mGREEK[0m
VILLAGE TAVERNA [43mGREEK[0m GRILL
JEAN CLAUDE [43mFRENCH[0m BISTRO
GRK FRESH [43mGREEK[0m
AVLEE [43mGREEK[0m KITCHEN
[43mGREEK[0m XPRESS
[43mFRENCH[0m LOUIE
[43mFRENCH[0m DINER
DIRTY [43mFRENCH[0m
MARATHI [43mGREEK[0m BISTRO
[43mGREEK[0m GRILL
[43mGREEK[0m EATS
EXCUSE MY [43mFRENCH[0m
3 [43mGREEK[0mS GRILL
AVLI LITTLE [43mGREEK[0m KAFE
KUZINA THE [43mGREEK[0m KITCHEN
[43mGREEK[0m FELLAS
LE [43mFRENCH[0m TART DELI
[43mFRENCH[0mETTE
[43mFRENCH[0mY COFFEE NYC
EONS [43mGREEK[0m FOOD FOR LIFE
[43mGREEK[0m FROM GREECE
LALGEROISE [43mFRENCH[0m BAKERY
SIMPLY [43mGREEK[0m


#### Repetition {m,n}

A repetition operator specifies that the atom or expression immediately before the repetition may be repeated. For example, if we are looking for restaurants that contain the letter I, three to five times:  

In [19]:
grep('I{3,5}', uniquenames)

EL CHIVITO D'ORO [43mIII[0m
KNAPP PIZZA [43mIII[0m
BARZOLA'S RESTAURANT [43mIII[0m
LOS POLLITOS [43mIII[0m
NEW WIN HING [43mIII[0m CHINESE RESTAURANT
BAGEL EXPRESS [43mIII[0m
LITTLE ITALY PIZZA [43mIII[0m
ROCCO PIZZA [43mIII[0m
MIRACALI [43mIII[0m
EL POLLO [43mIII[0m
CESTRAS PIZZA [43mIII[0m
CHINA WOK [43mIII[0m
EL NUEVO VALLE [43mIII[0m
PHO RAINBOW [43mIII[0m
RICO POLLO [43mIII[0m
AVOCADO SUSHI [43mIIII[0m
PHO BEST [43mIII[0m


Now, let's find all the restaurants that have a name length from 50 to 55 characters:

In [20]:
grep('^.{50,55}$', uniquenames)

[43mVINNY'S OF CARROLL GARDEN RESTAURANT & LUNCHEONETT[0m
[43mIFH EL BUFFET RESTAURANT | ALBERTO'S MOFONGO HOUSE[0m
[43mMARRIOTT MARQUIS - MAIN KITCHEN/5TH FLOOR EMPLYEE CAFE[0m
[43mADVENTURES AMUSEMENTS PARK (ICE CREAM, SWEETS STAND)[0m
[43mGREEN AND ACKERMAN KOSHER DAIRY RESTAURANT & PIZZA[0m
[43mST JOHN'S UNIVERSITY LIBRARY CAFE  (ST.AUGUSTINE HALL)[0m
[43mCARIBBEAN CONNECTION CATERING SERVICES INC RESTAURANT[0m
[43mTAKE AWAY CAFE IN REBEKAH REHAB EXTENDED CARE CENTER[0m
[43mVISTA SKY LOUNGE & CATERING (SHERATON FOUR POINTS)[0m
[43mRESORTS WORLD CASINO GROUND LEVEL ( EMPLOYEE DINING)[0m
[43mTHE BIG APPLE (RITZ CARLTON HOTEL EMPLOYEE CAFETERIA)[0m
[43mBARCLAYS LOWER SUITES TANDUAY BAR SOUTH CLUB LOUNGE[0m
[43mBARCLAYS UPPER SUITE STOLI BAR AND STORAGE ROOM 5C29.03[0m
[43mMANDARIN ORIENTAL NEW YORK- LOBBY LOUNGE 35TH FLOOR[0m
[43mTHE BISTRO AT THE COURTYARD & RESIDENCE INN BY MARRIOTT[0m
[43mCOLUMBIA UNIVERSITY BAKER ATHLETICS COMPLEX, STAND #1[0m
[43m

In the repetition operator {m,n}, we can skip putting the upper limit if we want to say, "anything with m matches and above". For example, let's find all the restaurants that have a name length 60 characters and above:

In [21]:
grep('^.{60,}$', uniquenames)

[43mTEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST[0m
[43mSNACK BAR (LOCATED BETWEEN A-B BETWEEN FANCY FOOD AND MASTERS)[0m
[43mMAIMONIDES PARK STAND # 110, TIKI BAR/MR SOFTEE/MARTY'S BURGER[0m
[43mMAIMONIDES PARK STAND #120, PREMIO SAUSAGE/ARANCINI/CHEESESTEAKS/MR. SOFTEE[0m
[43mFASHION INSTITUTE OF TECHNOLOGY DAVID DUBINSKY STUDENT CENTER[0m
[43mWORLD ICE CAFE AT FLUSHING MEADOWS CORONA PARK AQUATIC CENTER[0m
[43mRED STORM DINER @ ST. VINCENT'S HALL OF ST. JOHN'S UNIVERSITY[0m
[43mDELTA SKY CLUB (BARTENDER SERVICE TERMINAL D DELTA DEPARTURE)[0m
[43mINTERCONTINENTAL NEW YORK TIMES SQUARE, TODD ENGLISH THE STINGER[0m
[43mMARLIN BAR AT TOMMY BAHAMA AND TOMMY BAHAMA RESTAURANT AND B[0m
[43mTHE PENINSULA NEW YORK, CLEMENT/ THE GOTHAM LOUNGE/ SALON DE NING[0m
[43mUNION SQUARE SPORTS & ENTERTAINMENT AT THEATRE FOR A NEW AUDIENCE[0m
[43mFAIRFIELD INN & SUITES NEW YORK MANHATTAN FINANCIAL DISTRICT[0m
[43mMISTER DIPS (GARDEN

##### Repetition shortcuts (very common!): 

* `* = {0,}`. The `*` character means match the previous atom zero or more times
* `+ = {1,}`. The `+` character means match the previous atom one or more times
* `? = {0,1}`. The `*` character means match the previous atom zero or one times






Find all restaurants that start with one or more digits, followed by a space.

In [22]:
grep('^[0-9]+ ', uniquenames)

[43m1 [0mEAST 66TH STREET KITCHEN
[43m21 [0mCLUB
[43m5 [0mBURRO CAFE
[43m3 [0mGUYS
[43m1020 [0mBAR
[43m810 [0mDELI & CAFE
[43m101 [0mDELI
[43m3 [0mDELI & GRILL
[43m15 [0mEAST RESTAURANT
[43m44 [0m& X HELL'S KITCHEN
[43m7 [0mSTARS RESTAURANT
[43m3 [0mSISTERS' & SHANTA'S RESTAURANT & BAKERY
[43m68 [0mJAY STREET BAR
[43m3 [0mSTAR JUICE CENTER
[43m3 [0mWAY RESTAURANT
[43m5 [0mESTRELLA BAKERY
[43m169 [0mBAR
[43m809 [0mGRILL & BAR RESTAURANT
[43m230 [0mFIFTH
[43m11 [0mSTREET CAFE
[43m234 [0mCHINA CITY
[43m27 [0mSPORTS BAR & CAFE
[43m535 [0mMADISON CAFE
[43m1 [0mOAK
[43m86 [0mNOODLES
[43m95 [0mSOUTH
[43m1 [0mBANANA QUEEN
[43m33 [0mGOURMET
[43m55 [0mBAR
[43m2 [0mBROS PIZZA
[43m67 [0mORANGE STREET
[43m773 [0mLOUNGE
[43m18 [0mBAKERY
[43m1893 [0mSPORTS BAR
[43m212 [0mHISAE'S
[43m69 [0mBAR LOUNGE
[43m1 [0mSTOP PATTY SHOP
[43m620 [0mON CATON
[43m1001 [0mNIGHTS CAFE
[43m101 [0mCAFE
[43m5 [0mNAPKIN BURGER
[43m2 [0mBRO

Find all restaurants that start with a letter, followed by one or more digits, followed by a space.

In [23]:
grep('^[A-Z][0-9]+ ', uniquenames)

[43mB66 [0mCLUB
[43mB2 [0mHARLEM
[43mA1 [0mJAMAICA BREEZE
[43mO2 [0mK-BBQ


In [24]:
# Find all restaurants
# Beggining with one or more letters // ^[A-Z]+
# followed by one or more digits // [0-9]+
# Followed by any number of charaters // .*
# and ending with BAR  // BAR$
grep('^[A-Z]+[0-9]+.*BAR$', uniquenames)

Find all restaurants that start with the word STARBUCKS, followed by any number of characters, and then have a digit.

In [25]:
grep('STARBUCKS.*[0-9]+', uniquenames)

[43mSTARBUCKS #847[0m
[43mSTARBUCKS #7277[0m
[43mSTARBUCKS COFFEE #7358[0m
[43mSTARBUCKS #7378[0m
[43mSTARBUCKS-ENTRANCE 11 WEST 33[0mRD ST
[43mSTARBUCKS COFFEE # 26528[0m
[43mSTARBUCKS COFFEE #22716[0m
[43mSTARBUCKS COFFEE #23591[0m
[43mSTARBUCKS COFFEE #29856[0m
[43mSTARBUCKS COFFEE COMPANY #29897[0m
[43mSTARBUCKS COFFEE #48170[0m
[43mSTARBUCKS (STORE #50483[0m)
[43mSTARBUCKS COFFEE #49952[0m
[43mSTARBUCKS #50611[0m
[43mSTARBUCKS COFFEE # 49196[0m
[43mSTARBUCKS #48990[0m
[43mSTARBUCKS COFFEE #49550[0m
[43mSTARBUCKS COFFEE #50622[0m
[43mSTARBUCKS COFFEE #29719[0m
[43mSTARBUCKS COFFEE #53473[0m
[43mSTARBUCKS COFFEE #49450[0m
[43mSTARBUCKS #54446[0m
[43mSTARBUCKS COFFEE #823[0m
[43mSTARBUCKS COFFE #55085[0m
[43mSTARBUCKS COFFEE#54771[0m
[43mSTARBUCKS COFFEE #52530[0m
[43mSTARBUCKS COFFEE #58310[0m
[43mSTARBUCKS COFFEE #56451[0m
[43mSTARBUCKS COFFEE #57601[0m
[43mSTARBUCKS COFFEE #678[0m HUDSON
[43mSTARBUCKS COFFEE #14090[0m


#### Grouping ()

In the group operator, when a group of characters is enclosed in parentheses, the next operator applies to the whole group, not only the previous characters. 

For example: Find all the restaurants that start (`^`) with 8 or more repetitions (`{8,}`) of the `\w+ ` pattern (alphanumeric characters followed by space):

In [26]:
grep(r'^(\w+ ){8,}', uniquenames)

[43mTEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST [0mTEST
[43mWORLD ICE CAFE AT FLUSHING MEADOWS CORONA PARK AQUATIC [0mCENTER
[43mTAKE AWAY CAFE IN REBEKAH REHAB EXTENDED CARE [0mCENTER
[43mBARCLAYS UPPER SUITE STOLI BAR AND STORAGE ROOM [0m5C29.03
[43mMARLIN BAR AT TOMMY BAHAMA AND TOMMY BAHAMA RESTAURANT AND [0mB
[43mHOMEWOOD SUITES BY HILTON NEW YORK MIDTOWN MANHATTAN TIMES [0mSQUARE


#### In class exercices

What do these regular expressions match?

* b (cd)*
* h (d)+
* j? k+
* (cd){2,5}
* o(pre){3,}
* Panos|Ipeirotis

#### In class exercises (advanced)

Write down the regular expressions for the following:

* A telephone number (e.g, 212-555-0921)
* A zip+4 code (e.g, 10012-1809)
* For matching a float number (e.g., +12.34 or -1.457 or 1023.4568)
* Dollar amount with optional cents  (e.g. \$0.33, \$784)
* Time of Day (e.g. 12:15am, 3:34pm)
* Match urls  only of the form http://www.alphanumeric.com
* Match an email of the form username@domain (assume  that the domain might be in the form alphanumeric.alphanumeric, or alphanumeric.alphanumeric.alphanumeric)   



### Group references

Sometimes it is handy to be able to refer to a match that was made earlier in a regex. This is done with **backreferences**, which refer to groups. `\k` is the backreference specifier, where `k` is a number, which refers to the `k`-th regular expression *that was enclosed in parenthesis*.

For example, find if the first character(s) of a line are the same as the last:


In [27]:
grep(r'^(.{3,}).*\1$', uniquenames)

[43mTEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST TEST[0m
[43mZOO BREWS-BRONX ZOO[0m
[43mARRIBA ARRIBA[0m
[43mPIO PIO[0m
[43mBARRACUDA BAR[0m
[43mKENNEDY FRIED CHICKEN[0m
[43mHARLEM BAR-B-Q, NABE HARLEM[0m
[43mSEBA SEBA[0m
[43mRUBY'S & LITTLE RUBY'S[0m
[43mETCETERA ETCETERA[0m
[43mBARCELONA BAR[0m
[43mLAS RAMBLAS[0m
[43mSEBA-SEBA[0m
[43mTETE-A-TETE[0m
[43mANTONIO'S RESTAURANT[0m
[43mLOS 3 POTRILLOS[0m
[43mPIO-PIO[0m
[43mONE AND ONE[0m
[43mCHOP CHOP[0m
[43mBOULUD SUD & EPICERIE BOULUD[0m
[43mCHEEBURGER CHEEBURGER[0m
[43mGONZALEZ Y GONZALEZ[0m
[43mSHIMBASHI SUSHI[0m
[43mBARCLAYS NORTH SUITE STOLI BAR[0m
[43mBARCLAYS 40, 40 CLUB BAR[0m
[43mLOS TRES POTRILLOS[0m
[43mYUMMY YUMMY[0m
[43mBERON BERON[0m
[43mPIL PIL[0m
[43mVIS-A-VIS[0m
[43mCHEN'S KITCHEN[0m
[43mBARRY'S BOOTCAMP FUEL BAR[0m
[43mMANGO MANGO[0m
[43mTUK TUK[0m
[43mCHA CHA MATCHA[0m
[43mYIA YIA[0m
[43mKENNEDY CHICKEN[0m


Or find all the restaurant names that the first 5 characters (or more) are identical to the last characters.

In [28]:
grep(r'^([A-Z]+)\1$', uniquenames)

[43mNONO[0m
[43mMIMI[0m
[43mMEME[0m
[43mWIKIWIKI[0m
[43mMAKIMAKI[0m
[43mCOCO[0m
[43mMAMA[0m
[43mZIZI[0m


Or find the restaurants where the first character is the same as the last, and the second character is the same as the penultimate character.

In [35]:
grep(r'^(.)(.).*\2\1$', uniquenames)

Find all names that have three consecutive same digits

In [29]:
grep(r'([0-9])\1\1', uniquenames)

GALLAGHER'S 2[43m000[0m
LEGENDS [43m000[0m
[43m888[0m KITCHEN
[43m333[0m LOUNGE
MEXICO 2[43m000[0m
GOOD TASTE [43m666[0m
CARVEL # 10[43m222[0m4
STARBUCKS #5[43m444[0m6
NEW HONG KONG 3[43m888[0m INC
SHAXIAN [43m888[0m
CHECKERS STORE #[43m333[0m2
CAFE 2[43m000[0m


As we are going to see, these backreferences will also be of tremendous use for extraction purposes.

In [30]:
#### Naming groups
# The group that follows the term "DOUBLE" is named "doublewhat" and we can refer to it as \doublewhat
grep(r'DOUBLE (?P<doublewhat>\w+)', uniquenames)


[43mDOUBLE DOWN[0m SALOON
[43mDOUBLE DRAGON[0m RESTAURANT
[43mDOUBLE WINDSOR[0m
NEW [43mDOUBLE CHINESE[0m RESTAURANT
[43mDOUBLE RAINBOW[0m
BEST [43mDOUBLE DRAGON[0m RESTAURANT
[43mDOUBLE DUTCH[0m ESPRESSO
[43mDOUBLE TOP[0m CHINA & TORTILLA TACO
[43mDOUBLE DRAGON[0m
[43mDOUBLE DRAGON[0m 88
[43mDOUBLE ZERO[0m
108 [43mDOUBLE CHINESE[0m RESTAURANT
[43mDOUBLE BEN[0m CAFE
[43mDOUBLE CRISPY[0m BAKERY I
A & A BAKE AND [43mDOUBLE SHOP[0m
[43mDOUBLE RED[0m INC
[43mDOUBLE TWISTER[0m ICE CREAM AND COFFEE SHOP
NEW [43mDOUBLE DRAGON[0m


#### In class exercise (advanced)

Say that you have a file with telephone numbers written in a variety of forms: 

* 679-397-5255
* 2126660921
* 212-998-0902
* 888-888-2222
* 800-555-1211
* 800 555 1212
* 800.555.1213
* (800) 555-1214
* 1-800-555-1215
* 1(800)555-1216
* 800-555-1212-1234
* 800-555-1212x1234
* 800-555-1212 ext. 1234
* work 1-(800) 555.1212 #1234

The task is to standardize everything in the form (xxx)-xxx-xxx.


To make the process interactive, go to http://regex101.com/?#python, copy and paste the numbers above in the textarea called "Text String", and then try to write the regular expression above. (As a side note, the website provides excellent explanations about the meaning of the regular expression that you write down.) Remember to put the `"g"` character in the small textfield next to the regex: that `g` means "find globally" the regex, not just the first occurence.

###  Exercise 2

Retrieve news articles using the NewsAPI. Identify articles that mention a dollar amount (i.e., the `$` symbol, followed by numbers)

### Additional Regex Resources

* [Visual Regular Expression Tester](http://www.debuggex.com/?flavor=pcre)
* [Test Python Regular Expressions Online](http://www.pyregex.com/)
* [Regular Expressions 101](http://regex101.com/?)
* [Python's re Library Official Documentation](http://docs.python.org/2/library/re.html)
* [Regular expression reference at W3schools](http://www.w3schools.com/jsref/jsref_obj_regexp.asp)
* [Parsing phone numbers using Python and regular expressions](http://www.diveintopython.net/regular_expressions/phone_numbers.html)

### More Advanced Regular Expressions

And the ones below get a little bit more advanced:

* `*?`, `+?`: ordinarily, `*`, `+` and `?` are **greedy**. This means they are matching the longest possible string that satisfies the regular expression. Adding the `?` to any of these makes it non-greedy, instead matching the shortest possible expression. 
* `(?: )`: A non-capturing group. This works just as `()`, but doesn’t hold on to the matched contents.
* `(?<=x)`: Matches any string that is preceded by x (an arbitrary regular expression).
