Regular Expressions
-------------------

Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists. 

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. We will present examples using grep, which we covered previously. (In case you forgot, we used grep to find lines of a text file with a given string in them.) 

In [1]:
# The code below is written in Python to replicate the behavior of grep, the UNIX utility
# We will examine the details of how the code works in a subsequent notebook.
# For now, just execute the code, and use the function grep(regex_expression, file_name) as-is

import re

def printMatches(text, regex_expression):
    BACKGROUND_YELLOW = '\x1b[43m'
    COLOR_RESET  = "\x1b[0m"
    regex= re.compile(regex_expression)
    matches = regex.finditer(text)
    for m in matches:
        highlighted  = text[:m.start()] # the string before the regex match
        highlighted += BACKGROUND_YELLOW + text[m.start():m.end()] + COLOR_RESET 
        highlighted += text[m.end():] # the string after the regex match
        print(highlighted)

def grep(regex_expression, file_name):
    f = open(file_name, "r")
    content = f.read()
    f.close()
    for line in content.split("\n"):
        printMatches(line, regex_expression)

### NYC Restaurant Names Data

In the notebook, we will demonstrating the various regular expressions using the set of restaurant names from `/data/uniquenames.txt`.

Let's take a peek at the contents using the `head` and `tail` commands:

In [2]:
!head -10 /data/uniquenames.txt
!echo '........' # The "echo" command just prints in the output the string that follows the command (in this case "......")
!tail -10 /data/uniquenames.txt

#1 GARDEN CHINESE
#1 ME. NICK'S
#1 SABOR LATINO RESTAURANT
$1.25 PIZZA
''U'' LIKE CHINESE RESTAURANT
''W'' CAFE
'WICHCRAFT
(LEWIS DRUG STORE) LOCANDA VINI E OLII
(LIBRARY)  FOUR & TWENTY BLACKBIRDS
(PUBLIC FARE) 81ST STREET AND CENTRAL PARK WEST (DELACORTE THEATRE)
........
ZUCKER'S BAGELS AND SMOKED FISH
ZUM SCHNEIDER
ZUM STAMMTISCH
ZUMA JAPANESE RESTAURANT NEW YORK
ZUMBA RESTAURANT
ZUTTO
ZUZU RAMEN
ZYMI BAR & GRILL
ZZ CLAM BAR
ZZ'S PIZZA & GRILL


Now, let's see if there are any restaurants with the string 'PANO' in them:

In [3]:
grep('PANO', "/data/uniquenames.txt")

BUFFALO WILD WINGS,PEETS COOFEE &TEA, [43mPANO[0mPOLIS BAKERY & CAFE
CAFE ES[43mPANO[0mL
EL CHARRO ES[43mPANO[0mL
EL POTE ES[43mPANO[0mL
LA CANDELA ES[43mPANO[0mLA
PAM[43mPANO[0m
[43mPANO[0mRAMA OF MY SILENCE-HEART
[43mPANO[0mRAMA RESTAURANT
TIGIN IRISH PUB,PEETS COFFEE&TEA,[43mPANO[0mPOLIS BAKERY&CAFE


What can we do if we want to search for something more complex than a fixed string? Regular expressions are solving exactly this problem. 

### The atoms

The simplest regular expressions are a sequence of `atoms`. An atom can be any of the following:
* single character, 
* a dot,
* a bracket expression, 
* an anchor.

#### Single character atom

A single character atom matches itself.

#### The `.` character atom

A dot atom matches any single character (except for a new line character `\n`).

Example: Using single character atoms, and the `.` atom, let's find all restaurant names that contain the characters `AB`, followed by any character (`.`) and then the character `D`:

In [4]:
grep('AB.D', '/data/uniquenames.txt')

[43mABID[0mE BROOKLYN PITA
JJ PE[43mABOD[0mY'S
L[43mABAD[0mEE MANOIR
NEW KAB[43mAB D[0mINER
RESTAURANT [43mABID[0mJAN


#### Bracket expression atom

A bracket expression (defined by square brackets []) defines a set of characters. matches only one single character that can be any of the characters defined in a set. Example: [ABL] matches either A, B, or L.

Now, let's use a bracket expression: We want to find restaurants that contain one of the letters A,B,C,X,Y,Z followed by a digit. We specify the set of letters as `[ABCXYZ]` and the set of digits as `[0123456789]`.  

In [5]:
grep('[ABCXYZ][0123456789][0123456789]', '/data/uniquenames.txt')

[43mB66[0m CLUB
B[43mA10[0m02 BAR
B[43mA10[0m19 BAR
B[43mA61[0m10 BAR
B[43mC81[0m40 BAR AT THE GARDEN
C[43mB80[0m30 SAUSAGE CONCESSION
CIBO MARKET (GATE [43mC65[0m)
COTTO MARKET-GATE [43mC30[0m
F[43mA80[0m70 HOT DOG CONCESSION
F[43mB10[0m14 HOT DOG CONCESSION
F[43mB80[0m20 PIZZA CONCESSION
F[43mB90[0m90 HOT DOG CONCESSION
F[43mB91[0m10 HOT DOG CONCESSION
F[43mB91[0m20 HOT DOG CONCESSION
HOT DOG CONCESSION [43mA80[0m3-1
JFK FUEL BAR [43mB27[0m
MADISON CLUB (B[43mB71[0m84)
RUNWA[43mY69[0m
YOGURT [43mY23[0m INC


##### Brackets and ranges

Instead of typing long lists of characters in a bracket expression, we can use the range character: [0-9] is equivalent to [0123456789]. Similarly [A-Z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. And [D-T] is equivalent to [DEFGHIJKLMNOPQRST]. (You get the idea.) You can also combine multiple ranges: [a-e1-9] is equivalent to [abcde123456789]. Finally, you can even specify to be excluded from the set using the character (^). For example, [^0-9] matches any character other than a number.

For example, let's find restaurants that contain a letter, followed by a number, and then followed by a charather that is not a number:

In [6]:
grep('[A-Z][0-9][^0-9]', '/data/uniquenames.txt')

[43mA1 [0mOCHA SUSHI
A[43mH2 [0mICE TEA
[43mB4 [0mNYC
B[43mT3 [0mBAR
B[43mT4 [0mBAR
[43mC2 [0mCAFE
CAF[43mE1 [0m& CAFE 4 (AMERICAN MUSEUM OF NATURAL HISTORY)
[43mF1 [0mLOUNGE AND GRILL
ILLY/VELOCITY BAR (E[43mC2)[0m
[43mJ4 [0mHOOKAH LOUNGE
JUIC[43mE4U[0m
[43mM1-[0m5
[43mM2M[0m MART
[43mM2N[0m BUFFET
NINET[43mY9 [0m& UP DINER
N[43mO1 [0mCHINESE RESTAURANT
[43mQ2 [0mTHAI RESTAURANT
[43mT2 [0m- GO
TERMINA[43mL1 [0mEMPLOYEE CAFETERIA
THE NEW YORK PALACE HOTEL ([43mC1 [0mLEVEL CAFETERIA)
TW[43mO8T[0mWO BAR & BURGER
US FRIED CHICKEN & [43mP1Z[0mZA


Hm, we do not want to get results that have a space after the number, so let's also exclude the space character:

In [7]:
grep('[A-Z][0-9][^0-9 ]', '/data/uniquenames.txt') 

ILLY/VELOCITY BAR (E[43mC2)[0m
JUIC[43mE4U[0m
[43mM1-[0m5
[43mM2M[0m MART
[43mM2N[0m BUFFET
TW[43mO8T[0mWO BAR & BURGER
US FRIED CHICKEN & [43mP1Z[0mZA


In [8]:
# Digit, not letter not digit not space, digit
grep('[0-9][^A-Z0-9 ][0-9]', '/data/uniquenames.txt') 

$[43m1.2[0m5 PIZZA
[43m1.5[0m GALBI CORP
10[43m4-0[0m1 FOSTER AVENUE COFFEE SHOP(UPS)
3[43m6-0[0m2 DITMARS COFFEE CORP.
4[43m0/4[0m0 CLUB
4[43m0/4[0m0 CLUB BAR
44 [43m1/2[0m CAFE
83 [43m1/2[0m
BRASSERIE 8 [43m1/2[0m
FOOD DEPOT 1[43m2*4[0m
HOT DOG CONCESSION A80[43m3-1[0m
M[43m1-5[0m
PRB 2[43m4-7[0m
THE BEST $[43m1.0[0m0 PIZZA


In [9]:
# Restaurants with five digits
grep('[0-9][0-9][0-9][0-9][0-9]', '/data/uniquenames.txt') 

CAFE [43m11231[0m
COFFEE [43m11238[0m
MCDONALDS (#[43m11542[0m)
MCDONALDS [43m17754[0m
PIZZA HUT  # [43m29782[0m
PIZZA HUT #[43m29773[0m
PIZZA HUT [43m29531[0m
PIZZA HUT# [43m28256[0m
STARBUCKS # [43m14840[0m
STARBUCKS (STORE [43m16628[0m)
STARBUCKS [43m22420[0m
STARBUCKS COFFEE  #[43m16608[0m
STARBUCKS COFFEE # [43m15440[0m
STARBUCKS COFFEE #[43m14240[0m
STARBUCKS COFFEE #[43m18509[0m
STARBUCKS COFFEE #[43m20679[0m
STARBUCKS COFFEE #[43m21514[0m
STARBUCKS COFFEE #[43m22596[0m
STARBUCKS COFFEE #[43m23266[0m
STARBUCKS COFFEE #[43m23267[0m
STARBUCKS COFFEE (#[43m19890[0m)
STARBUCKS COFFEE (STORE #[43m13539[0m)
STARBUCKS COFFEE (STORE [43m17478[0m)
STARBUCKS COFFEE (STORE#[43m11650[0m)
STARBUCKS COFFEE (STORE#[43m20161[0m)
STARBUCKS COFFEE COMPANY #[43m22560[0m
SUBWAY (STORE #[43m27610[0m)
SUBWAY (STORE #[43m38550[0m)
SUBWAY STORE [43m46555[0m
SUBWAY#[43m50497[0m (CARDINAL HAYES HIGH SCHOOL)
TEAVANA #[43m22994[0m
TEAVANA#[43m2

#### Anchor

Anchor atoms are used to define the location of a regex within a line. 

The anchor `^` specifies the *beginning of a line*, the anchor `$` specifies the end of a line. The anchor `\b` specifies the word boundary.

Example: Find restaurant names that start with the characters `BAL`

In [10]:
grep('^BAL', '/data/uniquenames.txt')

[43mBAL[0mABOOSTA
[43mBAL[0mADE
[43mBAL[0mBOA RESTAURANT.
[43mBAL[0mCON QUITENO RESTAURANT
[43mBAL[0mDOR SPECIALTY FOODS
[43mBAL[0mDUCCI'S
[43mBAL[0mI NUSA INDONESIAN RESTAURANT
[43mBAL[0mILO DELI
[43mBAL[0mIMAYA RESTAURANT
[43mBAL[0mKANIKA
[43mBAL[0mKH SHISH KABAB HOUSE
[43mBAL[0mL PARK HOT DOG
[43mBAL[0mLARO
[43mBAL[0mLATO'S RESTAURANT
[43mBAL[0mLFIELDS CAFE
[43mBAL[0mLI DELI & SALAD BAR
[43mBAL[0mLY TOTAL FITNESS
[43mBAL[0mLY'S SPORT CLUB
[43mBAL[0mNDIE'S PLACE, INC
[43mBAL[0mON
[43mBAL[0mTHAZAR BAKERY
[43mBAL[0mTHAZAR RESTAURANT
[43mBAL[0mUCHI
[43mBAL[0mUCHI'S
[43mBAL[0mUCHI'S FRESH
[43mBAL[0mUCHI'S INDIAN FOOD
[43mBAL[0mVANERA
[43mBAL[0mZEM


Example: Find restaurant names that end with the characters `NORTH`

In [11]:
grep('NORTH$', '/data/uniquenames.txt')

AQUEDUCT [43mNORTH[0m
BOURGEOIS PIG [43mNORTH[0m
PRATT INSTITUTE [43mNORTH[0m


In [12]:
# All restaurants that end with 4 digits
grep('[0-9][0-9][0-9][0-9]$', '/data/uniquenames.txt')

CAFE 1[43m1231[0m
CAFE [43m1853[0m
CANTINA [43m1436[0m
CBRE-[43m1540[0m
CHIPOTLE MEXICAN GRILL # [43m2135[0m
CHIPOTLE MEXICAN GRILL #[43m1394[0m
CHIPOTLE MEXICAN GRILL #[43m1962[0m
CHIPOTLE MEXICAN GRILL #[43m1968[0m
CHIPOTLE MEXICAN GRILL #[43m2090[0m
CHIPOTLE MEXICAN GRILL #[43m2123[0m
CHIPOTLE MEXICAN GRILL#[43m1766[0m
COFFEE 1[43m1238[0m
DOMINO'S PIZZA #[43m3647[0m
DOMINO'S PIZZA [43m3537[0m
DOMINO'S PIZZA [43m3657[0m
DOMINOS PIZZA # [43m3448[0m
EMPIRE RESTAURANT OF [43m1635[0m
GALLAGHER'S [43m2000[0m
JACQUES [43m1534[0m
KAFFE [43m1668[0m
LABETTI'S POST # [43m2159[0m
LONGHORN STEAKHOUSE #[43m5453[0m
MCDONALD'S RESTAURANT #[43m3391[0m
MCDONALDS 1[43m7754[0m
MIDTOWN [43m1015[0m
OUTBACK STEAKHOUSE [43m3330[0m
OUTBACK STEAKHOUSE [43m3332[0m
PANDA EXPRESS #[43m2634[0m
PANDA RESTAURANT [43m2807[0m
PETER'S SINCE [43m1969[0m
PIZZA HUT  # 2[43m9782[0m
PIZZA HUT #2[43m9773[0m
PIZZA HUT 2[43m9531[0m
PIZZA HUT# 2[43m8256[0m
RE

Example: Let's try to find restaurants containing the word `COLUMBIA`:

In [13]:
grep(' COLUMBIA ', '/data/uniquenames.txt')

THE SCHOOL AT[43m COLUMBIA [0mUNIVERSITY


In [14]:
# Notice that adding space is not sufficient
grep(' COLUMBIA ', '/data/uniquenames.txt')

THE SCHOOL AT[43m COLUMBIA [0mUNIVERSITY


Hm, something is wrong. We also get COLUMBIANO, COLUMBIANAS, and other words. We want only the word COLUMBIA, so we add the word anchors:

In [15]:
# The r'....' is a "raw" string, and allows us to enter
# backslash without having to "escape" the backslash.
# Otherwise Python will interpret \b as a single special
# character, and not as two characters \b that are part of the regex
grep(r'\bCOLUMBIA\b', '/data/uniquenames.txt')

BROWNIE'S CAFE AT [43mCOLUMBIA[0m
CAFE 212/[43mCOLUMBIA[0m CATERING KITCHEN - ALFRED LERNER HALL
[43mCOLUMBIA[0m UNIVERSITY MEDICAL CENTER BOOKSTORE CAFE
THE FACULTY CLUB ([43mCOLUMBIA[0m UNIVERSITY)
THE SCHOOL AT [43mCOLUMBIA[0m UNIVERSITY


#### In class exercises

Write a regular expression for:

* Match any character
* Match the end of line
* Match any digit
* Find all characters that are not digits
* Find all words with four letters
* Find every line that starts with a digit
* Find all empty lines
* Find all lines with 4 characters


### Regular Expressions: Operators

#### Alternation |

The alternation operator `|` defines one or more alternatives regular expressions that need to be true for the string to match the regular expression. 

For example, if we are looking for names that contain either the word `GREEK` or the word `RUSSIAN`, we issue the following command: 

In [16]:
grep('GREEK|RUSSIAN|FRENCH', '/data/uniquenames.txt')

ANTHI'S [43mGREEK[0m FOOD
AVLEE  [43mGREEK[0m KITCHEN
AVLEE [43mGREEK[0m KITCHEN
AVLI THE LITTLE [43mGREEK[0m TAVERN
BREEZE THAI-[43mFRENCH[0m KITCHEN
BY SUZETTE [43mFRENCH[0m CREPES
DIRTY [43mFRENCH[0m
ETHOS [43mGREEK[0m CUISINE
[43mFRENCH[0m CAFE GOURMAND
[43mFRENCH[0m DINER
[43mFRENCH[0m LOUIE
[43mFRENCH[0m ROAST
[43mGREEK[0m EXPRESS
[43mGREEK[0m FAMILY KITCHEN
[43mGREEK[0m GARDENS GRILL
[43mGREEK[0m GRILL
[43mGREEK[0m ISLANDS
GRK FRESH [43mGREEK[0m
GYRO [43mGREEK[0m STYLE
JEAN CLAUDE [43mFRENCH[0m BISTRO
JEAN DANET [43mFRENCH[0m PASTRY
JENNY [43mFRENCH[0m TOAST COFFEE SHOP RESTAURANT
MEDITERRANEAN GRILL [43mGREEK[0m TARVERNA
OKEANOS [43mGREEK[0m SEAFOOD
OPA! [43mGREEK[0m RESTAURANT
PIZZA AND [43mFRENCH[0m TASTE PIZZERIA
RAFINA [43mGREEK[0m CUISINE
[43mRUSSIAN[0m BATHS
[43mRUSSIAN[0m SAMOVAR
[43mRUSSIAN[0m TURKISH BATHS
SOMETHING [43mGREEK[0m
SYMPOSIUM [43mGREEK[0m RESTAURANT
THE [43mGREEK[0m
THE [43mGREEK[0m CORNER

#### Repetition {m,n}

A repetition operator specifies that the atom or expression immediately before the repetition may be repeated. For example, if we are looking for restaurants that contain the letter I, three to five times:  

In [21]:
grep('I{2,3}', '/data/uniquenames.txt')

(LEWIS DRUG STORE) LOCANDA VINI E OL[43mII[0m
A. KAWA[43mII[0m JAPANESE RESTAURANT
ACQUISTA FOOD SERVICE [43mII[0m INC.
ALDO'S [43mII[0m PIZZA AND RESTAURANT
AMBROSINO'S [43mII[0m
AMICI [43mII[0m
ANTIQUE CAFE & BAKERY [43mIII[0m INC
AROME CAFE [43mII[0m
ASHIYA [43mII[0m SUSHI
AVENUE PIZZA [43mII[0m
AZOGUENITA BAKERY & RESTAURANT [43mIII[0m
B & B RESTAURANT [43mII[0m
BAGEL EXPRESS [43mIII[0m
BAMBINO PIZZA [43mII[0m
BARZOLA'S RESTAURANT [43mIII[0m
BEET [43mII[0m
BONA [43mII[0m PIZZA
BREAD BROTHERS [43mIII[0m
BROOKLYN PIZZA [43mII[0m
BROTHERS PIZZA [43mII[0m
C & J [43mII[0m JAMAICAN RESTAURANT & BAKERY
CAFE BORGIA [43mII[0m
CAFE CON PAN BAKERY [43mII[0m
CAFE RUSTICO [43mII[0m
CAPRI [43mII[0m PIZZA
CARROT TOP PASTRIES [43mII[0m
CESTRA'S PIZZA [43mIII[0m
CESTRAS [43mII[0m PIZZA
CHINA GARDEN [43mII[0m
CHINA WOK [43mII[0m
CIRCLE LINE X[43mII[0m
CIRCLE LINE XV[43mII[0m
CLAUDIO'S CAFE [43mII[0m
CLIPPERS [43mII[0m
COLOMBIA FAMA 

Now, let's find all the restaurants that have a name length from 50 to 55 characters:

In [24]:
grep('^.{50,55}$', '/data/uniquenames.txt')

[43mBRASSIERIE 1605/BROADWAY 49 BAR & LOUNGE (MAIN KITCHEN)[0m
[43mBROOKLYN CHILDREN'S MUSEUM CAFE/FOREST CITY RATNER CAFE[0m
[43mCAFE 212/COLUMBIA CATERING KITCHEN - ALFRED LERNER HALL[0m
[43mCAFE1 & CAFE 4 (AMERICAN MUSEUM OF NATURAL HISTORY)[0m
[43mCARIBBEAN CONNECTION CATERING SERVICES INC RESTAURANT[0m
[43mCHARTWELLS AT COLLEGE OF MOUNT ST. VINCENT-BENEDICT[0m
[43mCOURTYARD & RESIDENCE INN BY MARRIOTT CENTRAL PARK[0m
[43mFORDHAM UNIVERSITY/MCGINLEY CENTER/RAMSKELLER KITCHEN[0m
[43mGREEN AND ACKERMAN KOSHER DAIRY RESTAURANT & PIZZA[0m
[43mHOMESTYLE FOOD SERVICES (ST. BARNABAS HIGH SCHOOL)[0m
[43mLOBBY LOUNGE AND TROUBLE'S TRUST @ THE PALACE HOTEL[0m
[43mNATURAL TOFU & NOODLES RESTAURANT (BOOK CHANG DONG)[0m
[43mNEW YORK BOTANICAL GARDENS TERRACE CAFE ( GARDEN CAFE )[0m
[43mNEW YORK UNIVERSITY - KIMMEL STUDENT CENTER CAFETERIA[0m
[43mPYRAMID COFFEE COMPANY HOSPITAL FOR SPECIAL SURGERY[0m
[43mQ.B.COMM.COLLEGE-MAIN KITCHEN/TIGER BITES PIZZA SECTION[0m


In the repetition operator {m,n}, we can skip putting the upper limit if we want to say, "anything with m matches and above". For example, let's find all the restaurants that have a name length 60 characters and above:

In [25]:
grep('^.{60,}$', '/data/uniquenames.txt')

[43m(PUBLIC FARE) 81ST STREET AND CENTRAL PARK WEST (DELACORTE THEATRE)[0m
[43mBUFFALO WILD WINGS,PEETS COOFEE &TEA, PANOPOLIS BAKERY & CAFE[0m
[43mCENTER PLATE- CONCOURSE CAFE-JACOB K JAVITS CONVENTION CENTER[0m
[43mCENTERPLATE-EMPLOYEE CAFETERIA-JACOB K JAVITS CONVENTION CENTER[0m
[43mCENTRA`L MARKET ALL AMERICAN GRILL ( STATEN ISLAND FERRY TERMINAL)[0m
[43mDELTA SKY CLUB (BARTENDER SERVICE TERMINAL D DELTA DEPARTURE)[0m
[43mDUNKIN DONUTS (INSIDE GULF GAS STATION ON NORTH SIDE OF MAJ. DEEGAN EXWY- AFTER EXIT 13 - 233 ST.)[0m
[43mFASHION INSTITUTE OF TECHNOLOGY DAVID DUBINSKY STUDENT CENTER[0m
[43mGREATER NEW YORK SOCIAL AND HEALTH ADULT DAY CARE CENTER LLC[0m
[43mHOMEWOOD SUITES BY HILTON NEW YORK MIDTOWN MANHATTAN TIMES SQUARE[0m
[43mHONG KONG CAFE / FRESH SANDWICH BAKERY (BASEMENT FOOD COURT RESTAURANT & 1ST FL BAKERY)[0m
[43mMARLIN BAR AT TOMMY BAHAMA AND TOMMY BAHAMA RESTAURANT AND B[0m
[43mNEW WAI LING CHINESE RESTAURANT/NEW FRESCO TORTILLAS II TACO[0m


##### Repetition shortcuts (very common!): 

* `* = {0,}`. The `*` character means match the previous atom zero or more times
* `+ = {1,}`. The `+` character means match the previous atom one or more times
* `? = {0,1}`. The `*` character means match the previous atom zero or one times






Find all restaurants that start with one or more digits, followed by a space.

In [26]:
grep('^[0-9]+ ', '/data/uniquenames.txt')

[43m002 [0mMERCURY TACOS LLC
[43m1 [0m2 3 BURGER SHOT BEER
[43m1 [0mBANANA QUEEN
[43m1 [0mBUEN SABOR
[43m1 [0mDARBAR
[43m1 [0mEAST 66TH STREET KITCHEN
[43m1 [0mOAK
[43m1 [0mOR 8
[43m1 [0mSTOP PATTY SHOP
[43m10 [0mDEVOE
[43m10 [0mPOINTS KTV
[43m100 [0mFUN
[43m1001 [0mNIGHTS
[43m1001 [0mNIGHTS CAFE
[43m1005 [0mCATERING
[43m101 [0mCAFE
[43m101 [0mDELI
[43m101 [0mRESTAURANT AND BAR
[43m102 [0mNOODLES TOWN RESTAURANT
[43m1020 [0mBAR
[43m1028 [0mBAR & RESTAURANT EL SALVADORENO 
[43m1061 [0mCATERING LLC
[43m107 [0mWEST RESTAURANT
[43m108 [0mFAST FOOD CORP
[43m108 [0mLOUNGE - CLUB 108
[43m1081 [0mFULTON
[43m11 [0mSTREET CAFE
[43m111 [0mRESTAURANT
[43m1174 [0mFULTON CUISINE, HALAL FOOD
[43m12 [0mCHAIRS
[43m12 [0mCHAIRS CAFE
[43m12 [0mCORAZONES RESTAURANT & BAR
[43m12 [0mCORNERS
[43m12 [0mCORNERS COFFEE INC
[43m12 [0mSTREET ALE HOUSE
[43m120 [0mBAY CAFE
[43m1200 [0mMILES
[43m121 [0mFULTON STREET
[43m123 [0mNIKKO
[43m1

Find all restaurants that start with a letter, followed by one or more digits, followed by a space.

In [27]:
grep('^[A-Z][0-9]+ ', '/data/uniquenames.txt')

[43mA1 [0mOCHA SUSHI
[43mB4 [0mNYC
[43mB66 [0mCLUB
[43mC2 [0mCAFE
[43mF1 [0mLOUNGE AND GRILL
[43mH20 [0mLOUNGE AND RESTAURANT
[43mJ4 [0mHOOKAH LOUNGE
[43mQ2 [0mTHAI RESTAURANT
[43mT2 [0m- GO
[43mT49 [0mCAFE


In [28]:
# Find all restaurants
# Beggining with one or more letters // ^[A-Z]+
# followed by one or more digits // [0-9]+
# Followed by any number of charaters // .*
# and ending with BAR  // BAR$
grep('^[A-Z]+[0-9]+.*BAR$', '/data/uniquenames.txt')

[43mBA1002 BAR[0m
[43mBA1019 BAR[0m
[43mBA6110 BAR[0m
[43mBT3 BAR[0m
[43mBT4 BAR[0m


Find all restaurants that start with the word STARBUCKS, followed by any number of characters, and then have a digit.

In [34]:
grep('STARBUCKS.*[0-9]*', '/data/uniquenames.txt')

@ THE SQUARE([43mSTARBUCKS)[0m
D'ANGELO CENT/[43mSTARBUCKS COFFEE[0m
HERALD SQUARE CAFE ([43mSTARBUCKS)[0m
HUNTER COLLEGE PROUDLY BREW [43mSTARBUCKS[0m
JOSE O SHEA'S/[43mSTARBUCKS[0m
PRATT DESIGN CENTER/[43mSTARBUCKS[0m
[43mSTARBUCKS[0m
[43mSTARBUCKS # 14840[0m
[43mSTARBUCKS (JFK TERMINAL 5-POST SECURITY DEPARTURE)[0m
[43mSTARBUCKS (STORE 16628)[0m
[43mSTARBUCKS 22420[0m
[43mSTARBUCKS COFFEE[0m
[43mSTARBUCKS COFFEE  #16608[0m
[43mSTARBUCKS COFFEE # 15440[0m
[43mSTARBUCKS COFFEE # 7463[0m
[43mSTARBUCKS COFFEE # 7540[0m
[43mSTARBUCKS COFFEE #14240[0m
[43mSTARBUCKS COFFEE #18509[0m
[43mSTARBUCKS COFFEE #20679[0m
[43mSTARBUCKS COFFEE #21514[0m
[43mSTARBUCKS COFFEE #22596[0m
[43mSTARBUCKS COFFEE #23266[0m
[43mSTARBUCKS COFFEE #23267[0m
[43mSTARBUCKS COFFEE #3438[0m
[43mSTARBUCKS COFFEE #7344[0m
[43mSTARBUCKS COFFEE #7358[0m
[43mSTARBUCKS COFFEE #7416[0m
[43mSTARBUCKS COFFEE #7682[0m
[43mSTARBUCKS COFFEE #7826[0m
[43mSTARBUCKS COFFEE

#### Grouping ()

In the group operator, when a group of characters is enclosed in parentheses, the next operator applies to the whole group, not only the previous characters. For example, find all restaurant names that contain BA two times or more:

In [38]:
grep('(BA){2}', '/data/uniquenames.txt')

ALI [43mBABA[0m
ALI [43mBABA[0m RESTAURANT
ALI [43mBABA[0m'S
ALI[43mBABA[0m
ALI[43mBABA[0m EXPRESS
ALI[43mBABA[0m GRILL
[43mBABA[0m COOL
[43mBABA[0m GHANOUGE
[43mBABA[0m'S PIEROGIES
[43mBABA[0mGHANOUSH
[43mBABA[0mLU
SA[43mBABA[0m LOUNGE


#### In class exercices

What do these regular expressions match?

* b (cd)*
* h (d)+
* j? k+
* (cd){2,5}
* o(pre){3,}
* Panos|Ipeirotis

#### In class exercises (advanced)

Write down the regular expressions for the following:

* A telephone number (e.g, 212-555-0921)
* A zip+4 code (e.g, 10012-1809)
* For matching a float number (e.g., +12.34 or -1.457 or 1023.4568)
* Dollar amount with optional cents  (e.g. \$0.33, \$784)
* Time of Day (e.g. 12:15am, 3:34pm)
* Match urls  only of the form http://www.alphanumeric.com
* Match an email of the form username@domain (assume  that the domain might be in the form alphanumeric.alphanumeric, or alphanumeric.alphanumeric.alphanumeric)   



In [36]:
grep(r'\d{3}-\d{3}-\d{4}', '/data/uniquenames.txt')
grep(r'[2-9]\d{2}-\d{3}-\d{4}', '/data/uniquenames.txt')
'(?0[1-9]|1[0-2]):[0-5][0-9](am|pm)' #time of day

#zip code
\d{5}-\d{4}
#url
http://\w+\.\w+\.\w+
http://(\w+\.)+ #this group will continue one more times
http://(\.\w+)+[A-Za-z]+

### Backreferences

Sometimes it is handy to be able to refer to a match that was made earlier in a regex. This is done with backreferences. `\k` is the backreference specifier, where `k` is a number, which refers to the `k`-th regular expression *that was enclosed in parenthesis*.

For example, find if the first character(s) of a line are the same as the last:


In [None]:
grep(r'^(.{3,}).*\1$', '/data/uniquenames.txt')

Or find all the restaurant names that the first 5 characters (or more) are identical to the last characters.

In [None]:
grep(r'^([A-Z]+)\1$', '/data/uniquenames.txt')

Find all names that have three consecutive same digits

In [None]:
grep(r'([0-9])\1\1', '/data/uniquenames.txt')

As we are going to see, these backreferences will also be of tremendous use for extraction purposes.

#### In class exercise (advanced)

Say that you have a file with telephone numbers written in a variety of forms: 

* 679-397-5255
* 2126660921
* 212-998-0902
* 888-888-2222
* 800-555-1211
* 800 555 1212
* 800.555.1213
* (800) 555-1214
* 1-800-555-1215
* 1(800)555-1216
* 800-555-1212-1234
* 800-555-1212x1234
* 800-555-1212 ext. 1234
* work 1-(800) 555.1212 #1234

The task is to standardize everything in the form (xxx)-xxx-xxx.


To make the process interactive, go to http://regex101.com/?#python, copy and paste the numbers above in the textarea called "Text String", and then try to write the regular expression above. (As a side note, the website provides excellent explanations about the meaning of the regular expression that you write down.) Remember to put the "g" character in the small textfield next to the regex: this has the same meaning as in sed, and it means "find globally" the regex, not just the first occurence.


If you manage to deal with that task, consider the case of also having international country calling codes (e.g., +1 for US, +44 for UK, +7 for Russia, +30 for Greece, +354 for Iceland etc), and also standardizing the extensions.

### Additional Regex Resources

* [Visual Regular Expression Tester](http://www.debuggex.com/?flavor=pcre)
* [Test Python Regular Expressions Online](http://www.pyregex.com/)
* [Regular Expressions 101](http://regex101.com/?)
* [Python's re Library Official Documentation](http://docs.python.org/2/library/re.html)
* [Regular expression reference at W3schools](http://www.w3schools.com/jsref/jsref_obj_regexp.asp)
* [Parsing phone numbers using Python and regular expressions](http://www.diveintopython.net/regular_expressions/phone_numbers.html)

### Additional Regular Expressions

While we have not used these before, they are commonly used shortcuts to simplify the construction of regular expressions:

* `\d`: matches the digits, 0-9.
* `\D`: matches anything but `\d`.
* `\w`: matches any alphanumeric character plus underscore: `[A-Za-z0-9_]`.
* `\W`: matches anything but `\w`.
* `\s`: matches any "whitespace" character (space, tab, newline, etc): `[ \t\n\r\f\v]`.
* `\S`: matches anything but `\s`.
* `\b`: matches the breaks between alphanumeric and non-alphanumeric characters (an empty string), the boundary between `\w` and `\W`. Useful for ensuring that what you match is actually a word.
* `\B`: matches anything but `\b`. Useful for ensuring your match is in the middle of a word.

And the ones below get a little bit more advanced:

* `*?`, `+?`: ordinarily, `*`, `+` and `?` are **greedy**. This means they are matching the longest possible string that satisfies the regular expression. Adding the `?` to any of these makes it non-greedy, instead matching the shortest possible expression. 
* `(?: )`: A non-capturing group. This works just as `()`, but doesn’t hold on to the matched contents.
* `(?<=x)`: Matches any string that is preceded by x (an arbitrary regular expression).
