# Python Regular Expressions

Often times, data sets need cleaning and looking through text strings. Python provides built-in methods such as index(), s.replace(), s.lower(), and split() (among many others) to provide the user as much versatility as possible.

However, often times an analyst needs to parse through strings with unique patterns and properties - this is where regular expressions come in. Python's re library provides the user a unique and simple way to use regular expressions within Python code to extract what they're looking for. Furthermore, the result is often times much more interpretable than using Python's built-in functions.

In [1]:
import re
import pandas as pd
import random

In [2]:
data=pd.read_csv("citi_bike_subset1.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,0,2110,2017/8/1 0:00,2017/8/1 0:35,470,W 20 Street & 8 Avenue,40.743453,-74.00004,3289,W 90 St & Amsterdam Ave,40.790179,-73.972889,20954,Subscriber,1978.0,2
1,1,160,2017/8/1 0:00,2017/8/1 0:02,348,W Broadway & Spring St,40.72491,-74.001547,151,Cleveland Pl & Spring St,40.722104,-73.997249,15164,Subscriber,1978.0,1
2,2,1644,2017/8/1 0:00,2017/8/1 0:27,3165,Central Park West & W 72 St,40.775794,-73.976206,3320,Central Park West & W 100 St,40.793393,-73.963556,17540,Subscriber,1962.0,2
3,3,323,2017/8/1 0:00,2017/8/1 0:05,389,Broadway & Berry Street,40.710446,-73.965251,3073,Division Ave & Hooper St,40.706913,-73.954417,18705,Subscriber,1990.0,1
4,4,109,2017/8/1 0:00,2017/8/1 0:02,3145,E 84 Street & Park Avenue,40.778627,-73.957721,3147,E 85 St & 3 Ave,40.778012,-73.954071,27975,Subscriber,1983.0,1


We just directly look at the "start station name" column.

In [3]:
data = data.rename(columns={'start station name':"SSN"}) # rename the column for simplification

In [4]:
data.SSN.head(10)

0         W 20 Street & 8 Avenue
1         W Broadway & Spring St
2    Central Park West & W 72 St
3        Broadway & Berry Street
4      E 84 Street & Park Avenue
5            3 Street & 3 Avenue
6         Hanson Pl & Ashland Pl
7               W 47 St & 10 Ave
8                W 54 St & 9 Ave
9           Vernon Blvd & 50 Ave
Name: SSN, dtype: object

In [5]:
# unify all the abbreviations
# use re.sub()

def abbr(x):
    x = re.sub('Street', 'St', x)
    x = re.sub('Avenue', 'Ave', x)
    return x

In [6]:
# check the function

data.loc[:,'SSN'].apply(abbr).head(10)

0                W 20 St & 8 Ave
1         W Broadway & Spring St
2    Central Park West & W 72 St
3            Broadway & Berry St
4             E 84 St & Park Ave
5                   3 St & 3 Ave
6         Hanson Pl & Ashland Pl
7               W 47 St & 10 Ave
8                W 54 St & 9 Ave
9           Vernon Blvd & 50 Ave
Name: SSN, dtype: object

In [7]:
# modify the data

data['SSN'] = data.loc[:,'SSN'].apply(abbr)

We want to find all 30-49 street and create a new column with the street information.

We use the re.findall function.

In [8]:
# use re.findall()

def find_st(st):
    pattern = r'[W|E]\s[3|4]\d\sSt'
    res = re.findall(pattern, st)
    if len(res) > 0:
        return res[0]
    else:
        return None

In [9]:
# check the function

data.loc[:,'SSN'].apply(find_st).head(10)

0       None
1       None
2       None
3       None
4       None
5       None
6       None
7    W 47 St
8       None
9       None
Name: SSN, dtype: object

In [10]:
# create a new column containing all qualified streets
data["Street"] = data.loc[:,'SSN'].apply(find_st)

In [12]:
data["Street"].head(10)

0       None
1       None
2       None
3       None
4       None
5       None
6       None
7    W 47 St
8       None
9       None
Name: Street, dtype: object

In [None]:
# End