## Using Regular Expressions

Here's a possible problem you might encounter as a data scientist.  You have a bunch of file names, and you need to find all the files that are named with a certain pattern.

I've taken this list of files from a repo for one of my classes. It's not cleaned up - there are files and directories mixed in here.

In [1]:
import pandas as pd
files = pd.read_csv('file_list.csv', header=None).rename({0:'name'}, axis=1)

In [2]:
files.head(20)

Unnamed: 0,name
0,additional_resources
1,lab_1
2,homework
3,live_session
4,./additional_resources:
5,README.md
6,kmart_OH_notes
7,./additional_resources/kmart_OH_notes:
8,IntroToR
9,wk03


Let's say that you need to answer a question about my live session plans.  I happen to know that they are located in files with names like unit_04_ls.Rmd.

This is a great place to use a tool called regular expressions.

The first step is to define a pattern.

In [3]:
pattern = "unit.+Rmd"

This is what we call a regular expression.  I put in two special symbols.  the dot is a special character that matches anything except a line break.  the + mean 1 or more of whatever came before it.  so one or more dots, meaning one or more of anything.

In [4]:
files[files.name.str.fullmatch(pattern)]

Unnamed: 0,name
115,unit_01_homework.Rmd
118,unit_02_homework.Rmd
121,unit_3_homework.Rmd
124,unit_4_hw.Rmd
127,unit_05_homework.Rmd
130,unit_07_homework.Rmd
135,unit_09.Rmd
159,unit_10_homework.Rmd
173,unit_01.Rmd
178,unit_02.Rmd


There's a small problem there.  we're picking up live session plans, but also homework files.  let's try to fix that.  I really want the file to end with ls.Rmd.  but if I put in a . that's a special character, so I need to escape it with a backslash

In [6]:
pattern = "unit.+ls\.Rmd"
files[files.name.str.fullmatch(pattern)]

Unnamed: 0,name
188,unit_03_extra_materials.Rmd
212,unit_6_ls.Rmd
218,unit_7_ls.Rmd
222,unit_08_ls.Rmd
232,unit_09_ls.Rmd


Almost there.

In [7]:
pattern = "unit.+_ls.Rmd"
files[files.name.str.fullmatch(pattern)]

Unnamed: 0,name
212,unit_6_ls.Rmd
218,unit_7_ls.Rmd
222,unit_08_ls.Rmd
232,unit_09_ls.Rmd


Here's another example.  Let's say that I want to create a column with the unit number.  I need another pattern.

In [8]:
pattern = "_(\d+)_"
files.assign(unit = files.name.str.extract(pattern)).tail(20)

Unnamed: 0,name,unit
223,fishing_expedition.nb.html,
224,unit_08_ls.html,8.0
225,hprice1.RData,
226,unit_08_ls.nb.html,8.0
227,images,
228,./live_session/unit_08/images:,
229,linear_regression.png,
230,./live_session/unit_09:,
231,hprice1.RData,
232,unit_09_ls.Rmd,9.0


In this pattern, the \d represents any digit.  again, the + means that we are looking for 1 or more of these in a row.

You can also use regular expressions to replace substrings.

In [9]:
pattern = "_ls"
files.assign(long_name = files.name.str.replace(pattern, "_live_session")).tail(20)

Unnamed: 0,name,long_name
223,fishing_expedition.nb.html,fishing_expedition.nb.html
224,unit_08_ls.html,unit_08_live_session.html
225,hprice1.RData,hprice1.RData
226,unit_08_ls.nb.html,unit_08_live_session.nb.html
227,images,images
228,./live_session/unit_08/images:,./live_session/unit_08/images:
229,linear_regression.png,linear_regression.png
230,./live_session/unit_09:,./live_session/unit_09:
231,hprice1.RData,hprice1.RData
232,unit_09_ls.Rmd,unit_09_live_session.Rmd


There's a lot more you can do with regular expressions.  And they're not just for Python, you can use them from the command line - if you've heard of the command called grep, that uses regular expressions.  You can also use them from many other languages.  Hopefully that gives you an idea of how useful these are.  Next we'll learn a bit of the syntax of regular expressions.



