# Data Wrangling with Pandas Tutorial 

In [1]:
import pandas as pd
import numpy as np

## Combining and Merging Data Sets 

### Database-style DataFrame Merges

pandas method *merge* combines data sets by linking rows using one or more keys

- **many-to-one merge situation**:

In [2]:
df1 = pd.DataFrame({'key':list('bbacaab'), 'data1':range(7)})

In [3]:
df2 = pd.DataFrame({'key':list('abd'), 'data2':range(3)})

In [4]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [5]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


df1 has multiple entries for the same key, and df2 has a unique entry for every key.

If it's not specified otherwise, the dataset will be merged using the shared column:

In [6]:
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


We can also specify using argument *on*:

In [7]:
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


If the column names are different, you can specify the columns to join on using *left_on* for the left dataframe and *right_on* for the right one:

In [8]:
df3 = pd.DataFrame({'lkey':list('bbacaab'), 'data1':range(7)})

In [9]:
df4 = pd.DataFrame({'rkey':list('abd'), 'data2':range(3)})

In [10]:
df3

Unnamed: 0,lkey,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [11]:
df4

Unnamed: 0,rkey,data2
0,a,0
1,b,1
2,d,2


In [13]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


Note that **by default pd.merge does an inner join**, that's why keys 'c' and 'd' don't appear in the previous table. That is: only keys that appear in both dataframes will appear in the merged dataframe.

You can specify the type of join using the argument *how*:

- If we do a left join, all the keys from the left dataframe will appear with the corresponding added columns of right dataframe. However if a key is in the right dataframe and not the left one, that key will not appear.

In [14]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey', how='left')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1.0
1,b,1,b,1.0
2,a,2,a,0.0
3,c,3,,
4,a,4,a,0.0
5,a,5,a,0.0
6,b,6,b,1.0


- Same applies for right join:

In [15]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey', how='right')

Unnamed: 0,lkey,data1,rkey,data2
0,a,2.0,a,0
1,a,4.0,a,0
2,a,5.0,a,0
3,b,0.0,b,1
4,b,1.0,b,1
5,b,6.0,b,1
6,,,d,2


- The outer join takes the union of the keys:

In [16]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey', how='outer')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,c,3.0,,
7,,,d,2.0


- **many-to-many merge situation**: they take the Cartesian product of rows. That is: if key 'a' appears 4 times

In [17]:
df1 = pd.DataFrame({'key': list('bbacab'),
'data1': range(6)})

In [23]:
df2 = pd.DataFrame({'key': list('ababd'),
'data2': range(5)})

In [20]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [24]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In this case 'a' appears 2 times in df1 and 2 times in df2, 'b' appears 3 times in df1 and 2 times in df2. Therefore in the inner join merge we will have 4 entries with key 'a' and 6 entries with key 'b' (and none with keys 'c' and 'd'):

In [25]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


If we do a left join, then we will have the same entries with keys 'a' and 'b', and we will have one extra entry with key 'c' and NAN in data2 column:

In [26]:
pd.merge(df1, df2, how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


To merge with multiple keys, pass a list of the keys in the argument *on* or *left_on*/*right_on*:

In [27]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
'key2': ['one', 'two', 'one'],
'lval': [1, 2, 3]})

In [29]:
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
'key2': ['one', 'one', 'one', 'two'],
'rval': [4, 5, 6, 7]})

In [30]:
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [31]:
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In left df the set of keys [key1, key2] form a unique key. However, in the right dataset key ['foo', 'one'] has two entries. Therefore, this is a many-to-one situation. 

Inner join:

In [33]:
pd.merge(left, right, on=['key1', 'key2'])

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1,4
1,foo,one,1,5
2,bar,one,3,6


Outer join:

In [34]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


### Merging on index

- If key to join on is found in the index, you must specify left_index=True or right_index=True:

In [3]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
'value': range(6)})

In [5]:
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [6]:
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [7]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [9]:
pd.merge(left1, right1, right_index=True, left_on='key', how='left')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


- Same goes for hierarchically-indexed data:

In [17]:
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5.)})

righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])

In [18]:
lefth

Unnamed: 0,key1,key2,data
0,Ohio,2000,0.0
1,Ohio,2001,1.0
2,Ohio,2002,2.0
3,Nevada,2001,3.0
4,Nevada,2002,4.0


In [19]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [21]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


- Using indices on both sides of the merge:

In [11]:
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])

In [12]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [13]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [23]:
pd.merge(left2, right2, left_index=True, right_index=True, how='left')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
c,3.0,4.0,9.0,10.0
e,5.0,6.0,13.0,14.0


#### Mergin on index with .join

- df1.join(df2) does a left join of df1 with df2 (it can also be an outer join if specified)

In [24]:
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


- df1.join([df2, df3]) does a left join of the df1 with df2 and df3

In [27]:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon'])

In [26]:
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [28]:
left2.join([right2, another])

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


### Concatenating Along Axis

####  NumPy concatenate

In [3]:
arr = np.arange(12).reshape((3, 4))

In [4]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [5]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

#### pd.concat Series

- pd.concat([s1,s2]): by default it returns a new Series where index are stacked together (axis=0)

In [6]:
s1 = pd.Series([0,1], index=['a','b'])

In [7]:
s2 = pd.Series([2,3,4], index=['c','d','e'])

In [8]:
s3 = pd.Series([5,6], index=['f','g'])

In [9]:
pd.concat([s1,s2,s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

- pd.concat([s1,s2], axis=1): returns a new DataFrame with the union of index and a column for each of the Series

*In case the indexs don't overlap, for the indexs of each Series there will be NAN values for the columns corresponidng to the rest of the Series*:

In [10]:
pd.concat([s1,s2,s3],axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


*If there is overlap:*

In [11]:
s4 = pd.concat([s1*5, s3])

In [12]:
s4

a    0
b    5
f    5
g    6
dtype: int64

In [13]:
pd.concat([s1,s4], axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,5
f,,5
g,,6


- You can also specify intersect the indexs instead of doing the union with **join='inner'**:

In [14]:
pd.concat([s1,s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,5


- If you want to be able to identify the concatenated pieces, you can create a hierarchical index by specifying keys=['one', 'two]:

In [15]:
result = pd.concat([s1,s2,s3], keys=['one', 'two', 'three'])

In [16]:
result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

*Then you can unstack the hierarchically-indexed Series in a DataFrame*:

In [17]:
result.unstack()

Unnamed: 0,a,b,c,d,e,f,g
one,0.0,1.0,,,,,
two,,,2.0,3.0,4.0,,
three,,,,,,5.0,6.0


- If you concat on axis=1, then the *keys* argument refers to the columns' headers:

In [19]:
pd.concat([s1,s2,s3], axis=1, keys=['Q', 'W', 'T'])

Unnamed: 0,Q,W,T
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


#### pd.concat DataFrame 

In [41]:
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
columns=['one', 'two'])

df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
columns=['three', 'four'])

In [35]:
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [36]:
df2

Unnamed: 0,three,four
a,5,6
c,7,8


- pd.concat(): returns DataFrame where indexs are stacked together and columns are the union of columns

In [37]:
pd.concat([df1, df2])

Unnamed: 0,one,two,three,four
a,0.0,1.0,,
b,2.0,3.0,,
c,4.0,5.0,,
a,,,5.0,6.0
c,,,7.0,8.0


*To get a hierarchically-indexed DataFrame:*

In [43]:
pd.concat([df1, df2], keys=['one', 'two'])

Unnamed: 0,Unnamed: 1,one,two,three,four
one,a,0.0,1.0,,
one,b,2.0,3.0,,
one,c,4.0,5.0,,
two,a,,,5.0,6.0
two,c,,,7.0,8.0


- pd.concat(axis=1): returns DataFrame where index is the union of indexs and columns are stacked together

In [42]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


*To get levels for columns:*

In [45]:
pd.concat([df1, df2], axis=1,
keys=['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


*You can also name the different levels:*

In [46]:
pd.concat([df1, df2], axis=1,
keys=['level1', 'level2'], names=['upper', 'lower'])

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


- Finally, if you want to concatenate along indexs but index names are not important, you can pass the argument **ignore_index=True**:

In [47]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [48]:
df1

Unnamed: 0,a,b,c,d
0,-0.305549,-0.434518,0.147186,-1.449942
1,-0.444368,-1.035094,-0.639284,1.348064
2,1.398562,-1.273002,0.727595,-0.93244


In [49]:
df2

Unnamed: 0,b,d,a
0,-1.665583,0.188099,-0.638184
1,0.045783,-2.245093,1.076758


In [51]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,-0.305549,-0.434518,0.147186,-1.449942
1,-0.444368,-1.035094,-0.639284,1.348064
2,1.398562,-1.273002,0.727595,-0.93244
3,-0.638184,-1.665583,,0.188099
4,1.076758,0.045783,,-2.245093


### Combining Data with Overlap 

This section deals with combination of Series or DataFrames with overlap in index

#### np.where

- np.where(condition, a, b): condition is a book-array, a and b are arrays. If condition[i] is True pick element form a otherwise pick element from b

In [54]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan

In [55]:
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [56]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [57]:
cond = pd.isnull(a)

In [58]:
cond

f     True
e    False
d     True
c    False
b    False
a     True
dtype: bool

In [60]:
np.where(cond, b, a)

array([0. , 2.5, 2. , 3.5, 4.5, nan])

#### Series .combine_first

- s1.combine_first(s2) concatenates series along axis and if an index is shared it assigns the value of s1

In [62]:
b = b[:-2]

In [63]:
a = a[2:]

In [64]:
b

f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64

In [65]:
a

d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [66]:
b.combine_first(a)

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

#### DataFrames .combine_first()

For DataFrames, df1.combine_first(df2) does the same thing for shared columns:

In [68]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': range(2, 18, 4)})

df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})

In [69]:
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [70]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [71]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## Reshaping and Pivoting

## Data Transformation 

## String Manipulation

### String Object Methods 

- *split* breaks a string with chosen separator:

In [75]:
fruit = 'apple, pear,melon'

In [76]:
fruit.split(',')

['apple', ' pear', 'melon']

- *strip* removes substring from string:

In [77]:
item = ' apple'

In [78]:
item.strip(' ')

'apple'

*Added all together*:

In [79]:
pieces = [item.strip(' ') for item in fruit.split(',')]

In [80]:
pieces

['apple', 'pear', 'melon']

- Concatenate strings with +:

In [81]:
first, second, third = pieces

In [82]:
first + ':' + second + ':' + third

'apple:pear:melon'

- *join* concatenates strings in a list with a separator 'sep'.join([str1, str2]):

In [83]:
'-'.join(pieces)

'apple-pear-melon'

- Locating substrings:

In [85]:
'ple' in first

True

- *find* returns starting index of located substring (*returns -1 if not found*)

In [86]:
first.find('ple')

2

In [87]:
first.find('pea')

-1

- *count* returns number of ocurrences of a particular substring:

In [88]:
string = 'massachussetts'

In [89]:
string.count('ss')

2

- *replace* replaces substring with another substring:

In [90]:
string.replace('ss', 's')

'masachusetts'

- *lower*, *upper* converts to lower or upper case:

In [91]:
country = 'Catalonia'

In [92]:
country.lower()

'catalonia'

- *endswith*, *startswith*:

In [93]:
country.endswith('nia')

True

In [94]:
country.startswith('esp')

False

### Regular Expressions 

####  Intro

Regular expressions provide a flexible way to search or match string patterns in text.

The Python module to manage regular expressions is called **re**:

In [2]:
import re

A single expression **regex** is a string formed according to the regular expression language.

Regex expressions:

- **\s**: one whitespace characters \r\n\t\f

- **\S**: any non-whitespace characters

- **\d**: any digit character [0-9]

- **\D**: any character that is not a digit

- **\w**: any word character (a-z, A-Z, 0-9, _ )

- **\W**: any non-word character

In [97]:
text = "foo     bar\t baz   \tqux"

In [98]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

- re.compile('\s+'): compiles a regular expression that can then be used as many times as you want

In [99]:
regex = re.compile('\s+')

- *split*: split using regex as separator

In [100]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

- *findall*: find all the substrings matching the regular expression regex

In [101]:
regex.findall(text)

['     ', '\t ', '   \t']

- *search*: returns only first match

In [3]:
regex = re.compile('\d\w{4}\.')

In [6]:
regex.match('wer4wwww.qwe')

In [102]:
regex.search(text)

<re.Match object; span=(3, 8), match='     '>

- *match*: only matches beginning of string

#### regex syntax

- special characters:

    - . (dot): matches any character except newline

In [40]:
regex = re.compile('\d{3}[e.]{2}')

In [41]:
text = '344er'
regex.match(text)

    - ^ : matches the start of the string

In [12]:
regex = re.compile("^\d\w")

In [13]:
text = '4m-dia internacional \n 5m- dia mundial'
regex.findall(text)

['4m']

    - $ : matches de end of the string or just before the newline at the end of the string

In [26]:
regex = re.compile("\.es$")

In [25]:
text = 'toni@gmail.com, pere@gmail.es\n'
regex.findall(text)

[]

    - * : causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. 

In [111]:
## ab* means match character a and 0 or as many
## repetitions as possible of b
regex = re.compile('ab*')

In [108]:
regex.findall('abbbb')

['abbbb']

In [114]:
regex.findall('maaarianeboabbbba')

['a', 'a', 'a', 'a', 'abbbb', 'a']

    - +: causes the resulting RE to match 1 or more repetitions of the preceding RE (ab+ matches abb but not a)

In [118]:
regex = re.compile('ab+')

In [119]:
regex.findall('acababbb')

['ab', 'abbb']

    - ? : causes the resulting regex to match 0 or 1 repetitions of the preceding regex. (ab? matches a, c and ab but not abab)

In [116]:
regex = re.compile('ab?')

In [117]:
regex.findall('abb')

['ab']

    - {m} : matches m copies of RE (a{6} matches exactly six 'a' characters, but not 5)

    - {m, n}: matches from m to n repetitions or RE (attempting to match as many as possible). Omitting m takes lower bound 0 and omitting n takes upper bound inf

    - {m, n}?: matches from to n repetitions of RE (attempting to match as few as possible)

- []: indicate a set of characters

    - [amk]: listes characters 'a', 'm', 'k' individually

    - [a-z]: any lowercase ASCII letter

    - [0-5][0-9]: all the two-digit numbers from 00 to 59

    - if - is escaped ([a\-z]) or placed as first or last character ([-a], [a-]) it will match a literal -

    - define a set by the complementary using ^: [^5] matches any character except 5

    - Special characters lose their special meaning inside sets ([.] represents a literal dot)

- | : A|B matches either A or B

#### examples

- finding emails in a text:

In [126]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.es
"""

In [127]:
email = re.compile('[a-zA-Z0-9]*@[a-z]*\.[a-z]{2,4}')

In [128]:
email.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.es']

In [134]:
Regex_Pattern = r"\d{2}\D\d{2}\D\d{4}"

In [135]:
regex = re.compile(Regex_Pattern)

In [136]:
text = '13X45X5567sdfafd'
regex.findall(text)

['13X45X5567']

### Vectorized string functions in pandas 