Breaking up a string into columns using Regex in pandas

https://chrisalbon.com/python/data_wrangling/pandas_regex_to_create_columns/

#### Import modules

In [4]:
import re
import pandas as pd

Create df of raw strings

In [5]:
# Create a dataframe with a single column of strings
data = {'raw': ['Arizona 1 2014-12-23       3242.0',
                'Iowa 1 2010-02-23       3453.7',
                'Oregon 0 2014-06-20       2123.0',
                'Maryland 0 2014-03-14       1123.6',
                'Florida 1 2013-01-15       2134.0',
                'Georgia 0 2012-07-14       2345.6']}
df = pd.DataFrame(data, columns = ['raw'])
df

Unnamed: 0,raw
0,Arizona 1 2014-12-23 3242.0
1,Iowa 1 2010-02-23 3453.7
2,Oregon 0 2014-06-20 2123.0
3,Maryland 0 2014-03-14 1123.6
4,Florida 1 2013-01-15 2134.0
5,Georgia 0 2012-07-14 2345.6


Search column of strings for a pattern

In [6]:
# Which rows of df['raw'] contain 'xxxx-xx-xx'?
df['raw'].str.contains('....-..-..', regex=True)

0    True
1    True
2    True
3    True
4    True
5    True
Name: raw, dtype: bool

Extract column of single digits

In [8]:
# In the column 'raw', extract single digit in the strings
df['female'] = df['raw'].str.extract('(\d)', expand=True)
df['female']

0    1
1    1
2    0
3    0
4    1
5    0
Name: female, dtype: object

Extract the column of dates

In [9]:
# In the column 'raw', extract xxxx-xx-xx in the strings
df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)
df['date']

0    2014-12-23
1    2010-02-23
2    2014-06-20
3    2014-03-14
4    2013-01-15
5    2012-07-14
Name: date, dtype: object

Extract the column of thousands

In [14]:
df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand = True)
df['score']

0    3242.0
1    3453.7
2    2123.0
3    1123.6
4    2134.0
5    2345.6
Name: score, dtype: object

Extract the column of words

In [19]:
df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)
df['state']

0     Arizona
1        Iowa
2      Oregon
3    Maryland
4     Florida
5     Georgia
Name: state, dtype: object

View final dataframe

In [20]:
df

Unnamed: 0,raw,female,date,score,states,state
0,Arizona 1 2014-12-23 3242.0,1,2014-12-23,3242.0,A,Arizona
1,Iowa 1 2010-02-23 3453.7,1,2010-02-23,3453.7,I,Iowa
2,Oregon 0 2014-06-20 2123.0,0,2014-06-20,2123.0,O,Oregon
3,Maryland 0 2014-03-14 1123.6,0,2014-03-14,1123.6,M,Maryland
4,Florida 1 2013-01-15 2134.0,1,2013-01-15,2134.0,F,Florida
5,Georgia 0 2012-07-14 2345.6,0,2012-07-14,2345.6,G,Georgia
