# Pandas Working with Text Data

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. 

link to [Working with Text Data](http://pandas.pydata.org/pandas-docs/stable/user_guide/text.html?highlight=str)

In [5]:
import pandas as pd
import numpy as np

## lower/upper/len

In [7]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

In [8]:
s.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

In [9]:
s.str.len()

0    1.0
1    1.0
2    1.0
3    4.0
4    4.0
5    NaN
6    4.0
7    3.0
8    3.0
dtype: float64

## strip/lstrip/rstrip

In [10]:
idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank'])

In [11]:
idx.str.strip()

Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [12]:
idx.str.lstrip()

Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [13]:
idx.str.rstrip()

Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

## Splitting and Replacing Strings¶

**These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing white spaces, lower casing all names, and replacing any remaining white spaces with underscores**

In [14]:
df = pd.DataFrame(np.random.randn(3, 2),
   ...:                   columns=[' Column A ', ' Column B '], index=range(3))
   ...:   

In [15]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ','_')

In [16]:
df.columns

Index(['column_a', 'column_b'], dtype='object')

In [19]:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])

In [25]:
s2.str.split('_').str.get(1) #Elements in the split lists can be accessed using get or [] notation:

0      b
1      d
2    NaN
3      g
dtype: object

In [27]:
s2.str.split('_',expand=True) #It is easy to expand this to return a DataFrame using expand.

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


In [32]:
s2.str.split('_',expand=True,n=1,) #It is also possible to limit the number of splits:

Unnamed: 0,0,1
0,a,b_c
1,c,d_e
2,,
3,f,g_h


In [36]:
s2.str.rsplit('_',expand=True,n=1)
#rsplit is similar to split except it works in the reverse direction, 
#i.e., from the end of the string to the beginning of the string:

Unnamed: 0,0,1
0,a_b,c
1,c_d,e
2,,
3,f_g,h


**replace by default replaces regular expressions:**

## Concatenation

In [42]:
s = pd.Series(['a', 'b', 'c', 'd'])
t = pd.Series(['a', 'b', np.nan, 'd'])
df = pd.DataFrame({'s':s,'t':t})
df

Unnamed: 0,s,t
0,a,a
1,b,b
2,c,
3,d,d


In [45]:
df.s.str.cat(t,sep='-',na_rep='AHA')

0      a-a
1      b-b
2    c-AHA
3      d-d
Name: s, dtype: object

In [46]:
d = pd.concat([t, s], axis=1)

In [48]:
df.s.str.cat(d,na_rep='_')

0    aaa
1    bbb
2    c_c
3    ddd
Name: s, dtype: object

## Testing for Strings that Match or Contain a Pattern


In [49]:
pattern = r'[0-9][a-z]'
pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [50]:
pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern)

0    False
1    False
2     True
3     True
4    False
dtype: bool

## Creating Indicator Variables

In [51]:
s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
s

0      a
1    a|b
2    NaN
3    a|c
dtype: object

In [52]:
s.str.get_dummies(sep='|')

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1
