## Text Methods
---

A normal Python string has a variety of method calls available, see `help(str)`

In [4]:
mystring = 'hello'

In [5]:
mystring.capitalize()

'Hello'

In [6]:
mystring.isdigit()

False

### Pandas and Text

Pandas can do a lot more than what we show here. [documentation](https://pandas.pydata.org/docs/user_guide/text.html)

In [7]:
import pandas as pd

In [8]:
names = pd.Series(['andrew', 'bobo', 'claire', 'david', '4'])
names

0    andrew
1      bobo
2    claire
3     david
4         4
dtype: object

In [9]:
names.str.capitalize()

0    Andrew
1      Bobo
2    Claire
3     David
4         4
dtype: object

In [10]:
names.str.isdigit()

0    False
1    False
2    False
3    False
4     True
dtype: bool

#### Splitting, Grabbing, and Expanding

In [14]:
tech_finance = ['GOOG, APPL, AMZN', 'JPM, BAC,GS']
len(tech_finance)

2

In [16]:
tickers = pd.Series(tech_finance)
tickers

0    GOOG, APPL, AMZN
1         JPM, BAC,GS
dtype: object

In [17]:
tickers.str.split(',')

0    [GOOG,  APPL,  AMZN]
1         [JPM,  BAC, GS]
dtype: object

In [18]:
tickers.str.split(',').str[0]

0    GOOG
1     JPM
dtype: object

In [19]:
tickers.str.split(',', expand=True)

Unnamed: 0,0,1,2
0,GOOG,APPL,AMZN
1,JPM,BAC,GS


#### Cleaning or Editing Strings

In [21]:
messy_names = pd.Series(["andrew  ","bo;bo","  claire  "])
messy_names

0      andrew  
1         bo;bo
2      claire  
dtype: object

In [22]:
messy_names.str.replace(";","")

0      andrew  
1          bobo
2      claire  
dtype: object

In [23]:
messy_names.str.strip()

0    andrew
1     bo;bo
2    claire
dtype: object

In [24]:
messy_names.str.replace(";","").str.strip()

0    andrew
1      bobo
2    claire
dtype: object

In [25]:
messy_names.str.replace(";","").str.strip().str.capitalize()

0    Andrew
1      Bobo
2    Claire
dtype: object

#### Alternative with Custom apply() call

In [26]:
def cleanup(name):
    name = name.replace(";","")
    name = name.strip()
    name = name.capitalize()

    return name

In [27]:
messy_names

0      andrew  
1         bo;bo
2      claire  
dtype: object

In [28]:
messy_names.apply(cleanup)

0    Andrew
1      Bobo
2    Claire
dtype: object

## Which one is more efficient?

In [29]:
import timeit 
  
# code snippet to be executed only once
setup = '''
import pandas as pd
import numpy as np
messy_names = pd.Series(["andrew  ","bo;bo","  claire  "])
def cleanup(name):
    name = name.replace(";","")
    name = name.strip()
    name = name.capitalize()
    return name
'''
  
# code snippet whose execution time is to be measured 
stmt_pandas_str = ''' 
messy_names.str.replace(";","").str.strip().str.capitalize()
'''

stmt_pandas_apply = '''
messy_names.apply(cleanup)
'''

stmt_pandas_vectorize='''
np.vectorize(cleanup)(messy_names)
'''

In [30]:
pandas_str = timeit.timeit(setup=setup, stmt=stmt_pandas_str, number=10000)
pandas_apply = timeit.timeit(setup=setup, stmt=stmt_pandas_apply, number=10000)
pandas_vectorize = timeit.timeit(setup=setup, stmt=stmt_pandas_vectorize, number=10000)

print(f'pandas_str:{pandas_str} secs, pandas_apply:{pandas_apply} secs, pandas_vectorize:{pandas_vectorize} secs')

pandas_str:3.3668107120001878 secs, pandas_apply:1.0311018669999612 secs, pandas_vectorize:0.1870466740001575 secs


---