<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Text-Methods" data-toc-modified-id="Text-Methods-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Text Methods</a></span></li><li><span><a href="#Pandas-and-Text" data-toc-modified-id="Pandas-and-Text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Pandas and Text</a></span><ul class="toc-item"><li><span><a href="#Text-Methods-on-Pandas-String-Column" data-toc-modified-id="Text-Methods-on-Pandas-String-Column-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Text Methods on Pandas String Column</a></span></li><li><span><a href="#Splitting-,-Grabbing,-and-Expanding" data-toc-modified-id="Splitting-,-Grabbing,-and-Expanding-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Splitting , Grabbing, and Expanding</a></span></li><li><span><a href="#Cleaning-or-Editing-Strings" data-toc-modified-id="Cleaning-or-Editing-Strings-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Cleaning or Editing Strings</a></span></li><li><span><a href="#Alternative-with-Custom-apply()-call" data-toc-modified-id="Alternative-with-Custom-apply()-call-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Alternative with Custom apply() call</a></span></li><li><span><a href="#Which-one-is-more-efficient?" data-toc-modified-id="Which-one-is-more-efficient?-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Which one is more efficient?</a></span></li></ul></li></ul></div>

# Text Methods

A normal Python string has a variety of method calls available:

In [1]:
mystring = 'hello'

In [2]:
mystring.capitalize()

'Hello'

In [3]:
mystring.isdigit()

False

In [4]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  

# Pandas and Text

Pandas can do a lot more than what we show here. Full online documentation on things like advanced string indexing and regular expressions with pandas can be found here: https://pandas.pydata.org/docs/user_guide/text.html

## Text Methods on Pandas String Column

In [5]:
import pandas as pd

In [6]:
names = pd.Series(['andrew','bobo','claire','david','4'])

In [7]:
names

0    andrew
1      bobo
2    claire
3     david
4         4
dtype: object

In [8]:
names.str.capitalize()

0    Andrew
1      Bobo
2    Claire
3     David
4         4
dtype: object

In [9]:
names.str.isdigit()

0    False
1    False
2    False
3    False
4     True
dtype: bool

## Splitting , Grabbing, and Expanding

In [10]:
tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS']

In [11]:
len(tech_finance)

2

In [12]:
tickers = pd.Series(tech_finance)

In [13]:
tickers

0    GOOG,APPL,AMZN
1        JPM,BAC,GS
dtype: object

In [14]:
tickers.str.split(',')

0    [GOOG, APPL, AMZN]
1        [JPM, BAC, GS]
dtype: object

In [15]:
tickers.str.split(',').str[0]

0    GOOG
1     JPM
dtype: object

In [16]:
tickers.str.split(',',expand=True)

Unnamed: 0,0,1,2
0,GOOG,APPL,AMZN
1,JPM,BAC,GS


## Cleaning or Editing Strings

In [17]:
messy_names = pd.Series(["andrew  ","bo;bo","  claire  "])

In [18]:
# Notice the "mis-alignment" on the right hand side due to spacing in "andrew  " and "  claire  "
messy_names

0      andrew  
1         bo;bo
2      claire  
dtype: object

In [19]:
messy_names.str.replace(";","")

0      andrew  
1          bobo
2      claire  
dtype: object

In [20]:
messy_names.str.strip()

0    andrew
1     bo;bo
2    claire
dtype: object

In [21]:
messy_names.str.replace(";","").str.strip()

0    andrew
1      bobo
2    claire
dtype: object

In [22]:
messy_names.str.replace(";","").str.strip().str.capitalize()

0    Andrew
1      Bobo
2    Claire
dtype: object

## Alternative with Custom apply() call

In [23]:
def cleanup(name):
    name = name.replace(";","")
    name = name.strip()
    name = name.capitalize()
    return name

In [24]:
messy_names

0      andrew  
1         bo;bo
2      claire  
dtype: object

In [25]:
messy_names.apply(cleanup)

0    Andrew
1      Bobo
2    Claire
dtype: object

## Which one is more efficient?

In [26]:
import timeit 
  
# code snippet to be executed only once 
setup = '''
import pandas as pd
import numpy as np
messy_names = pd.Series(["andrew  ","bo;bo","  claire  "])
def cleanup(name):
    name = name.replace(";","")
    name = name.strip()
    name = name.capitalize()
    return name
'''
  
# code snippet whose execution time is to be measured 
stmt_pandas_str = ''' 
messy_names.str.replace(";","").str.strip().str.capitalize()
'''

stmt_pandas_apply = '''
messy_names.apply(cleanup)
'''

stmt_pandas_vectorize='''
np.vectorize(cleanup)(messy_names)
'''

In [27]:
timeit.timeit(setup = setup, 
                    stmt = stmt_pandas_str, 
                    number = 10000) 

9.318778199999997

In [28]:
timeit.timeit(setup = setup, 
                    stmt = stmt_pandas_apply, 
                    number = 10000) 

2.5986859999999297

In [29]:
timeit.timeit(setup = setup, 
                    stmt = stmt_pandas_vectorize, 
                    number = 10000) 

0.930248000000006

Wow! While .str() methods can be extremely convienent, when it comes to performance, don't forget about np.vectorize()! Review the "Useful Methods" lecture for a deeper discussion on np.vectorize()