# 03.10 Working with Strings

The vectorization of operations simplifies the syntax of operating on arrays of data, but for numerical arrays only. For arrays of strings, Numpy does not provide such simple access. It will require a more verbose loop syntax

In [1]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
#but missing value will create an error message

['Peter', 'Paul', 'Mary', 'Guido']

Pandas includes features to adress both need for vectorized string and for correctly handing missing data via the str attributes of pandas series and index objects containing strings

In [1]:
import pandas as pd
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [5]:
names.str.capitalize()
#str attribute has all vectorized string methods available

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

## Tables of Pandas String methods

### Methods similar to python string methods

a list of Pandas str methods that mirror Python string methods:

    len()	lower()	translate()	islower()
    ljust()	upper()	startswith()	isupper()
    rjust()	find()	endswith()	isnumeric()
    center()	rfind()	isalnum()	isdecimal()
    zfill()	index()	isalpha()	split()
    strip()	rindex()	isdigit()	rsplit()
    rstrip()	capitalize()	isspace()	partition()
    lstrip()	swapcase()	istitle()	rpartition()


In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [13]:
monte.str.split(' ',expand=True)

Unnamed: 0,0,1
0,Graham,Chapman
1,John,Cleese
2,Terry,Gilliam
3,Eric,Idle
4,Terry,Jones
5,Michael,Palin


### Methods using regular expression

there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in re module:

    Method	Description
    match()	Call re.match() on each element, returning a boolean.
    extract()	Call re.match() on each element, returning matched groups as strings.
    findall()	Call re.findall() on each element
    replace()	Replace occurrences of pattern with some other string
    contains()	Call re.search() on each element, returning a boolean
    count()	Count occurrences of pattern
    split()	Equivalent to str.split(), but accepts regexps
    rsplit()	Equivalent to str.rsplit(), but accepts regexps

In [25]:
# extract a contiguous group of characters
monte.str.extract(r'([A-Za-z]+)',expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [27]:
#find name starts with (^) and end with($) regular expression
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

### Miscellaneous methods

Finally, there are some miscellaneous methods that enable other convenient operations:

    Method	Description
    get()	Index each element
    slice()	Slice each element
    slice_replace()	Replace slice in each element with passed value
    cat()	Concatenate strings
    repeat()	Repeat values
    normalize()	Return Unicode form of string
    pad()	Add whitespace to left, right, or both sides of strings
    wrap()	Split long strings into lines with length less than a given width
    join()	Join strings in each element of the Series with passed separator
    get_dummies()	extract dummy variables as a dataframe


In [30]:
monte.str.slice(0,3)

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

In [35]:
monte.str.split().str.get(0) #first name

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [36]:
monte.str.split().str.get(1) #last name

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

In [38]:
#when data has a column containing some sort of code indicator
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


In [41]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


In [3]:
#recipe database not availabel becasue of proxy, use a downloaded sample
import os
try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


Due to using a file which each line is itself a valid json, but the full file is not, we need to use:

In [6]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [7]:
recipes.shape

(1042, 9)

In [8]:
recipes.iloc[0]

name                                      Easter Leftover Sandwich
ingredients      12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\...
url              http://thepioneerwoman.com/cooking/2013/04/eas...
image            http://static.thepioneerwoman.com/cooking/file...
cookTime                                                        PT
recipeYield                                                      8
datePublished                                           2013-04-01
prepTime                                                     PT15M
description      Got leftover Easter eggs?    Got leftover East...
Name: 0, dtype: object

In [11]:
recipes.description.str.contains(r'[Bb]reakfast').sum()

11

### A simple recipe recommender

Given a list of ingredients, find a recipe that uses all those ingredients

In [12]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

In [None]:
import re

In [22]:
print('tarragon',recipes.ingredients.str.contains('tarragon', re.IGNORECASE))

tarragon 0       False
1       False
2       False
3       False
4       False
        ...  
1037    False
1038    False
1039    False
1040    False
1041    False
Name: ingredients, Length: 1042, dtype: bool


In [14]:

spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice,re.IGNORECASE))
                             for spice in spice_list))

In [15]:
spice_df

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
1037,True,False,False,False,False,False,False,False,False,False
1038,True,False,False,False,False,False,False,False,False,False
1039,False,True,False,False,False,False,False,False,False,False
1040,True,False,False,False,False,False,False,False,False,False


In [23]:
selection = spice_df.query('parsley | paprika')
len(selection)

31

In [31]:
# recipes.name[selection.index]
# recipes.loc[selection.index,'name']

# 03.11 Working with Time Series

## Date and Times in python

In [1]:
from datetime import datetime
datetime(year=2019, month=7, day=4)

datetime.datetime(2019, 7, 4, 0, 0)

In [5]:
from dateutil import parser
date =parser.parse('4th of july, 2019')
date.strftime('%A')

'Thursday'

In [15]:
import numpy as np
import pandas as pd

In [9]:
date + np.arange(12)

TypeError: unsupported operand type(s) for +: 'datetime.datetime' and 'int'

In [12]:
np.datetime64(date) + np.arange(12)

array(['2019-07-04T00:00:00.000000', '2019-07-04T00:00:00.000001',
       '2019-07-04T00:00:00.000002', '2019-07-04T00:00:00.000003',
       '2019-07-04T00:00:00.000004', '2019-07-04T00:00:00.000005',
       '2019-07-04T00:00:00.000006', '2019-07-04T00:00:00.000007',
       '2019-07-04T00:00:00.000008', '2019-07-04T00:00:00.000009',
       '2019-07-04T00:00:00.000010', '2019-07-04T00:00:00.000011'],
      dtype='datetime64[us]')

In [18]:
date+ pd.to_timedelta(np.arange(12),'D')

DatetimeIndex(['2019-07-04', '2019-07-05', '2019-07-06', '2019-07-07',
               '2019-07-08', '2019-07-09', '2019-07-10', '2019-07-11',
               '2019-07-12', '2019-07-13', '2019-07-14', '2019-07-15'],
              dtype='datetime64[ns]', freq=None)