---
title: String in Pandas Example
tags: [jupyter]
keywords: pandas
summary: "Pandas manipulating string example."
mlType: dataFrame
infoType: pandas
sidebar: pandas_sidebar
permalink: __AutoGenThis__
notebookfilename:  __AutoGenThis__
---

This is an overview of various [string](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html) manipulations you can do in pandas.  It is from the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html).

In [1]:
import sys

sys.path.append("../")

In [36]:
import pandas as pd
import numpy as np
from pprint import pprint

# Padas Options

In [17]:
pd.set_option('max_rows', 20)

# I/O

In [None]:
C:\Cerebral\_MarioSandBox\Python\projectPage_pythonPlayground\git_projectCodes\gitRepo\notebooks

In [7]:
jsonLocation = '../../../../../DB/mlPlayground/recipeitems-latest.json'

In [12]:
# read the entire file into a Python array
# note that if you use the pd.read_json it will give you an error ValueError
with open(jsonLocation, 'r',encoding="utf8") as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [13]:
recipes.shape

(173278, 17)

There are 173278 recipes with 17 different attributes.

In [14]:
recipes.columns

Index(['_id', 'name', 'ingredients', 'url', 'image', 'ts', 'cookTime',
       'source', 'recipeYield', 'datePublished', 'prepTime', 'description',
       'totalTime', 'creator', 'recipeCategory', 'dateModified',
       'recipeInstructions'],
      dtype='object')

In [18]:
recipes.iloc[0]

_id                                {'$oid': '5160756b96cc62079cc2db15'}
name                                    Drop Biscuits and Sausage Gravy
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
image                 http://static.thepioneerwoman.com/cooking/file...
ts                                             {'$date': 1365276011104}
cookTime                                                          PT30M
source                                                  thepioneerwoman
recipeYield                                                          12
datePublished                                                2013-03-11
prepTime                                                          PT10M
description           Late Saturday afternoon, after Marlboro Man ha...
totalTime                                                           NaN
creator                                                         

## Simple Recipe Complex Identifier

This looks messy but we get some idea on the dataset.  For instance we can look at complexity as a function of the number of ingredients.  Lets use what we learned to parts through each one and add a new coloumn in the db called number of ingredients.

In [22]:
recipes.dtypes

_id                   object
name                  object
ingredients           object
url                   object
image                 object
ts                    object
cookTime              object
source                object
recipeYield           object
datePublished         object
prepTime              object
description           object
totalTime             object
creator               object
recipeCategory        object
dateModified          object
recipeInstructions    object
dtype: object

These are the steps to identify the number of ingredients used:

- convert the column to string
- using string methods split the ingredients by '\n' assuming that each line contains one ingredient.
- for the newly created DF apply a function to obtain the length of the array
- add this to a new column called **numIngredients** to the recipes.

Then we can identify the recipes with the largest number of recipes after identifying the number of ingredients used.

In [35]:
recipes['numIngredients']=recipes.ingredients.astype('str').str.split('\n').apply(lambda x: len(x))

In [37]:
recipes.name[np.argmax(recipes.numIngredients)]

'Tiffin: a selection of Indian street food'

In [38]:
recipes.numIngredients[np.argmax(recipes.numIngredients)]

129

## Simple Recipe Recommender

Lets say we want recipes that contain the following spices:

```
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
```

There are several ways to do this but we can use what we learned so far to identify maybe the number of times each recipe used the spices and if they are used we can create a two coloumn in the DF, num used and used.

If used then used will be true and numUsed will be an int with the number of time it is used.  Then use the ```.apply``` method to apply a function to each of the rows.

Lets not go this route but rather the string route.

Lets create a DF with the same number of rows as recipes but with coloumns being the spice of interest.  This will give use a DF in which we can search ingrients really quickly and more efficiently using the ```query()``` command.  Note that this is very easy and we can do both to see if we get the same result.

### Method 1

In [39]:
def getNumUsed(ingredientString,spice_list=None):
    if spice_list==None:
        spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
    
    count = 0
    for spice in spice_list:
        if spice in ingredientString:
            count += 1
    
    return count

In [41]:
def getUsed(ingredientString,spice_list=None):
    if spice_list==None:
        spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
    
    for spice in spice_list:
        if spice in ingredientString:
            return True
    else:
        return False

In [46]:
dfSpiceList = recipes.copy()
spiceList = ['parsley','paprika','tarragon']
dfSpiceList['used'] = recipes.ingredients.apply(lambda x: getUsed(x))
dfSpiceList['numUsed'] = recipes.ingredients.apply(lambda x: getNumUsed(x))

In [62]:
dfSpiceList['found']=recipes.ingredients.apply(lambda x: (getNumUsed(x,spice_list=['parsley','paprika','tarragon'])==3))

In [64]:
dfSpiceList.name[dfSpiceList.found==True]

2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object

### Method 2

In [67]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

Apply regular expression to see if the string is contained in the ingredients column

In [68]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


Use **query()** or **eval()**

In [69]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)

10

In [70]:
recipes.name[selection.index]

2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object

Notice that they retrieve the exact same values.