# String Manipulation, Regex, and Lamda Functions 

## Python native string operations 

String manipulation is a common task you'll have to execute during data analysis, which can include parsing strings, splitting/breaking them apart, searching, or substituting. Python has some built in tools that are able to perform some basic string manipulations, some of which we've used in previous modules. 

Let's bring in a string-heavy dataset to demonstrate with. We will be using data in file `most_backed.csv` which contains data from the most backed campaigns on the crowd funding platform "Kickstarter", found [here](https://www.kaggle.com/datasets/socathie/kickstarter-project-statistics). 

In [1]:
import pandas as pd

kst = pd.read_csv('most_backed_edit.csv', nrows = 100)
kst.drop(columns = 'Unnamed: 0', inplace = True)
kst.head(3)

Unnamed: 0,amt.pledged,blurb,by,category,currency,goal,location,num.backers,num.backers.tier,pledge.tier,title,url
0,8782571.0,\nThis is a card game for people who are into ...,Elan Lee,Tabletop Games,usd,10000.0,"Los Angeles, CA",219382,"[15505, 202934, 200, 5]","[20.0, 35.0, 100.0, 500.0]",Exploding Kittens,/projects/elanlee/exploding-kittens
1,6465690.0,"\nAn unusually addicting, high-quality desk to...",Matthew and Mark McLachlan,Product Design,usd,15000.0,"Denver, CO",154926,"[788, 250, 43073, 21796, 41727, 21627, 12215, ...","[1.0, 14.0, 19.0, 19.0, 35.0, 35.0, 79.0, 79.0...",Fidget Cube: A Vinyl Desk Toy,/projects/antsylabs/fidget-cube-a-vinyl-desk-toy
2,5408916.0,\nBring Reading Rainbow’s library of interacti...,LeVar Burton & Reading Rainbow,Web,usd,1000000.0,"Los Angeles, CA",105857,"[19639, 14343, 9136, 2259, 5666, 24512, 4957, ...","[5.0, 10.0, 25.0, 30.0, 35.0, 50.0, 75.0, 100....","Bring Reading Rainbow Back for Every Child, Ev...",/projects/readingrainbow/bring-reading-rainbow...


You should already be familiar with string operations such as adding, where we can add characters/strings together. For example we can alter the url column to include the full url, not just the path alone - e.g. `/projects/elanlee/exploding-kittens` becomes `www.kickstarter.com/projects/elanlee/exploding-kittens`. 

We can create a string that contains the domain, and add that to the path to create a string with the full url. For a single string it looks like this: 

In [2]:
domain = 'www.kickstarter.com'
path = kst.url[0]
domain+path

'www.kickstarter.com/projects/elanlee/exploding-kittens'

To apply this to an column in a pandas dataframe, we simply apply the string methods to it as a **series**. Applying string methods to a series will operate on every element in the column. We could then replace the existing column with the altered series. 

In [3]:
kst['url'] = domain + kst.url
kst.head(3)  

Unnamed: 0,amt.pledged,blurb,by,category,currency,goal,location,num.backers,num.backers.tier,pledge.tier,title,url
0,8782571.0,\nThis is a card game for people who are into ...,Elan Lee,Tabletop Games,usd,10000.0,"Los Angeles, CA",219382,"[15505, 202934, 200, 5]","[20.0, 35.0, 100.0, 500.0]",Exploding Kittens,www.kickstarter.com/projects/elanlee/exploding...
1,6465690.0,"\nAn unusually addicting, high-quality desk to...",Matthew and Mark McLachlan,Product Design,usd,15000.0,"Denver, CO",154926,"[788, 250, 43073, 21796, 41727, 21627, 12215, ...","[1.0, 14.0, 19.0, 19.0, 35.0, 35.0, 79.0, 79.0...",Fidget Cube: A Vinyl Desk Toy,www.kickstarter.com/projects/antsylabs/fidget-...
2,5408916.0,\nBring Reading Rainbow’s library of interacti...,LeVar Burton & Reading Rainbow,Web,usd,1000000.0,"Los Angeles, CA",105857,"[19639, 14343, 9136, 2259, 5666, 24512, 4957, ...","[5.0, 10.0, 25.0, 30.0, 35.0, 50.0, 75.0, 100....","Bring Reading Rainbow Back for Every Child, Ev...",www.kickstarter.com/projects/readingrainbow/br...


Methods, such as `.strip()` and `.split()` can be added to series as well. When using native python to strip a string `s` you'd use `s.strip()` whereas with a series, `ser` you would use `ser.str.split()`. 

For example, let's strip the newline `\n` from the `blurb` column. Note that we did this in sprint 1 practice 2 using lists, where we used a for loop to access each element in the list. When using a pandas series, you can bypass the for loop and complete the task in a single line! 

In [4]:
kst.blurb = kst.blurb.str.strip()
kst.blurb

0     This is a card game for people who are into ki...
1     An unusually addicting, high-quality desk toy ...
2     Bring Reading Rainbow’s library of interactive...
3     UPDATED: This is it. We're making a Veronica M...
4     An adventure game from Tim Schafer, Double Fin...
                            ...                        
95    The greatest work IN English literature, now i...
96    Enjoy real butter straight from the fridge. Th...
97    A minimalist wallet that holds everything you ...
98    A real-time, class-based strategy game, set in...
99    Reaper Miniatures Bones II was a project to co...
Name: blurb, Length: 100, dtype: object

<hr style="border:2px solid gray"> </hr>

### Now you try! 

Look at the values of `kst.category`. Remove the whitespace that occurs before some of the values, e.g. `kst.category[4]`, and replace the column in our `kst` dataframe. 

In [5]:
### BEGIN SOLUTION 

kst.category = kst.category.str.strip()
kst.category

### END SOLUTION 

0     Tabletop Games
1     Product Design
2                Web
3     Narrative Film
4        Video Games
           ...      
95        Publishing
96    Product Design
97           Fashion
98       Video Games
99    Tabletop Games
Name: category, Length: 100, dtype: object

<hr style="border:2px solid gray"> </hr>

For splitting, you'd use `str.split()`, which allows us to split a string into a list based on a particular separator. Say we wanted to split the `location` column into two columns where one contains the city & the other contains the state name. We are returned a series where now each element is a **list** containing the split elements. 

In [6]:
kst.location = kst.location.str.split(', ')
kst.location

0       [Los Angeles, CA]
1            [Denver, CO]
2       [Los Angeles, CA]
3         [San Diego, CA]
4     [San Francisco, CA]
             ...         
95      [Toronto, Canada]
96           [Sydney, AU]
97            [Ogden, UT]
98         [Brisbane, AU]
99           [Denton, TX]
Name: location, Length: 100, dtype: object

If we wanted to turn this series into two columns, we could apply the following where we add two new columns to `kst` and fill them with a new dataframe created by a list of our [city, state] location lists, `kst.location.tolist()`, and the index from `kst`. 

In [7]:
kst[['city','state']] = pd.DataFrame(kst.location.tolist(), index= kst.index)
kst.head(3)

Unnamed: 0,amt.pledged,blurb,by,category,currency,goal,location,num.backers,num.backers.tier,pledge.tier,title,url,city,state
0,8782571.0,This is a card game for people who are into ki...,Elan Lee,Tabletop Games,usd,10000.0,"[Los Angeles, CA]",219382,"[15505, 202934, 200, 5]","[20.0, 35.0, 100.0, 500.0]",Exploding Kittens,www.kickstarter.com/projects/elanlee/exploding...,Los Angeles,CA
1,6465690.0,"An unusually addicting, high-quality desk toy ...",Matthew and Mark McLachlan,Product Design,usd,15000.0,"[Denver, CO]",154926,"[788, 250, 43073, 21796, 41727, 21627, 12215, ...","[1.0, 14.0, 19.0, 19.0, 35.0, 35.0, 79.0, 79.0...",Fidget Cube: A Vinyl Desk Toy,www.kickstarter.com/projects/antsylabs/fidget-...,Denver,CO
2,5408916.0,Bring Reading Rainbow’s library of interactive...,LeVar Burton & Reading Rainbow,Web,usd,1000000.0,"[Los Angeles, CA]",105857,"[19639, 14343, 9136, 2259, 5666, 24512, 4957, ...","[5.0, 10.0, 25.0, 30.0, 35.0, 50.0, 75.0, 100....","Bring Reading Rainbow Back for Every Child, Ev...",www.kickstarter.com/projects/readingrainbow/br...,Los Angeles,CA


We can find certain characters in a string using `str.find()` on a series. This will return the indecies of that character in each string if it exists in the element, and -1 (false) if not. 

In [8]:
kst.by.str.find('&')

0     -1
1     -1
2     13
3     -1
4     -1
      ..
95    -1
96    -1
97    -1
98    -1
99    -1
Name: by, Length: 100, dtype: int64

What if we wanted to replace all instances of 'and' in the `by` column with an ampersand to standardize it. We can use `str.replace()` on our series to accomplish this. Be mindful of spaces! If we do not include the beginning and end spaces for our ` and ` input, and the column includes a name like "De**and**ra", we will end up with an ampersand in that person's name, ""De&ra". 

In [10]:
kst.by = kst.by.str.replace(' and ', ' & ')
kst.by

0                               Elan Lee
1               Matthew & Mark McLachlan
2         LeVar Burton & Reading Rainbow
3                             Rob Thomas
4     Double Fine & 2 Player Productions
                     ...                
95                            Ryan North
96                        DM Initiatives
97                         Ryan Crabtree
98                       5 Lives Studios
99                     Reaper Miniatures
Name: by, Length: 100, dtype: object

## Interplolating strings 

A more advanced string method is **interpolating** or **formatting** strings. This is the process by which we can construct a new string where certain placeholders are filled with vairables. While there are [several ways](https://towardsdatascience.com/python-string-interpolation-829e14e1fc75#:~:text=String%20interpolation%20is%20a%20process,ways%20to%20format%20string%20literals.) to accomplish this, a common way to do this is using the `.format()` method. 

With this method we construct a string and add `{}` where we want a variable to fill the placeholder. We then apply the `.format()` method to the string, with input arguments being the variables we want to replace the placeholders in appropraite order. 

Lets create a loop that passes through the first few elements in our dataframe, and for each element use the `title`, `goal`, and `by` columns to construct & print a string statement. 

In [11]:
for i in range(0, 10):
    title = kst.title[i]
    by = kst.by[i]
    goal = kst.goal[i]
    print("The '{}' campaign by {} has a goal of ${}".format(title, by, goal))

The 'Exploding Kittens' campaign by Elan Lee has a goal of $10000.0
The 'Fidget Cube: A Vinyl Desk Toy' campaign by Matthew & Mark McLachlan has a goal of $15000.0
The 'Bring Reading Rainbow Back for Every Child, Everywhere!' campaign by LeVar Burton & Reading Rainbow has a goal of $1000000.0
The 'The Veronica Mars Movie Project' campaign by Rob Thomas has a goal of $2000000.0
The 'Double Fine Adventure' campaign by Double Fine & 2 Player Productions has a goal of $400000.0
The 'Pebble Time - Awesome Smartwatch, No Compromises' campaign by Pebble Technology has a goal of $500000.0
The 'Torment: Tides of Numenera' campaign by inXile entertainment has a goal of $900000.0
The 'Pillars of Eternity (formerly Project Eternity)' campaign by Obsidian Entertainment has a goal of $1100000.0
The 'Yooka-Laylee - A 3D Platformer Rare-vival!' campaign by Playtonic Games has a goal of $175000.0
The 'ZNAPS - Connection is just a snap away' campaign by ZNAPS has a goal of $120000.0


Something similar could be accomplished using **fstrings**, where the prefix `f` is used to signify "literal string interpretation". In this format the variable names are put directly into the `{}`. 

In [12]:
i = 50
title = kst.title[i]
by = kst.by[i]
goal = kst.goal[i]
print(f'The {title} campaign by {by} has a goal of ${goal}')

The Shroud of the Avatar: Forsaken Virtues campaign by Portalarium, Inc. has a goal of $1000000.0


<hr style="border:2px solid gray"> </hr>

### Now you try! 

Use either string interpolation method to construct a statement using the `title` and `num.backers` for any entry in the dataframe. 

In [13]:
i = 27
title = kst.title[i]
num_backers = kst['num.backers'][i]
print(f'The {title} campaign has {num_backers} backers who have donated.')

The Shadowrun Returns campaign has 36276 backers who have donated.


<hr style="border:2px solid gray"> </hr>

## Using regex to identify characters

**Regex** is a 

In [None]:
import re



## Using regex to group patterns  

## Lamda functions 