# String Manipulation, Regex, and Lamda Functions 

## Python native string operations 

String manipulation is a common task you'll have to execute during data analysis, which can include parsing strings, splitting/breaking them apart, searching, or substituting. Python has some built in tools that are able to perform some basic string manipulations, some of which we've used in previous modules. 

Let's bring in a string-heavy dataset to demonstrate with. We will be using data in file `most_backed.csv` which contains data from the most backed campaigns on the crowd funding platform "Kickstarter", found [here](https://www.kaggle.com/datasets/socathie/kickstarter-project-statistics). 

In [None]:
import pandas as pd

kst = pd.read_csv('most_backed_edit.csv', nrows = 100)
kst.drop(columns = 'Unnamed: 0', inplace = True)
kst.head(3)

You should already be familiar with string operations such as adding, where we can add characters/strings together. For example we can alter the url column to include the full url, not just the path alone - e.g. `/projects/elanlee/exploding-kittens` becomes `www.kickstarter.com/projects/elanlee/exploding-kittens`. 

We can create a string that contains the domain, and add that to the path to create a string with the full url. For a single string it looks like this: 

In [None]:
domain = 'www.kickstarter.com'
path = kst.url[0]
domain+path

To apply this to an column in a pandas dataframe, we simply apply the string methods to it as a **series**. Applying string methods to a series will operate on every element in the column. We could then replace the existing column with the altered series. 

In [None]:
kst['url'] = domain + kst.url
kst.head(3)  

Methods, such as `.strip()` and `.split()` can be added to series as well. When using native python to strip a string `s` you'd use `s.strip()` whereas with a series, `ser` you would use `ser.str.split()`. 

For example, let's strip the newline `\n` from the `blurb` column. Note that we did this in sprint 1 practice 2 using lists, where we used a for loop to access each element in the list. When using a pandas series, you can bypass the for loop and complete the task in a single line! 

In [None]:
kst.blurb = kst.blurb.str.strip()
kst.blurb

<hr style="border:2px solid gray"> </hr>

### Now you try! 

Look at the values of `kst.category`. Remove the whitespace that occurs before some of the values, e.g. `kst.category[4]`, and replace the column in our `kst` dataframe. 

In [None]:
### BEGIN SOLUTION 

kst.category = kst.category.str.strip()
kst.category

### END SOLUTION 

<hr style="border:2px solid gray"> </hr>

For splitting, you'd use `str.split()`, which allows us to split a string into a list based on a particular separator. Say we wanted to split the `location` column into two columns where one contains the city & the other contains the state name. We are returned a series where now each element is a **list** containing the split elements. 

In [None]:
kst.location = kst.location.str.split(', ')
kst.location

If we wanted to turn this series into two columns, we could apply the following where we add two new columns to `kst` and fill them with a new dataframe created by a list of our [city, state] location lists, `kst.location.tolist()`, and the index from `kst`. 

In [None]:
kst[['city','state']] = pd.DataFrame(kst.location.tolist(), index= kst.index)
kst.head(3)

We can find certain characters in a string using `str.find()` on a series. This will return the indecies of that character in each string if it exists in the element, and -1 (false) if not. 

In [None]:
kst.by.str.find('&')

What if we wanted to replace all instances of 'and' in the `by` column with an ampersand to standardize it. We can use `str.replace()` on our series to accomplish this. Be mindful of spaces! If we do not include the beginning and end spaces for our ` and ` input, and the column includes a name like "De**and**ra", we will end up with an ampersand in that person's name, ""De&ra". 

In [None]:
kst.by = kst.by.str.replace(' and ', ' & ')
kst.by

## Interplolating strings 

A more advanced string method is **interpolating** or **formatting** strings. This is the process by which we can construct a new string where certain placeholders are filled with vairables. While there are [several ways](https://towardsdatascience.com/python-string-interpolation-829e14e1fc75#:~:text=String%20interpolation%20is%20a%20process,ways%20to%20format%20string%20literals.) to accomplish this, a common way to do this is using the `.format()` method. 

With this method we construct a string and add `{}` where we want a variable to fill the placeholder. We then apply the `.format()` method to the string, with input arguments being the variables we want to replace the placeholders in appropraite order. 

Lets create a loop that passes through the first few elements in our dataframe, and for each element use the `title`, `goal`, and `by` columns to construct & print a string statement. 

In [None]:
for i in range(0, 10):
    title = kst.title[i]
    by = kst.by[i]
    goal = kst.goal[i]
    print("The '{}' campaign by {} has a goal of ${}".format(title, by, goal))

Something similar could be accomplished using **fstrings**, where the prefix `f` is used to signify "literal string interpretation". In this format the variable names are put directly into the `{}`. 

In [None]:
i = 50
title = kst.title[i]
by = kst.by[i]
goal = kst.goal[i]
print(f'The {title} campaign by {by} has a goal of ${goal}')

<hr style="border:2px solid gray"> </hr>

### Now you try! 

Use either string interpolation method to construct a statement using the `title` and `num.backers` for any entry in the dataframe. 

In [None]:
i = 27
title = kst.title[i]
num_backers = kst['num.backers'][i]
print(f'The {title} campaign has {num_backers} backers who have donated.')

<hr style="border:2px solid gray"> </hr>

## Using regex to identify characters

**Regex** is a more advanced string operatiosn 

In [None]:
import re



## Using regex to group patterns  

## Lamda functions 