# Vectorized String Methods
Just like NumPy "Vectorized" some math functions, Pandas vectorizes some string functions.

Data Source: https://www.icpsr.umich.edu/icpsrweb/NACJD/studies/35509

In [None]:
#An example of vectorization
import numpy as np
ones = np.ones(10)
ones

In [None]:
ones + 1

What if we have a series of strings?

In [None]:
import pandas as pd
friends_ser = pd.Series(['Jason','Alex','Fayzan','Ethan'])
friends_ser

Lets look at the `lower` method on a string.

In [None]:
name = 'Lucas'
name

In [None]:
name.lower()

It puts everything to lowercase. Similar to how we couldn't just multiply a list by 2 and needed to use numpy we cannot just apply lower to a list. We need to use Pandas. Is there a `pd.lower()` like there was an `np.exp` for example?

In [None]:
pd.lower(friends_ser)

Unfortunately not, however there is something you can do.

## Introducting `<SERIES>.str`

In [None]:
friends_ser

In [None]:
#Calling <SERIES>.str lets us access all of the string methods
friends_ser.str

In [None]:
friends_ser.str.lower()

Most string methods that you can think of can be accessed by calling `<SERIES>.str.<METHOD>`

A full list is here: https://pandas.pydata.org/pandas-docs/stable/text.html

## Return types

Certain string methods return series' of strings like `lower`

In [None]:
friends_ser.str.lower()

In [None]:
#Another example which isn't equalty a method but is something you can do. Indexing!
friends_ser.str[1:3]

Certain string methods may return integers

In [None]:
friends_ser.str.len()

Perhaps most interstingly (for now) are the ones that return booleans. 

(We will soon talk about ones that return lists which may be even more interesting)

In [None]:
friends_ser.str.contains('a')

Why is this the most interesting? Because we can use this for logical indexers. Lets see an example.

In [None]:
#Load in the data same way as in lecture 1 and 2
data = pd.read_csv('data/drugs.tsv',delimiter='\t',index_col='QUESTID2')

Lets say we only wanted to select the columns that tell us how many days this perhaps has done this drug. How can we do that using vectorized string operations?

In [None]:
#get the series of columns
data.columns

Note: A pandas index is very similar to a series.

Now, we can use the `endswith` string method and see if the given column name ends with `DAYS`.

In [None]:

data.columns.str.endswith('DAYS')

Now we can index using this.

In [None]:
data.loc[:,data.columns.str.endswith('DAYS')]

## Methods that return lists

There are multiple methods that do this, but we'll discuss the `split` method.

In [None]:
#Lets apply the transformation to EMP that we used in lectures 1 and 2
emp_dict = {1:'Full Time',2:'Part Time',3:'Unemployed',4:'Other',99:'Child'}
data.EMP = data.EMP.apply(lambda x: emp_dict[x])

Lets look at the split method on the EMP column.

In [None]:
data.EMP.str.split()
#returns a series of lists

By default we split on white space. We could also split on something else, like commas by using the `pat` argument. It would look like this.

In [None]:
data.EMP.str.split(pat=',')
#There are no commas so nothing happens.

Its unclear how we would work this series of lists. But we can do something else. We can turn this into a dataframe by using the `expand` argument.

In [None]:
#no longer split by comma
data.EMP.str.split()

In [None]:
data.EMP.str.split(expand=True)
#What is going on with the NaNs?

Here would be a use of this. You have the basketball dataset and a list of player names which are full names. You split on space and expand and now you have a column of first names and a column of last names. You can then append this to your dataset and drop the names column.

## Methods that use regular expressions

I don't want to cover regexp notation but just know that there are many methods that utilize regular expressions.

## More reference:
https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html