Again in this notebook, we will ask you to find some Pandas functions that we haven't yet taught you - this is a purposful move to start your path towards Googling glory. This is an important part of coding: nobody has everything memorised, but we are all experts at Googling the information we need quickly. Search around, read pandas docs, stack overflow, medium, and anywhere else you find information. If you're totally stuck feel free to ask your "colleagues" (fellow students), your instructor, or one of our junior instructors. Most of the time though, we'll have taught you the answer in one of the previous notebooks, please check back in them too, they will be a great source of information throughout this bootcamp.

# Pandas and Text Methods

We will use again the `people` dataframe, with some more people and columns:

In [1]:
import pandas as pd

names = ["Erika Schumacher", "Javi López", "Maria Rovira", "Ana Garamond", 
         "Shekhar Biswas", "Muriel Adams", "Saira Polom", "Alex Edwin", 
         "Kit Ching", "Dog Woof"]
ages = [22, 50, 23, 29, 44, 30, 25, 71, 35, 2]
nations = ["DE", "ES", "ES", "ES", "IN", "DE", "IN", "UK", "UK", "XX"]
sibilings = [2, 0, 4, 1, 1, 2, 3, 7, 0, 9]
colors = ["Red", "Yellow", "Yellow", "Blue", "Red", "Yellow", "Blue", "Blue", "Red", "Gray"]



people = pd.DataFrame({"name":names,
                       "age":ages,
                       "country":nations,
                       "sibilings":sibilings,
                       "favourite_color":colors
                      })

people.head()

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red


In [2]:
people

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red
5,Muriel Adams,30,DE,2,Yellow
6,Saira Polom,25,IN,3,Blue
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red
9,Dog Woof,2,XX,9,Gray


## Filtering data based on conditions

Let's say we want to select only rows for people whose favourite color is "Yellow".

If we just type the condition (`favourite_color=="Yellow"`), we will create a Pandas Series of boolean values of the same length as the rows in the dataframe. It holds `True` for rows where the condition is met, and `False` otherwise:

In [3]:
people.favourite_color=="Yellow"

0    False
1     True
2     True
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: favourite_color, dtype: bool

> Note: a Pandas Series is like a list, but it has an index and all of its elements must share the same data type. You can think of it as a "single column dataframe".

We can use this Series inside of the `loc[]` function we learned earlier to select only the rows that corrspond to the `True` values:

In [4]:
people.loc[people.favourite_color=="Yellow"]

Unnamed: 0,name,age,country,sibilings,favourite_color
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
5,Muriel Adams,30,DE,2,Yellow


In [5]:
people[people['favourite_color']=="Yellow"]

Unnamed: 0,name,age,country,sibilings,favourite_color
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
5,Muriel Adams,30,DE,2,Yellow


In [6]:
people[people.favourite_color=="Yellow"]

Unnamed: 0,name,age,country,sibilings,favourite_color
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
5,Muriel Adams,30,DE,2,Yellow


###### **Exercise 1:**
filter the `people` dataframe and keep only people from the UK.

In [7]:
people[people.country=='UK']

Unnamed: 0,name,age,country,sibilings,favourite_color
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red


###### **Exercise 2:** 
filter the `people` dataframe and keep only people from either the UK or Germany (the country code for Germany is "DE"). 

> Tip: To use two conditions inside of `loc[]`, wrap each condition in parentheses and separate them using logical operators
- `&` if you need both conditions to be met
- `|` if meeting one of the conditions is enough

In [8]:
people.loc[(people.country=='UK') | (people.country=='DE')]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
5,Muriel Adams,30,DE,2,Yellow
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red


In [9]:
people[people.country.isin(['DE','UK'])]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
5,Muriel Adams,30,DE,2,Yellow
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red


In [10]:
people[people.age.between(15,30)]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
5,Muriel Adams,30,DE,2,Yellow
6,Saira Polom,25,IN,3,Blue


###### **Exercise 3:**:
filter the `people` dataframe and keep only:

- people from either the UK or Germany (the country code for Germany is "DE"), who have 2 or more siblings

In [11]:
people[people.country.isin(['UK','DE']) & (people.sibilings >= 2) ]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
5,Muriel Adams,30,DE,2,Yellow
7,Alex Edwin,71,UK,7,Blue


In [12]:
people.loc[((people.country == 'UK') | (people.country == 'DE')) & (people.sibilings >= 2)]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
5,Muriel Adams,30,DE,2,Yellow
7,Alex Edwin,71,UK,7,Blue


## String Operations

The previous exercises could be solved by combining simple conditions based on equalities `==` or comparisons `>`, `<`. But when it comes to text data, sometimes the conditions are more complex. How would we select all the people whose name starts with a certain letter? 

This is where Pandas String Operations are really helpful. Go through [this user guide](https://pandas.pydata.org/docs/user_guide/text.html#string-methods) from Pandas' documentation, it's a good introduction. Here are some examples:

Filtering rows with name starting with A:

- first we generate the boolean expression

In [13]:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.str.startswith.html

people.name.str.startswith("A")

0    False
1    False
2    False
3     True
4    False
5    False
6    False
7     True
8    False
9    False
Name: name, dtype: bool

- and then pass it to `loc[]`

In [14]:
people.loc[people.name.str.startswith("A")]

Unnamed: 0,name,age,country,sibilings,favourite_color
3,Ana Garamond,29,ES,1,Blue
7,Alex Edwin,71,UK,7,Blue


String methods can also change text:

In [15]:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html
people.name.str.lower()

0    erika schumacher
1          javi lópez
2        maria rovira
3        ana garamond
4      shekhar biswas
5        muriel adams
6         saira polom
7          alex edwin
8           kit ching
9            dog woof
Name: name, dtype: object

In [16]:
people

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow
2,Maria Rovira,23,ES,4,Yellow
3,Ana Garamond,29,ES,1,Blue
4,Shekhar Biswas,44,IN,1,Red
5,Muriel Adams,30,DE,2,Yellow
6,Saira Polom,25,IN,3,Blue
7,Alex Edwin,71,UK,7,Blue
8,Kit Ching,35,UK,0,Red
9,Dog Woof,2,XX,9,Gray


Note that we have just outputted these names, but we have not changed the original dataframe:

In [17]:
people.head(2)

Unnamed: 0,name,age,country,sibilings,favourite_color
0,Erika Schumacher,22,DE,2,Red
1,Javi López,50,ES,0,Yellow


Pandas will not make changes to the original data unless you explicitly tell it to do so. If we wanted to change the original dataframe, we would have assign this output (the names in lower case) to the column in the dataframe we want to change. When doing that, it is important that you select that column using `loc[]`, and not simply `DataFrame.column`:

In [18]:
people.loc[:,"name"] = people.name.str.lower()

In [19]:
# now the original dataframe has been modified:
people.head(2)

Unnamed: 0,name,age,country,sibilings,favourite_color
0,erika schumacher,22,DE,2,Red
1,javi lópez,50,ES,0,Yellow


###### **Exercise 4:**
select all people whose name contains (either in the first name or the surname) the letter `p`.

In [20]:
people[people.name.str.contains('p',case=False)]

Unnamed: 0,name,age,country,sibilings,favourite_color
1,javi lópez,50,ES,0,Yellow
6,saira polom,25,IN,3,Blue


###### **Exercise 5:**
select all people whose full name + surname has more than 12 characters.

In [21]:
people[people.name.str.len() > 12]

Unnamed: 0,name,age,country,sibilings,favourite_color
0,erika schumacher,22,DE,2,Red
4,shekhar biswas,44,IN,1,Red


###### **Exercise 6:**
select all people whose surname starts with the letter `e`:

In [22]:
people.loc[people.name.str.contains(' e',case=False)]

Unnamed: 0,name,age,country,sibilings,favourite_color
7,alex edwin,71,UK,7,Blue


In [23]:
people['last_name'] = people.name.str.split(' ',expand=True)[1]
people.drop(['last_name'],axis=1,inplace=True)

In [24]:
people[people.name.str.split(' ',expand=True)[1].str.startswith('E')]

Unnamed: 0,name,age,country,sibilings,favourite_color


###### **Exercise 7:**
Create a new dataframe, `people_names`, where the first name and the last name are split into two different columns, `first_name` and `last_name`. The first row of the new dataframe should look like this:

`name           	first_name	last_name	age	country 	sibilings	favourite_color`

`erika schumacher	erika    	schumacher	22	DE      	2       	Red`

In [25]:
people.name.str.split(' ',expand=True)

Unnamed: 0,0,1
0,erika,schumacher
1,javi,lópez
2,maria,rovira
3,ana,garamond
4,shekhar,biswas
5,muriel,adams
6,saira,polom
7,alex,edwin
8,kit,ching
9,dog,woof


In [26]:
people[['first_name','last_name']] = people.name.str.split(' ',expand=True)
people

Unnamed: 0,name,age,country,sibilings,favourite_color,first_name,last_name
0,erika schumacher,22,DE,2,Red,erika,schumacher
1,javi lópez,50,ES,0,Yellow,javi,lópez
2,maria rovira,23,ES,4,Yellow,maria,rovira
3,ana garamond,29,ES,1,Blue,ana,garamond
4,shekhar biswas,44,IN,1,Red,shekhar,biswas
5,muriel adams,30,DE,2,Yellow,muriel,adams
6,saira polom,25,IN,3,Blue,saira,polom
7,alex edwin,71,UK,7,Blue,alex,edwin
8,kit ching,35,UK,0,Red,kit,ching
9,dog woof,2,XX,9,Gray,dog,woof


In [27]:
#creating a new, same dataframe
peoples = people.copy()

In [28]:
peoples[['first_name','last_name']] = peoples.name.str.split(pat=' ',expand=True)
peoples

Unnamed: 0,name,age,country,sibilings,favourite_color,first_name,last_name
0,erika schumacher,22,DE,2,Red,erika,schumacher
1,javi lópez,50,ES,0,Yellow,javi,lópez
2,maria rovira,23,ES,4,Yellow,maria,rovira
3,ana garamond,29,ES,1,Blue,ana,garamond
4,shekhar biswas,44,IN,1,Red,shekhar,biswas
5,muriel adams,30,DE,2,Yellow,muriel,adams
6,saira polom,25,IN,3,Blue,saira,polom
7,alex edwin,71,UK,7,Blue,alex,edwin
8,kit ching,35,UK,0,Red,kit,ching
9,dog woof,2,XX,9,Gray,dog,woof


In [29]:
peoples = peoples[['name','first_name','last_name','age','country','favourite_color']]

In [30]:
peoples

Unnamed: 0,name,first_name,last_name,age,country,favourite_color
0,erika schumacher,erika,schumacher,22,DE,Red
1,javi lópez,javi,lópez,50,ES,Yellow
2,maria rovira,maria,rovira,23,ES,Yellow
3,ana garamond,ana,garamond,29,ES,Blue
4,shekhar biswas,shekhar,biswas,44,IN,Red
5,muriel adams,muriel,adams,30,DE,Yellow
6,saira polom,saira,polom,25,IN,Blue
7,alex edwin,alex,edwin,71,UK,Blue
8,kit ching,kit,ching,35,UK,Red
9,dog woof,dog,woof,2,XX,Gray


## Cars challenges

###### **Exercise 8:**
read the `vehicles.csv` and set it to a variable called `cars`. We will use it for some extra challenges.

In [36]:
url = 'https://drive.google.com/file/d/18zYGrzRhn_mz1HJLXxSO_MwR0_nWBS3K/view?usp=sharing' # vehicles.csv
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
cars = pd.read_csv(path)

###### **Exercise 9:**
create a column called `Auto` filled with either `True` or `False` depending on whether the transmission is Automatic or not.

In [41]:
cars['Transmission'].unique()

array(['Automatic 3-spd', 'Automatic 4-spd', 'Manual 5-spd',
       'Automatic (S5)', 'Manual 6-spd', 'Automatic 5-spd', 'Auto(AM8)',
       'Auto(AM-S8)', 'Auto(AV-S7)', 'Automatic (S6)', 'Automatic (S9)',
       'Automatic (S4)', 'Auto(AM-S9)', 'Automatic (S7)', 'Auto(AM7)',
       'Auto(AM-S7)', 'Auto(AM6)', 'Automatic 6-spd', 'Manual 4-spd',
       'Automatic (S8)', 'Manual(M7)', 'Auto(AM-S6)',
       'Automatic (variable gear ratios)', 'Automatic (AV)',
       'Auto(AV-S8)', 'Automatic (AM6)', 'Automatic 8-spd', 'Auto(A1)',
       'Automatic (A1)', 'Automatic (A6)', 'Auto(AV-S6)', 'Manual 3-spd',
       'Manual 7-spd', 'Automatic 9-spd', 'Auto (AV)', 'Automatic 6spd',
       'Auto(L4)', 'Auto(L3)', 'Auto (AV-S6)', 'Auto (AV-S8)',
       'Automatic (AV-S6)', 'Automatic 7-spd', 'Manual 5 spd',
       'Auto(AM5)', 'Automatic (AM5)'], dtype=object)

In [43]:
cars["Auto"] =cars.Transmission.str.contains("Auto")
cars.head(15)

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year,Auto
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950,True
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,True
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100,True
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550,True
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550,True
5,Acura,2.2CL/3.0CL,1997,2.2,4.0,Automatic 4-spd,Front-Wheel Drive,Subcompact Cars,Regular,14.982273,20,26,22,403.954545,1500,True
6,Acura,2.2CL/3.0CL,1997,2.2,4.0,Manual 5-spd,Front-Wheel Drive,Subcompact Cars,Regular,13.73375,22,28,24,370.291667,1400,False
7,Acura,2.2CL/3.0CL,1997,3.0,6.0,Automatic 4-spd,Front-Wheel Drive,Subcompact Cars,Regular,16.4805,18,26,20,444.35,1650,True
8,Acura,2.3CL/3.0CL,1998,2.3,4.0,Automatic 4-spd,Front-Wheel Drive,Subcompact Cars,Regular,14.982273,19,27,22,403.954545,1500,True
9,Acura,2.3CL/3.0CL,1998,2.3,4.0,Manual 5-spd,Front-Wheel Drive,Subcompact Cars,Regular,13.73375,21,29,24,370.291667,1400,False


###### **Exercise 10:**
create a column called `Speeds` that contains the number of speeds each transmission has, based on the number that appears in the column `Transmission`. For example, a transmission named "Automatic 4-spd" has 4 speeds, and one named "Auto (AM6)" has 6 speeds. If you find edge cases (e.g. numbers that do not make sense, no number at all...), use your own judgement to assign values to them.

Note: you will most likely need to use something called a "Regular Expression" or "regex" inside of the string method. Regular expressions are sequences of characters designed to match patterns. They can become really complex (to match complex patterns), but for this case, a simple [5 minute tutorial](https://www.youtube.com/watch?v=UQQsYXa1EHs&ab_channel=Kite) or some google should be enough. Whenever you see people writing regex in plain python, remember that you can use any regular expression directly inside of a Pandas `str` method. In the example below, we use the regular expression `"[v-z]"`, which means "match any lowercase letter between v and z (alphabetically)", in combination with the string method `str.contains()`:

In [33]:
people.name.str.contains("[v-z]")

0    False
1     True
2     True
3    False
4     True
5    False
6    False
7     True
8    False
9     True
Name: name, dtype: bool

In [50]:
cars["Speeds"]=cars["Transmission"].str.extract("(\d)")

In [57]:
cars["Speeds"].fillna(0, inplace=True)

In [59]:
cars["Speeds"].unique()

array(['3', '4', '5', '6', '8', '7', '9', 0, '1'], dtype=object)

###### **Exercise 11:**
remove non-alphanumeric characters from the "Drivetrain" and the "Make" column

In [70]:
cars['Make_raw'] = cars.Make.str.replace(r"\W","")
cars['Drivetrain_raw'] = cars.Drivetrain.str.replace(r"\W","")
cars[['Make_raw','Drivetrain_raw']]

  """Entry point for launching an IPython kernel.
  


Unnamed: 0,Make_raw,Drivetrain_raw
0,AMGeneral,2WheelDrive
1,AMGeneral,2WheelDrive
2,AMGeneral,RearWheelDrive
3,AMGeneral,RearWheelDrive
4,ASCIncorporated,RearWheelDrive
...,...,...
35947,smart,RearWheelDrive
35948,smart,RearWheelDrive
35949,smart,RearWheelDrive
35950,smart,RearWheelDrive
