# Filtering and splitting data
---



A dataframe contains rows and columns.  Most operations will need specific columns and specific rows.  Knowing how to isolate the rows and columns you need to work with is the focus of this worksheet.  You will split columns, rows by index (head, tail, iloc) and filter rows by given criteria.

To start the first set of exercises on this sheet, read the Titanic data set in the CSV file at this URL: https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

You will need to run this cell each time you come back to this worksheet.

Read the data from the file into a dataframe called **titanic**

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"

titanic = pd.read_csv(url)

# Splitting across columns and rows
---

For further reference:  [How do I select a subset of a Dataframe - Pandas Getting Started Tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html)

### Exercise 1 - create a dataframe containing a subset of columns
---

Create a new dataframe called **survival** which contains just the `Name`, `Sex`, `Age` and `Survived` columns.  Display the first 5 rows of the new data column.

(*Reminder: use [ ] to specify a column or a set of columns.  Where there is a set of columns, these should be included in a list inside the main squar brackets e.g. df[ [ item1, item2, item3 ] ] so that there is only ever ONE item in the outer brackets*)

**Test output**:  
	Name	Sex	Age	Survived  
0	Braund, Mr. Owen Harris	male	22.0	0  
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1  
2	Heikkinen, Miss. Laina	female	26.0	1  
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1  
4	Allen, Mr. William Henry	male	35.0	0  

In [3]:
survival = titanic[["Name", "Sex", "Age", "Survived"]]
#create new df containing columns listed
display (survival.head(5))
#display first 5 rows of DF 

Unnamed: 0,Name,Sex,Age,Survived
0,"Braund, Mr. Owen Harris",male,22.0,0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,"Heikkinen, Miss. Laina",female,26.0,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,"Allen, Mr. William Henry",male,35.0,0


### Exercise 2 - and another subset of columns
---

From the original dataframe, create a new dataframe called **fares** which contains the columns `Pclass`, `Cabin`, `Ticket` and `Fare`.  Display the final 8 rows.

**Test output**:   
	Pclass	Cabin	Ticket	Fare  
883	2	NaN	C.A./SOTON 34068	10.500  
884	3	NaN	SOTON/OQ 392076	7.050  
885	3	NaN	382652	29.125  
886	2	NaN	211536	13.000  
887	1	B42	112053	30.000  
888	3	NaN	W./C. 6607	23.450  
889	1	C148	111369	30.000  
890	3	NaN	370376	7.750  


In [6]:
fares = titanic[["Pclass", "Cabin", "Ticket", "Fare"]]
#create new df containing columns listed
display (fares.tail(8))
#display last 8 rows of DF 

Unnamed: 0,Pclass,Cabin,Ticket,Fare
883,2,,C.A./SOTON 34068,10.5
884,3,,SOTON/OQ 392076,7.05
885,3,,382652,29.125
886,2,,211536,13.0
887,1,B42,112053,30.0
888,3,,W./C. 6607,23.45
889,1,C148,111369,30.0
890,3,,370376,7.75


# Filtering rows according to given criteria
---

To select records according to a given criteria, specify the criteria in the [ ] after the dataframe.  There may be one criterion or a set of criteria, in this case enclose each criterion in brackets ( ) and use logical symbols (e.g. & | !) or comparison operators (e.g. ==, < > !=) or  .

**Example 1**
The following will create a new dataframe called survivors which contains only the records of those who survived the sinking.  

`survivors = titanic[titanic['Survived'] == 1]`

The first five records of the `survivors` dataframe will be passengers with the ids 2, 3, 4, 9 and 10 and the shape of `survivors` will be `(342, 12)  `

**Example 2**
The following code will create a new dataframe called **young_females** which contains only the records of women under the age of 30 who survived the sinking.  

`young_females = titanic[(titanic['Sex'] == 'female') & (titanic['Age'] < 30)]`

The last five records of the `young_females` dataframe will be passengers with the ids 875, 876, 881, 883 and 888 and the shape of `young_females` will be (147, 12)  

Try this code out in the cell below and get it to display the last 5 records.


In [8]:
survivors = titanic[titanic['Survived'] == 1]
#df containing data for survivors (survivors = 1)
young_females = titanic[(titanic['Sex'] == 'female') & (titanic['Age'] < 30)]
#df containing data females under age 30 

display (survivors.head(5), young_females.tail(5))
#display first 5 rows of DF, display last 5 rows of DF






Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.225,,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S


### Exercise 3 - find the third class passengers
---

Create a new dataframe called **third_class_passengers** which contains only the records for passengers who travelled in passenger class 3.  Display the first 20 records

**Test output**:  
shape = (491, 12)  
Indexes - 0,2,4,5,7,8,10,12,13,14,16,18,19,22,24,25,26,28,29,32   
PassengerIds - 1,3,5,6,8,9,11,13,14,15,17,19,20,23,25,26,27,29,30,33  



In [13]:
third_class_passengers = titanic[titanic["Pclass"] == 3]
#df containing data for 3rd class passengers

display (third_class_passengers.head(20))


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S


### Exercise 4 - female 1st class passengers who survived
---

Create a new dataframe called **female_1st_class_survivors** which contains only the records for female passengers who travelled in passenger class 1 and who survived.  Display the last 10 records

**Test output**:  
(91, 12)  
Indexes - 829, 835, 842, 849, 853, 856, 862, 871, 879, 887  
PassengerIds - 830, 836, 843, 850, 854, 857, 863, 872, 880, 888  


In [22]:
female_1st_class_survivors = titanic[(titanic["Sex"] == "female") & (titanic["Pclass"] == 1) & (titanic["Survived"] == 1)]
#DF of female, 1st class survivors
display (female_1st_class_survivors.tail(10))
#display last 10 rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,
835,836,1,1,"Compton, Miss. Sara Rebecca",female,39.0,1,1,PC 17756,83.1583,E49,C
842,843,1,1,"Serepeca, Miss. Augusta",female,30.0,0,0,113798,31.0,,C
849,850,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)",female,,1,0,17453,89.1042,C92,C
853,854,1,1,"Lines, Miss. Mary Conover",female,16.0,0,1,PC 17592,39.4,D28,S
856,857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)",female,45.0,1,1,36928,164.8667,,S
862,863,1,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0,0,17466,25.9292,D17,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S


# Filtered and split
---

When selecting on criteria and by column, specify the criteria first, then specify the columns.  Example - display the name and passenger class for female passengers under the age of 30:  

`young_females = titanic[(titanic['Sex'] == 'female')&(titanic['Age'] < 30][['Name','Pclass']]`




### Exercise 5 - name and passenger id for passengers who embarked at port C
---

Create a new dataframe called **port_embarkation_list** which contains only the records for passengers who embarked at port C.  Display the `Name` and `PassengerId` only for all records

**Test output**:  
(168, 2)  
PassengerIds shown - 1,9,19,26,30, ... 866, 874, 875, 879, 889  


In [4]:
port_embarkation_list = titanic[titanic["Embarked"]== "C"]
#DF  containing data for passengers who embarked at port C.
display (port_embarkation_list[["Name", "PassengerId"]])
#display listed columns for DF.

Unnamed: 0,Name,PassengerId
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",2
9,"Nasser, Mrs. Nicholas (Adele Achem)",10
19,"Masselmani, Mrs. Fatima",20
26,"Emir, Mr. Farred Chehab",27
30,"Uruchurtu, Don. Manuel E",31
...,...,...
866,"Duran y More, Miss. Asuncion",867
874,"Abelson, Mrs. Samuel (Hannah Wizosky)",875
875,"Najib, Miss. Adele Kiamie ""Jane""",876
879,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",880


### Exercise 6 - passenger id and age for all surviving passengers over 50
---

Create a dataframe called **older_survivors** which contains only the records for passengers who survived and who are older than 50.  Display the `PassengerId` and `Age` only for the last 15 records.  

**Test output**:  
shape = (22, 2)  
Indexes = 483, 496, 513, 570, 571, 587, 591, 630, 647, 765, 774, 820, 829, 857, 879  
PassengerIds = 484, 497, 514, 571, 572, 588, 592, 631, 648, 766, 775, 821, 830, 859, 880     


In [10]:
older_survivors = titanic[titanic["Age"] > 50]
#DF containing data for passenger over 50
display (older_survivors[["PassengerId", "Age"]])
#display listed columns


Unnamed: 0,PassengerId,Age
6,7,54.0
11,12,58.0
15,16,55.0
33,34,66.0
54,55,65.0
...,...,...
820,821,52.0
829,830,62.0
851,852,74.0
857,858,51.0


### Exercise 7 - display the Name and Age of the first male, 2nd class passenger who embarked at port Q
---


Create a dataframe called **male_2nd_Q** which contains only the records for passengers who embarked at port Q, travelled second class and were male.  Display the `Name` and `Age` of the first (and only) passenger in this list.

**Test output**:  
shape = (1, 2)  
Name = Kirkland, Rev. Charles Leonard  
Age = 57.0  

In [15]:
male_2nd_Q = titanic[(titanic["Embarked"] == "Q") & (titanic["Pclass"] == 2) & (titanic["Sex"] == "male")]
#create DF containing data for 2nd class males, embarked at port  Q
display (male_2nd_Q[["Name", "Age"]])
#display name and age columns


Unnamed: 0,Name,Age
626,"Kirkland, Rev. Charles Leonard",57.0


### Exercise 8 - summarise data on who survived in each passenger class
---


Create three dataframes to hold the records for passengers in each of the three passenger classes who survived.  Display the description of each set of `PassengerIds` for all survivors.  

*To print the description as a string use:*  
```
print('First class survivors') 
print(first_class.describe().to_string())
```

**Test output**:  
first_class =  count 136.000000 ...  
second_class =  count 87.000000 ...  
third_class =  count 119.000000 ...  

In [45]:
first_class_survivors = titanic[(titanic["Pclass"] == 1) & (titanic["Survived"] == 1)]
second_class_survivors = titanic[(titanic["Pclass"] == 2) & (titanic["Survived"] == 1)]
third_class_survivors = titanic[(titanic["Pclass"] == 3) & (titanic["Survived"] == 1)]
#Separate DFs for 1st to 3rd class survivors
print('First class survivors') 
print(first_class_survivors["PassengerId"].describe().to_string())

print('Second class survivors') 
print(second_class_survivors["PassengerId"].describe().to_string())

print('Third class survivors') 
print(third_class_survivors["PassengerId"].describe().to_string())

#Display the description of each set of PassengerIds for survivors

First class survivors
count    136.000000
mean     491.772059
std      239.006988
min        2.000000
25%      307.750000
50%      510.500000
75%      693.500000
max      890.000000
Second class survivors
count     87.000000
mean     439.080460
std      244.211937
min       10.000000
25%      254.000000
50%      441.000000
75%      612.500000
max      881.000000
Third class survivors
count    119.000000
mean     394.058824
std      264.680245
min        3.000000
25%      169.500000
50%      359.000000
75%      633.500000
max      876.000000


### Exercise 9 - summarise data on young males who survived and all males
---

Create two dataframes to hold the records for all male passengers in one and all males who survived in the other.  Display the description of each set including `Age` for all records in each.    

**Test output**:  
all_males = Age count 453.000000 mean 30.726645 ...  
young_males = Age count 93.000000 mean 27.276022 ...


In [37]:
all_males =  titanic[(titanic["Sex"] == "male")]
#DF containing data for all male passengers
all_males_survived = titanic[(titanic["Sex"] == "male") & (titanic["Survived"] == 1)]
#DF containing data for all male passengers who survived

print('All Males') 
print(all_males["Age"].describe().to_string())
print('All Male Survivors') 
print(all_males_survived["Age"].describe().to_string())
#Display the description of the age data for each DF.


All Males
count    453.000000
mean      30.726645
std       14.678201
min        0.420000
25%       21.000000
50%       29.000000
75%       39.000000
max       80.000000
All Male Survivors
count    93.000000
mean     27.276022
std      16.504803
min       0.420000
25%      18.000000
50%      28.000000
75%      36.000000
max      80.000000


### What does this tell us about a possible link between age and survival?

Answer:  The mean age is lower for the survivors, suggests younger males were more likely to survive.  
---


### Exercise 10 - challenge
---

Create a set of code cells each with some code that shows the means or the counts for interesting data sets. Add a text cell before each set of code cells to explain what you are showing in the following set of cells. 

*To do this you can add the function to the end of the selection code as shown below*:  
```
young_males_avg_age = titanic[(titanic['Sex'] == 'male')]['Age'].mean()
print(young_males_avg_age)
```

An example might be that you are going to select passengers who embarked at port C and who paid a fare over 50.000 and you are going to count the number of `PassengerId` and `Cabin` for these passengers (*you will apply the count to the selected columns rather than the whole data table*)
```
embarked_passengers = titanic[(titanic['Embarked'] == 'C')&(titanic['Fare'] > 50)][['PassengerId','Cabin']].count()
print(embarked_passengers)
```



This code will show the mean age of males travelling in 3rd class who survived.

In [40]:
males_survived_3rd = titanic[(titanic["Sex"] == "male") & (titanic["Survived"] == 1) & (titanic["Pclass"] == 3)]["Age"].mean()
print (males_survived_3rd)
#mean age of 3rd class males who survived.

22.274210526315787


This code will show the count of the Passenger Ids and cabins for passengers travelling in 3rd  class who's fare was less than £10.

In [41]:
passengers_fare_3rd = titanic[(titanic["Pclass"] == 3) & (titanic["Fare"] < 10)][["PassengerId", "Cabin"]].count()
print(passengers_fare_3rd)
#Passenger Id and cabin counts for 3rd class passengers who paid <10.

PassengerId    324
Cabin            5
dtype: int64


This code will show the count of the Passenger Ids and cabins for passengers travelling in 1st class who embarked from Port Q.

In [43]:
passengers_port_Q = titanic[(titanic["Embarked"] == "Q") & (titanic["Pclass"] == 1)][["PassengerId", "Cabin"]].count()
print(passengers_port_Q)
#Prints passenger Id and cabin counts for first class passengers who emarked at port Q.

PassengerId    2
Cabin          2
dtype: int64


# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer: Creating a dataframe using a subset of columns and filtering rows out of a dataframe based on desired criteria. Summarising the data and displaying basic statistics (mean and coumt) and for interesting datasets.

## What caused you the most difficulty?

Your answer: Structuring the code used to filter out rows based on multiple criteria, ensuring correct placement of brackets etc.