[👈 Chapter 18](18-pandas-importing-data.ipynb) -
[🏠 To index](README.md) -
[👉 Chapter 20](20-pandas-transform-apply.ipynb)

# 19 - Pandas: filtering and sorting data

In [2]:
# Let's filter and sort the data from the children name database.
# First, let's import and inspect the data using the read_csv method we used in chapter 18 again
import pandas as pd
df = pd.read_csv("svb-names-2014.csv")
df.head()

Unnamed: 0,name,number,gender
0,Sophie,836,1
1,Daan,751,0
2,Emma,728,1
3,Bram,727,0
4,Milan,700,0


In [3]:
# As you can see this list is ordered on the number of times a name is given
# The 'gender' column indicates whether the child was a boy (0) or a girl (1)
df["gender"].value_counts()

1    13402
0    11657
Name: gender, dtype: int64

In [4]:
# Pandas uses the bracket syntax ([]) to filter the data as well, for example, this
# is how we only get the top 3 boys names
df[df["gender"] == 0].head(3) # Remember that head shows the first x items from the beginning

Unnamed: 0,name,number,gender
1,Daan,751,0
3,Bram,727,0
4,Milan,700,0


In [5]:
# And here are the girls names
df[df["gender"] == 1].head(3)

Unnamed: 0,name,number,gender
0,Sophie,836,1
2,Emma,728,1
8,Julia,676,1


In [6]:
# Pandas DataFrames also have a query() method that might be more intuitive
df.query("gender == 1").head(3)

Unnamed: 0,name,number,gender
0,Sophie,836,1
2,Emma,728,1
8,Julia,676,1


In [7]:
# We can also add a new column to the dataset, simply by assigning it and specifying the
# values. Note that this 'broadcasts' to the entire dataset, it's basically like 
# doing a for loop.
# Let's add a column that adds the length of the name
df["name_length"] = df["name"].str.len()
df.head()

Unnamed: 0,name,number,gender,name_length
0,Sophie,836,1,6
1,Daan,751,0,4
2,Emma,728,1,4
3,Bram,727,0,4
4,Milan,700,0,5


In [8]:
# And now, let's sort the list by this new column, so we can see what is the longest name
# given to a child born in 2004
df.sort_values("name_length").head()

Unnamed: 0,name,number,gender,name_length
9264,Al,1,0,2
13128,Li,1,0,2
24308,To,1,1,2
12463,Ji,1,0,2
16220,Yk,1,0,2


In [9]:
# Ah! We get shortest instead, because sort_values by default sorts ascending (low to high), 
# let's fix that
df.sort_values("name_length", ascending=False).head()

Unnamed: 0,name,number,gender,name_length
24495,Vlinder-Jezzebelle,1,1,18
10114,D'Angelo-Valentino,1,0,18
11434,Ghëarmiangelijanno,1,0,18
16848,Angenieta-Jantina,1,1,17
10669,Djievano-Sonnekay,1,0,17


In [10]:
# You can also do string operations, for example, 
# let's see the most popular double names (names with a dash)
df[df["name"].str.contains("-")].sort_values("number", ascending=False).head()

Unnamed: 0,name,number,gender,name_length
1711,Anne-Sophie,14,1,11
1911,Jazz-Lynn,12,1,9
2027,Anne-Fleur,11,1,10
2143,Jay-Jay,10,0,7
2183,Anne-Lynn,10,1,9


In [11]:
# We can use filtering with multiple conditions as well, 
# let's find out all girls names that start with 'Piet'
df[(df["name"].str.startswith("Piet")) & (df["gender"] == 1)] # Note the brackets and the & operator

Unnamed: 0,name,number,gender,name_length
1458,Pieternella,17,1,11
5769,Pietertje,3,1,9
8630,Pieternel,2,1,9
22779,Pieta,1,1,5


In [12]:
# The 'groupby' method can be very handy to apply operations on a series of values, for example to sum() 
# both genders in the dataset, you can try this
df.groupby("gender").sum()

Unnamed: 0_level_0,number,name_length
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
0,88765,70761
1,84851,86063


In [13]:
# Combining a couple of methods we've learned now let's do something complex
# Let's find out the most populair initial letter for names.
# First we need an extra column that just has the first letter of the name in the row
df["first"] = df["name"].str[0]

# Now we group by this letter, sum all the values in the 'number' column, sort them with highest value first
# and the print the first five results
df.groupby("first")["number"].sum().sort_values(ascending = False).head()

first
J    18913
L    16394
M    16183
S    15491
A    11697
Name: number, dtype: int64

[👈 Chapter 18](18-pandas-importing-data.ipynb) -
[🏠 To index](README.md) -
[👉 Chapter 20](20-pandas-transform-apply.ipynb)