# Section 5: DataFrames in Depth

In [None]:
import numpy as np
import pandas as pd

## Introducting a New Dataset - Soccer

In this section we will be working with a Dataframe containing English Premier League soccer players. There are over 400 players and 17 attributes!

In [None]:
data_url = 'https://andybek.com/pandas-soccer'

Read in the data

In [None]:
players = pd.read_csv(data_url)

Let's take a look at the Dataframe info. Looks like we have quite a few numeric columns as well as object columns.

In [None]:
players.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Columns: 17 entries, name to new_signing
dtypes: float64(2), int64(10), object(5)
memory usage: 190.7 KB


We can take a higher level look at this using the `dtypes` attribute and calling the `value_counts()` method.

In [None]:
players.dtypes.value_counts()

int64      10
object      5
float64     2
dtype: int64

Take one last peek at the DataFrame structure to ensure we're ready to go.

In [None]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0


## Quick Review - Indexing with boolean Masks

Recall the general approach for boolean indexing:
1. Generate a sequence of booleans ("Trues" and "Falses")
2. Use that boolean sequence with either square brackets [ ] or `.loc[]` to make the selection.

Say we're interested in learning which players have a market value exceeding 40 million dollars. We start by using the attribute accessor or selection brackets with a comparison operators to create the boolean sequence.

In [None]:
players["market_value"] > 40

0       True
1       True
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

Now to select just the players with market value over 40 million, simply pass this expression into a set of selection brackets for the DataFrame itself

In [None]:
players[players["market_value"] > 40]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
96,Eden Hazard,Chelsea,26,LW,1,75.0,4220,10.5,2.30%,224,2,Belgium,0,3,5,1,0
97,Diego Costa,Chelsea,28,CF,1,50.0,4454,10.0,3.00%,196,2,Spain,0,4,5,1,0
108,N%27Golo Kante,Chelsea,26,DM,2,50.0,4042,5.0,13.80%,83,2,France,0,3,5,1,1
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
244,Kevin De Bruyne,Manchester+City,26,AM,1,65.0,2252,10.0,17.50%,199,2,Belgium,0,3,11,1,0
245,Sergio Aguero,Manchester+City,29,CF,1,65.0,4046,11.5,9.70%,175,3,Argentina,0,4,11,1,0
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
264,Romelu Lukaku,Manchester+United,24,CF,1,50.0,3727,11.5,45.00%,221,2,Belgium,0,2,12,1,0


Indeed, there are 13 players who meet this criteria.

In [None]:
players[players["market_value"] > 40].shape

(13, 17)

## More Approaches to Boolean Masking: `isin()`, `lt()`, `between()`

Say we're interested in looking at just the *Defenders* (soccer position) in the DataFrame. Defenders, or backs, can be left, center, or right. Let's see what the position codes are in our DataFrame by examining the `position` column and using the `unique()` method.


In [None]:
players['position'].unique()

array(['LW', 'AM', 'GK', 'RW', 'CB', 'RB', 'CF', 'LB', 'DM', 'RM', 'CM',
       nan, 'SS', 'LM'], dtype=object)

Out of those, we are interested in the defenders: LB, CB, and RB. Thus, we want to extract lines from our DataFrame on the condition that they are one of these three positions. This is a great use case for the *`isin()` method, which is used to create a boolean series that indicates whether that particular entry is in the list of values passed in.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html

In [None]:
players['position'].isin(['LB', 'CB', 'RB'])

0      False
1      False
2      False
3      False
4       True
       ...  
460    False
461     True
462     True
463    False
464    False
Name: position, Length: 465, dtype: bool

We can now apply the boolean mask to the DataFrame using `loc[]` to select just the Defenders. We see that 157 players in this list are defenders.

In [None]:
players.loc[players['position'].isin(['LB', 'CB', 'RB'])]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2,Spain,0,4,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
17,Gabriel Paulista,Arsenal,26,CB,3,13.0,552,5.0,0.10%,45,3,Brazil,0,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
455,Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
458,Angelo Ogbonna,West+Ham,29,CB,3,9.0,247,4.5,1.10%,45,2,Italy,0,4,20,0,0
459,Pablo Zabaleta,West+Ham,32,RB,3,7.0,698,5.0,2.70%,45,3,Argentina,0,5,20,0,0
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1


Let's look at ranges of values. We can use the familiar `between()` method for this.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html
* Remember that by default, the edge values are inclusive. This can be changed using the `inclusive` parameter

Let's assess players that are between market values of 40 and 50 million dollars.

In [None]:
players.market_value.between(40, 50, inclusive = False)

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

As before, we can now apply this boolean mask to the original DataFrame. it appears there are only three players swith a market value of above 40 million but below 50 million

In [None]:
players[players.market_value.between(40, 50, inclusive = False)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
380,Dele Alli,Tottenham,21,CM,2,45.0,4626,9.5,38.60%,225,1,England,0,1,17,1,0


If the edge values are inclusive, we see that the number of players is substantially higher.

In [None]:
players[players.market_value.between(40, 50, inclusive = True)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
31,Alexandre Lacazette,Arsenal,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
97,Diego Costa,Chelsea,28,CF,1,50.0,4454,10.0,3.00%,196,2,Spain,0,4,5,1,0
102,Thibaut Courtois,Chelsea,25,GK,4,40.0,1260,5.5,18.50%,141,2,Belgium,0,3,5,1,0
108,N%27Golo Kante,Chelsea,26,DM,2,50.0,4042,5.0,13.80%,83,2,France,0,3,5,1,1
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
219,Sadio Mane,Liverpool,25,LW,1,40.0,3219,9.5,5.30%,156,4,Senegal,0,3,10,1,1
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
263,Bernardo Silva,Manchester+City,22,RW,1,40.0,1098,8.0,4.60%,0,2,Portugal,1,2,11,1,0
264,Romelu Lukaku,Manchester+United,24,CF,1,50.0,3727,11.5,45.00%,221,2,Belgium,0,2,12,1,0


How else can we create boolean Series on the fly? One way is to use old-fashioned comparison operator. For instance, if we want to look at players age 25 or younger: 

In [None]:
players['age'] <= 25

0      False
1      False
2      False
3      False
4      False
       ...  
460     True
461     True
462     True
463     True
464    False
Name: age, Length: 465, dtype: bool

In [None]:
players[players['age'] <= 25]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0
10,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
11,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Pedro Obiang,West+Ham,25,CM,2,9.0,286,4.5,0.30%,55,2,Spain,0,3,20,0,0
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0


However, Pandas also has another method for accomplishing this directly. Instead of the `<=` method, we can use the `.le()` method, which stands for "less than or equal to". 

In [None]:
players[players['age'].le(25)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0
10,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
11,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Pedro Obiang,West+Ham,25,CM,2,9.0,286,4.5,0.30%,55,2,Spain,0,3,20,0,0
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0


The result is equivalent to using comparison operators. To prove this, let's use the `equals()` method to compare them.

In [None]:
players.age.le(25).equals(players.age <= 25)

True

There are multiple comparison methods are your disposal:
* `le()`: less than or equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.le.html)
* `gt()`: greater than (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.gt.html)
* `lt()`: less than (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.lt.html)
* `ge()`: greater than or equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ge.html)
* `ne()`: not equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ne.html)
* `eq()`: equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.eq.html)

The main difference between using comparators and using these comparison methods is that the methods have a `fill_value` parameter that can be used to substitute in missing data.

## Binary Operators with Booleans

Before we get to combining conditions, let's look at how booleans behave in isolation

**Binary operators** are similar to other operators, but they work on the binary representation of the values, that is, the individual bits. They allow us to operations and comparisons or complements on a bit-by-bit basis. They are also known as bit-wise operators.

The most useful binary operators are `OR` and `AND`

Let's first examine `OR`. What would we expect if we performed a binary operator of False OR True (with OR represented by the pipe `|`.

We expect to get True, because an OR comparison will resolve to True as long as at least one of the conditions is True. Think of the `OR` operator as "searching for True"

In [22]:
True | False

True

Let's try some others

In [23]:
False | False

False

In [24]:
True | True

True

Let's now look at the `AND` operator, represented by the ampersand `&`. Think of the AND operator as always True unless there is a False.
* A single `False` is enough to trigger a "False".

In [26]:
False & True

False

In [27]:
True & True

True

In [28]:
False & False

False

In [29]:
True & False

False

In [30]:
True & True & False & True & True

False

How do Pandas Series combine using booleans? Let's start by making a single-element series containing a single False value

In [31]:
f = pd.Series(False)

Let's also make a single-value Series that contains True

In [32]:
t = pd.Series(True)

Let's now use the AND and OR operators on them. What we get is a combined series that compares the boolean values of each Series

In [33]:
t & f

0    False
dtype: bool

In [34]:
t | f

0    True
dtype: bool

That was simple, but where this becomes very powerful is with long series of booleans.

In [35]:
t = pd.Series([True if i % 2 == 0 else False for i in range(10)])

In [36]:
t

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

In [37]:
f = pd.Series([False for i in range(10)])

In [38]:
f

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

What happens when we combine these together using the AND and OR operators? 

The `&` operator, as expected, returns a series of all False values since the `f` series consists of all False values.

In [39]:
t & f

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

With the `|` OR operator, the result will be alternating `True` and `False`, as we would expect since the `t` Series contains alternating True and False values

In [40]:
t | f

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

The important thing to note is that when we combine two Pandas series, the comparison is done label-to-label. It is NOT done based on the order. Consider this example:

In [44]:
f = pd.Series(data = [False, True, True], index = ['c', 'b', 'a'])
t = pd.Series(data = [True, False, False], index = ['a','b','c'])

In [45]:
f

c    False
b     True
a     True
dtype: bool

In [46]:
t

a     True
b    False
c    False
dtype: bool

Now let's use the `&` operator. If the comparison were based on order of the values within the individual Series, we would expect `[False, False, False]`. 

In [47]:
f & t

a     True
b    False
c    False
dtype: bool

We did not get that! Instead, we got `True` for `a` and `False` for `b` and `c`. That's because the comparison was done by the index labels `a`, `b`, and `c` and NOT the positions of the values.

This will help illustrate what happens behind the scenes when we combine booleans.