# Section 5: DataFrames in Depth

In [1]:
import numpy as np
import pandas as pd

## Introducting a New Dataset - Soccer

In this section we will be working with a Dataframe containing English Premier League soccer players. There are over 400 players and 17 attributes!

In [2]:
data_url = 'https://andybek.com/pandas-soccer'

Read in the data

In [3]:
players = pd.read_csv(data_url)

Let's take a look at the Dataframe info. Looks like we have quite a few numeric columns as well as object columns.

In [4]:
players.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Columns: 17 entries, name to new_signing
dtypes: float64(2), int64(10), object(5)
memory usage: 190.7 KB


We can take a higher level look at this using the `dtypes` attribute and calling the `value_counts()` method.

In [5]:
players.dtypes.value_counts()

int64      10
object      5
float64     2
dtype: int64

Take one last peek at the DataFrame structure to ensure we're ready to go.

In [6]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0


## Quick Review - Indexing with boolean Masks

Recall the general approach for boolean indexing:
1. Generate a sequence of booleans ("Trues" and "Falses")
2. Use that boolean sequence with either square brackets [ ] or `.loc[]` to make the selection.

Say we're interested in learning which players have a market value exceeding 40 million dollars. We start by using the attribute accessor or selection brackets with a comparison operators to create the boolean sequence.

In [7]:
players["market_value"] > 40

0       True
1       True
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

Now to select just the players with market value over 40 million, simply pass this expression into a set of selection brackets for the DataFrame itself

In [8]:
players[players["market_value"] > 40]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
96,Eden Hazard,Chelsea,26,LW,1,75.0,4220,10.5,2.30%,224,2,Belgium,0,3,5,1,0
97,Diego Costa,Chelsea,28,CF,1,50.0,4454,10.0,3.00%,196,2,Spain,0,4,5,1,0
108,N%27Golo Kante,Chelsea,26,DM,2,50.0,4042,5.0,13.80%,83,2,France,0,3,5,1,1
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
244,Kevin De Bruyne,Manchester+City,26,AM,1,65.0,2252,10.0,17.50%,199,2,Belgium,0,3,11,1,0
245,Sergio Aguero,Manchester+City,29,CF,1,65.0,4046,11.5,9.70%,175,3,Argentina,0,4,11,1,0
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
264,Romelu Lukaku,Manchester+United,24,CF,1,50.0,3727,11.5,45.00%,221,2,Belgium,0,2,12,1,0


Indeed, there are 13 players who meet this criteria.

In [9]:
players[players["market_value"] > 40].shape

(13, 17)

## More Approaches to Boolean Masking: `isin()`, `lt()`, `between()`

Say we're interested in looking at just the *Defenders* (soccer position) in the DataFrame. Defenders, or backs, can be left, center, or right. Let's see what the position codes are in our DataFrame by examining the `position` column and using the `unique()` method.


In [10]:
players['position'].unique()

array(['LW', 'AM', 'GK', 'RW', 'CB', 'RB', 'CF', 'LB', 'DM', 'RM', 'CM',
       nan, 'SS', 'LM'], dtype=object)

Out of those, we are interested in the defenders: LB, CB, and RB. Thus, we want to extract lines from our DataFrame on the condition that they are one of these three positions. This is a great use case for the *`isin()` method, which is used to create a boolean series that indicates whether that particular entry is in the list of values passed in.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html

In [11]:
players['position'].isin(['LB', 'CB', 'RB'])

0      False
1      False
2      False
3      False
4       True
       ...  
460    False
461     True
462     True
463    False
464    False
Name: position, Length: 465, dtype: bool

We can now apply the boolean mask to the DataFrame using `loc[]` to select just the Defenders. We see that 157 players in this list are defenders.

In [12]:
players.loc[players['position'].isin(['LB', 'CB', 'RB'])]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2,Spain,0,4,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
17,Gabriel Paulista,Arsenal,26,CB,3,13.0,552,5.0,0.10%,45,3,Brazil,0,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
455,Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
458,Angelo Ogbonna,West+Ham,29,CB,3,9.0,247,4.5,1.10%,45,2,Italy,0,4,20,0,0
459,Pablo Zabaleta,West+Ham,32,RB,3,7.0,698,5.0,2.70%,45,3,Argentina,0,5,20,0,0
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1


Let's look at ranges of values. We can use the familiar `between()` method for this.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html
* Remember that by default, the edge values are inclusive. This can be changed using the `inclusive` parameter

Let's assess players that are between market values of 40 and 50 million dollars.

In [13]:
players.market_value.between(40, 50, inclusive = False)

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

As before, we can now apply this boolean mask to the original DataFrame. it appears there are only three players swith a market value of above 40 million but below 50 million

In [14]:
players[players.market_value.between(40, 50, inclusive = False)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
380,Dele Alli,Tottenham,21,CM,2,45.0,4626,9.5,38.60%,225,1,England,0,1,17,1,0


If the edge values are inclusive, we see that the number of players is substantially higher.

In [15]:
players[players.market_value.between(40, 50, inclusive = True)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
31,Alexandre Lacazette,Arsenal,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
97,Diego Costa,Chelsea,28,CF,1,50.0,4454,10.0,3.00%,196,2,Spain,0,4,5,1,0
102,Thibaut Courtois,Chelsea,25,GK,4,40.0,1260,5.5,18.50%,141,2,Belgium,0,3,5,1,0
108,N%27Golo Kante,Chelsea,26,DM,2,50.0,4042,5.0,13.80%,83,2,France,0,3,5,1,1
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
219,Sadio Mane,Liverpool,25,LW,1,40.0,3219,9.5,5.30%,156,4,Senegal,0,3,10,1,1
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
263,Bernardo Silva,Manchester+City,22,RW,1,40.0,1098,8.0,4.60%,0,2,Portugal,1,2,11,1,0
264,Romelu Lukaku,Manchester+United,24,CF,1,50.0,3727,11.5,45.00%,221,2,Belgium,0,2,12,1,0


How else can we create boolean Series on the fly? One way is to use old-fashioned comparison operator. For instance, if we want to look at players age 25 or younger: 

In [16]:
players['age'] <= 25

0      False
1      False
2      False
3      False
4      False
       ...  
460     True
461     True
462     True
463     True
464    False
Name: age, Length: 465, dtype: bool

In [17]:
players[players['age'] <= 25]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0
10,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
11,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Pedro Obiang,West+Ham,25,CM,2,9.0,286,4.5,0.30%,55,2,Spain,0,3,20,0,0
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0


However, Pandas also has another method for accomplishing this directly. Instead of the `<=` method, we can use the `.le()` method, which stands for "less than or equal to". 

In [18]:
players[players['age'].le(25)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0
10,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
11,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Pedro Obiang,West+Ham,25,CM,2,9.0,286,4.5,0.30%,55,2,Spain,0,3,20,0,0
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0


The result is equivalent to using comparison operators. To prove this, let's use the `equals()` method to compare them.

In [19]:
players.age.le(25).equals(players.age <= 25)

True

There are multiple comparison methods are your disposal:
* `le()`: less than or equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.le.html)
* `gt()`: greater than (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.gt.html)
* `lt()`: less than (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.lt.html)
* `ge()`: greater than or equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ge.html)
* `ne()`: not equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ne.html)
* `eq()`: equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.eq.html)

The main difference between using comparators and using these comparison methods is that the methods have a `fill_value` parameter that can be used to substitute in missing data.

## Binary Operators with Booleans

Before we get to combining conditions, let's look at how booleans behave in isolation

**Binary operators** are similar to other operators, but they work on the binary representation of the values, that is, the individual bits. They allow us to operations and comparisons or complements on a bit-by-bit basis. They are also known as bit-wise operators.

The most useful binary operators are `OR` and `AND`

Let's first examine `OR`. What would we expect if we performed a binary operator of False OR True (with OR represented by the pipe `|`.

We expect to get True, because an OR comparison will resolve to True as long as at least one of the conditions is True. Think of the `OR` operator as "searching for True"

In [20]:
True | False

True

Let's try some others

In [21]:
False | False

False

In [22]:
True | True

True

Let's now look at the `AND` operator, represented by the ampersand `&`. Think of the AND operator as always True unless there is a False.
* A single `False` is enough to trigger a "False".

In [23]:
False & True

False

In [24]:
True & True

True

In [25]:
False & False

False

In [26]:
True & False

False

In [27]:
True & True & False & True & True

False

How do Pandas Series combine using booleans? Let's start by making a single-element series containing a single False value

In [28]:
f = pd.Series(False)

Let's also make a single-value Series that contains True

In [29]:
t = pd.Series(True)

Let's now use the AND and OR operators on them. What we get is a combined series that compares the boolean values of each Series

In [30]:
t & f

0    False
dtype: bool

In [31]:
t | f

0    True
dtype: bool

That was simple, but where this becomes very powerful is with long series of booleans.

In [32]:
t = pd.Series([True if i % 2 == 0 else False for i in range(10)])

In [33]:
t

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

In [34]:
f = pd.Series([False for i in range(10)])

In [35]:
f

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

What happens when we combine these together using the AND and OR operators? 

The `&` operator, as expected, returns a series of all False values since the `f` series consists of all False values.

In [36]:
t & f

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

With the `|` OR operator, the result will be alternating `True` and `False`, as we would expect since the `t` Series contains alternating True and False values

In [37]:
t | f

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

The important thing to note is that when we combine two Pandas series, the comparison is done **label-to-label**. It is NOT done based on the order. Consider this example:

In [38]:
f = pd.Series(data = [False, True, True], index = ['c', 'b', 'a'])
t = pd.Series(data = [True, False, False], index = ['a','b','c'])

In [39]:
f

c    False
b     True
a     True
dtype: bool

In [40]:
t

a     True
b    False
c    False
dtype: bool

Now let's use the `&` operator. If the comparison were based on order of the values within the individual Series, we would expect `[False, False, False]`. 

In [41]:
f & t

a     True
b    False
c    False
dtype: bool

We did not get that! Instead, we got `True` for `a` and `False` for `b` and `c`. That's because the comparison was done by the index labels `a`, `b`, and `c` and NOT the positions of the values.

This will help illustrate what happens behind the scenes when we combine booleans. Onward and upward!

## BONUS - XOR and Complement Binary Ops

The binary OR and AND operators are without a doubt the most frequently used ones. However, there are several others available. This lecture will cover two of them.

XOR stands for "exclusive or". As the name suggests, it is exclusive. That means it resolves to *True* when the inputs are different, and to *False* if they are alike.
* In Python the XOR operator is represented by `^`

In [42]:
True ^ False

True

In [43]:
False ^ False

False

In [44]:
True ^ True

False

In [45]:
True ^ (False | False & True) | False

True

XOR can be used to combine boolean series in DataFrame indexing, such as when we want one condition but not the other. We'll see examples of this later.


Another binary operator is the **complement operator**, represented by the tilde `~`.
* In Pandas and Numpy it is extremely useful for negating boolean series.

In [46]:
~False

-1

In [47]:
~True

-2

What's going on here? It has to do with how computers represents integers. The "twos complement" system is a scheme for deriving binary representations of integer numbers. It is the most common method for representing signed integers for computers.
Remember that computers only understand 0's and 1's. How then do we work with integers? We use binary numbers and their respective two's complement. The two's complement is the inverstion of the binary representation of the number.
* https://en.wikipedia.org/wiki/Two%27s_complement
* https://www.cs.cornell.edu/~tomf/notes/cps104/twoscomp.html
* Simple tutorial on binary numbers: http://www.steves-internet-guide.com/binary-numbers-explained/

Going back to our previous example, we know that the integer value of False is 0 while the integer value of True is 1. Therefore:
* `~True = ~1 = inversion(00000001) = 11111110 = -2`
* `~False = ~0 = inversion(00000000) = 11111111 = -1`

Going back to Pandas, let's use the complement binary operator to negate our binary operators. It will turn our Trues to Falses and vice versa.

In [48]:
t = pd.Series([True, True, False])

In [49]:
t

0     True
1     True
2    False
dtype: bool

In [50]:
~t

0    False
1    False
2     True
dtype: bool

The result of using the `~` here is a negation of each of our boolean values. This feature will be very useful for defining negative conditions with DataFrames. For example, if we wanted to select all soccer players that are NOT defenders.

In [51]:
players

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
463,Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


## Combining Conditions - Indexing with Multiple Conditions

Quick refresher, let's select all of the LBs (left backs)

In [52]:
players[players.position == 'LB']

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2,Spain,0,4,1,1,0
18,Kieran Gibbs,Arsenal,27,LB,3,10.0,489,5.0,0.50%,45,1,England,0,3,1,1,0
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
34,Charlie Daniels,Bournemouth,30,LB,3,3.0,185,5.0,19.80%,134,1,England,0,4,2,0,0
54,Brad Smith,Bournemouth,23,LB,3,2.0,297,4.0,3.30%,4,4,Australia,0,2,2,0,0
62,Gaetan Bong,Brighton+and+Hove,29,LB,3,1.5,97,4.5,0.20%,0,4,Cameroon,0,4,3,0,0
65,Markus Suttner,Brighton+and+Hove,30,LB,3,2.0,23,4.5,0.20%,0,2,Austria,0,4,3,0,0
82,Stephen Ward,Burnley,31,LB,3,1.5,152,4.5,2.50%,91,2,Ireland,0,4,4,0,0
99,Marcos Alonso Mendoza,Chelsea,26,LB,3,25.0,3069,7.0,12.40%,177,2,Spain,0,3,5,1,1
112,Kenedy,Chelsea,21,LB,3,7.0,566,5.0,0.10%,3,3,Brazil,0,1,5,1,0


Now, what if we want to restrict by yet another condition? Simple! Just use the binary operators that we have learned.
For instance, say we want all of the left backs who are 25 years old or younger. We can place both conditionals within the square selection brackets and combine them using the `&` operator.
* **IMPORTANT NOTE**: Always wrap your individual conditions within square brackets. Failure to do so will cause Python to misinterpret your conditions due to order of operations.

In [53]:
players[(players.position == 'LB') & (players.age <= 25)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
54,Brad Smith,Bournemouth,23,LB,3,2.0,297,4.0,3.30%,4,4,Australia,0,2,2,0,0
112,Kenedy,Chelsea,21,LB,3,7.0,566,5.0,0.10%,3,3,Brazil,0,1,5,1,0
128,Jeffrey Schlupp,Crystal+Palace,24,LB,3,8.0,385,5.0,0.30%,47,4,Ghana,0,2,6,0,0
212,Ben Chilwell,Leicester+City,20,LB,3,2.5,288,4.5,0.80%,19,1,England,0,1,9,0,0
236,Alberto Moreno,Liverpool,25,LB,3,10.0,397,4.5,0.30%,8,2,Spain,0,3,10,1,0
281,Luke Shaw,Manchester+United,22,LB,3,20.0,947,5.0,0.40%,45,1,England,0,2,12,1,0
294,Paul Dummett,Newcastle+United,25,LB,3,3.5,177,4.5,1.00%,0,2,Wales,0,3,13,0,0
298,Massadio Haidara,Newcastle+United,24,LB,3,1.5,114,4.0,0.50%,0,2,France,0,2,13,0,0
328,Matt Targett,Southampton,21,LB,3,3.0,110,4.5,0.20%,12,1,England,0,1,14,0,0


You could even add more conditions! Let's continue filtering for LBs age 25 or younger that have a market value of at least $10 million

In [54]:
players[(players.position == 'LB') & (players.age <= 25) & (players.market_value >= 10)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
236,Alberto Moreno,Liverpool,25,LB,3,10.0,397,4.5,0.30%,8,2,Spain,0,3,10,1,0
281,Luke Shaw,Manchester+United,22,LB,3,20.0,947,5.0,0.40%,45,1,England,0,2,12,1,0
389,Ben Davies,Tottenham,24,LB,3,12.0,396,5.5,1.80%,90,2,Wales,0,2,17,1,0


Very cool. We can also combine different binary operators. For example, let's exclude this subset of left backs who are not from Arsenal or Tottenham. One way to do this is to use the `isin()` method and pass in "Arsenal" and "Tottenham", and then negate it using the tilde `~`

In [55]:
players[(players.position == 'LB') & (players.age <= 25) & (players.market_value >= 10) & (~players.club.isin(['Tottenham', 'Arsenal']))]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
236,Alberto Moreno,Liverpool,25,LB,3,10.0,397,4.5,0.30%,8,2,Spain,0,3,10,1,0
281,Luke Shaw,Manchester+United,22,LB,3,20.0,947,5.0,0.40%,45,1,England,0,2,12,1,0


## Conditions As Variables

The long indexing that combines multiple conditions can get very ugly to look at very quickly. One way to make the code more readable (aside from breaking into separate lines) is to use *conditions as variables*. 

This essentially means refactoring your conditions into standalone variables.

Suppose we are interested in right backs (RBs) from Arsenal and goalkeepers from Chelsea. Start by creating a variable generating a boolean mask for Arsenal players

In [56]:
arsenal_player = players.club == 'Arsenal'

In [57]:
arsenal_player

0       True
1       True
2       True
3       True
4       True
       ...  
460    False
461    False
462    False
463    False
464    False
Name: club, Length: 465, dtype: bool

Next, let's get the right backs from Arsenal

In [58]:
right_back = players.position == 'RB'

In [59]:
right_back

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462     True
463    False
464    False
Name: position, Length: 465, dtype: bool

For the Chelsea part, we'll combine both conditions in one go.

In [60]:
chelsea_goalkeepers = (players.club == 'Chelsea') & (players.position == 'GK')

In [61]:
chelsea_goalkeepers

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Length: 465, dtype: bool

Finally, let's index the whole DataFrame based on the conditions we have defined in these variables. To switch things up a bit we'll use `.loc[]`

In [62]:
players.loc[(arsenal_player & right_back) | chelsea_goalkeepers]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
27,Carl Jenkinson,Arsenal,25,RB,3,5.0,561,4.5,0.40%,2,1,England,0,3,1,1,0
102,Thibaut Courtois,Chelsea,25,GK,4,40.0,1260,5.5,18.50%,141,2,Belgium,0,3,5,1,0
109,Willy Caballero,Chelsea,35,GK,4,1.5,542,5.0,0.20%,64,3,Argentina,0,6,5,1,0


## Skill Challenge
Identify the subset of players that meets all of the following criteria:
1. English `nationality`
2. Market value is more than twice the average market value in the league (`market_value`)
3. More than 4000 page views (`page_views`) OR are a new signing (`new_signing`), but not both.

Let's handle this by creating variables for conditions. Let's start with English-ness

In [63]:
english = players.nationality == "England"

Next let's handle market value. First we'll calculate the average market value for all players, and then create a condition where only players that are more than twice the average market value are selected.

In [64]:
avg_market = players['market_value'].mean()

In [65]:
avg_market

11.125649350649349

In [66]:
twice_avg = players.market_value > avg_market * 2

In [67]:
twice_avg

0       True
1       True
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

Finaly, let's handle the page views and signing condition. Since we want players that either have >4000 page views OR are a new_signing, but not both, our selections must reflect that.
That is, our 4000 views selector must exclude new_signing, and our signing selector must exclude players greater than 4000 views

In [68]:
fourK_views = players['page_views'] > 4000

In [69]:
new_signing = players['new_signing'] == 1

We now combine all of the expressions together. The third condition (views and signing) is a great opportunity to implement the EXOR operator `^` when combinding them. Remember that this will only resolve to True if either one of the conditions is True but not both. That's exactly what we want.

In [70]:
players[english & twice_avg & (fourK_views ^ new_signing)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
256,John Stones,Manchester+City,23,CB,3,35.0,1078,5.5,2.30%,59,1,England,0,2,11,1,1
380,Dele Alli,Tottenham,21,CM,2,45.0,4626,9.5,38.60%,225,1,England,0,1,17,1,0
381,Harry Kane,Tottenham,23,CF,1,60.0,4161,12.5,35.10%,224,1,England,0,2,17,1,0


## Two-Dimensional Indexing: Selecting Columns


So far, all of our data extractions have operated only on the index axis - that is, selecting *rows* that we want.
However, remember that DataFrames are two-dimensional data structures. Thus, we can index by columns as well.

Say we want to select Chelsea players that are age 23 and under. We already know how to do this.

In [71]:
chelsea_23under = (players.club == "Chelsea") & (players.age.le(23) )

In [72]:
players.loc[chelsea_23under]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
110,Michy Batshuayi,Chelsea,23,CF,1,25.0,1162,8.5,1.60%,48,2,Belgium,0,2,5,1,1
111,Kurt Zouma,Chelsea,22,CB,3,15.0,723,5.5,0.80%,15,2,France,0,2,5,1,0
112,Kenedy,Chelsea,21,LB,3,7.0,566,5.0,0.10%,3,3,Brazil,0,1,5,1,0
115,Tiemoue Bakayoko,Chelsea,22,DM,2,16.0,1011,5.0,1.60%,0,2,France,1,2,5,1,0


But notice that these types of selections return ALL columns. What if we were only interested in one or a few columns, but not all of them?

Using the `loc[]` indexer, we can pass in a second argument that represents our columns that we want to select. 
* https://datagy.io/pandas-select-columns/#loc-select-columns

This is perhaps the most intuitive way to do it, but it is not the most flexible. For example, it won't allow you to select all columns that start with a particular letter.


In [73]:
players.loc[chelsea_23under, ['position', 'market_value']]

Unnamed: 0,position,market_value
110,CF,25.0
111,CB,15.0
112,LB,7.0
115,DM,16.0


Let's actually try that. Let's select all columns that begin with a 'p'. To do this, we can use the `startswith` string method on the columns attribute for the DataFrame, which generates a boolean mask for the columns! 

**Important note**: just like selecting rows, selecting columns with a boolean mask requires the boolean series to be of the same length as the column axis. 

In [74]:
startswith_p = players.columns.str.startswith('p')

Now we can use that boolean mask as the second argument for `loc[]`. Viola, we get just the columns that start with `p`.

In [75]:
players.loc[chelsea_23under, startswith_p]

Unnamed: 0,position,position_cat,page_views
110,CF,1,1162
111,CB,3,723
112,LB,3,566
115,DM,2,1011


It is also common to chain two square brackets together. The instructor suggests avoiding doing this.

In [76]:
players[chelsea_23under]['position']

110    CF
111    CB
112    LB
115    DM
Name: position, dtype: object

What you are essentially doing above is subsetting your DataFrame by the first condition, and then selecting a single column from that.

This is NOT the same as the following:

In [77]:
players.loc[chelsea_23under, 'position']

110    CF
111    CB
112    LB
115    DM
Name: position, dtype: object

Although the output is the same, behind the scenes the square bracket method-chaining is a slower process. This is because a `__getitem__` method under the hood which gets called twice with  bracket chaining.
Therefore, the instructor recomments avoiding bracket chaining and instead using `loc[]` with a second argument.

## Fancy Indexing with `lookup()`

"Fancy" indexing is simply refers to passing multiple labels at at once. It's very similar to basic indexing, but instead of using single labels, we instead specify a list or tuple of labels. Easy!

The `lookup()` method is another way to achieve fancying indexing. it allows us to pick up a specific row label (or list of them) and a specific column label (or list of them) and find the value corresponding to that position. This will return a numpy array, not a value.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.lookup.html
* Note that this is depracated since Pandes version 1.2.0.

In [78]:
players.lookup([450], ['age'])

array([30])

Let's compare fancy indexing with `loc[]`, where we want two rows and two columns. This will return a DataFrame slice.

In [79]:
players.loc[[0, 132], ('name', 'market_value')]

Unnamed: 0,name,market_value
0,Alexis Sanchez,65.0
132,Connor Wickham,6.0


Comparing to the `lookup()` method:

In [80]:
players.lookup([0, 132], ['name', 'market_value'])

array(['Alexis Sanchez', 6.0], dtype=object)

Again, notice that what gets returned is a numpy array displaying the first and last values that we expect - the name of the person on the 0 row, and the market value of the person on the 132nd row. 

What happens if we swap the column labels around?


In [81]:
players.lookup([0, 132], [ 'market_value', 'name'])

array([65.0, 'Connor Wickham'], dtype=object)

This time, our numpy array begins with the market value of the player at row 0, and the ends with the name of the player at row 132.

In reality, the `lookup()` method is most useful when we already have a collection of labels that we want to use to make our selections. Say we have three players and want to source specific attributes.


In [82]:
names = ['Petr Cech', 'Mesut Ozil', 'Alexis Sanchez']

In [83]:
attributes = ['age', 'market_value', 'page_views']

To look this up, you cannot simply pass in the names as the index labels. Why not? Because the current index labels are numbers, not player names. The player names is one of the columns in the DataFrames.

In order to get this to work, we have to **set the index** to player names. It is recommended NOT to do this `inplace` so that you don't accidentally manipulate the DataFrame in an undesirable way.

In [84]:
players.set_index('name')

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


Now let's chain on the `lookup()` that we want to do. Note that this now requires unique index and columns, whereas our DataFrame contains duplicate names be design. The lecture needs to be re-recorded.

In any case, we will use the `duplicated()` method and logical negation to remove duplicated players from the DataFrame

In [85]:
players_by_name = players.set_index('name') 
dupes = players_by_name.index.duplicated() 
players_by_name[~dupes].lookup(names, attributes)

array([  35.,   50., 4329.])

Can we do something similar using `.loc[]`? Let's find out.

In [86]:
players.set_index('name').loc[names, attributes]

Unnamed: 0_level_0,age,market_value,page_views
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Petr Cech,35,7.0,1529
Mesut Ozil,28,50.0,4395
Alexis Sanchez,28,65.0,4329


Yes we can! And honestly it's easier, if not more computationally intensive.

## Sorting by Index or Column - the `sort_values()` and `sort_index()` Methods

Recall that we previously learned how to sort by column values in ascending or descending order.

In [87]:
players.sort_values(by = 'market_value', ascending = False)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
96,Eden Hazard,Chelsea,26,LW,1,75.00,4220,10.5,2.30%,224,2,Belgium,0,3,5,1,0
267,Paul Pogba,Manchester+United,24,CM,2,75.00,7435,8.0,19.50%,115,2,France,0,2,12,1,1
0,Alexis Sanchez,Arsenal,28,LW,1,65.00,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
244,Kevin De Bruyne,Manchester+City,26,AM,1,65.00,2252,10.0,17.50%,199,2,Belgium,0,3,11,1,0
245,Sergio Aguero,Manchester+City,29,CF,1,65.00,4046,11.5,9.70%,175,3,Argentina,0,4,11,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,Joel Castro Pereira,Manchester+United,21,GK,4,0.10,395,4.0,1.00%,6,2,Portugal,0,1,12,1,0
113,Eduardo Carvalho,Chelsea,34,LW,1,0.05,467,5.0,0.10%,0,2,Portugal,0,6,5,1,1
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,,56,6.0,0.60%,0,2,Benin,0,2,8,0,0


Sometimes we also want to sort by index values, and this usually happens for indexes that we create (as opposed to Pandas creating them).

For instance, the index for the current `players` DataFrame is an integer index that's not particularly interesting or useful. Nothing about that numeric index label tells you anything about the data in those rows. But let's say we switch the index to player names. This can be achieved with the `set_index()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html

In [88]:
players.set_index('name', inplace = True)

In [89]:
players

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


Now we have something interesting to work with! The index now contains strings and is of dtype `object`

In [90]:
players.index

Index(['Alexis Sanchez', 'Mesut Ozil', 'Petr Cech', 'Theo Walcott',
       'Laurent Koscielny', 'Hector Bellerin', 'Olivier Giroud',
       'Nacho Monreal', 'Shkodran Mustafi', 'Alex Iwobi',
       ...
       'Aaron Cresswell', 'Pedro Obiang', 'Sofiane Feghouli', 'Angelo Ogbonna',
       'Pablo Zabaleta', 'Edimilson Fernandes', 'Arthur Masuaku', 'Sam Byram',
       'Ashley Fletcher', 'Diafra Sakho'],
      dtype='object', name='name', length=465)

But now our index is not sorted in any way. It's a good practice to make an ordered index. We can do this with the `sort_index` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html

In [91]:
players.sort_index().head(10)

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0
Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0
Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0
Abdoulaye Doucoure,Watford,24,CM,2,6.0,124,5.0,0.00%,38,2,France,0,2,18,0,0
Adam Federici,Bournemouth,32,GK,4,1.0,126,4.0,1.50%,8,4,Australia,0,5,2,0,0
Adam Lallana,Liverpool,29,AM,1,25.0,1808,7.5,6.40%,139,1,England,0,4,10,1,0
Adam Smith,Bournemouth,26,RB,3,5.0,200,5.0,0.90%,104,1,England,0,3,2,0,0
Ademola Lookman,Everton,19,LW,1,5.0,1387,5.5,0.30%,16,1,England,0,1,7,0,0
Adrian,West+Ham,30,GK,4,8.0,266,4.5,0.80%,64,2,Spain,0,4,20,0,0


Let's set `inplace` to True so that the sorting sticks.

In [92]:
players.sort_index(inplace =  True)

In [93]:
players.head()

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0
Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0
Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0
Abdoulaye Doucoure,Watford,24,CM,2,6.0,124,5.0,0.00%,38,2,France,0,2,18,0,0


This method also helps us sort our columns. Suppose we want to sort the column axis (that is, the column labels). All we need to change is the `axis` parameter of the `sort_index()` method.

This can be helpful if you want to look through the column labels by name.

In [94]:
players.sort_index(axis=1)

Unnamed: 0_level_0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,nationality,new_foreign,new_signing,page_views,position,position_cat,region
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Aaron Cresswell,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,England,0,0,380,LB,3,1
Aaron Lennon,30,4,0,Everton,7,22,0.20%,5.5,5.0,England,0,0,504,RW,1,1
Aaron Mooy,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Australia,0,0,588,CM,2,4
Aaron Ramsey,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Wales,0,0,1040,CM,2,1
Abdoulaye Doucoure,24,2,0,Watford,18,38,0.00%,5.0,6.0,France,0,0,124,CM,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yohan Cabaye,31,4,0,Crystal+Palace,6,91,1.40%,5.5,15.0,France,0,0,456,CM,2,2
YounÃ¨s Kaboul,31,4,0,Watford,18,57,0.10%,4.5,2.5,France,0,1,263,CB,3,2
Ã‰tienne Capoue,29,4,0,Watford,18,131,8.00%,5.5,9.0,France,0,0,412,DM,2,2
Ã€ngel Rangel,34,6,0,Swansea,16,26,18.80%,4.0,1.0,Spain,0,0,137,RB,3,2


Let's reset the index so that we are on the same page as the instructor for the next lecture. This can be done using the `reset_index()` method. It resets the index to the default one. By default, the old index gets inserted into the DataFrame as a column. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [95]:
players.reset_index(inplace = True)

In [96]:
players.head(10)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
1,Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0
2,Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0
3,Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0
4,Abdoulaye Doucoure,Watford,24,CM,2,6.0,124,5.0,0.00%,38,2,France,0,2,18,0,0
5,Adam Federici,Bournemouth,32,GK,4,1.0,126,4.0,1.50%,8,4,Australia,0,5,2,0,0
6,Adam Lallana,Liverpool,29,AM,1,25.0,1808,7.5,6.40%,139,1,England,0,4,10,1,0
7,Adam Smith,Bournemouth,26,RB,3,5.0,200,5.0,0.90%,104,1,England,0,3,2,0,0
8,Ademola Lookman,Everton,19,LW,1,5.0,1387,5.5,0.30%,16,1,England,0,1,7,0,0
9,Adrian,West+Ham,30,GK,4,8.0,266,4.5,0.80%,64,2,Spain,0,4,20,0,0


## Sorting vs Reordering - the `reindex()` Method

You can customize a lot of features with the sort methods that we use, like `sort_index()` and `sort_values()`. But at the end of the day, we are only sorting in ascending or descending order.

What if we wanted to more precisely reorder the rows according to some very specific order? To do this, we can use the `reindex()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

Let's start by isolating a small chunk of our DataFrame `iloc[]`

In [97]:
players_lite = players.iloc[:4, :4]

In [98]:
players_lite

Unnamed: 0,name,club,age,position
0,Aaron Cresswell,West+Ham,27,LB
1,Aaron Lennon,Everton,30,RW
2,Aaron Mooy,Huddersfield,26,CM
3,Aaron Ramsey,Arsenal,26,CM


Now, say we want to reorder the rows based on a very specific requirement. Support we're working for a soccer scout and we want to get these four players, and we want the relevant data to be displayed in a very specific format.

Row order needs to be: 2, 1, 3, 0
Column order needs to be: age, name, position, club

There is no way to do this with the methods we already know. But it is VERY easy to do it using `reindex()`. All we need to do is pass in an `index` parameter with the indices in the exact order that we want them, and do the same with the `columns` parameter.

In [99]:
players_lite.reindex(index = [2, 1, 3, 0], columns = ['age', 'name', 'position', 'club'])

Unnamed: 0,age,name,position,club
2,26,Aaron Mooy,CM,Huddersfield
1,30,Aaron Lennon,RW,Everton
3,26,Aaron Ramsey,CM,Arsenal
0,27,Aaron Cresswell,LB,West+Ham


Here we have subset our DataFrame and applied the `reindex()` method to it. We can achieve this same result by working directly on our large DataFrame. This works because we have explicitly called out the index labels and columns that we want to display. This means we can **use the `reindex()` method to carve out slices of our DataFrame, and they don't even need to be consecutive slices.**

HOWEVER, it is *NOT* the most effective way to do selections. The instructor recommends using `loc[]` and `iloc[]` as much as possible.

In [100]:
players.reindex(index = [2, 1, 3, 0], columns = ['age', 'name', 'position', 'club'])

Unnamed: 0,age,name,position,club
2,26,Aaron Mooy,CM,Huddersfield
1,30,Aaron Lennon,RW,Everton
3,26,Aaron Ramsey,CM,Arsenal
0,27,Aaron Cresswell,LB,West+Ham


Let's say our scout has changed her mind and she wants ALL of the data columns for these four players, in standard alphabetical order. 

There are several options. Perhaps the easiest is to chain on the `sort_index()` method and set `axis = 1`

In [101]:
players.reindex(index = [2, 1, 3, 0]).sort_index(axis = 1)

Unnamed: 0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,name,nationality,new_foreign,new_signing,page_views,position,position_cat,region
2,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Aaron Mooy,Australia,0,0,588,CM,2,4
1,30,4,0,Everton,7,22,0.20%,5.5,5.0,Aaron Lennon,England,0,0,504,RW,1,1
3,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Aaron Ramsey,Wales,0,0,1040,CM,2,1
0,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,Aaron Cresswell,England,0,0,380,LB,3,1


But we can also avoid method chaining and rely entirely on the `reindex()` method. All we have to do is pass to the `columns` parameter a list of the column labels that has already been sorted in alphabetical order.

We can get exactly that just by using the `columns` attribute of the DataFrame, which returns an iterable *columns* object that we can sort using `sort_values()`

In [102]:
players.reindex(index = [2, 1, 3, 0], columns = players.columns.sort_values())

Unnamed: 0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,name,nationality,new_foreign,new_signing,page_views,position,position_cat,region
2,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Aaron Mooy,Australia,0,0,588,CM,2,4
1,30,4,0,Everton,7,22,0.20%,5.5,5.0,Aaron Lennon,England,0,0,504,RW,1,1
3,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Aaron Ramsey,Wales,0,0,1040,CM,2,1
0,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,Aaron Cresswell,England,0,0,380,LB,3,1


Another way we can do this is to use a Python method called `sorted()`. All we need to do is pass our columns to this method, and we'll get that sorted list back! It accomplishes the exact same objective.

In [103]:
players.reindex(index = [2, 1, 3, 0], columns = sorted(players.columns))

Unnamed: 0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,name,nationality,new_foreign,new_signing,page_views,position,position_cat,region
2,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Aaron Mooy,Australia,0,0,588,CM,2,4
1,30,4,0,Everton,7,22,0.20%,5.5,5.0,Aaron Lennon,England,0,0,504,RW,1,1
3,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Aaron Ramsey,Wales,0,0,1040,CM,2,1
0,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,Aaron Cresswell,England,0,0,380,LB,3,1


## What NOT to Do When Sorting: Transposing the DataFrame Twice

When column sorting, suppose you want to sort alphabetically. What some people do is change the shape of the DataFrame by transposing it. One way to do this is with the `swapaxes()` method.

In [104]:
df = players.iloc[:6, :6]

In [105]:
df

Unnamed: 0,name,club,age,position,position_cat,market_value
0,Aaron Cresswell,West+Ham,27,LB,3,12.0
1,Aaron Lennon,Everton,30,RW,1,5.0
2,Aaron Mooy,Huddersfield,26,CM,2,5.0
3,Aaron Ramsey,Arsenal,26,CM,2,35.0
4,Abdoulaye Doucoure,Watford,24,CM,2,6.0
5,Adam Federici,Bournemouth,32,GK,4,1.0


In [106]:
df.swapaxes(1, 0)

Unnamed: 0,0,1,2,3,4,5
name,Aaron Cresswell,Aaron Lennon,Aaron Mooy,Aaron Ramsey,Abdoulaye Doucoure,Adam Federici
club,West+Ham,Everton,Huddersfield,Arsenal,Watford,Bournemouth
age,27,30,26,26,24,32
position,LB,RW,CM,CM,CM,GK
position_cat,3,1,2,2,2,4
market_value,12,5,5,35,6,1


Another way to transpose using the transpose attribute `T`

In [107]:
df.T

Unnamed: 0,0,1,2,3,4,5
name,Aaron Cresswell,Aaron Lennon,Aaron Mooy,Aaron Ramsey,Abdoulaye Doucoure,Adam Federici
club,West+Ham,Everton,Huddersfield,Arsenal,Watford,Bournemouth
age,27,30,26,26,24,32
position,LB,RW,CM,CM,CM,GK
position_cat,3,1,2,2,2,4
market_value,12,5,5,35,6,1


This gives us our column names as our indexes, and we can now sort them in alphabetical order.

In [108]:
df.T.sort_index()

Unnamed: 0,0,1,2,3,4,5
age,27,30,26,26,24,32
club,West+Ham,Everton,Huddersfield,Arsenal,Watford,Bournemouth
market_value,12,5,5,35,6,1
name,Aaron Cresswell,Aaron Lennon,Aaron Mooy,Aaron Ramsey,Abdoulaye Doucoure,Adam Federici
position,LB,RW,CM,CM,CM,GK
position_cat,3,1,2,2,2,4


Finally, we re-transpose back into the original form of the DataFrame, in which the column names will now be sorted.

In [109]:
df.T.sort_index().T

Unnamed: 0,age,club,market_value,name,position,position_cat
0,27,West+Ham,12,Aaron Cresswell,LB,3
1,30,Everton,5,Aaron Lennon,RW,1
2,26,Huddersfield,5,Aaron Mooy,CM,2
3,26,Arsenal,35,Aaron Ramsey,CM,2
4,24,Watford,6,Abdoulaye Doucoure,CM,2
5,32,Bournemouth,1,Adam Federici,GK,4


**DO NOT do this!** Why not?
* The code is ugly to read and can be confusing
* It is indirect and uses a lot of temporary outputs
* There's no reason to do this when you can accomplish the same thing using just the `sort_index()` method.

## Skill Challenge
1. Sort the `players` DataFrame by age in ascending order. Who is the youngest footballer in the EPL?
2. Set the *club* column as the index of the DataFrame. Then sort the DataFrame index in alphabetical order. Make sure these changes are applied to the underlying DataFrame and carry over to the next question.
3. Sort the DataFrame values by `club` and `market_value`, where the club is alphabetical (Arsenal first) and the market value is in descending order within each time, with the most valuable players first.

For Part 1, we can simply use the `sort_values()` method and pass "age" to the `by` parameter. Optionally you can set `ascending` to `True`, but that is the behavior by default.

In [110]:
players.sort_values(by = 'age')

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
53,Ben Woodburn,Liverpool,17,LW,1,1.50,1241,4.5,0.10%,5,1,Wales,0,1,10,1,0
217,Jonathan Leko,West+Brom,18,RW,1,1.50,169,4.5,0.20%,12,1,England,0,1,19,0,0
434,Trent Alexander-Arnold,Liverpool,18,RB,3,1.50,327,4.5,0.30%,15,2,England,0,1,10,1,0
229,Josh Tymon,Stoke+City,18,LB,3,1.00,120,4.5,0.10%,9,1,England,0,1,15,0,0
45,Axel Tuanzebe,Manchester+United,19,CB,3,1.00,279,4.0,1.70%,14,1,England,0,1,12,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,Gareth Barry,Everton,36,DM,2,1.50,1331,4.5,1.70%,68,1,England,0,6,7,0,0
90,Damien Delaney,Crystal+Palace,36,CB,3,1.00,195,4.5,0.60%,51,2,Ireland,0,6,6,0,0
38,Artur Boruc,Bournemouth,37,GK,4,1.00,436,4.5,6.90%,120,2,Poland,0,6,2,0,0
143,Gareth McAuley,West+Brom,37,CB,3,1.00,458,5.0,11.80%,131,2,Northern Ireland,0,6,19,0,0


The youngest player in the EPL is Ben Woodburn. We could explicity grab that value by using the `iloc[]` method.

In [111]:
players.sort_values(by = 'age').iloc[0,0]

'Ben Woodburn'

Another way to do this is to use the `idxmin()` method on the age column, which will identify the smallest value for age and return that index corresponding to that player. Passing that result to iloc will then give us all of the rows for that player. 

In [112]:
players.iloc[players.age.idxmin()]

name            Ben Woodburn
club               Liverpool
age                       17
position                  LW
position_cat               1
market_value             1.5
page_views              1241
fpl_value                4.5
fpl_sel                0.10%
fpl_points                 5
region                     1
nationality            Wales
new_foreign                0
age_cat                    1
club_id                   10
big_club                   1
new_signing                0
Name: 53, dtype: object

For Part 2, we will use `set_index()` to set `club` as the index. Then we will sort that index in alphabetical order. We can do this all in one go!

In [115]:
players = players.set_index('club').sort_index()

In [116]:
players

Unnamed: 0_level_0,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
club,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West+Ham,Mark Noble,30,CM,2,7.0,425,5.5,0.10%,71,1,England,0,4,20,0,0
West+Ham,Michail Antonio,27,RW,1,18.0,1142,7.5,0.50%,132,1,England,0,3,20,0,0
West+Ham,Robert Snodgrass,29,RW,1,8.0,1210,6.0,6.50%,133,2,Scotland,0,4,20,0,0
West+Ham,Ashley Fletcher,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


For Part 3, we will sort this club-indexed DataFrame by both club and market value. We can achieve the differential sorting for *club* and *market_value* using one call of `sort_values()` by passing in a list of boolean to the `ascending` parameter.

Remember, we want the clubs in alphabetical ascending order and the market values in descending order.

In [119]:
players.sort_values(by = ['club', 'market_value'], ascending = [True, False])

Unnamed: 0_level_0,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
club,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West+Ham,Edimilson Fernandes,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
West+Ham,Sam Byram,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
West+Ham,Darren Randolph,30,GK,4,2.5,459,4.5,0.40%,69,2,Ireland,0,4,20,0,0
West+Ham,James Collins,33,CB,3,2.0,187,4.5,0.90%,69,2,Wales,0,5,20,0,0
