# Section 5: DataFrames in Depth

In [1]:
import numpy as np
import pandas as pd

## Introducting a New Dataset - Soccer

In this section we will be working with a Dataframe containing English Premier League soccer players. There are over 400 players and 17 attributes!

In [2]:
data_url = 'https://andybek.com/pandas-soccer'

Read in the data

In [3]:
players = pd.read_csv(data_url)

Let's take a look at the Dataframe info. Looks like we have quite a few numeric columns as well as object columns.

In [4]:
players.info(verbose=False, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Columns: 17 entries, name to new_signing
dtypes: float64(2), int64(10), object(5)
memory usage: 190.7 KB


We can take a higher level look at this using the `dtypes` attribute and calling the `value_counts()` method.

In [5]:
players.dtypes.value_counts()

int64      10
object      5
float64     2
dtype: int64

Take one last peek at the DataFrame structure to ensure we're ready to go.

In [6]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0


## Quick Review - Indexing with boolean Masks

Recall the general approach for boolean indexing:
1. Generate a sequence of booleans ("Trues" and "Falses")
2. Use that boolean sequence with either square brackets [ ] or `.loc[]` to make the selection.

Say we're interested in learning which players have a market value exceeding 40 million dollars. We start by using the attribute accessor or selection brackets with a comparison operators to create the boolean sequence.

In [7]:
players["market_value"] > 40

0       True
1       True
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

Now to select just the players with market value over 40 million, simply pass this expression into a set of selection brackets for the DataFrame itself

In [8]:
players[players["market_value"] > 40]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
96,Eden Hazard,Chelsea,26,LW,1,75.0,4220,10.5,2.30%,224,2,Belgium,0,3,5,1,0
97,Diego Costa,Chelsea,28,CF,1,50.0,4454,10.0,3.00%,196,2,Spain,0,4,5,1,0
108,N%27Golo Kante,Chelsea,26,DM,2,50.0,4042,5.0,13.80%,83,2,France,0,3,5,1,1
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
244,Kevin De Bruyne,Manchester+City,26,AM,1,65.0,2252,10.0,17.50%,199,2,Belgium,0,3,11,1,0
245,Sergio Aguero,Manchester+City,29,CF,1,65.0,4046,11.5,9.70%,175,3,Argentina,0,4,11,1,0
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
264,Romelu Lukaku,Manchester+United,24,CF,1,50.0,3727,11.5,45.00%,221,2,Belgium,0,2,12,1,0


Indeed, there are 13 players who meet this criteria.

In [9]:
players[players["market_value"] > 40].shape

(13, 17)

## More Approaches to Boolean Masking: `isin()`, `lt()`, `between()`

Say we're interested in looking at just the *Defenders* (soccer position) in the DataFrame. Defenders, or backs, can be left, center, or right. Let's see what the position codes are in our DataFrame by examining the `position` column and using the `unique()` method.


In [10]:
players['position'].unique()

array(['LW', 'AM', 'GK', 'RW', 'CB', 'RB', 'CF', 'LB', 'DM', 'RM', 'CM',
       nan, 'SS', 'LM'], dtype=object)

Out of those, we are interested in the defenders: LB, CB, and RB. Thus, we want to extract lines from our DataFrame on the condition that they are one of these three positions. This is a great use case for the *`isin()` method, which is used to create a boolean series that indicates whether that particular entry is in the list of values passed in.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html

In [11]:
players['position'].isin(['LB', 'CB', 'RB'])

0      False
1      False
2      False
3      False
4       True
       ...  
460    False
461     True
462     True
463    False
464    False
Name: position, Length: 465, dtype: bool

We can now apply the boolean mask to the DataFrame using `loc[]` to select just the Defenders. We see that 157 players in this list are defenders.

In [12]:
players.loc[players['position'].isin(['LB', 'CB', 'RB'])]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2,Spain,0,4,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
17,Gabriel Paulista,Arsenal,26,CB,3,13.0,552,5.0,0.10%,45,3,Brazil,0,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
455,Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
458,Angelo Ogbonna,West+Ham,29,CB,3,9.0,247,4.5,1.10%,45,2,Italy,0,4,20,0,0
459,Pablo Zabaleta,West+Ham,32,RB,3,7.0,698,5.0,2.70%,45,3,Argentina,0,5,20,0,0
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1


Let's look at ranges of values. We can use the familiar `between()` method for this.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html
* Remember that by default, the edge values are inclusive. This can be changed using the `inclusive` parameter

Let's assess players that are between market values of 40 and 50 million dollars.

In [13]:
players.market_value.between(40, 50, inclusive = False)

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

As before, we can now apply this boolean mask to the original DataFrame. it appears there are only three players swith a market value of above 40 million but below 50 million

In [14]:
players[players.market_value.between(40, 50, inclusive = False)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
380,Dele Alli,Tottenham,21,CM,2,45.0,4626,9.5,38.60%,225,1,England,0,1,17,1,0


If the edge values are inclusive, we see that the number of players is substantially higher.

In [15]:
players[players.market_value.between(40, 50, inclusive = True)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
31,Alexandre Lacazette,Arsenal,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
97,Diego Costa,Chelsea,28,CF,1,50.0,4454,10.0,3.00%,196,2,Spain,0,4,5,1,0
102,Thibaut Courtois,Chelsea,25,GK,4,40.0,1260,5.5,18.50%,141,2,Belgium,0,3,5,1,0
108,N%27Golo Kante,Chelsea,26,DM,2,50.0,4042,5.0,13.80%,83,2,France,0,3,5,1,1
218,Philippe Coutinho,Liverpool,25,AM,1,45.0,2958,9.0,30.80%,171,3,Brazil,0,3,10,1,0
219,Sadio Mane,Liverpool,25,LW,1,40.0,3219,9.5,5.30%,156,4,Senegal,0,3,10,1,1
246,Raheem Sterling,Manchester+City,22,LW,1,45.0,2074,8.0,3.80%,149,1,England,0,2,11,1,0
263,Bernardo Silva,Manchester+City,22,RW,1,40.0,1098,8.0,4.60%,0,2,Portugal,1,2,11,1,0
264,Romelu Lukaku,Manchester+United,24,CF,1,50.0,3727,11.5,45.00%,221,2,Belgium,0,2,12,1,0


How else can we create boolean Series on the fly? One way is to use old-fashioned comparison operator. For instance, if we want to look at players age 25 or younger: 

In [16]:
players['age'] <= 25

0      False
1      False
2      False
3      False
4      False
       ...  
460     True
461     True
462     True
463     True
464    False
Name: age, Length: 465, dtype: bool

In [17]:
players[players['age'] <= 25]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0
10,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
11,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Pedro Obiang,West+Ham,25,CM,2,9.0,286,4.5,0.30%,55,2,Spain,0,3,20,0,0
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0


However, Pandas also has another method for accomplishing this directly. Instead of the `<=` method, we can use the `.le()` method, which stands for "less than or equal to". 

In [18]:
players[players['age'].le(25)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0
10,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
11,Granit Xhaka,Arsenal,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Pedro Obiang,West+Ham,25,CM,2,9.0,286,4.5,0.30%,55,2,Spain,0,3,20,0,0
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0


The result is equivalent to using comparison operators. To prove this, let's use the `equals()` method to compare them.

In [19]:
players.age.le(25).equals(players.age <= 25)

True

There are multiple comparison methods are your disposal:
* `le()`: less than or equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.le.html)
* `gt()`: greater than (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.gt.html)
* `lt()`: less than (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.lt.html)
* `ge()`: greater than or equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ge.html)
* `ne()`: not equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ne.html)
* `eq()`: equal to (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.eq.html)

The main difference between using comparators and using these comparison methods is that the methods have a `fill_value` parameter that can be used to substitute in missing data.

## Binary Operators with Booleans

Before we get to combining conditions, let's look at how booleans behave in isolation

**Binary operators** are similar to other operators, but they work on the binary representation of the values, that is, the individual bits. They allow us to operations and comparisons or complements on a bit-by-bit basis. They are also known as bit-wise operators.

The most useful binary operators are `OR` and `AND`

Let's first examine `OR`. What would we expect if we performed a binary operator of False OR True (with OR represented by the pipe `|`.

We expect to get True, because an OR comparison will resolve to True as long as at least one of the conditions is True. Think of the `OR` operator as "searching for True"

In [20]:
True | False

True

Let's try some others

In [21]:
False | False

False

In [22]:
True | True

True

Let's now look at the `AND` operator, represented by the ampersand `&`. Think of the AND operator as always True unless there is a False.
* A single `False` is enough to trigger a "False".

In [23]:
False & True

False

In [24]:
True & True

True

In [25]:
False & False

False

In [26]:
True & False

False

In [27]:
True & True & False & True & True

False

How do Pandas Series combine using booleans? Let's start by making a single-element series containing a single False value

In [28]:
f = pd.Series(False)

Let's also make a single-value Series that contains True

In [29]:
t = pd.Series(True)

Let's now use the AND and OR operators on them. What we get is a combined series that compares the boolean values of each Series

In [30]:
t & f

0    False
dtype: bool

In [31]:
t | f

0    True
dtype: bool

That was simple, but where this becomes very powerful is with long series of booleans.

In [32]:
t = pd.Series([True if i % 2 == 0 else False for i in range(10)])

In [33]:
t

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

In [34]:
f = pd.Series([False for i in range(10)])

In [35]:
f

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

What happens when we combine these together using the AND and OR operators? 

The `&` operator, as expected, returns a series of all False values since the `f` series consists of all False values.

In [36]:
t & f

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

With the `|` OR operator, the result will be alternating `True` and `False`, as we would expect since the `t` Series contains alternating True and False values

In [37]:
t | f

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9    False
dtype: bool

The important thing to note is that when we combine two Pandas series, the comparison is done **label-to-label**. It is NOT done based on the order. Consider this example:

In [38]:
f = pd.Series(data = [False, True, True], index = ['c', 'b', 'a'])
t = pd.Series(data = [True, False, False], index = ['a','b','c'])

In [39]:
f

c    False
b     True
a     True
dtype: bool

In [40]:
t

a     True
b    False
c    False
dtype: bool

Now let's use the `&` operator. If the comparison were based on order of the values within the individual Series, we would expect `[False, False, False]`. 

In [41]:
f & t

a     True
b    False
c    False
dtype: bool

We did not get that! Instead, we got `True` for `a` and `False` for `b` and `c`. That's because the comparison was done by the index labels `a`, `b`, and `c` and NOT the positions of the values.

This will help illustrate what happens behind the scenes when we combine booleans. Onward and upward!

## BONUS - XOR and Complement Binary Ops

The binary OR and AND operators are without a doubt the most frequently used ones. However, there are several others available. This lecture will cover two of them.

XOR stands for "exclusive or". As the name suggests, it is exclusive. That means it resolves to *True* when the inputs are different, and to *False* if they are alike.
* In Python the XOR operator is represented by `^`

In [42]:
True ^ False

True

In [43]:
False ^ False

False

In [44]:
True ^ True

False

In [45]:
True ^ (False | False & True) | False

True

XOR can be used to combine boolean series in DataFrame indexing, such as when we want one condition but not the other. We'll see examples of this later.


Another binary operator is the **complement operator**, represented by the tilde `~`.
* In Pandas and Numpy it is extremely useful for negating boolean series.

In [46]:
~False

-1

In [47]:
~True

-2

What's going on here? It has to do with how computers represents integers. The "twos complement" system is a scheme for deriving binary representations of integer numbers. It is the most common method for representing signed integers for computers.
Remember that computers only understand 0's and 1's. How then do we work with integers? We use binary numbers and their respective two's complement. The two's complement is the inverstion of the binary representation of the number.
* https://en.wikipedia.org/wiki/Two%27s_complement
* https://www.cs.cornell.edu/~tomf/notes/cps104/twoscomp.html
* Simple tutorial on binary numbers: http://www.steves-internet-guide.com/binary-numbers-explained/

Going back to our previous example, we know that the integer value of False is 0 while the integer value of True is 1. Therefore:
* `~True = ~1 = inversion(00000001) = 11111110 = -2`
* `~False = ~0 = inversion(00000000) = 11111111 = -1`

Going back to Pandas, let's use the complement binary operator to negate our binary operators. It will turn our Trues to Falses and vice versa.

In [48]:
t = pd.Series([True, True, False])

In [49]:
t

0     True
1     True
2    False
dtype: bool

In [50]:
~t

0    False
1    False
2     True
dtype: bool

The result of using the `~` here is a negation of each of our boolean values. This feature will be very useful for defining negative conditions with DataFrames. For example, if we wanted to select all soccer players that are NOT defenders.

In [51]:
players

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
463,Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


## Combining Conditions - Indexing with Multiple Conditions

Quick refresher, let's select all of the LBs (left backs)

In [52]:
players[players.position == 'LB']

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2,Spain,0,4,1,1,0
18,Kieran Gibbs,Arsenal,27,LB,3,10.0,489,5.0,0.50%,45,1,England,0,3,1,1,0
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
34,Charlie Daniels,Bournemouth,30,LB,3,3.0,185,5.0,19.80%,134,1,England,0,4,2,0,0
54,Brad Smith,Bournemouth,23,LB,3,2.0,297,4.0,3.30%,4,4,Australia,0,2,2,0,0
62,Gaetan Bong,Brighton+and+Hove,29,LB,3,1.5,97,4.5,0.20%,0,4,Cameroon,0,4,3,0,0
65,Markus Suttner,Brighton+and+Hove,30,LB,3,2.0,23,4.5,0.20%,0,2,Austria,0,4,3,0,0
82,Stephen Ward,Burnley,31,LB,3,1.5,152,4.5,2.50%,91,2,Ireland,0,4,4,0,0
99,Marcos Alonso Mendoza,Chelsea,26,LB,3,25.0,3069,7.0,12.40%,177,2,Spain,0,3,5,1,1
112,Kenedy,Chelsea,21,LB,3,7.0,566,5.0,0.10%,3,3,Brazil,0,1,5,1,0


Now, what if we want to restrict by yet another condition? Simple! Just use the binary operators that we have learned.
For instance, say we want all of the left backs who are 25 years old or younger. We can place both conditionals within the square selection brackets and combine them using the `&` operator.
* **IMPORTANT NOTE**: Always wrap your individual conditions within square brackets. Failure to do so will cause Python to misinterpret your conditions due to order of operations.

In [53]:
players[(players.position == 'LB') & (players.age <= 25)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
54,Brad Smith,Bournemouth,23,LB,3,2.0,297,4.0,3.30%,4,4,Australia,0,2,2,0,0
112,Kenedy,Chelsea,21,LB,3,7.0,566,5.0,0.10%,3,3,Brazil,0,1,5,1,0
128,Jeffrey Schlupp,Crystal+Palace,24,LB,3,8.0,385,5.0,0.30%,47,4,Ghana,0,2,6,0,0
212,Ben Chilwell,Leicester+City,20,LB,3,2.5,288,4.5,0.80%,19,1,England,0,1,9,0,0
236,Alberto Moreno,Liverpool,25,LB,3,10.0,397,4.5,0.30%,8,2,Spain,0,3,10,1,0
281,Luke Shaw,Manchester+United,22,LB,3,20.0,947,5.0,0.40%,45,1,England,0,2,12,1,0
294,Paul Dummett,Newcastle+United,25,LB,3,3.5,177,4.5,1.00%,0,2,Wales,0,3,13,0,0
298,Massadio Haidara,Newcastle+United,24,LB,3,1.5,114,4.0,0.50%,0,2,France,0,2,13,0,0
328,Matt Targett,Southampton,21,LB,3,3.0,110,4.5,0.20%,12,1,England,0,1,14,0,0


You could even add more conditions! Let's continue filtering for LBs age 25 or younger that have a market value of at least $10 million

In [54]:
players[(players.position == 'LB') & (players.age <= 25) & (players.market_value >= 10)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
236,Alberto Moreno,Liverpool,25,LB,3,10.0,397,4.5,0.30%,8,2,Spain,0,3,10,1,0
281,Luke Shaw,Manchester+United,22,LB,3,20.0,947,5.0,0.40%,45,1,England,0,2,12,1,0
389,Ben Davies,Tottenham,24,LB,3,12.0,396,5.5,1.80%,90,2,Wales,0,2,17,1,0


Very cool. We can also combine different binary operators. For example, let's exclude this subset of left backs who are not from Arsenal or Tottenham. One way to do this is to use the `isin()` method and pass in "Arsenal" and "Tottenham", and then negate it using the tilde `~`

In [55]:
players[(players.position == 'LB') & (players.age <= 25) & (players.market_value >= 10) & (~players.club.isin(['Tottenham', 'Arsenal']))]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
236,Alberto Moreno,Liverpool,25,LB,3,10.0,397,4.5,0.30%,8,2,Spain,0,3,10,1,0
281,Luke Shaw,Manchester+United,22,LB,3,20.0,947,5.0,0.40%,45,1,England,0,2,12,1,0


## Conditions As Variables

The long indexing that combines multiple conditions can get very ugly to look at very quickly. One way to make the code more readable (aside from breaking into separate lines) is to use *conditions as variables*. 

This essentially means refactoring your conditions into standalone variables.

Suppose we are interested in right backs (RBs) from Arsenal and goalkeepers from Chelsea. Start by creating a variable generating a boolean mask for Arsenal players

In [56]:
arsenal_player = players.club == 'Arsenal'

In [57]:
arsenal_player

0       True
1       True
2       True
3       True
4       True
       ...  
460    False
461    False
462    False
463    False
464    False
Name: club, Length: 465, dtype: bool

Next, let's get the right backs from Arsenal

In [58]:
right_back = players.position == 'RB'

In [59]:
right_back

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462     True
463    False
464    False
Name: position, Length: 465, dtype: bool

For the Chelsea part, we'll combine both conditions in one go.

In [60]:
chelsea_goalkeepers = (players.club == 'Chelsea') & (players.position == 'GK')

In [61]:
chelsea_goalkeepers

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Length: 465, dtype: bool

Finally, let's index the whole DataFrame based on the conditions we have defined in these variables. To switch things up a bit we'll use `.loc[]`

In [62]:
players.loc[(arsenal_player & right_back) | chelsea_goalkeepers]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0
27,Carl Jenkinson,Arsenal,25,RB,3,5.0,561,4.5,0.40%,2,1,England,0,3,1,1,0
102,Thibaut Courtois,Chelsea,25,GK,4,40.0,1260,5.5,18.50%,141,2,Belgium,0,3,5,1,0
109,Willy Caballero,Chelsea,35,GK,4,1.5,542,5.0,0.20%,64,3,Argentina,0,6,5,1,0


## Skill Challenge
Identify the subset of players that meets all of the following criteria:
1. English `nationality`
2. Market value is more than twice the average market value in the league (`market_value`)
3. More than 4000 page views (`page_views`) OR are a new signing (`new_signing`), but not both.

Let's handle this by creating variables for conditions. Let's start with English-ness

In [63]:
english = players.nationality == "England"

Next let's handle market value. First we'll calculate the average market value for all players, and then create a condition where only players that are more than twice the average market value are selected.

In [64]:
avg_market = players['market_value'].mean()

In [65]:
avg_market

11.125649350649349

In [66]:
twice_avg = players.market_value > avg_market * 2

In [67]:
twice_avg

0       True
1       True
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: market_value, Length: 465, dtype: bool

Finaly, let's handle the page views and signing condition. Since we want players that either have >4000 page views OR are a new_signing, but not both, our selections must reflect that.
That is, our 4000 views selector must exclude new_signing, and our signing selector must exclude players greater than 4000 views

In [68]:
fourK_views = players['page_views'] > 4000

In [69]:
new_signing = players['new_signing'] == 1

We now combine all of the expressions together. The third condition (views and signing) is a great opportunity to implement the EXOR operator `^` when combinding them. Remember that this will only resolve to True if either one of the conditions is True but not both. That's exactly what we want.

In [70]:
players[english & twice_avg & (fourK_views ^ new_signing)]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
256,John Stones,Manchester+City,23,CB,3,35.0,1078,5.5,2.30%,59,1,England,0,2,11,1,1
380,Dele Alli,Tottenham,21,CM,2,45.0,4626,9.5,38.60%,225,1,England,0,1,17,1,0
381,Harry Kane,Tottenham,23,CF,1,60.0,4161,12.5,35.10%,224,1,England,0,2,17,1,0


## Two-Dimensional Indexing: Selecting Columns


So far, all of our data extractions have operated only on the index axis - that is, selecting *rows* that we want.
However, remember that DataFrames are two-dimensional data structures. Thus, we can index by columns as well.

Say we want to select Chelsea players that are age 23 and under. We already know how to do this.

In [71]:
chelsea_23under = (players.club == "Chelsea") & (players.age.le(23) )

In [72]:
players.loc[chelsea_23under]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
110,Michy Batshuayi,Chelsea,23,CF,1,25.0,1162,8.5,1.60%,48,2,Belgium,0,2,5,1,1
111,Kurt Zouma,Chelsea,22,CB,3,15.0,723,5.5,0.80%,15,2,France,0,2,5,1,0
112,Kenedy,Chelsea,21,LB,3,7.0,566,5.0,0.10%,3,3,Brazil,0,1,5,1,0
115,Tiemoue Bakayoko,Chelsea,22,DM,2,16.0,1011,5.0,1.60%,0,2,France,1,2,5,1,0


But notice that these types of selections return ALL columns. What if we were only interested in one or a few columns, but not all of them?

Using the `loc[]` indexer, we can pass in a second argument that represents our columns that we want to select. 
* https://datagy.io/pandas-select-columns/#loc-select-columns

This is perhaps the most intuitive way to do it, but it is not the most flexible. For example, it won't allow you to select all columns that start with a particular letter.


In [73]:
players.loc[chelsea_23under, ['position', 'market_value']]

Unnamed: 0,position,market_value
110,CF,25.0
111,CB,15.0
112,LB,7.0
115,DM,16.0


Let's actually try that. Let's select all columns that begin with a 'p'. To do this, we can use the `startswith` string method on the columns attribute for the DataFrame, which generates a boolean mask for the columns! 

**Important note**: just like selecting rows, selecting columns with a boolean mask requires the boolean series to be of the same length as the column axis. 

In [74]:
startswith_p = players.columns.str.startswith('p')

Now we can use that boolean mask as the second argument for `loc[]`. Viola, we get just the columns that start with `p`.

In [75]:
players.loc[chelsea_23under, startswith_p]

Unnamed: 0,position,position_cat,page_views
110,CF,1,1162
111,CB,3,723
112,LB,3,566
115,DM,2,1011


It is also common to chain two square brackets together. The instructor suggests avoiding doing this.

In [76]:
players[chelsea_23under]['position']

110    CF
111    CB
112    LB
115    DM
Name: position, dtype: object

What you are essentially doing above is subsetting your DataFrame by the first condition, and then selecting a single column from that.

This is NOT the same as the following:

In [77]:
players.loc[chelsea_23under, 'position']

110    CF
111    CB
112    LB
115    DM
Name: position, dtype: object

Although the output is the same, behind the scenes the square bracket method-chaining is a slower process. This is because a `__getitem__` method under the hood which gets called twice with  bracket chaining.
Therefore, the instructor recomments avoiding bracket chaining and instead using `loc[]` with a second argument.

## Fancy Indexing with `lookup()`

"Fancy" indexing is simply refers to passing multiple labels at at once. It's very similar to basic indexing, but instead of using single labels, we instead specify a list or tuple of labels. Easy!

The `lookup()` method is another way to achieve fancying indexing. it allows us to pick up a specific row label (or list of them) and a specific column label (or list of them) and find the value corresponding to that position. This will return a numpy array, not a value.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.lookup.html
* Note that this is depracated since Pandes version 1.2.0.

In [78]:
players.lookup([450], ['age'])

array([30])

Let's compare fancy indexing with `loc[]`, where we want two rows and two columns. This will return a DataFrame slice.

In [79]:
players.loc[[0, 132], ('name', 'market_value')]

Unnamed: 0,name,market_value
0,Alexis Sanchez,65.0
132,Connor Wickham,6.0


Comparing to the `lookup()` method:

In [80]:
players.lookup([0, 132], ['name', 'market_value'])

array(['Alexis Sanchez', 6.0], dtype=object)

Again, notice that what gets returned is a numpy array displaying the first and last values that we expect - the name of the person on the 0 row, and the market value of the person on the 132nd row. 

What happens if we swap the column labels around?


In [81]:
players.lookup([0, 132], [ 'market_value', 'name'])

array([65.0, 'Connor Wickham'], dtype=object)

This time, our numpy array begins with the market value of the player at row 0, and the ends with the name of the player at row 132.

In reality, the `lookup()` method is most useful when we already have a collection of labels that we want to use to make our selections. Say we have three players and want to source specific attributes.


In [82]:
names = ['Petr Cech', 'Mesut Ozil', 'Alexis Sanchez']

In [83]:
attributes = ['age', 'market_value', 'page_views']

To look this up, you cannot simply pass in the names as the index labels. Why not? Because the current index labels are numbers, not player names. The player names is one of the columns in the DataFrames.

In order to get this to work, we have to **set the index** to player names. It is recommended NOT to do this `inplace` so that you don't accidentally manipulate the DataFrame in an undesirable way.

In [84]:
players.set_index('name')

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


Now let's chain on the `lookup()` that we want to do. Note that this now requires unique index and columns, whereas our DataFrame contains duplicate names be design. The lecture needs to be re-recorded.

In any case, we will use the `duplicated()` method and logical negation to remove duplicated players from the DataFrame

In [85]:
players_by_name = players.set_index('name') 
dupes = players_by_name.index.duplicated() 
players_by_name[~dupes].lookup(names, attributes)

array([  35.,   50., 4329.])

Can we do something similar using `.loc[]`? Let's find out.

In [86]:
players.set_index('name').loc[names, attributes]

Unnamed: 0_level_0,age,market_value,page_views
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Petr Cech,35,7.0,1529
Mesut Ozil,28,50.0,4395
Alexis Sanchez,28,65.0,4329


Yes we can! And honestly it's easier, if not more computationally intensive.

## Sorting by Index or Column - the `sort_values()` and `sort_index()` Methods

Recall that we previously learned how to sort by column values in ascending or descending order.

In [87]:
players.sort_values(by = 'market_value', ascending = False)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
96,Eden Hazard,Chelsea,26,LW,1,75.00,4220,10.5,2.30%,224,2,Belgium,0,3,5,1,0
267,Paul Pogba,Manchester+United,24,CM,2,75.00,7435,8.0,19.50%,115,2,France,0,2,12,1,1
0,Alexis Sanchez,Arsenal,28,LW,1,65.00,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
244,Kevin De Bruyne,Manchester+City,26,AM,1,65.00,2252,10.0,17.50%,199,2,Belgium,0,3,11,1,0
245,Sergio Aguero,Manchester+City,29,CF,1,65.00,4046,11.5,9.70%,175,3,Argentina,0,4,11,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,Joel Castro Pereira,Manchester+United,21,GK,4,0.10,395,4.0,1.00%,6,2,Portugal,0,1,12,1,0
113,Eduardo Carvalho,Chelsea,34,LW,1,0.05,467,5.0,0.10%,0,2,Portugal,0,6,5,1,1
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,,56,6.0,0.60%,0,2,Benin,0,2,8,0,0


Sometimes we also want to sort by index values, and this usually happens for indexes that we create (as opposed to Pandas creating them).

For instance, the index for the current `players` DataFrame is an integer index that's not particularly interesting or useful. Nothing about that numeric index label tells you anything about the data in those rows. But let's say we switch the index to player names. This can be achieved with the `set_index()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html

In [88]:
players.set_index('name', inplace = True)

In [89]:
players

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


Now we have something interesting to work with! The index now contains strings and is of dtype `object`

In [90]:
players.index

Index(['Alexis Sanchez', 'Mesut Ozil', 'Petr Cech', 'Theo Walcott',
       'Laurent Koscielny', 'Hector Bellerin', 'Olivier Giroud',
       'Nacho Monreal', 'Shkodran Mustafi', 'Alex Iwobi',
       ...
       'Aaron Cresswell', 'Pedro Obiang', 'Sofiane Feghouli', 'Angelo Ogbonna',
       'Pablo Zabaleta', 'Edimilson Fernandes', 'Arthur Masuaku', 'Sam Byram',
       'Ashley Fletcher', 'Diafra Sakho'],
      dtype='object', name='name', length=465)

But now our index is not sorted in any way. It's a good practice to make an ordered index. We can do this with the `sort_index` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html

In [91]:
players.sort_index().head(10)

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0
Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0
Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0
Abdoulaye Doucoure,Watford,24,CM,2,6.0,124,5.0,0.00%,38,2,France,0,2,18,0,0
Adam Federici,Bournemouth,32,GK,4,1.0,126,4.0,1.50%,8,4,Australia,0,5,2,0,0
Adam Lallana,Liverpool,29,AM,1,25.0,1808,7.5,6.40%,139,1,England,0,4,10,1,0
Adam Smith,Bournemouth,26,RB,3,5.0,200,5.0,0.90%,104,1,England,0,3,2,0,0
Ademola Lookman,Everton,19,LW,1,5.0,1387,5.5,0.30%,16,1,England,0,1,7,0,0
Adrian,West+Ham,30,GK,4,8.0,266,4.5,0.80%,64,2,Spain,0,4,20,0,0


Let's set `inplace` to True so that the sorting sticks.

In [92]:
players.sort_index(inplace =  True)

In [93]:
players.head()

Unnamed: 0_level_0,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0
Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0
Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0
Abdoulaye Doucoure,Watford,24,CM,2,6.0,124,5.0,0.00%,38,2,France,0,2,18,0,0


This method also helps us sort our columns. Suppose we want to sort the column axis (that is, the column labels). All we need to change is the `axis` parameter of the `sort_index()` method.

This can be helpful if you want to look through the column labels by name.

In [94]:
players.sort_index(axis=1)

Unnamed: 0_level_0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,nationality,new_foreign,new_signing,page_views,position,position_cat,region
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Aaron Cresswell,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,England,0,0,380,LB,3,1
Aaron Lennon,30,4,0,Everton,7,22,0.20%,5.5,5.0,England,0,0,504,RW,1,1
Aaron Mooy,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Australia,0,0,588,CM,2,4
Aaron Ramsey,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Wales,0,0,1040,CM,2,1
Abdoulaye Doucoure,24,2,0,Watford,18,38,0.00%,5.0,6.0,France,0,0,124,CM,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yohan Cabaye,31,4,0,Crystal+Palace,6,91,1.40%,5.5,15.0,France,0,0,456,CM,2,2
YounÃ¨s Kaboul,31,4,0,Watford,18,57,0.10%,4.5,2.5,France,0,1,263,CB,3,2
Ã‰tienne Capoue,29,4,0,Watford,18,131,8.00%,5.5,9.0,France,0,0,412,DM,2,2
Ã€ngel Rangel,34,6,0,Swansea,16,26,18.80%,4.0,1.0,Spain,0,0,137,RB,3,2


Let's reset the index so that we are on the same page as the instructor for the next lecture. This can be done using the `reset_index()` method. It resets the index to the default one. By default, the old index gets inserted into the DataFrame as a column. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [95]:
players.reset_index(inplace = True)

In [96]:
players.head(10)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0
1,Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0
2,Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0
3,Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0
4,Abdoulaye Doucoure,Watford,24,CM,2,6.0,124,5.0,0.00%,38,2,France,0,2,18,0,0
5,Adam Federici,Bournemouth,32,GK,4,1.0,126,4.0,1.50%,8,4,Australia,0,5,2,0,0
6,Adam Lallana,Liverpool,29,AM,1,25.0,1808,7.5,6.40%,139,1,England,0,4,10,1,0
7,Adam Smith,Bournemouth,26,RB,3,5.0,200,5.0,0.90%,104,1,England,0,3,2,0,0
8,Ademola Lookman,Everton,19,LW,1,5.0,1387,5.5,0.30%,16,1,England,0,1,7,0,0
9,Adrian,West+Ham,30,GK,4,8.0,266,4.5,0.80%,64,2,Spain,0,4,20,0,0


## Sorting vs Reordering - the `reindex()` Method

You can customize a lot of features with the sort methods that we use, like `sort_index()` and `sort_values()`. But at the end of the day, we are only sorting in ascending or descending order.

What if we wanted to more precisely reorder the rows according to some very specific order? To do this, we can use the `reindex()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

Let's start by isolating a small chunk of our DataFrame `iloc[]`

In [97]:
players_lite = players.iloc[:4, :4]

In [98]:
players_lite

Unnamed: 0,name,club,age,position
0,Aaron Cresswell,West+Ham,27,LB
1,Aaron Lennon,Everton,30,RW
2,Aaron Mooy,Huddersfield,26,CM
3,Aaron Ramsey,Arsenal,26,CM


Now, say we want to reorder the rows based on a very specific requirement. Support we're working for a soccer scout and we want to get these four players, and we want the relevant data to be displayed in a very specific format.

Row order needs to be: 2, 1, 3, 0
Column order needs to be: age, name, position, club

There is no way to do this with the methods we already know. But it is VERY easy to do it using `reindex()`. All we need to do is pass in an `index` parameter with the indices in the exact order that we want them, and do the same with the `columns` parameter.

In [99]:
players_lite.reindex(index = [2, 1, 3, 0], columns = ['age', 'name', 'position', 'club'])

Unnamed: 0,age,name,position,club
2,26,Aaron Mooy,CM,Huddersfield
1,30,Aaron Lennon,RW,Everton
3,26,Aaron Ramsey,CM,Arsenal
0,27,Aaron Cresswell,LB,West+Ham


Here we have subset our DataFrame and applied the `reindex()` method to it. We can achieve this same result by working directly on our large DataFrame. This works because we have explicitly called out the index labels and columns that we want to display. This means we can **use the `reindex()` method to carve out slices of our DataFrame, and they don't even need to be consecutive slices.**

HOWEVER, it is *NOT* the most effective way to do selections. The instructor recommends using `loc[]` and `iloc[]` as much as possible.

In [100]:
players.reindex(index = [2, 1, 3, 0], columns = ['age', 'name', 'position', 'club'])

Unnamed: 0,age,name,position,club
2,26,Aaron Mooy,CM,Huddersfield
1,30,Aaron Lennon,RW,Everton
3,26,Aaron Ramsey,CM,Arsenal
0,27,Aaron Cresswell,LB,West+Ham


Let's say our scout has changed her mind and she wants ALL of the data columns for these four players, in standard alphabetical order. 

There are several options. Perhaps the easiest is to chain on the `sort_index()` method and set `axis = 1`

In [101]:
players.reindex(index = [2, 1, 3, 0]).sort_index(axis = 1)

Unnamed: 0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,name,nationality,new_foreign,new_signing,page_views,position,position_cat,region
2,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Aaron Mooy,Australia,0,0,588,CM,2,4
1,30,4,0,Everton,7,22,0.20%,5.5,5.0,Aaron Lennon,England,0,0,504,RW,1,1
3,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Aaron Ramsey,Wales,0,0,1040,CM,2,1
0,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,Aaron Cresswell,England,0,0,380,LB,3,1


But we can also avoid method chaining and rely entirely on the `reindex()` method. All we have to do is pass to the `columns` parameter a list of the column labels that has already been sorted in alphabetical order.

We can get exactly that just by using the `columns` attribute of the DataFrame, which returns an iterable *columns* object that we can sort using `sort_values()`

In [102]:
players.reindex(index = [2, 1, 3, 0], columns = players.columns.sort_values())

Unnamed: 0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,name,nationality,new_foreign,new_signing,page_views,position,position_cat,region
2,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Aaron Mooy,Australia,0,0,588,CM,2,4
1,30,4,0,Everton,7,22,0.20%,5.5,5.0,Aaron Lennon,England,0,0,504,RW,1,1
3,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Aaron Ramsey,Wales,0,0,1040,CM,2,1
0,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,Aaron Cresswell,England,0,0,380,LB,3,1


Another way we can do this is to use a Python method called `sorted()`. All we need to do is pass our columns to this method, and we'll get that sorted list back! It accomplishes the exact same objective.

In [103]:
players.reindex(index = [2, 1, 3, 0], columns = sorted(players.columns))

Unnamed: 0,age,age_cat,big_club,club,club_id,fpl_points,fpl_sel,fpl_value,market_value,name,nationality,new_foreign,new_signing,page_views,position,position_cat,region
2,26,3,0,Huddersfield,8,0,2.50%,5.5,5.0,Aaron Mooy,Australia,0,0,588,CM,2,4
1,30,4,0,Everton,7,22,0.20%,5.5,5.0,Aaron Lennon,England,0,0,504,RW,1,1
3,26,3,1,Arsenal,1,56,5.10%,7.0,35.0,Aaron Ramsey,Wales,0,0,1040,CM,2,1
0,27,3,0,West+Ham,20,60,1.30%,5.0,12.0,Aaron Cresswell,England,0,0,380,LB,3,1


## What NOT to Do When Sorting: Transposing the DataFrame Twice

When column sorting, suppose you want to sort alphabetically. What some people do is change the shape of the DataFrame by transposing it. One way to do this is with the `swapaxes()` method.

In [104]:
df = players.iloc[:6, :6]

In [105]:
df

Unnamed: 0,name,club,age,position,position_cat,market_value
0,Aaron Cresswell,West+Ham,27,LB,3,12.0
1,Aaron Lennon,Everton,30,RW,1,5.0
2,Aaron Mooy,Huddersfield,26,CM,2,5.0
3,Aaron Ramsey,Arsenal,26,CM,2,35.0
4,Abdoulaye Doucoure,Watford,24,CM,2,6.0
5,Adam Federici,Bournemouth,32,GK,4,1.0


In [106]:
df.swapaxes(1, 0)

Unnamed: 0,0,1,2,3,4,5
name,Aaron Cresswell,Aaron Lennon,Aaron Mooy,Aaron Ramsey,Abdoulaye Doucoure,Adam Federici
club,West+Ham,Everton,Huddersfield,Arsenal,Watford,Bournemouth
age,27,30,26,26,24,32
position,LB,RW,CM,CM,CM,GK
position_cat,3,1,2,2,2,4
market_value,12,5,5,35,6,1


Another way to transpose using the transpose attribute `T`

In [107]:
df.T

Unnamed: 0,0,1,2,3,4,5
name,Aaron Cresswell,Aaron Lennon,Aaron Mooy,Aaron Ramsey,Abdoulaye Doucoure,Adam Federici
club,West+Ham,Everton,Huddersfield,Arsenal,Watford,Bournemouth
age,27,30,26,26,24,32
position,LB,RW,CM,CM,CM,GK
position_cat,3,1,2,2,2,4
market_value,12,5,5,35,6,1


This gives us our column names as our indexes, and we can now sort them in alphabetical order.

In [108]:
df.T.sort_index()

Unnamed: 0,0,1,2,3,4,5
age,27,30,26,26,24,32
club,West+Ham,Everton,Huddersfield,Arsenal,Watford,Bournemouth
market_value,12,5,5,35,6,1
name,Aaron Cresswell,Aaron Lennon,Aaron Mooy,Aaron Ramsey,Abdoulaye Doucoure,Adam Federici
position,LB,RW,CM,CM,CM,GK
position_cat,3,1,2,2,2,4


Finally, we re-transpose back into the original form of the DataFrame, in which the column names will now be sorted.

In [109]:
df.T.sort_index().T

Unnamed: 0,age,club,market_value,name,position,position_cat
0,27,West+Ham,12,Aaron Cresswell,LB,3
1,30,Everton,5,Aaron Lennon,RW,1
2,26,Huddersfield,5,Aaron Mooy,CM,2
3,26,Arsenal,35,Aaron Ramsey,CM,2
4,24,Watford,6,Abdoulaye Doucoure,CM,2
5,32,Bournemouth,1,Adam Federici,GK,4


**DO NOT do this!** Why not?
* The code is ugly to read and can be confusing
* It is indirect and uses a lot of temporary outputs
* There's no reason to do this when you can accomplish the same thing using just the `sort_index()` method.

## Skill Challenge
1. Sort the `players` DataFrame by age in ascending order. Who is the youngest footballer in the EPL?
2. Set the *club* column as the index of the DataFrame. Then sort the DataFrame index in alphabetical order. Make sure these changes are applied to the underlying DataFrame and carry over to the next question.
3. Sort the DataFrame values by `club` and `market_value`, where the club is alphabetical (Arsenal first) and the market value is in descending order within each time, with the most valuable players first.

For Part 1, we can simply use the `sort_values()` method and pass "age" to the `by` parameter. Optionally you can set `ascending` to `True`, but that is the behavior by default.

In [110]:
players.sort_values(by = 'age')

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
53,Ben Woodburn,Liverpool,17,LW,1,1.50,1241,4.5,0.10%,5,1,Wales,0,1,10,1,0
217,Jonathan Leko,West+Brom,18,RW,1,1.50,169,4.5,0.20%,12,1,England,0,1,19,0,0
434,Trent Alexander-Arnold,Liverpool,18,RB,3,1.50,327,4.5,0.30%,15,2,England,0,1,10,1,0
229,Josh Tymon,Stoke+City,18,LB,3,1.00,120,4.5,0.10%,9,1,England,0,1,15,0,0
45,Axel Tuanzebe,Manchester+United,19,CB,3,1.00,279,4.0,1.70%,14,1,England,0,1,12,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,Gareth Barry,Everton,36,DM,2,1.50,1331,4.5,1.70%,68,1,England,0,6,7,0,0
90,Damien Delaney,Crystal+Palace,36,CB,3,1.00,195,4.5,0.60%,51,2,Ireland,0,6,6,0,0
38,Artur Boruc,Bournemouth,37,GK,4,1.00,436,4.5,6.90%,120,2,Poland,0,6,2,0,0
143,Gareth McAuley,West+Brom,37,CB,3,1.00,458,5.0,11.80%,131,2,Northern Ireland,0,6,19,0,0


The youngest player in the EPL is Ben Woodburn. We could explicity grab that value by using the `iloc[]` method.

In [111]:
players.sort_values(by = 'age').iloc[0,0]

'Ben Woodburn'

Another way to do this is to use the `idxmin()` method on the age column, which will identify the smallest value for age and return that index corresponding to that player. Passing that result to iloc will then give us all of the rows for that player. 

In [112]:
players.iloc[players.age.idxmin()]

name            Ben Woodburn
club               Liverpool
age                       17
position                  LW
position_cat               1
market_value             1.5
page_views              1241
fpl_value                4.5
fpl_sel                0.10%
fpl_points                 5
region                     1
nationality            Wales
new_foreign                0
age_cat                    1
club_id                   10
big_club                   1
new_signing                0
Name: 53, dtype: object

For Part 2, we will use `set_index()` to set `club` as the index. Then we will sort that index in alphabetical order. We can do this all in one go!

In [113]:
players = players.set_index('club').sort_index()

In [114]:
players

Unnamed: 0_level_0,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
club,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West+Ham,Mark Noble,30,CM,2,7.0,425,5.5,0.10%,71,1,England,0,4,20,0,0
West+Ham,Michail Antonio,27,RW,1,18.0,1142,7.5,0.50%,132,1,England,0,3,20,0,0
West+Ham,Robert Snodgrass,29,RW,1,8.0,1210,6.0,6.50%,133,2,Scotland,0,4,20,0,0
West+Ham,Ashley Fletcher,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


For Part 3, we will sort this club-indexed DataFrame by both club and market value. We can achieve the differential sorting for *club* and *market_value* using one call of `sort_values()` by passing in a list of boolean to the `ascending` parameter.

Remember, we want the clubs in alphabetical ascending order and the market values in descending order.

In [115]:
players.sort_values(by = ['club', 'market_value'], ascending = [True, False])

Unnamed: 0_level_0,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
club,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
West+Ham,Edimilson Fernandes,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
West+Ham,Sam Byram,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
West+Ham,Darren Randolph,30,GK,4,2.5,459,4.5,0.40%,69,2,Ireland,0,4,20,0,0
West+Ham,James Collins,33,CB,3,2.0,187,4.5,0.90%,69,2,Wales,0,5,20,0,0


Let's reset the DataFrame with name as the index

In [116]:
players.reset_index(inplace=True)

In [117]:
players.head()

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0


## Identifying Duplicates: The `duplicated()` Method

In the real world, the data you work with is not perfect, or even not good enough. You'll spend a lot of time manipulation and cleaning.

Case and point, our *players* DataFrame contains duplicates of several players with the same values. This can adversely impact your analysis, for example if you tried to calculate statistics based on these entries.

The easiest way to identify duplicate methods in Pandas is to use the `duplicated()` method. It generates a boolean series that tells you whether each entry is a duplicated (True) or unique (False)
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html

In [118]:
players.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Length: 465, dtype: bool

In [119]:
players.loc[players.duplicated()]

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
14,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
17,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
24,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0


But what really constitutes a duplicate value? When used with default parameters, a value is duplicated if and only if the values across all columns are exactly the same for two or more records (rows). 

But what if we have nearly duplicate entries that are only slightly different? We can loosen the constraints by using the `subset` parameter to choose which columns to scrutinize for duplicates. Entries that have duplicates in columns that are not included in the subset are ignored.

In this example, we'll look for duplicates only within the club, age, position, and market_value columns. With this condition, the duplicate search becomes more "lax", allowing more players to be considered duplicates because we're comparing along fewer columns.

In [120]:
players.loc[players.duplicated(subset = ['club','age','position','market_value'])]

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
14,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
17,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
24,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
65,Brighton+and+Hove,Shane Duffy,25,CB,3,5.0,243,4.5,0.60%,0,2,Ireland,0,3,3,0,0
254,Manchester+City,Fernandinho,32,DM,2,18.0,595,5.0,0.80%,78,3,Brazil,0,5,11,1,0
266,Manchester+United,Marcos Rojo,27,CB,3,18.0,1063,5.5,0.10%,77,3,Argentina,0,3,12,1,0
301,Newcastle+United,Lascelles,27,CB,3,5.0,400,4.5,3.60%,0,1,England,0,3,13,0,0


So, which entries are the original, and which are the duplicates? Most often (and as is the default in Pandas), the first occurrence is treated as the original, and the ones that follow are duplicates. However, this is somewhat arbitrary.

Using the `keep` parameter allows us to determine which value will be considered the "original", with all other instances being considered duplicates.
* "first" treats the first instance as the original, with others considered duplicates
* "last" treats the last instance as the original, with the others considered duplicates
* "False" treats ALL duplicates as True (i.e. duplicates). That means no entry will be considered an original.

In [121]:
players.loc[players.duplicated(subset = ['club','age','position','market_value'], keep='first')]

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
14,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
17,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
24,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
65,Brighton+and+Hove,Shane Duffy,25,CB,3,5.0,243,4.5,0.60%,0,2,Ireland,0,3,3,0,0
254,Manchester+City,Fernandinho,32,DM,2,18.0,595,5.0,0.80%,78,3,Brazil,0,5,11,1,0
266,Manchester+United,Marcos Rojo,27,CB,3,18.0,1063,5.5,0.10%,77,3,Argentina,0,3,12,1,0
301,Newcastle+United,Lascelles,27,CB,3,5.0,400,4.5,3.60%,0,1,England,0,3,13,0,0


In [122]:
players.loc[players.duplicated(subset = ['club','age','position','market_value'], keep='last')]

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
6,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
16,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
17,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
63,Brighton+and+Hove,Lewis Dunk,25,CB,3,5.0,140,4.5,4.10%,0,1,England,0,3,3,0,0
251,Manchester+City,Fernando,32,DM,2,18.0,338,4.5,0.40%,18,3,Brazil,0,5,11,1,0
265,Manchester+United,Chris Smalling,27,CB,3,18.0,834,5.5,1.30%,52,1,England,0,3,12,1,0
295,Newcastle+United,Ciaran Clark,27,CB,3,5.0,273,4.5,0.90%,0,2,Ireland,0,3,13,0,0


In [123]:
players.loc[players.duplicated(subset = ['club','age','position','market_value'], keep=False)]

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
6,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
14,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
16,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
17,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
24,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
63,Brighton+and+Hove,Lewis Dunk,25,CB,3,5.0,140,4.5,4.10%,0,1,England,0,3,3,0,0
65,Brighton+and+Hove,Shane Duffy,25,CB,3,5.0,243,4.5,0.60%,0,2,Ireland,0,3,3,0,0
251,Manchester+City,Fernando,32,DM,2,18.0,338,4.5,0.40%,18,3,Brazil,0,5,11,1,0
254,Manchester+City,Fernandinho,32,DM,2,18.0,595,5.0,0.80%,78,3,Brazil,0,5,11,1,0
265,Manchester+United,Chris Smalling,27,CB,3,18.0,834,5.5,1.30%,52,1,England,0,3,12,1,0


## Removing Duplicates: The drop_duplicates() Method

The existance of duplicates does not always necessarily mean that the data is irrelevant. However in our case it does. There is no good reason to have a player multiple times in this dataset. In fact, the presence of this player multiple times may lead us to calculate some statistics incorrectly.

In [124]:
players[players.duplicated()]

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
14,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
17,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
24,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0


Case and point, both Granit Xhaka and Alex Oxlade-Chamberlain have been identified as duplicates, and both have market values that are far above the mean. This means that the average is biased upward.

In [125]:
players.market_value.mean()

11.125649350649349

To remedy this, we have to somehow exclude the duplicate values. 

One way to do this is to use the `drop_duplicates()` method, which returns a copy of the DataFrame where duplicate values have been removed.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

In [126]:
players.drop_duplicates(keep = 'first')

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,West+Ham,Mark Noble,30,CM,2,7.0,425,5.5,0.10%,71,1,England,0,4,20,0,0
461,West+Ham,Michail Antonio,27,RW,1,18.0,1142,7.5,0.50%,132,1,England,0,3,20,0,0
462,West+Ham,Robert Snodgrass,29,RW,1,8.0,1210,6.0,6.50%,133,2,Scotland,0,4,20,0,0
463,West+Ham,Ashley Fletcher,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


Now let's recalculate the mean based on this fixed DataFrame

In [127]:
players.drop_duplicates(keep = 'first').market_value.mean()

11.026252723311545

In this case the change, while noticeable, is not very extreme. However, it is nevertheless important to do such checks on your data to ensure that duplicates are meaningful, and if so, ensure you know why they are meaningful.

There are times when duplicates will have meaning, such as a database of chess moves during a match or census data.

## Removing DataFrame Rows - the `drop()` Method

We previously saw how to remove duplicated rows in one go using the `drop_duplicates()` method. Another approach is to identify the records that we want to remove and then exclude some or all of them separately.

Consider our **players** DataFrame with the records that are complete duplicates:

In [128]:
players[players.duplicated()]

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
14,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
17,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0
24,Arsenal,Alex Oxlade-Chamberlain,23,RM,2,22.0,1519,6.0,1.80%,83,1,England,0,2,1,1,0


Suppose we wanted to drop the rows associated with index label 17, but keep the other two. Using `drop_duplicates()` would not help here because it would simply drop all of these duplicates.

Instead, we can use the more generic `drop()` method. This method can be used to drop specified rows or columns by specifying the axis. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
* By default, the method creates a copy of the underlying DataFrame. You can use the `inplace` parameter to alter the DataFrame directly.



In [129]:
players.drop(labels = 17, axis = 0)

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,West+Ham,Mark Noble,30,CM,2,7.0,425,5.5,0.10%,71,1,England,0,4,20,0,0
461,West+Ham,Michail Antonio,27,RW,1,18.0,1142,7.5,0.50%,132,1,England,0,3,20,0,0
462,West+Ham,Robert Snodgrass,29,RW,1,8.0,1210,6.0,6.50%,133,2,Scotland,0,4,20,0,0
463,West+Ham,Ashley Fletcher,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


What gets returned is a copy of the DataFrame with index 17 removed.

Another way to produce the exact same result is to use the `index` parameter directly. In this case, we won't need to specify the axis.


In [130]:
players.drop(index = 17).head(20)

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
5,Arsenal,Santi Cazorla,32,CM,2,12.0,943,7.0,0.10%,38,2,Spain,0,5,1,1,0
6,Arsenal,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
7,Arsenal,Lucas Perez,28,CF,1,15.0,2055,7.5,0.10%,20,2,Spain,0,4,1,1,1
8,Arsenal,Kieran Gibbs,27,LB,3,10.0,489,5.0,0.50%,45,1,England,0,3,1,1,0
9,Arsenal,Granit Xhaka,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0


You can drop multiple rows or columns by passing in a Python list of labels.

In [131]:
players.drop(index = [19, 20, 21, 231, 10])

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,West+Ham,Mark Noble,30,CM,2,7.0,425,5.5,0.10%,71,1,England,0,4,20,0,0
461,West+Ham,Michail Antonio,27,RW,1,18.0,1142,7.5,0.50%,132,1,England,0,3,20,0,0
462,West+Ham,Robert Snodgrass,29,RW,1,8.0,1210,6.0,6.50%,133,2,Scotland,0,4,20,0,0
463,West+Ham,Ashley Fletcher,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


In [132]:
players.shape

(465, 17)

## Removing Columns Using `drop()`

Removing columns using the `drop()` method works similarly to removing rows. You've got two options:
* Set the drop `axis` parameter to `1` and use the `labels` parameter as before, only this time identifying the names of the columns that you want to drop.
* Use the `columns` parameter and pass in the name or list of names of columns you want to drop

In [133]:
players.head()

Unnamed: 0,club,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0


In [134]:
players.drop(labels = ["age","market_value"], axis = 1).head()

Unnamed: 0,club,name,position,position_cat,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,GK,4,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,CF,1,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,LW,1,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,CB,3,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,AM,1,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0


In [135]:
players.drop(columns = ["age", "market_value"]).head()

Unnamed: 0,club,name,position,position_cat,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Arsenal,David Ospina,GK,4,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Arsenal,Alexandre Lacazette,CF,1,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Arsenal,Alexis Sanchez,LW,1,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Arsenal,Laurent Koscielny,CB,3,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Arsenal,Mesut Ozil,AM,1,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0


## The `pop()` Method

The `pop()` method is used to remove columns one at a time by simply passing in the name of the column that we want to remove. 
* The popped column is returned by the `pop()` method as a Series
* This method operates in place, modifying the underlying DataFrame. So be careful when using it!
* You can only remove one column at a time using `pop()`. If you need to remove multiple columns, `drop()` is more efficient
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html

In [136]:
players2 = players

In [137]:
players2.pop('club')

0       Arsenal
1       Arsenal
2       Arsenal
3       Arsenal
4       Arsenal
         ...   
460    West+Ham
461    West+Ham
462    West+Ham
463    West+Ham
464    West+Ham
Name: club, Length: 465, dtype: object

## Dropping Rows and Columns with `reindex()`

This is a less popular alternative to "dropping" rows and columns.

Recall that the `reindex()` method crafts a new dataframe that only reflects the labels that we specify.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

If we reindex the `players` dataframe with no parameters, we'll get a copy of the dataframe back

In [138]:
players.reindex()

Unnamed: 0,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
1,Alexandre Lacazette,26,CF,1,40.0,1183,10.5,26.50%,0,2,France,1,3,1,1,0
2,Alexis Sanchez,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
3,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
4,Mesut Ozil,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Mark Noble,30,CM,2,7.0,425,5.5,0.10%,71,1,England,0,4,20,0,0
461,Michail Antonio,27,RW,1,18.0,1142,7.5,0.50%,132,1,England,0,3,20,0,0
462,Robert Snodgrass,29,RW,1,8.0,1210,6.0,6.50%,133,2,Scotland,0,4,20,0,0
463,Ashley Fletcher,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


If we reindex with specific indices specified, we'll get just those indices (rows) back.

In [139]:
players.reindex(index = [0, 3, 9])

Unnamed: 0,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
3,Laurent Koscielny,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
9,Granit Xhaka,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0


We can also identify the indices that we don't want, then use that list to exclude those indices. We will do this by creating a new index that consists of all of the indices from `players` *except* those in `unwanted_rows`.
* This is known as taking a **set difference**, and is accomplished using the `index.difference()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.difference.html?highlight=difference#pandas.Index.difference


In [140]:
unwanted_rows = [1, 2, 3, 4]

In [141]:
players.reindex(index = set(players.index).difference(unwanted_rows))

Unnamed: 0,name,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,David Ospina,28,GK,4,7.0,544,5.0,0.20%,2,3,Colombia,0,4,1,1,0
5,Santi Cazorla,32,CM,2,12.0,943,7.0,0.10%,38,2,Spain,0,5,1,1,0
6,Granit Xhaka,24,DM,2,35.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
7,Lucas Perez,28,CF,1,15.0,2055,7.5,0.10%,20,2,Spain,0,4,1,1,1
8,Kieran Gibbs,27,LB,3,10.0,489,5.0,0.50%,45,1,England,0,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Mark Noble,30,CM,2,7.0,425,5.5,0.10%,71,1,England,0,4,20,0,0
461,Michail Antonio,27,RW,1,18.0,1142,7.5,0.50%,132,1,England,0,3,20,0,0
462,Robert Snodgrass,29,RW,1,8.0,1210,6.0,6.50%,133,2,Scotland,0,4,20,0,0
463,Ashley Fletcher,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


The same thing can be done with columns.

In [142]:
unwanted_columns = ['name','position','position_cat']

In [143]:
players.reindex(
    index = set(players.index).difference(unwanted_rows),
    columns = set(players.columns).difference(unwanted_columns)
)

Unnamed: 0,page_views,new_signing,age,new_foreign,age_cat,club_id,region,big_club,fpl_sel,market_value,nationality,fpl_value,fpl_points
0,544,0,28,0,4,1,3,1,0.20%,7.0,Colombia,5.0,2
5,943,0,32,0,5,1,2,1,0.10%,12.0,Spain,7.0,38
6,1815,0,24,0,2,1,2,1,2.00%,35.0,Switzerland,5.5,85
7,2055,1,28,0,4,1,2,1,0.10%,15.0,Spain,7.5,20
8,489,0,27,0,3,1,1,1,0.50%,10.0,England,5.0,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,425,0,30,0,4,20,1,0,0.10%,7.0,England,5.5,71
461,1142,0,27,0,3,20,1,0,0.50%,18.0,England,7.5,132
462,1210,0,29,0,4,20,2,0,6.50%,8.0,Scotland,6.0,133
463,412,1,21,0,1,20,1,0,5.90%,1.0,England,4.5,16


## Null Values in DataFrames

Let's start by re-adding the `players` dataframe fresh.

In [144]:
players = pd.read_csv('https://andybek.com/pandas-soccer')

In this lecture we'll discuss Null markers, or `NaN`. We've already seen this in Series. Let's refresh our memory with Pandas series. We'll start by taking a column from the `players` dataframe as a series.

In [145]:
ages = players.age

In [146]:
type(players.age)

pandas.core.series.Series

To check for null values, we can simply chain the `isna()` method to it. What is returned is a series of booleans that determines whether each entry in the series is an N/A value.

In [147]:
ages.isna()

0      False
1      False
2      False
3      False
4      False
       ...  
460    False
461    False
462    False
463    False
464    False
Name: age, Length: 465, dtype: bool

Recall that this boolean series is useful because we can use it as an input for an indexing function, like so:

In [148]:
ages.loc[ages.isna()]

Series([], Name: age, dtype: int64)

Turns out there are no N/A values in our `ages` series, so we got an empty series back.

Now, what if we want to do this for the *entire* dataframe. Turns out that the `isna()` method also works for dataframes. What we'll get back is a new dataframe consisting entirely of booleans, which explains whether each value in the dataframe is N/A or not.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html

In [149]:
players.isna()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
461,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
462,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
463,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


As we can see, the output is not attractive at all. But it *is* useful.

We can do a quick count the number of N/A values in the dataframe by utilizing the numpy `count_nonzero()` method.
* https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html

In [150]:
np.count_nonzero(players.isna())

4

Thus, there are a total of 4 N/A values in our dataframe. Great! But what we *really* want to know is which records are those. How do we do that?

Can we pass `players.isna()` to an indexer to select for the values that are NA, similarly to series?



In [151]:
players[players.isna()]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,,,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,,,,,,,,,,,,,,,,,
461,,,,,,,,,,,,,,,,,
462,,,,,,,,,,,,,,,,,
463,,,,,,,,,,,,,,,,,


As you can see, the output is not meaningful. We simply cannot index in this manner by passing the boolean dataframe into the indexer. Instead we first need to extract the boolean values into an array structure.

In [152]:
players.isna().values

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

We can now use this array structure to select for the dataframe values that are Null!

In [153]:
players[players.isna().values]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


Very nice. A few things to note about this:
* This shows us which records have one or more columns with a Null value. This does NOT tell us which columns have the Null values.
* In this particular example, there are only 3 unique players. Granit Xhaka is a duplicate.

We can get ride of the duplicate entry by chaining on the `drop_duplicates()` method. 
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

## Dropping and Filling DataFrame NAs: `fillna()` and `dropna()`

What do we do with null values in dataframes? One option is to replace them with something that is more meaningful. 

We can achieve this using the `fillna()` method. This is similar to the similarly-named method for series.
* In the simplest incarnation, we just specify the value that we want to replace our NA's with.
* By default, the method creates a new copy of the dataframe; the original is unaltered unless the `inplace` parameter is set to `True`
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

In [154]:
players.fillna('replacement value')

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Edimilson Fernandes,West+Ham,21,CM,2,5,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
463,Ashley Fletcher,West+Ham,21,CF,1,1,412,4.5,5.90%,16,1,England,0,1,20,0,1


That wasn't too helpful because we have no idea where the replacements occured in this huge dataframe. Thankfully we can find it using the `isna()` method and the indexing function, like we did in the previous lecture.

In [155]:
players[players.isna().values]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


We see the NA values were in rows 30, 192, and 195, within the "position" and "market_value" columns. Now let's use `fillna()` again and then use `loc` to see those replacements.

In [156]:
players.fillna('replacement value').loc[[30, 192, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,replacement value,2,replacement value,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,replacement value,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,replacement value,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


But what if we don't want to replace EVERY NA with the same value? Maybe we need to do something more meaningful.

We can do this using the **dictionary** syntax of the `fillna()` method. We do this by passing in a dictionary containing the column labels as keys and the replacement values as parameters.
* Note that in this example we're still limited to replacing every NA value within the specified column with a single value.

In [157]:
players.fillna({
    'market_value': 100,
    'position':'RM'
}).loc[[30, 192, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,RM,2,100.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,100.0,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,100.0,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


We could get even more dynamic here. For instance, we can fill missing market values with the mean market value of all players.

In [158]:
players.fillna({
    'market_value': players.market_value.mean(),
    'position':'RM'
}).loc[[30, 192, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,RM,2,11.125649,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,11.125649,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,11.125649,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


Another option is to simply drop all NA values entirely. The simplest way to do this is by using the dropna() method, which we've explored earlier.
* Keep in mind that by default this method drops any rows that contain one or more **NA** values.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html


In [159]:
# players.dropna().loc[[30, 192,195]]

We can modify the behavior of the method by switching the `axis` to `1`. This will allow us to drop columns based on the same criteria.

In [160]:
 players.dropna(axis = 1).loc[[30, 192,195]]

Unnamed: 0,name,club,age,position_cat,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,2,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,1,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,4,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


Notice now how the `market_value` and `position` columns are now gone from the dataframe, since they had one or more NA values.

## Methods and Axes with `fillna()`

The `fillna()` method comes jam packed with a bunch of really cool parameters that we can use.

First off, in the previous lecture we used `fillna()` without modifying the underlying dataframe - everything was done on a copy. That means we still have our original dataframe.

In [161]:
players[players.isna().values].drop_duplicates()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


Previously, we used fillna() with strings or dictionaries to replace values with new values that we specify.
An alternative is to not provide any values at all, and instead use the `method` parameter.

In this example, the `ffill` method is used to "forward fill" the data, or propogate the last valid observation forward to the next valid

In [162]:
players.fillna(method = 'ffill')

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0
463,Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1


Let's isolate the labels that we know contain NAs

In [163]:
players.fillna(method = 'ffill').loc[[30, 192, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,LB,2,15.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,3.0,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,30.0,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


The NA's have disappeared and they are replaced by what appear to be reasonable values. With the `ffill` keyword for the `method` parameter, Pandas takes most recent valid value and propogates it into the NA value that is missing.
* `pad` is an alias for `ffill` and can be used in its place

To confirm this, let's take a look at rows 29, 191, and 194, which should propogate forward to fill the NAs in row 30, 192, and 195, respectively. Before the "ffill" we have:

In [164]:
players.loc[[29, 30, 191, 192, 194, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
191,Laurent Depoitre,Huddersfield,28,CF,1,3.0,212,5.5,0.30%,0,2,Belgium,0,4,8,0,0
192,Steve Mounie,Huddersfield,22,CF,1,,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
194,Riyad Mahrez,Leicester+City,26,RW,1,30.0,1753,8.5,1.70%,120,4,Algeria,0,3,9,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


After using `ffill`, we have:

In [165]:
players.fillna(method = 'ffill').loc[[29, 30, 191, 192, 194, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
29,Sead Kolasinac,Arsenal,24,LB,3,15.0,618,6.0,6.90%,0,2,Bosnia,1,2,1,1,0
30,Granit Xhaka,Arsenal,24,LB,2,15.0,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
191,Laurent Depoitre,Huddersfield,28,CF,1,3.0,212,5.5,0.30%,0,2,Belgium,0,4,8,0,0
192,Steve Mounie,Huddersfield,22,CF,1,3.0,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
194,Riyad Mahrez,Leicester+City,26,RW,1,30.0,1753,8.5,1.70%,120,4,Algeria,0,3,9,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,30.0,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


The `axis` parameter is silent and defaults to 0, referring to the rows. In other words, the filling occurs vertically from the preceding row record. 
* Think of filling as going parallel to the axis that we identify.

What happens if we change `axis` to 1?

In [166]:
players.loc[[30, 192, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,,2,,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


In [167]:
players.fillna(method = 'ffill', axis = 1).loc[[30, 192, 195]]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
30,Granit Xhaka,Arsenal,24,24,2,2,1815,5.5,2.00%,85,2,Switzerland,0,2,1,1,0
192,Steve Mounie,Huddersfield,22,CF,1,1,56,6.0,0.60%,0,2,Benin,0,2,8,0,0
195,Kasper Schmeichel,Leicester+City,30,GK,4,4,1601,5.0,2.40%,109,2,Denmark,0,4,9,0,0


Here we see that the forward fill happened "horizontally" - that is, the most recent valid value from a column on the left is forward filled into the column containing the NA. 

Horizontal fill oftentimes does not make much sense (as in our example dataset), because columns typically describe completely different types of data.

The `method` parameter can also use `bfill` or "backfill", which is similar to `ffill` but works in the opposite direction.
* `backfill` is an alias for `bfill` and can be used in its place

## Skill Challenge


#### 1. From the *players* dataframe, remove the rows labeled 2, 10, 21 and the *market_value* column. Do not modify the underlying dataframe. Assign this new dataframe to the variable `df2`

Start by verifying the current size of the dataframe.

In [168]:
players.shape

(465, 17)

Drop the instructed rows and column. This can be done by chaining two `drop()` method calls together.

In [169]:
df2 = players.drop(index = [2, 10, 21]).drop(columns = 'market_value')

Note that we could have also just combined the index and columns drop calls into one `drop()` call. 

Now let's confirm the reshaped dataframe.

In [170]:
df2.shape

(462, 16)

#### 2. Does the nationality column in *df2* contain any NA values? How many unique nationalities are there?

We can call the `isna()` method on the *nationality* column and chain on `value_counts()` to determine if there are any NA values in that column

In [171]:
df2['nationality'].isna().value_counts()

False    462
Name: nationality, dtype: int64

Looks like there are no **NA** values in the *nationality* column, since `False` was returned for every entry in that column. We can count the number of unique nationalities using the `unique()` method to generate a series of unique country names, then use the `size` attribute to return the number length of that series (i.e. the number of unique countries).

In [172]:
df2['nationality'].unique().size

61

Thus, there are 61 unique values in the *nationality* column.

An alternative approach. would be use `drop_duplicates` on the *nationality* series and then tag on the `size` attribute call.

#### 3. Starting from *df2*, isolate a dataframe slice that contains only the unique age-position combinations for each club. Do not include the club column itself.

We start by using the `loc` selector to select on the *age*, *position*, and *club* columns (as well as all rows)

In [173]:
df2.loc[:, ['age','position','club']]

Unnamed: 0,age,position,club
0,28,LW,Arsenal
1,28,AM,Arsenal
3,28,RW,Arsenal
4,31,CB,Arsenal
5,22,RB,Arsenal
...,...,...,...
460,21,CM,West+Ham
461,23,LB,West+Ham
462,23,RB,West+Ham
463,21,CF,West+Ham


Remember that we want unique age-position combos within each club. To get this, we'll want to use the `drop_duplicates` method, which we can chain on to the selector.

In [174]:
df2.loc[:, ['age','position','club']].drop_duplicates()

Unnamed: 0,age,position,club
0,28,LW,Arsenal
1,28,AM,Arsenal
3,28,RW,Arsenal
4,31,CB,Arsenal
5,22,RB,Arsenal
...,...,...,...
460,21,CM,West+Ham
461,23,LB,West+Ham
462,23,RB,West+Ham
463,21,CF,West+Ham


Finally, we drop the "club" column. We are now left with 433 rows of unique age-position combinations for each club.

In [175]:
df2.loc[:, ['age','position','club']].drop_duplicates().drop(columns = 'club')

Unnamed: 0,age,position
0,28,LW
1,28,AM
3,28,RW
4,31,CB
5,22,RB
...,...,...
460,21,CM
461,23,LB
462,23,RB
463,21,CF


## Calculating Aggregates with `agg()`

One of the most powerful utilities in Pandas is the ability to group your values along multiple attributes.

The `agg()` method aggregates the output of a function that you specify and collapses the entire axis into a single value.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html

For example, let's try applying the function "mean" to our entire *players* dataframe.

In [176]:
players.agg('mean')

age              26.776344
position_cat      2.178495
market_value     11.125649
page_views      771.546237
fpl_value         5.450538
fpl_points       57.544086
region            1.989247
new_foreign       0.034409
age_cat           3.195699
club_id          10.253763
big_club          0.309677
new_signing       0.144086
dtype: float64

What we get back is a Series of the mean (average) values for each of the columns that contained numeric data. Let's confirm this by testing one of the columns:

In [177]:
players.new_signing.mean()

0.14408602150537633

The `agg()` method can use many other functions, such as "min":

In [178]:
players.agg('min')

name            Aaron Cresswell
club                    Arsenal
age                          17
position_cat                  1
market_value               0.05
page_views                    3
fpl_value                     4
fpl_sel                   0.00%
fpl_points                    0
region                        1
nationality             Algeria
new_foreign                   0
age_cat                       1
club_id                       1
big_club                      0
new_signing                   0
dtype: object

Note that the "min" function returned some strings as the "minumum" values. That is because you can technically apply comparative operators to strings, which works by comparing the position of their respective code points.
* Each string character corresponds to an integer value, known as its unicode code point. 

In [179]:
ord('a')

97

In [180]:
ord('b')

98

In [181]:
'a' < 'b'

True

In [182]:
ls = ['a', 'b', 'c', 'day']

In [183]:
min(ls)

'a'

You will run into a `TypeError` if you attempt to run a comparison between a string and a number (integer or float). This is because comparison operators are not supported between instances of 'float' and 'str'

In [184]:
ls = ['a', 'b', 'c', 'day', 19.1]

In [185]:
# min(ls)

What if we don't want to include these string comparisons in our `agg()` method? We can exclude them by pre-filtering the columns by type. We can accomplish this using the `select_dtypes()` method, which returns a subset of the datafram's columns based on the datatypes that you selected.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
* In this example, we will select columns that contain numbers

In [186]:
players.select_dtypes(np.number).agg('min')

age             17.00
position_cat     1.00
market_value     0.05
page_views       3.00
fpl_value        4.00
fpl_points       0.00
region           1.00
new_foreign      0.00
age_cat          1.00
club_id          1.00
big_club         0.00
new_signing      0.00
dtype: float64

A final cool thing about `agg()` is that we can actually pass in a **list of functions** and it will work! This will return a dataframe of results (one row per function) instead of just a single series. This is pretty neat given the small amount of code that we wrote.

In [187]:
players.select_dtypes(np.number).agg(['min', 'max', 'mean'])

Unnamed: 0,age,position_cat,market_value,page_views,fpl_value,fpl_points,region,new_foreign,age_cat,club_id,big_club,new_signing
min,17.0,1.0,0.05,3.0,4.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
max,38.0,4.0,75.0,7664.0,12.5,264.0,4.0,1.0,6.0,20.0,1.0,1.0
mean,26.776344,2.178495,11.125649,771.546237,5.450538,57.544086,1.989247,0.034409,3.195699,10.253763,0.309677,0.144086


A important final note: `agg()` *reshapes* our data. That is, we are applying a function to our dataframe, and a new data structure with the aggregate output is returned.

## Same-shape Transforms: the `transform()` method

Pandas has a built-in method called `transform()` that can be used to apply constant functions to the entire dataframe. It guarantees that the shape of the dataframe will not change - unlike the `agg()` method, `transform()` does not reshape the dataframe.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html

In [188]:
players.head(3)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0


Suppose we want to convert the *fpl_value* from US Dollars to Euros. In order to do this, we need to program in a conversion rate. For this example, assume that 1 USD is equal to 0.91 Euro

The first thing we will want to do is index for the two columns that we want to convert to Euros.

In [189]:
players.loc[:, ['market_value', 'fpl_value']]

Unnamed: 0,market_value,fpl_value
0,65.0,12.0
1,50.0,9.5
2,7.0,5.5
3,20.0,7.5
4,22.0,6.0
...,...,...
460,5.0,4.5
461,7.0,4.5
462,4.5,4.5
463,1.0,4.5


Next, let's use an inline lambda function to transform all of these values from USD to Euro. Think of *x* as a representative value for each of the seelcted columns that will be acted on by the `transform()` method. The `transform()` method will actually assign each column to *x* and then execute the function on it.

In [190]:
players.loc[:, ['market_value', 'fpl_value']].transform(lambda x: x * 0.91)

Unnamed: 0,market_value,fpl_value
0,59.150,10.920
1,45.500,8.645
2,6.370,5.005
3,18.200,6.825
4,20.020,5.460
...,...,...
460,4.550,4.095
461,6.370,4.095
462,4.095,4.095
463,0.910,4.095


Note that with this simple example, we could have also just performed multiplication on the columns of interest. It produces the exact same result, and probably utilizes fewer resources.

In [191]:
players.loc[:, ['market_value', 'fpl_value']] * 0.91

Unnamed: 0,market_value,fpl_value
0,59.150,10.920
1,45.500,8.645
2,6.370,5.005
3,18.200,6.825
4,20.020,5.460
...,...,...
460,4.550,4.095
461,6.370,4.095
462,4.095,4.095
463,0.910,4.095


Let's try something more complicated. Suppose we want to apply a string transformation to all string columns, but we don't want to specify what the transformation will be. 

Before we start that, let's review the `choice()` function from the `random` module, which allows us to select a value at random from a sequence of values.

In [192]:
from random import choice

In [193]:
names = ['Bud', 'Brooke', "Paleo"]

In [194]:
for num in range(1, 10):
  print(num)
  print(choice(names))

1
Paleo
2
Paleo
3
Brooke
4
Bud
5
Bud
6
Paleo
7
Paleo
8
Paleo
9
Brooke


We also want to introduce the built-in Pandas string `repr` functions. In pure Python, we have string methods that allow us to do things like capitalize strings.

In [195]:
'Andy'.upper()

'ANDY'

However, if we want to do this with a Pandas series, we encounter errors because Series does not have such methods.

In [196]:
ser = pd.Series(['Bud',' Brooke', 'Paleo'])

In [197]:
# This will not work
# ser.upper()

To do this in a Pandas context, we simply add the **string accessor** `str` between the series and the method. We see that it now works!

In [198]:
ser.str.upper()

0        BUD
1     BROOKE
2      PALEO
dtype: object

This webpage has a nice list of many Python string methods that can be used on Pandas series: https://www.w3schools.com/python/python_ref_string.asp

Back to our string transformation. Our function should:
* apply a random string capitalization method
* apply this method to a sequence of values
* return the transformed sequence

In [199]:
# The function should take a single value x
def random_case(x):
  # Since we want our method to randomly capitalize characters in the string, we will want to call on multiple methods. We can do this by creating an array of bound function references.
  funcs = [x.str.swapcase, x.str.lower, x.str.title, x.str.upper]
  # Now let's randomly select one of these functions and invoke it in order to apply the transformation.
  return choice(funcs)()
  

Also note that we only want to apply this function to string columns. Thus, we should only include columns with *object* datatypes.

In [200]:
players.select_dtypes(include = object).transform(random_case)

Unnamed: 0,name,club,position,fpl_sel,nationality
0,Alexis Sanchez,ARSENAL,Lw,17.10%,cHILE
1,Mesut Ozil,ARSENAL,Am,5.60%,gERMANY
2,Petr Cech,ARSENAL,Gk,5.90%,cZECH rEPUBLIC
3,Theo Walcott,ARSENAL,Rw,1.50%,eNGLAND
4,Laurent Koscielny,ARSENAL,Cb,0.70%,fRANCE
...,...,...,...,...,...
460,Edimilson Fernandes,WEST+HAM,Cm,0.40%,sWITZERLAND
461,Arthur Masuaku,WEST+HAM,Lb,0.20%,cONGO dr
462,Sam Byram,WEST+HAM,Rb,0.30%,eNGLAND
463,Ashley Fletcher,WEST+HAM,Cf,5.90%,eNGLAND


As you can see, the columns are affected by different methods at random.

## More Flexibility with the `apply()` method

We previously discussed `apply()` in the Series section, but we should explore how this method is applied in a two-dimensional context.

We would use `apply()` whenever we want to apply a given function to an entire row *or* column.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

In [201]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0


As a demonstration, suppose we want to modify all of our floating point columns and round them to the closest integer without actually converting them to integers. 

Let's start by defining a function that will perform the rounding. This function will accept a value (or rather, a column or row of values, such that **x** will be a series when the function executes) and then test whether it is of the `dtype` `np.float64`. If it is, it will return the rounded value of that float. If not, it will return the original value.

In [202]:
def round_floats(x):
  if x.dtype == np.float64:
    return round(x)

  return x

Now let's apply our function to the *players* dataframe.

In [203]:
players.apply(round_floats)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,10.0,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,6.0,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,8.0,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.0,0.40%,38,2,Switzerland,0,1,20,0,1
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.0,0.20%,34,4,Congo DR,0,2,20,0,1
462,Sam Byram,West+Ham,23,RB,3,4.0,198,4.0,0.30%,29,1,England,0,2,20,0,0
463,Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.0,5.90%,16,1,England,0,1,20,0,1


It worked, but heck if we can tell by just looking at it. Let's narrow down our view of the dataframe to just columns with floating point values.

In [204]:
players.select_dtypes(np.float64).head()

Unnamed: 0,market_value,fpl_value
0,65.0,12.0
1,50.0,9.5
2,7.0,5.5
3,20.0,7.5
4,22.0,6.0


Ah, now we see that only two of our columns actually contained floating point values. Applying our `round_floats` function to this subset will more clearly show the difference.

In [205]:
players.select_dtypes(np.float64).apply(round_floats).head()

Unnamed: 0,market_value,fpl_value
0,65.0,12.0
1,50.0,10.0
2,7.0,6.0
3,20.0,8.0
4,22.0,6.0


Notice that this is awfully similar to the `transform()` method that we discussed in the previous lecture. In fact, we can directly swap out `transform()` with `apply()` and it will work exactly the same way.

In [206]:
players.select_dtypes(np.float64).transform(round_floats).head()

Unnamed: 0,market_value,fpl_value
0,65.0,12.0
1,50.0,10.0
2,7.0,6.0
3,20.0,8.0
4,22.0,6.0


So then, why would we want to use `apply()` if we already have transform? The two functions are not aliases of each other, so they must have at least some difference.

The answer is that `apply()` has a bit more flexibility and a more general scope. It supports both aggregations and same shape transforms. Think of `apply()` as a combination of `transform()` and `agg()` in that a single function supports both types of operations.

Let's check out `apply()` as an aggregate function. Remember that we previously used `agg()` to apply a function to every column in the dataframe (or at least every column that could be acted on by the function:

In [207]:
players.agg('mean')

age              26.776344
position_cat      2.178495
market_value     11.125649
page_views      771.546237
fpl_value         5.450538
fpl_points       57.544086
region            1.989247
new_foreign       0.034409
age_cat           3.195699
club_id          10.253763
big_club          0.309677
new_signing       0.144086
dtype: float64

But we now know that `apply()` can handle aggregations. We'll get the same output!

In [208]:
players.apply('mean')

age              26.776344
position_cat      2.178495
market_value     11.125649
page_views      771.546237
fpl_value         5.450538
fpl_points       57.544086
region            1.989247
new_foreign       0.034409
age_cat           3.195699
club_id          10.253763
big_club          0.309677
new_signing       0.144086
dtype: float64

We would not be able to use `transform()` to perform aggregates. Simply put, transforms cannot produce aggregate results.

We can also control how we want the `apply()` method to be applied by utilizing the `axis` parameter. Previously when we ran our `round_floats` function using `apply()`, it didn't matter whether we applied to rows or columns because it's a single-value in-place transformation. However, when performing aggregation-like calculations, flipping the axes produces fundamentally different results.

In the following code, we are telling the method to aggregate along the index, or aggregate vertically. In other words, apply the function to each column.


In [209]:
players.apply('mean', axis = 0)

age              26.776344
position_cat      2.178495
market_value     11.125649
page_views      771.546237
fpl_value         5.450538
fpl_points       57.544086
region            1.989247
new_foreign       0.034409
age_cat           3.195699
club_id          10.253763
big_club          0.309677
new_signing       0.144086
dtype: float64

Let's try flipping the axis so that the aggregation occurs along the columns, or horizontally.

In [210]:
players.apply('mean', axis = 1)

0      392.333333
1      388.208333
2      143.708333
3      214.875000
4       91.916667
          ...    
460     31.875000
461     24.791667
462     23.750000
463     39.875000
464     24.708333
Length: 465, dtype: float64

This gives us a completely different result. In this case, we've calculated the averages over values that don't really make sense to aggregate, such as age and market value. But for the sake of illustration, let's go ahead and verify the calculation for one of the rows (players).

In [211]:
players.loc[460, :]

name            Edimilson Fernandes
club                       West+Ham
age                              21
position                         CM
position_cat                      2
market_value                      5
page_views                      288
fpl_value                       4.5
fpl_sel                       0.40%
fpl_points                       38
region                            2
nationality             Switzerland
new_foreign                       0
age_cat                           1
club_id                          20
big_club                          0
new_signing                       1
Name: 460, dtype: object

We can't just apply the `mean()` function here; we'll get a `TypeError` due to the mixed datatypes.

In [212]:
# Produces a TypeError
# players.loc[460, :].mean()

Instead, we can select only the columns containing numeric data that we can aggregate over to calculate the mean. Let's do this with a list comprehension, where we iterate over a list of all column dtypes and use a comparator to select only columns that are numeric.

In [213]:
for dtype in players.dtypes:
  print(dtype != object)

False
False
True
False
True
True
True
True
False
True
True
False
True
True
True
True
True


In [214]:
players.loc[460, [dtype != object for dtype in players.dtypes]]

age              21
position_cat      2
market_value      5
page_views      288
fpl_value       4.5
fpl_points       38
region            2
new_foreign       0
age_cat           1
club_id          20
big_club          0
new_signing       1
Name: 460, dtype: object

Now we have only the numeric values. Let's go ahead and calculate the mean, which we will see is the same as the value calculated above with the `apply()` function across the column axis.

In [215]:
players.loc[460, [dtype != object for dtype in players.dtypes]].mean()

31.875

## Element-Wise Operations with `applymap()`

The methods that we discussed over the past few lectures (e.g. `agg()`, `transform()`, and `apply()`) execute **vectorized operations**, a Numpy feature that allows for massive performance gains by applying a given operation on a set of values all at once.

This set or group of values is called a **vector**.

This is made possible by **single instruction, multiple data** or **SIMD**, which allows a processing unit (CPU or GPU) to perform a single operation on multiple datapoints in a single processing cycle or step:
* https://en.wikipedia.org/wiki/SIMD

However, not all functions can be vectorized - some functions only operate on individual values. In such cases, Pandas offers the `applymap()` function. This applies the function to a dataframe element-wise.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html

In [216]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0


Suppose we want to adjust the player market value and fantasy player value (fpl_value) for inflation, which is a general increase in the price of things. We'll start by setting the inflation value and then subsetting the dataframe for the columns of interest.

In [217]:
inflation = 1.02

In [218]:
mini_df = players.loc[:, ['market_value','fpl_value']]

One thing we can do is simply multiply our `mini_df` by the inflation variable. This is a highly efficient vectorized operation - no function needed.

In [219]:
mini_df * inflation

Unnamed: 0,market_value,fpl_value
0,66.30,12.24
1,51.00,9.69
2,7.14,5.61
3,20.40,7.65
4,22.44,6.12
...,...,...
460,5.10,4.59
461,7.14,4.59
462,4.59,4.59
463,1.02,4.59


Suppose we now want to know when each 100th value has been adjusted for inflation, and we want a report of the timestamp of when that adjustment took place.

In [220]:
# Import the datetime module
from datetime import datetime

# Set a counter that tracks how many values have been adjusted.
counter = 0


def log_and_transform(x):
  # Declare counter as a global variable
  global counter
  # Increment counter by 1
  counter += 1
  if counter % 100 == 0:
    print(f"It's {datetime.now()} and I just adjusted the {counter}th value.")

  return x * inflation


With our function defined, let's now apply it to our subsetted mini dataframe.

In [221]:
mini_df.applymap(log_and_transform)

It's 2021-09-30 03:03:24.794059 and I just adjusted the 100th value.
It's 2021-09-30 03:03:24.794209 and I just adjusted the 200th value.
It's 2021-09-30 03:03:24.794302 and I just adjusted the 300th value.
It's 2021-09-30 03:03:24.794383 and I just adjusted the 400th value.
It's 2021-09-30 03:03:24.794859 and I just adjusted the 500th value.
It's 2021-09-30 03:03:24.795217 and I just adjusted the 600th value.
It's 2021-09-30 03:03:24.795314 and I just adjusted the 700th value.
It's 2021-09-30 03:03:24.795412 and I just adjusted the 800th value.
It's 2021-09-30 03:03:24.795492 and I just adjusted the 900th value.


Unnamed: 0,market_value,fpl_value
0,66.30,12.24
1,51.00,9.69
2,7.14,5.61
3,20.40,7.65
4,22.44,6.12
...,...,...
460,5.10,4.59
461,7.14,4.59
462,4.59,4.59
463,1.02,4.59


We now get a little notification for every 100th value that is adjusted. And remember, we're applying this element-wise, so every single value is passed into our `log_and_transform` function.

We cannot vectorize our `log_and_transform` function without losing some of its intended functionality. For example, we cannot properly log every 100th value modified if we used a vectorized operation, since that log depends on element-wise operation. Let's visualize this by running our function through `apply()`. Remember that `apply()` passes the entire column at once, and so the parameter **x** becomes an entire column. 

In [222]:
mini_df.apply(log_and_transform)

Unnamed: 0,market_value,fpl_value
0,66.30,12.24
1,51.00,9.69
2,7.14,5.61
3,20.40,7.65
4,22.44,6.12
...,...,...
460,5.10,4.59
461,7.14,4.59
462,4.59,4.59
463,1.02,4.59


Note that our counter never hits a value that is evenly divisibly by 100, and thus we don't get the *log* reports.

When possible, should ALWAYS prefer vectorized operations over element-wise operations (e.g. using `apply()`) due to the performance gains. However, as we've seen, sometimes your hands are tied and you are forced to use `applymap()` for element-wise operations. 

## Skill Challenge

#### 1. Create a standalone function that
* accepts a single parameter `x`
* returns the string 'relatively unknown' if x < 200
* returns the string "kind of popular" if 200 <= x < 600
* returns the string "popular" if 600 <= x < 2000
* otherwise returns "super popular"



Here is our function, which we shall call `popularity`

In [223]:
def popularity(x):
  if x < 200:
    return 'relatively unknown'
  elif x < 600:
    return 'kind of popular'
  elif x < 2000:
    return 'popular'
  else:
    return 'super popular'

#### 2. Apply the function from step 1 above to the players' `page_views` column. Use a method that supports vectorized operations such as `apply()` or `transform()`

In [224]:
players.page_views.apply(popularity)

0           super popular
1           super popular
2                 popular
3           super popular
4                 popular
              ...        
460       kind of popular
461    relatively unknown
462    relatively unknown
463       kind of popular
464       kind of popular
Name: page_views, Length: 465, dtype: object

#### 3. Add the output from the step above as a new column to the `players dataframe`. Name the column `popularity`.

In [225]:
players['popularity'] = players.page_views.apply(popularity)

In [226]:
players

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
460,Edimilson Fernandes,West+Ham,21,CM,2,5.0,288,4.5,0.40%,38,2,Switzerland,0,1,20,0,1,kind of popular
461,Arthur Masuaku,West+Ham,23,LB,3,7.0,199,4.5,0.20%,34,4,Congo DR,0,2,20,0,1,relatively unknown
462,Sam Byram,West+Ham,23,RB,3,4.5,198,4.5,0.30%,29,1,England,0,2,20,0,0,relatively unknown
463,Ashley Fletcher,West+Ham,21,CF,1,1.0,412,4.5,5.90%,16,1,England,0,1,20,0,1,kind of popular


#### 4. How many 'super popular' players are there?

The number of 'super popular` players can be counted using the `value_counts()` method.

In [227]:
players.popularity.value_counts()

kind of popular       185
popular               143
relatively unknown    100
super popular          37
Name: popularity, dtype: int64

Thus, there are 37 super popular players. If we're wondering who they are, we can subset the dataframe by super popular players.

In [228]:
players[players['popularity'] == "super popular"]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
6,Olivier Giroud,Arsenal,30,CF,1,22.0,2230,8.5,2.50%,116,2,France,0,4,1,1,0,super popular
24,Lucas Perez,Arsenal,28,CF,1,15.0,2055,7.5,0.10%,20,2,Spain,0,4,1,1,1,super popular
33,Jermain Defoe,Bournemouth,34,CF,1,5.0,3213,8.0,15.00%,166,1,England,0,6,2,0,0,super popular
96,Eden Hazard,Chelsea,26,LW,1,75.0,4220,10.5,2.30%,224,2,Belgium,0,3,5,1,0,super popular
97,Diego Costa,Chelsea,28,CF,1,50.0,4454,10.0,3.00%,196,2,Spain,0,4,5,1,0,super popular
99,Marcos Alonso Mendoza,Chelsea,26,LB,3,25.0,3069,7.0,12.40%,177,2,Spain,0,3,5,1,1,super popular
103,David Luiz,Chelsea,30,CB,3,30.0,2745,6.0,20.30%,132,3,Brazil,0,4,5,1,0,super popular


Let's make sure there are only 37 entries.

In [229]:
players.loc[players['popularity'] == "super popular"].shape

(37, 18)

Looks good!

## Setting data in DataFrames using the `.at` and `iat`.

Sometimes we don't want to apply a function to an entire column, dataframe, or series. Instead, we just want to pick a specific value and change it. 

In [230]:
players.head(10)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0,popular
6,Olivier Giroud,Arsenal,30,CF,1,22.0,2230,8.5,2.50%,116,2,France,0,4,1,1,0,super popular
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2,Spain,0,4,1,1,0,kind of popular
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1,popular
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0,popular


Suppose we want to change Theo Walcott's position from right winger (rw) to center midfielder (CM).

Since we know the index location and column name, one way we can do this is using the `loc[]` indexer.

In [231]:
players.loc[3, 'position']

'RW'

We can then use the assignment operator:

In [232]:
players.loc[3, 'position'] = "CM"

We see that the value has been changed.

In [233]:
players.head(10)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,CM,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular
5,Hector Bellerin,Arsenal,22,RB,3,30.0,1675,6.0,13.70%,119,2,Spain,0,2,1,1,0,popular
6,Olivier Giroud,Arsenal,30,CF,1,22.0,2230,8.5,2.50%,116,2,France,0,4,1,1,0,super popular
7,Nacho Monreal,Arsenal,31,LB,3,13.0,555,5.5,4.70%,115,2,Spain,0,4,1,1,0,kind of popular
8,Shkodran Mustafi,Arsenal,25,CB,3,30.0,1877,5.5,4.00%,90,2,Germany,0,3,1,1,1,popular
9,Alex Iwobi,Arsenal,21,LW,1,10.0,1812,5.5,1.00%,89,4,Nigeria,0,1,1,1,0,popular


We can also use the position-based indexer `iloc[]` to make changes based on integer position of  the index and column.

In [234]:
players.iloc[3, 3] = 'RW'

In [235]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular


However, there are two methods that are preferred for single value assignment because they perform much more efficiently.

`at[]` is a labeled-based indexer that is similar to `loc[]`, but works on single values only
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html

In [236]:
players.at[3, 'position'] = 'CM'

In [237]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,CM,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular


`iat[]` is an index-based indexer that is similar to `iloc[]`. We could use this one as well.

In [238]:
players.iat[3, 3] = 'RW'

In [239]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular


Unlike `loc[]` and `iloc[]`, `at[]` and `iat[]` support only ONE type of syntax. That means they have less flexibility, but also less overhead in their function definitions. Thus, they run much more efficiently. 

How much faster, you ask? Let's find out!

In [240]:
%%timeit
players.iloc[3, 3] = 'RW'

1000 loops, best of 5: 304 µs per loop


In [241]:
%%timeit
players.iat[3, 3] = 'RW'

The slowest run took 8.65 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 26 µs per loop


## The `SettingWithCopy` Warning

When you try to set values in a dataframe, Pandas might throw a warning at you. Suppose we want to change the page views for Petr Cech .

In [242]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular


One thing we could do is pull the value using bracket indexing, and then set it to the value that you want.

In [243]:
players['page_views'][2] = 2001

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Check out that warning - a value is trying to be set on a copy of a slice from the original dataframe. Thus, Pandas cannot guarantee that we're working on the underlying dataframe. Because we're using `chain indexing` Pandas does not know if we're working with a copy of the data or a view of the underlying dataframe. More on this in the next lecture.

If we are working with a **copy** of a dataframe, any change we do may or may not apply to the underlying dataframe. It depends on how the memory for the dataset is laid out.

Let's see if it worked here!

In [244]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,2001,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular


In this case, given the structure of the dataframe, our selection created a *view*. Then when we set the value of Petr Cech's page views, we managed to change the underlying dataframe.

However, this is not guaranteed. Let's try a method that we know returns a copy of our dataframe.

In [245]:
players.drop_duplicates()['page_views'][2] = 3000

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [246]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,2001,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular


Notice here how Petr Cech's `page_views` value did not change! That's because we applied it to a copy of the dataframe, not the dataframe itself.

This warning can be disabled within pandas options, where we can set 
`pd.options.mode.chained_assignment = 'None'`. But the instructor recommends we keep the warning on, as it can be very useful. If we're running an assignment, we want to know for a fact that we're working on a view to the underlying dataframe.

The next lecture will discuss the difference between views and copies, and we'll explore general rules to avoid seeing the `SettingWithCopy` warning.

## View vs Copy

Whenever we index, extract slices, or apply a method to a dataframe, we run into the issue of whether we are working with a **view** or with a **copy** of the data.
* A **view** can be thought of as a "window" into the data. If we modify a view, we will modify the underlying dataframe.
* A **copy** is exactly as it sounds, a distinct object in memory that may be exactly the same size, shape, and contain the same data as the reference dataframe, but is nevertheless its own entity.

How do we know if Pandas is returning to us a copy or a view? The majority can be captured by the 2-point rule.
* Pandas loves to give us copies when we execute methods, and it is generally safe to assume that we are getting a copy. In fact, any method with an `inplace` parameter by default returns a copy unless `inplace` is set to True.
* However, if we use `loc[]/iloc[]` or `at[]/iat[]`, we are GUARANTEED to get a view.

How do we guarantee that we're working with a view, such that we're changing the underlying values and avoiding the SettingWithCopyWarning? One way is to use `loc[]/iloc[]` or `at[]/iat[]`.

In [247]:
players.loc[0:3]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,GK,4,7.0,2001,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular


In [248]:
players.loc[0:3,'position'] = ['CM','RW','CB','GK']

In [249]:
players.head(4)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
0,Alexis Sanchez,Arsenal,28,CM,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular
1,Mesut Ozil,Arsenal,28,RW,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular
2,Petr Cech,Arsenal,35,CB,4,7.0,2001,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular
3,Theo Walcott,Arsenal,28,GK,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular


Let's try another example where we replace all players with "Aaron" as a first name with a nickname. We start by finding how many Aarons there are by using a boolean mask to grab the relevant players:

In [250]:
players.loc[players.name.str.startswith('Aaron')]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
15,Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0,popular
157,Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0,kind of popular
176,Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0,kind of popular
455,Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0,kind of popular


Now let's grab that slice of the dataframe using a boolean mask and change the name of all of these Aarons to "Ronny". 

In [251]:
players.loc[players.name.str.startswith('Aaron'), 'name'] = "Ronny"

Did it work? Let's check:

In [252]:
players.loc[[15, 157, 176, 455], :]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity
15,Ronny,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0,popular
157,Ronny,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0,kind of popular
176,Ronny,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0,kind of popular
455,Ronny,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0,kind of popular


Importantly, this view guarantee only works if we use the indexers on the original dataframe. We can easily break this rule by chaining on the indexer that returns views after first indexing with something that is NOT guaranteed to return a view. Consider the following example:

In [253]:
# do not do this ever!
players['age'].iloc[1] = 12

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Finally, let's reset the dataframe to get rid of all of these silly edits.

In [254]:
players = pd.read_csv(data_url)

In [255]:
players.head(4)

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0


In [256]:
players.loc[[15, 157, 176, 455], :]

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing
15,Aaron Ramsey,Arsenal,26,CM,2,35.0,1040,7.0,5.10%,56,1,Wales,0,3,1,1,0
157,Aaron Lennon,Everton,30,RW,1,5.0,504,5.5,0.20%,22,1,England,0,4,7,0,0
176,Aaron Mooy,Huddersfield,26,CM,2,5.0,588,5.5,2.50%,0,4,Australia,0,3,8,0,0
455,Aaron Cresswell,West+Ham,27,LB,3,12.0,380,5.0,1.30%,60,1,England,0,3,20,0,0


## Adding DataFrame Columns with `insert()` and `assign()`

Sometimes you need to expand your dataset by expanding your attributes. We've done this before using the assignment brackets operator. Let's add back the popularity column from the skill challenge above.

In [257]:
players['popularity'] = players.page_views.apply(popularity)

In [258]:
'popularity' in players

True

Let's add a new column called `MVtoFPL` to the dataframe and initialize their values to 1.0. We will use this to eventually calculate the market value to FPL.

In [259]:
players['MVtoFPL'] = 1.0

In [260]:
'MVtoFPL' in players

True

In [261]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity,MVtoFPL
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular,1.0
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular,1.0
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular,1.0
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular,1.0
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular,1.0


Now let's recode this new column by actually performing the calculation to get the ratio of MV/FPL.

In [262]:
players['MVtoFPL'] = players['market_value'] / players['fpl_value']

In [263]:
players.head()

Unnamed: 0,name,club,age,position,position_cat,market_value,page_views,fpl_value,fpl_sel,fpl_points,region,nationality,new_foreign,age_cat,club_id,big_club,new_signing,popularity,MVtoFPL
0,Alexis Sanchez,Arsenal,28,LW,1,65.0,4329,12.0,17.10%,264,3,Chile,0,4,1,1,0,super popular,5.416667
1,Mesut Ozil,Arsenal,28,AM,1,50.0,4395,9.5,5.60%,167,2,Germany,0,4,1,1,0,super popular,5.263158
2,Petr Cech,Arsenal,35,GK,4,7.0,1529,5.5,5.90%,134,2,Czech Republic,0,6,1,1,0,popular,1.272727
3,Theo Walcott,Arsenal,28,RW,1,20.0,2393,7.5,1.50%,122,1,England,0,4,1,1,0,super popular,2.666667
4,Laurent Koscielny,Arsenal,31,CB,3,22.0,912,6.0,0.70%,121,2,France,0,4,1,1,0,popular,3.666667


There are also methods that can be used that offer decent alternative approaches to insert columns. The first is the `insert()` method.

For this example, let's grab a slice of the dataframe and assign it to a variable and then work with it.

In [264]:
df_mini = players.iloc[:4, 1:5]

In [265]:
df_mini

Unnamed: 0,club,age,position,position_cat
0,Arsenal,28,LW,1
1,Arsenal,28,AM,1
2,Arsenal,35,GK,4
3,Arsenal,28,RW,1


Suppose we want to add a new column with player names. We start by creating a series that contains the names.

In [266]:
player_names = pd.Series(['Bronson', 'Bradley','Ronald','Ronnie'])

In [267]:
player_names

0    Bronson
1    Bradley
2     Ronald
3     Ronnie
dtype: object

Now we use `insert()` to insert our new series at the 0th column index with the label "nicknames". The beauty of this method is that it allows you control the spot at which the series is inserted as well as choose a column name for that series.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html

In [268]:
df_mini.insert(0, 'nicknames', player_names)

In [269]:
df_mini

Unnamed: 0,nicknames,club,age,position,position_cat
0,Bronson,Arsenal,28,LW,1
1,Bradley,Arsenal,28,AM,1
2,Ronald,Arsenal,35,GK,4
3,Ronnie,Arsenal,28,RW,1


Another method that we can use is the `assign()` method. It assigns new columns to the existing dataframe and *returns a new copy* of the dataframe with the column added. The syntax does not expect any parameters, but instead expects you to specify the column names and values as keyword arguments.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
* `assign()` allows you to add multiple columns at once, something you could not do with the `insert()` method.

Suppose we want to add a new column with the number of goals each player has scores in his career. We can do the following:

In [270]:
df_mini.assign(career_goals = [12, 67, 179, 49])

Unnamed: 0,nicknames,club,age,position,position_cat,career_goals
0,Bronson,Arsenal,28,LW,1,12
1,Bradley,Arsenal,28,AM,1,67
2,Ronald,Arsenal,35,GK,4,179
3,Ronnie,Arsenal,28,RW,1,49


Remember that our original underlying dataframe is unaffected - the `assign` method returns a copy. Let's try adding two columns at the same time!

In [271]:
df_mini.assign(career_goals = [12, 67, 179, 49], nationality = ['American','British','Turkish','Indian'])

Unnamed: 0,nicknames,club,age,position,position_cat,career_goals,nationality
0,Bronson,Arsenal,28,LW,1,12,American
1,Bradley,Arsenal,28,AM,1,67,British
2,Ronald,Arsenal,35,GK,4,179,Turkish
3,Ronnie,Arsenal,28,RW,1,49,Indian


In [272]:
df_mini

Unnamed: 0,nicknames,club,age,position,position_cat
0,Bronson,Arsenal,28,LW,1
1,Bradley,Arsenal,28,AM,1
2,Ronald,Arsenal,35,GK,4
3,Ronnie,Arsenal,28,RW,1


When we cover merging and joining dataframes later one, we will learn and practice much more efficient ways to do this at scale.

## Adding DataFrame Rows with `append()`

Can we add new rows to our dataframe in a similar manner as adding columns? Yes we can. Suppose we wanted to add a new player to our *df_mini* dataframe.

The first approach is using the `append()` method. It accepts a dataframe, a series, or a list of dataframes or series. It returns a *new* modified dataframe - the underlying original dataframe is unchanged.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

In this first example we'll use a single series to add a player. We will build this Series using a dictionary, where the keys are the keys are the column names and the values are what we enter. We will also name this series as **4** (we'll see why momentarily, but notice that 4 would be the next index to add).

In [273]:
cristiano = pd.Series({
    'nicknames':'Christiano',
    'age': 32,
    'position': 'RW',
    'club': "Juventus",
    'position_cat': 1
}, name = 4)

In [274]:
cristiano

nicknames       Christiano
age                     32
position                RW
club              Juventus
position_cat             1
Name: 4, dtype: object

Now all we gotta do is call the `append()` method on `df_mini`.

In [275]:
df_mini.append(cristiano)

Unnamed: 0,nicknames,club,age,position,position_cat
0,Bronson,Arsenal,28,LW,1
1,Bradley,Arsenal,28,AM,1
2,Ronald,Arsenal,35,GK,4
3,Ronnie,Arsenal,28,RW,1
4,Christiano,Juventus,32,RW,1


Notice that the name of the series became the index label on the new record for the new row that was added. Very nice.

What if we wanted to add multiple players? All we need to do is create a series for each player, and then pass a list of those series to the `append()` method. Here is some commented non-functional code that illustrates how we would do this.

In [276]:
# df_mini.append([player_1, player_2, player_3...])

Another approach is to build a new dataframe that contains the new rows! We can do this by using the `pd.DataFrame` constructor. Note that we're assigning the index values as well.

In [277]:
other_players = pd.DataFrame({
    'nicknames' : ['Gianluigi', 'Lionel'],
    'age': [37, 32],
    'club': ["Juventus", "Barcelona"],
    'position' : ['GK', 'CF'],
    'position_cat' : [4, 2]
}, index = [5, 6])

In [278]:
other_players

Unnamed: 0,nicknames,age,club,position,position_cat
5,Gianluigi,37,Juventus,GK,4
6,Lionel,32,Barcelona,CF,2


Now append that sucker to `df_mini`!

In [279]:
df_mini.append(other_players)

Unnamed: 0,nicknames,club,age,position,position_cat
0,Bronson,Arsenal,28,LW,1
1,Bradley,Arsenal,28,AM,1
2,Ronald,Arsenal,35,GK,4
3,Ronnie,Arsenal,28,RW,1
5,Gianluigi,Juventus,37,GK,4
6,Lionel,Barcelona,32,CF,2


One thing you will likely encounter when working with code written by other Pandas users is *setting with enlargement*. You can set a value to a new `loc` or `iloc` view that doesn't exist, and that will end up adding the row.

First, remember that `append()` does not work in place. In order to update the original dataframe, we must assign the new copy to the original dataframe variable name.


In [280]:
df_mini = df_mini.append(cristiano)

In [281]:
df_mini = df_mini.append(other_players)

In [282]:
df_mini

Unnamed: 0,nicknames,club,age,position,position_cat
0,Bronson,Arsenal,28,LW,1
1,Bradley,Arsenal,28,AM,1
2,Ronald,Arsenal,35,GK,4
3,Ronnie,Arsenal,28,RW,1
4,Christiano,Juventus,32,RW,1
5,Gianluigi,Juventus,37,GK,4
6,Lionel,Barcelona,32,CF,2


Let's now try to assign a new row to an index that doesn't exist.

In [283]:
df_mini.loc[9] = 'some row value'

In [284]:
df_mini

Unnamed: 0,nicknames,club,age,position,position_cat
0,Bronson,Arsenal,28,LW,1
1,Bradley,Arsenal,28,AM,1
2,Ronald,Arsenal,35,GK,4
3,Ronnie,Arsenal,28,RW,1
4,Christiano,Juventus,32,RW,1
5,Gianluigi,Juventus,37,GK,4
6,Lionel,Barcelona,32,CF,2
9,some row value,some row value,some row value,some row value,some row value


Check that out - we assigned a value to a new row at index 9 that didn't exist via setting with enlargement. However, we should strive to avoid this. It is computationally inefficient - `append()` is a much better method. 

That said, regardless of how you do it, adding new rows to dataframes is computationally expensive because of how Pandas stores dataframes in memory. Be mindful of this when working with Pandas. 

## How DataFrames are Stored in Memory

We learned above that appending rows is very inefficienty in Pandas, regardless of the approach. That's because, due to how dataframes are stored in memory, each time we do this append operation, the entire dataframe is copied over.


In [285]:
players.info(verbose = False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Columns: 19 entries, name to MVtoFPL
dtypes: float64(3), int64(10), object(6)
memory usage: 69.1+ KB


When we look at our dataframe, we see that it is split up into three contiguous chunks based on the datatype (float, int, or object). Notice something? This is exactly how columns are classified as well! That means Pandas memory storage is heavily column-driven. At this level, there is no concept of a dataframe row. 

That means that every time we append a row, Pandas needs to reach back into memory, extract the relevant values from each and every chunk, and reconstruct a row object by modifying all of these chunks. This is orchestrated by what's called the **block manager** class in Pandas.

By contrast, when we add a column, Pandas determines the type of data it is, and then adds it to the data type block (chunk), essentially modifying that block only. Thus, adding a column is much less computationally expensive than adding a row.

## Skill Challenge

#### 1. From the *players* dataframe, select 4 columns and 4 rows of no particular order. Assign the resulting 4x4 dataframe to df_random.

We will keep it simple be slicing a 4x4 dataframe using `iloc`.

In [337]:
df_random = players.iloc[10:14, 8:12]

In [338]:
df_random

Unnamed: 0,fpl_sel,fpl_points,region,nationality
10,2.00%,85,2,Switzerland
11,2.00%,85,2,Switzerland
12,1.80%,83,1,England
13,1.80%,83,1,England


An alternative would have been to randomize our slice by using the `.sample()` method.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html


#### 2. Extend df_random 1) vertically by adding a new row, and 2) horizontally by adding a new column. Do this as two separate operations.

Let's start by adding a new row, which we can do using the series constructor and the `append()` method. Remember to name the series.

In [296]:
new_row = pd.Series({'fpl_sel':"1.60%","fpl_points":81,"region":2,"nationality":"Spain"
}, name = 14)

In [297]:
new_row

fpl_sel        1.60%
fpl_points        81
region             2
nationality    Spain
Name: 14, dtype: object

In [301]:
df_random.append(new_row)

Unnamed: 0,fpl_sel,fpl_points,region,nationality
10,2.00%,85,2,Switzerland
11,2.00%,85,2,Switzerland
12,1.80%,83,1,England
13,1.80%,83,1,England
14,1.60%,81,2,Spain


Now let's add a new column called "height". Let's first do this using the `insert()` method.

In [319]:
heights = [158, 162, 187, 170]

In [320]:
df_random.insert(4, "height", heights)

In [321]:
df_random

Unnamed: 0,fpl_sel,fpl_points,region,nationality,height
10,2.00%,85,2,Switzerland,158
11,2.00%,85,2,Switzerland,162
12,1.80%,83,1,England,187
13,1.80%,83,1,England,170


Now let's drop this column and re-add it using the `assign()` method.

In [322]:
df_random.drop(columns = "height", inplace=True)

In [323]:
df_random.assign(height = heights)

Unnamed: 0,fpl_sel,fpl_points,region,nationality,height
10,2.00%,85,2,Switzerland,158
11,2.00%,85,2,Switzerland,162
12,1.80%,83,1,England,187
13,1.80%,83,1,England,170


#### 3. Compare the relative performance of these operations. Is adding a row or a column faster? Is there a signification difference?

Let's run a `%%timeit` magic function on these commands and see! First reset the df_random:

In [359]:
df_random = players.iloc[10:14, 8:12]

In [360]:
df_random

Unnamed: 0,fpl_sel,fpl_points,region,nationality
10,2.00%,85,2,Switzerland
11,2.00%,85,2,Switzerland
12,1.80%,83,1,England
13,1.80%,83,1,England


In [361]:
%%timeit
df_random.append(new_row)

100 loops, best of 5: 2.94 ms per loop


In [362]:
df_random

Unnamed: 0,fpl_sel,fpl_points,region,nationality
10,2.00%,85,2,Switzerland
11,2.00%,85,2,Switzerland
12,1.80%,83,1,England
13,1.80%,83,1,England


In [364]:
%%timeit
df_random.assign(height = heights)

The slowest run took 5.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 462 µs per loop


It looks like adding a column with `assign()` is about 6 times faster than adding a row with `append()`.

Note that `insert()` doesn't work well here because it modifies the dataframe in place. `%%timeit` doesn't like that.