# Accessing Data

In [1]:
import pandas as pd

Let's create some data to play around.

In [2]:
df = pd.DataFrame(data = {'team'  : ['RealMadrid','FCBarcelona', 'Sevilla', 'FCBarcelona','AtleticoMadrid',
                                     'RealMadrid', 'FCBarcelona','AtleticoMadrid', 'RealMadrid'],
                          'year' : [2020, 2020, 2020, 2019, 2019, 2019, 2018, 2018, 2018],
                          'wins'  : [26,25,19,26,22,21,28,23,22],
                          'draws' : [3,6,6,9,10,5,9,10,10],
                          'losses': [3,6,6,3,6,12,1,5,6]})
df

Unnamed: 0,team,year,wins,draws,losses
0,RealMadrid,2020,26,3,3
1,FCBarcelona,2020,25,6,6
2,Sevilla,2020,19,6,6
3,FCBarcelona,2019,26,9,3
4,AtleticoMadrid,2019,22,10,6
5,RealMadrid,2019,21,5,12
6,FCBarcelona,2018,28,9,1
7,AtleticoMadrid,2018,23,10,5
8,RealMadrid,2018,22,10,6


There are multitude ways of retrieving information from DataFrames. This can be **very confusing** if you start looking on the internet. You will save yourselves **a lot** of headaches consulting the following section. Let's remind ourselves of the data we have in the DataFrame.

In [3]:
df

Unnamed: 0,team,year,wins,draws,losses
0,RealMadrid,2020,26,3,3
1,FCBarcelona,2020,25,6,6
2,Sevilla,2020,19,6,6
3,FCBarcelona,2019,26,9,3
4,AtleticoMadrid,2019,22,10,6
5,RealMadrid,2019,21,5,12
6,FCBarcelona,2018,28,9,1
7,AtleticoMadrid,2018,23,10,5
8,RealMadrid,2018,22,10,6


### Selecting Columns (Only)

In [4]:
# retrieving a single column (returns a series) df[['year']] returns a DataFrame
df['year']

0    2020
1    2020
2    2020
3    2019
4    2019
5    2019
6    2018
7    2018
8    2018
Name: year, dtype: int64

Depending on the name of the column, you may be able to access the column using a period followed by the column name right after the name of the DataFrame. However, this will not work if the name of the column includes some space, e.g. 'league wins'

In [5]:
# retrieving a single column
df.year

0    2020
1    2020
2    2020
3    2019
4    2019
5    2019
6    2018
7    2018
8    2018
Name: year, dtype: int64

To retrieve more than one entire column you will need to supply a list with the name of the columns you are interested.

In [6]:
# retrieving more than one column: year, team
df[['year', 'team']] 

Unnamed: 0,year,team
0,2020,RealMadrid
1,2020,FCBarcelona
2,2020,Sevilla
3,2019,FCBarcelona
4,2019,AtleticoMadrid
5,2019,RealMadrid
6,2018,FCBarcelona
7,2018,AtleticoMadrid
8,2018,RealMadrid


## Slicing a DataFrame

Whenever you want to go beyond retrieving **entire** columns Pandas offers several ways to access information within a DataFrame:

- Use `.loc` when you want access data by making reference to the names of the columns and index (= LABELS). 

- Use `.iloc` when you want to access data by making reference to its **POSITION** in the DataFrame. Similar to a numpy array.

### Using labels: `.loc()`


In [7]:
# return column 'team', similar to df['team'] or df.team
df.loc[:, 'team']

0        RealMadrid
1       FCBarcelona
2           Sevilla
3       FCBarcelona
4    AtleticoMadrid
5        RealMadrid
6       FCBarcelona
7    AtleticoMadrid
8        RealMadrid
Name: team, dtype: object

Above, we use `:` to specify to Pandas we want **all** rows return for column *team*

In [8]:
# Return team in third row
df.loc[2, 'team']

'Sevilla'

Above we returned the name of the *team* with an *index* value of 2. Next, we want to return the name of the *team* and the *draws* for that same *index*

In [9]:
# Return team and draws in third row
df.loc[2, ['team', 'draws']]

team     Sevilla
draws          6
Name: 2, dtype: object

Now, we will return all rows for the contiguous columns *year* to *draws* including both ends.

In [10]:
# Return all rows showing years, teams and wins
df.loc[:, 'year':'draws']

Unnamed: 0,year,wins,draws
0,2020,26,3
1,2020,25,6
2,2020,19,6
3,2019,26,9
4,2019,22,10
5,2019,21,5
6,2018,28,9
7,2018,23,10
8,2018,22,10


Return all rows for columns: *year*, *wins*, *team*

In [11]:
# Return all rows showing years, teams and wins
df.loc[:, ['year', 'wins', 'team'] ]

Unnamed: 0,year,wins,team
0,2020,26,RealMadrid
1,2020,25,FCBarcelona
2,2020,19,Sevilla
3,2019,26,FCBarcelona
4,2019,22,AtleticoMadrid
5,2019,21,RealMadrid
6,2018,28,FCBarcelona
7,2018,23,AtleticoMadrid
8,2018,22,RealMadrid


Return entries in rows with indices 1 through 4 (both included) and columns: *year*, *team*, *losses*, in that order.

In [12]:
df.loc[1:4,['year', 'team', 'losses']]

Unnamed: 0,year,team,losses
1,2020,FCBarcelona,6
2,2020,Sevilla,6
3,2019,FCBarcelona,3
4,2019,AtleticoMadrid,6


Return entries for rows with indices: 1, 0, 2 (in that order) and columns: *year*, *team*, *losses*, in that order. 

In [13]:
# Returning a 'block' within the DataFrame.
# Return items in years 2020  and columns 'year', 'team', 'losses'
df.loc[[1,0,2],['year', 'team', 'losses']]

Unnamed: 0,year,team,losses
1,2020,FCBarcelona,6
0,2020,RealMadrid,3
2,2020,Sevilla,6


### Using position: `.iloc()`

Return all entries in rows 2 to 4 and columns 0 to 3.

In [14]:
# return block rows 2 to 4 both inclusive, and columns 0 to 3 both inclusive 
df.iloc[2:5, 0:4]

Unnamed: 0,team,year,wins,draws
2,Sevilla,2020,19,6
3,FCBarcelona,2019,26,9
4,AtleticoMadrid,2019,22,10


Return the last three rows,

In [15]:
# this is the same as df.iloc[-3:,:]
df.iloc[-3:]

Unnamed: 0,team,year,wins,draws,losses
6,FCBarcelona,2018,28,9,1
7,AtleticoMadrid,2018,23,10,5
8,RealMadrid,2018,22,10,6


```{caution}
By default, Pandas creates a integer range type of index. Hence, it can be confusing to understand the difference between `.iloc[1,..]` and `.loc[1,..]`. In the first case we are accessing the **second row** while in the second case, we are accessing the row with *index* equal to 1.
```