# Accessing Data within Pandas - Lab

## Introduction

In this lab, we'll look at a data set which contains information World cup matches. Let's use the pandas commands learned in the previous lecture to learn more about our data!

## Objectives
You will be able to:
* Understand and explain some key Pandas methods
* Access DataFrame data by using the label
* Perform boolean indexing on both Series and DataFrames
* Use simple selectors for series
* Set new Series and DataFrame inputs

## Load the data

Load the file `WorldCupMatches.csv` as a dataframe in Pandas

In [1]:
import pandas as pd

df = pd.read_csv("WorldCupMatches.csv")


## Common methods and attributes

Use the correct method to look at the first 7 rows of the data set.

In [8]:
df.head(7)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
0,1930,13 Jul 1930 - 15:00,Group 1,Pocitos,Montevideo,France,4,1,Mexico,,4444.0,3,0,LOMBARDI Domingo (URU),CRISTOPHE Henry (BEL),REGO Gilberto (BRA),201,1096,FRA,MEX
1,1930,13 Jul 1930 - 15:00,Group 4,Parque Central,Montevideo,USA,3,0,Belgium,,18346.0,2,0,MACIAS Jose (ARG),MATEUCCI Francisco (URU),WARNKEN Alberto (CHI),201,1090,USA,BEL
2,1930,14 Jul 1930 - 12:45,Group 2,Parque Central,Montevideo,Yugoslavia,2,1,Brazil,,24059.0,2,0,TEJADA Anibal (URU),VALLARINO Ricardo (URU),BALWAY Thomas (FRA),201,1093,YUG,BRA
3,1930,14 Jul 1930 - 14:50,Group 3,Pocitos,Montevideo,Romania,3,1,Peru,,2549.0,1,0,WARNKEN Alberto (CHI),LANGENUS Jean (BEL),MATEUCCI Francisco (URU),201,1098,ROU,PER
4,1930,15 Jul 1930 - 16:00,Group 1,Parque Central,Montevideo,Argentina,1,0,France,,23409.0,0,0,REGO Gilberto (BRA),SAUCEDO Ulises (BOL),RADULESCU Constantin (ROU),201,1085,ARG,FRA
5,1930,16 Jul 1930 - 14:45,Group 1,Parque Central,Montevideo,Chile,3,0,Mexico,,9249.0,1,0,CRISTOPHE Henry (BEL),APHESTEGUY Martin (URU),LANGENUS Jean (BEL),201,1095,CHI,MEX
6,1930,17 Jul 1930 - 12:45,Group 2,Parque Central,Montevideo,Yugoslavia,4,0,Bolivia,,18306.0,0,0,MATEUCCI Francisco (URU),LOMBARDI Domingo (URU),WARNKEN Alberto (CHI),201,1092,YUG,BOL


Look at the last 3 rows of the data set.

In [9]:
df.tail(3)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
849,2014,09 Jul 2014 - 17:00,Semi-finals,Arena de Sao Paulo,Sao Paulo,Netherlands,0,0,Argentina,Argentina win on penalties (2 - 4),63267.0,0,0,C�neyt �AKIR (TUR),DURAN Bahattin (TUR),ONGUN Tarik (TUR),255955,300186490,NED,ARG
850,2014,12 Jul 2014 - 17:00,Play-off for third place,Estadio Nacional,Brasilia,Brazil,0,3,Netherlands,,68034.0,0,2,HAIMOUDI Djamel (ALG),ACHIK Redouane (MAR),ETCHIALI Abdelhak (ALG),255957,300186502,BRA,NED
851,2014,13 Jul 2014 - 16:00,Final,Estadio do Maracana,Rio De Janeiro,Germany,1,0,Argentina,Germany win after extra time,74738.0,0,0,Nicola RIZZOLI (ITA),Renato FAVERANI (ITA),Andrea STEFANI (ITA),255959,300186501,GER,ARG


Get a concise summary of your data using `.info()`

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 852 entries, 0 to 851
Data columns (total 20 columns):
Year                    852 non-null int64
Datetime                852 non-null object
Stage                   852 non-null object
Stadium                 852 non-null object
City                    852 non-null object
Home Team Name          852 non-null object
Home Team Goals         852 non-null int64
Away Team Goals         852 non-null int64
Away Team Name          852 non-null object
Win conditions          852 non-null object
Attendance              850 non-null float64
Half-time Home Goals    852 non-null int64
Half-time Away Goals    852 non-null int64
Referee                 852 non-null object
Assistant 1             852 non-null object
Assistant 2             852 non-null object
RoundID                 852 non-null int64
MatchID                 852 non-null int64
Home Team Initials      852 non-null object
Away Team Initials      852 non-null object
dtypes: float64(1), i

Obtain a tuple representing the number of rows and number of columns

In [12]:
df.shape

(852, 20)

Use the appropriate attribute to get the column names

In [13]:
df.

Index(['Year', 'Datetime', 'Stage', 'Stadium', 'City', 'Home Team Name',
       'Home Team Goals', 'Away Team Goals', 'Away Team Name',
       'Win conditions', 'Attendance', 'Half-time Home Goals',
       'Half-time Away Goals', 'Referee', 'Assistant 1', 'Assistant 2',
       'RoundID', 'MatchID', 'Home Team Initials', 'Away Team Initials'],
      dtype='object')

## Selecting dataframe information

When looking at the dataframe's `.head()`, you might have noticed that the games are structured chronologically in the dataframe.

Use the right selection method to print all the information from the 3rd to the 5th game.

In [14]:
df[2:5]

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
2,1930,14 Jul 1930 - 12:45,Group 2,Parque Central,Montevideo,Yugoslavia,2,1,Brazil,,24059.0,2,0,TEJADA Anibal (URU),VALLARINO Ricardo (URU),BALWAY Thomas (FRA),201,1093,YUG,BRA
3,1930,14 Jul 1930 - 14:50,Group 3,Pocitos,Montevideo,Romania,3,1,Peru,,2549.0,1,0,WARNKEN Alberto (CHI),LANGENUS Jean (BEL),MATEUCCI Francisco (URU),201,1098,ROU,PER
4,1930,15 Jul 1930 - 16:00,Group 1,Parque Central,Montevideo,Argentina,1,0,France,,23409.0,0,0,REGO Gilberto (BRA),SAUCEDO Ulises (BOL),RADULESCU Constantin (ROU),201,1085,ARG,FRA


Now, print all the info from game 5-9, but we're only interested to print out the "Home Team Name" and the "Away Team Name", 

In [19]:
df.loc[5:10,["Home Team Name","Away Team Name"]]

Unnamed: 0,Home Team Name,Away Team Name
5,Chile,Mexico
6,Yugoslavia,Bolivia
7,USA,Paraguay
8,Uruguay,Peru
9,Chile,France
10,Argentina,Mexico


Next, we'd like the information on all the games played in Group 3 for the 1950 World Cup.

In [24]:
df.loc[(df.Year == 1950) & (df.Stage == "Group 3"), "Attendance"]

56    36502.0
61     7903.0
65    25811.0
Name: Attendance, dtype: float64

Let's repeat the command above, but now we only want to print out the attendance column for the Group 3 games

You can combine conditions like this:

`df[(condition1) | (condition2)]`  -> Returns rows where either condition is true

`df[(condition1) & (condition2)]`  -> Returns rows where both conditions are true

In [26]:
df.head(1)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
0,1930,13 Jul 1930 - 15:00,Group 1,Pocitos,Montevideo,France,4,1,Mexico,,4444.0,3,0,LOMBARDI Domingo (URU),CRISTOPHE Henry (BEL),REGO Gilberto (BRA),201,1096,FRA,MEX


Throughout the entire history of the world cup, How many Home games were played by the Netherlands?

In [76]:
df[df["Home Team Name"] == "Netherlands"]["Home Team Name"]


244    Netherlands
258    Netherlands
265    Netherlands
269    Netherlands
277    Netherlands
285    Netherlands
295    Netherlands
302    Netherlands
422    Netherlands
472    Netherlands
504    Netherlands
509    Netherlands
525    Netherlands
542    Netherlands
557    Netherlands
569    Netherlands
574    Netherlands
578    Netherlands
665    Netherlands
682    Netherlands
716    Netherlands
731    Netherlands
760    Netherlands
764    Netherlands
771    Netherlands
805    Netherlands
829    Netherlands
830    Netherlands
832    Netherlands
838    Netherlands
847    Netherlands
849    Netherlands
Name: Home Team Name, dtype: object

In [79]:
df[(df["Home Team Name"] == "Netherlands") | (df["Away Team Name"] == "Netherlands")].count()

Year                    54
Datetime                54
Stage                   54
Stadium                 54
City                    54
Home Team Name          54
Home Team Goals         54
Away Team Goals         54
Away Team Name          54
Win conditions          54
Attendance              54
Half-time Home Goals    54
Half-time Away Goals    54
Referee                 54
Assistant 1             54
Assistant 2             54
RoundID                 54
MatchID                 54
Home Team Initials      54
Away Team Initials      54
dtype: int64

How many games were played by the Netherlands in total?

In [62]:
df.Attendance

0        4444.0
1       18346.0
2       24059.0
3        2549.0
4       23409.0
5        9249.0
6       18306.0
7       18306.0
8       57735.0
9        2000.0
10      42100.0
11      25466.0
12      12000.0
13      70022.0
14      41459.0
15      72886.0
16      79867.0
17      68346.0
18      16000.0
19       9000.0
20      33000.0
21      14000.0
22       8000.0
23      21000.0
24      25000.0
25       9000.0
26      12000.0
27       3000.0
28      35000.0
29      23000.0
30      43000.0
31      35000.0
32      15000.0
33       7000.0
34      55000.0
35      27152.0
36       9000.0
37      30454.0
38       7000.0
39      19000.0
40      13452.0
41      11000.0
42       8000.0
43      20025.0
44      22021.0
45      15000.0
46       7000.0
47      58455.0
48      18141.0
49      20000.0
50      33000.0
51      12000.0
52      45000.0
53      81649.0
54      29703.0
55       9511.0
56      36502.0
57       7336.0
58      42032.0
59      11078.0
60      19790.0
61       7903.0
62      

Next, let's try and figure out how many games the USA played in the 2014 world cup. 

In [83]:
df[((df["Home Team Name"] == "USA") | (df["Away Team Name"] == "USA")) & (df.Year == 2014)].count()

Year                    5
Datetime                5
Stage                   5
Stadium                 5
City                    5
Home Team Name          5
Home Team Goals         5
Away Team Goals         5
Away Team Name          5
Win conditions          5
Attendance              5
Half-time Home Goals    5
Half-time Away Goals    5
Referee                 5
Assistant 1             5
Assistant 2             5
RoundID                 5
MatchID                 5
Home Team Initials      5
Away Team Initials      5
dtype: int64

Now, let's try to find out how many countries participated in the 1986 world cup.

Hint 1: as a first step, create a new data set that only contain games in that year.

Hint 2: You can use `.unique()` to make sure you don't end up with duplicate country names.

In [84]:
df1986 = df[df.Year == 1986]

In [92]:
df1986["Away Team Name"].append(df1986["Home Team Name"]).nunique()

24

In the world cup history, how matches had more than 5 goals in total?

In [101]:
df.head(0)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,...,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials,Total_Goals,Half-time Goals


In [95]:
df["Total_Goals"] = df["Home Team Goals"] + df["Away Team Goals"]

In [97]:
df.Total_Goals[lambda count : count>5].count()

74

## Changing values and creating new columns

With the information you currently have in your `df`, create a new column "Half-time Goals".

In [2]:
df["Half-time Goals"] = df["Half-time Home Goals"]  + df["Half-time Away Goals"]

In [3]:
df[df["Half-time Goals"]!=0].loc[:,["Half-time Goals","Half-time Home Goals","Half-time Away Goals"]]

Unnamed: 0,Half-time Goals,Half-time Home Goals,Half-time Away Goals
0,3,3,0
1,2,2,0
2,2,2,0
3,1,1,0
5,1,1,0
7,2,2,0
10,4,3,1
11,1,1,0
12,1,1,0
13,4,4,0


Run the code below. You'll notice that for Korea, there are records for both North-Korea (Korea DPR) and South-Korea (Korea Republic). 

In [4]:
df.loc[df["Home Team Name"].str.contains('Korea'), "Home Team Name" ]

179         Korea DPR
187         Korea DPR
374    Korea Republic
386    Korea Republic
434    Korea Republic
444    Korea Republic
480    Korea Republic
524    Korea Republic
593    Korea Republic
609    Korea Republic
635    Korea Republic
642    Korea Republic
655    Korea Republic
710    Korea Republic
753         Korea DPR
802    Korea Republic
818    Korea Republic
Name: Home Team Name, dtype: object

Imagine that for some reason, we simply want Korea listed as one entry, so we want to replace every "Home Team Name" and "Away Team Name" entry that contains "Korea" to simply "Korea". In the same way, we want to change the columns "Home Team Initials" and "Away Team Initials" to NSK (North & South Korea) instead of "KOR" and "PRK". 

In [6]:
def to_korea(to_change):
    string = None
    try:
        string = str(to_change)
        if string.contains("Korea"):
            return "Korea"
    except Exception:
        return to_change

Make sure to verify your answer!

In [13]:
df["Home Team Name"][df["Home Team Name"].str.contains("Korea")].map(lambda a : "Korea")

179    Korea
187    Korea
374    Korea
386    Korea
434    Korea
444    Korea
480    Korea
524    Korea
593    Korea
609    Korea
635    Korea
642    Korea
655    Korea
710    Korea
753    Korea
802    Korea
818    Korea
Name: Home Team Name, dtype: object

In [16]:
df = df["Home Team Name"][df["Home Team Name"].str.contains("Korea")]

KeyError: 'Home Team Name'

In [9]:
df.apply(to_korea)[df['Home Team Name'].str.contains("Korea")]

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,...,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials,Half-time Goals
179,1966,15 Jul 1966 - 19:30,Group 4,Ayresome Park,Middlesbrough,Korea DPR,1,1,Chile,,...,0,1,KANDIL Aly Hussein (EGY),CRAWFORD William (SCO),FINNEY Jim (ENG),238,1609,PRK,CHI,1
187,1966,19 Jul 1966 - 19:30,Group 4,Ayresome Park,Middlesbrough,Korea DPR,1,0,Italy,,...,1,0,SCHWINTE Pierre (FRA),ADAIR John (NIR),TAYLOR John (ENG),238,1679,PRK,ITA,1
374,1986,05 Jun 1986 - 16:00,Group A,Estadio Ol�mpico Universitario,Mexico City,Korea Republic,1,1,Bulgaria,,...,0,1,AL SHANAR Fallaj Khuzam (KSA),IGNA Ioan (ROU),BUTENKO Valeri (RUS),308,460,KOR,BUL,1
386,1986,10 Jun 1986 - 12:00,Group A,Cuauhtemoc,Puebla,Korea Republic,2,3,Italy,,...,0,1,SOCHA David (USA),URREA Joaquin (MEX),AL SHARIF Jamal (SYR),308,643,KOR,ITA,1
434,1990,17 Jun 1990 - 21:00,Group E,Dacia Arena,Udine,Korea Republic,1,3,Spain,,...,1,1,JACOME GUERRERO Elias V. (ECU),MAGNI Pierluigi (ITA),LOUSTAU Juan (ARG),322,175,KOR,ESP,2
444,1990,21 Jun 1990 - 17:00,Group E,Friuli,Udine,Korea Republic,0,1,Uruguay,,...,0,0,LANESE Tullio (ITA),DIRAMBA Jean Fidele (GAB),JOUINI Neji (TUN),322,290,KOR,URU,0
480,1994,23 Jun 1994 - 19:30,Group C,Foxboro Stadium,Boston,Korea Republic,0,0,Bolivia,,...,0,0,MOTTRAM Leslie (SCO),MATTHYS Luc (BEL),EVERSTIG Mikael (SWE),337,3065,KOR,BOL,0
524,1998,13 Jun 1998 - 17:30,Group E,Stade de Gerland,Lyon,Korea Republic,1,3,Mexico,,...,1,0,BENKO Gunter (AUT),FRED Lencie (VAN),SCHNEIDER Erich (GER),1014,8732,KOR,MEX,1
593,2002,04 Jun 2002 - 20:30,Group D,Busan Asiad Main Stadium,Busan,Korea Republic,2,0,Poland,,...,1,0,RUIZ Oscar (COL),DORIRI Elise (VAN),LINDBERG Leif (SWE),43950100,43950014,KOR,POL,1
609,2002,10 Jun 2002 - 15:30,Group D,Daegu World Cup Stadium,Daegu,Korea Republic,1,1,USA,,...,0,1,MEIER Urs (SUI),BEREUTER Egon (AUT),TOMUSANGE Ali (UGA),43950100,43950030,KOR,USA,1


## Summary

In this lab, you learned how to access data within Pandas!