# Problem 13: Soccer (a.k.a., the Real Football) Guru

_Version 1.5_

Soccer season is on and teams need to start preparing for the World Cup 2022. We need your help as a **Soccer Guru** to analyse different statistics and come up with insights to help the teams prepare better.

This problem tests your understanding of Pandas and SQL concepts.

**Important note.** Due to a limitation in Vocareum's software stack, this notebook is set to use the Python 3.5 kernel (rather than a more up-to-date 3.6 or 3.7 kernel). If you are developing on your local machine and are using a different version of Python, you may need to adapt your solution before submitting to the autograder.


**Exercise 0** (0 points). Run the code cell below to load the data, which is a SQLite3 database containing results and fixtures of various soccer matches that have been played around the globe since 1980.

Observe that the code loads all rows from the table, `soccer_results`, contained in the database file, `prob0.db`.

> You do not need to do anything for this problem other than run the next two code cells and familiarize yourself with the resulting dataframe, which is stored in the variable `df`.

In [87]:
import sqlite3 as db
import pandas as pd
from datetime import datetime
from collections import defaultdict
disk_engine = db.connect('file:resource/asnlib/publicdata/prob0.db?mode=ro', uri=True)

def load_data():
    df = pd.read_sql_query("SELECT * FROM soccer_results", disk_engine) 
    return df

In [88]:
# Test: Exercise 0 (exposed)
df = load_data()
assert df.shape[0] == 22851, "Row counts do not match. Try loading the data again"
assert df.shape[1] == 9, "You don't have all the columns. Try loading the data again"
print("\n(Passed!)")
df.head()


(Passed!)


Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1994-01-02,Barbados,Grenada,0,0,Friendly,Bridgetown,Barbados,False
1,1994-01-02,Ghana,Egypt,2,1,Friendly,Accra,Ghana,False
2,1994-01-05,Mali,Burkina Faso,1,1,Friendly,Bamako,Mali,False
3,1994-01-09,Mauritania,Mali,1,3,Friendly,Nouakchott,Mauritania,False
4,1994-01-11,Thailand,Nigeria,1,1,Friendly,Bangkok,Thailand,False


In [89]:
df.info

<bound method DataFrame.info of              date   home_team     away_team  home_score  away_score  \
0      1994-01-02    Barbados       Grenada           0           0   
1      1994-01-02       Ghana         Egypt           2           1   
2      1994-01-05        Mali  Burkina Faso           1           1   
3      1994-01-09  Mauritania          Mali           1           3   
4      1994-01-11    Thailand       Nigeria           1           1   
...           ...         ...           ...         ...         ...   
22846  2019-09-10      France       Andorra           3           0   
22847  2019-09-10     Moldova        Turkey           0           4   
22848  2019-09-18    DR Congo        Rwanda           2           3   
22849  2019-09-29  Bangladesh        Bhutan           4           1   
22850  2019-09-30    Botswana       Liberia           0           0   

                    tournament        city     country neutral  
0                     Friendly  Bridgetown    Barb

Each row of this dataframe is a game, which is played between a "home team" (column `home_team`) and an "away team" (`away_team`). The number of goals scored by each team appears in the `home_score` and `away_score` columns, respectively.

**Exercise 1** (1 point): Write an **SQL query** find the ten (10) teams that have the highest average away-scores since the year 2000. Your query should satisfy the following criteria:

- It should return two columns:
    * `team`: The name of the team
    * `ave_goals`: The team's average number of goals **in "away" games.** An "away game" is one in which the team's name appars in `away_team` **and** the game takes place at a "non-neutral site" (`neutral` value equals `FALSE`).
- It should only include teams that have played **at least 30 away matches**.
- It should round the average goals value (`ave_goals`) to three decimal places.
- It should only return the top 10 teams in descending order by average away-goals.
- It should only consider games played since 2000 (including the year 2000).

Store your query string as the variable, `query_top10_away`, below. The test cell will run this query string against the input dataframe, `df`, defined above and return the result in a dataframe named `offensive_teams`. (See the test cell.)

> **Note.** The following exercises have hidden test cases and you'll be awarded full points for passing both the exposed and hidden test cases.

In [90]:
# query_top10_away = '''SELECT away_team, COUNT(away_team) AS games_played, AVG(away_score) AS avg_goals FROM soccer_results
#                         WHERE strftime('%Y',date)>= 2000 AND neutral = "FALSE"
#                         GROUP BY 1 HAVING COUNT(away_team) >= 30 ORDER BY 2 ASC'''

query_top10_away = '''SELECT away_team AS team, ROUND(AVG(away_score),3) AS ave_goals FROM soccer_results
                        WHERE date >= '2000-01-01' AND neutral = "FALSE"
                        GROUP BY 1 HAVING COUNT(away_team) >= 30 ORDER BY 2 DESC LIMIT 10'''

###
df = pd.read_sql_query(query_top10_away, disk_engine)
display(df)
###

print(query_top10_away)

Unnamed: 0,team,ave_goals
0,Germany,2.17
1,Brazil,2.01
2,Spain,1.927
3,England,1.763
4,Netherlands,1.742
5,France,1.639
6,Portugal,1.579
7,Argentina,1.56
8,Saudi Arabia,1.54
9,Denmark,1.534


SELECT away_team AS team, ROUND(AVG(away_score),3) AS ave_goals FROM soccer_results
                        WHERE date >= '2000-01-01' AND neutral = "FALSE"
                        GROUP BY 1 HAVING COUNT(away_team) >= 30 ORDER BY 2 DESC LIMIT 10


In [91]:
# Test: Exercise 1 (exposed)
offensive_teams = pd.read_sql_query(query_top10_away, disk_engine)
df_cols = offensive_teams.columns.tolist()
df_cols.sort()
desired_cols = ['team', 'ave_goals']
desired_cols.sort()
print(offensive_teams.head(10))
assert offensive_teams.shape[0] == 10, "Expected 10 rows but returned dataframe has {}".format(offensive_teams.shape[0])
assert offensive_teams.shape[1] == 2, "Expected 2 columns but returned dataframe has {}".format(offensive_teams.shape[1])
assert df_cols == desired_cols, "Column names should be: {}. Returned dataframe has: {}".format(desired_cols, df_cols)

tolerance = .001
team_4 = offensive_teams.iloc[3].team
team_4_ave = offensive_teams.iloc[3].ave_goals
desired_team_4_ave = 1.763
assert (team_4 == "England" and abs(team_4_ave - 1.763) <= .001), "Fourth entry is {} with average of {}. Got {} with average of {}".format("England", 1.76, team_4, team_4_ave)

print("\n(Passed!)")

           team  ave_goals
0       Germany      2.170
1        Brazil      2.010
2         Spain      1.927
3       England      1.763
4   Netherlands      1.742
5        France      1.639
6      Portugal      1.579
7     Argentina      1.560
8  Saudi Arabia      1.540
9       Denmark      1.534

(Passed!)


In [92]:
# Hidden test cell: exercise1_hidden

print("""
In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.
""")

###
### AUTOGRADER TEST - DO NOT REMOVE
###



In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.



**Exercise 2** (2 points): Suppose we are now interested in the top 10 teams having the best goal **differential**, between the years 2012 and 2018 (both inclusive). A team's goal differential is the difference between the total number of goals it scored and the total number it conceded across all games (in the requested years).

Complete the function, `best_goal_differential()`, below, so that it returns a pandas dataframe containing the top 10 teams by goal differential, sorted in descending order of differential. The dataframe should have two columns: `team`, which holds the team's name, and `differential`, which holds its overall goal differential.

> As a sanity check, you should find the Brazil is the number one team, with a differential of 152 during the selected time period of 2012-2018 (inclusive). It should be the first row of the returned dataframe.

In [93]:
# query = '''SELECT home_team, away_team, SUM((home_score - away_score)) AS best_goal_differential,
#             CASE 
#                WHEN home_score > away_score THEN 'home win'
#                WHEN home_score = away_score THEN 'tie'
#                ELSE 'home lost'
#            END AS win_loss             
#            FROM soccer_results
#              WHERE date BETWEEN "2012-01-01" AND "2018-12-31
#            GROUP BY 1, 2
#            ORDER by 3 DESC'''

In [94]:
query = '''SELECT team, SUM(best_goal_differential) AS best_goal_differential FROM
            (SELECT date, home_team AS team, (home_score - away_score) AS best_goal_differential FROM soccer_results 
            UNION 
            SELECT date, away_team as team, (away_score - home_score) AS best_goal_differential FROM soccer_results)
            WHERE date BETWEEN "2012-01-01" AND "2018-12-31"
            GROUP BY team
            ORDER BY 2 DESC
            LIMIT 10
            '''
df = pd.read_sql_query(query, disk_engine)
display(df)

Unnamed: 0,team,best_goal_differential
0,Brazil,152
1,Spain,147
2,Belgium,119
3,Germany,113
4,France,98
5,Iran,91
6,England,90
7,Portugal,87
8,Argentina,86
9,Japan,81


In [95]:
def best_goal_differential():
    query = '''SELECT team, SUM(best_goal_differential) AS differential FROM
            (SELECT date, home_team AS team, (home_score - away_score) AS best_goal_differential FROM soccer_results 
            UNION 
            SELECT date, away_team as team, (away_score - home_score) AS best_goal_differential FROM soccer_results)
            WHERE date BETWEEN "2012-01-01" AND "2018-12-31"
            GROUP BY team
            ORDER BY 2 DESC
            LIMIT 10
            '''
    df = pd.read_sql_query(query, disk_engine)
    return df


In [96]:
# Test: Exercise 2 (exposed)

diff_df = best_goal_differential()
df_cols = diff_df.columns.tolist()
df_cols.sort()
desired_cols = ['team', 'differential']
desired_cols.sort()

assert isinstance(diff_df, pd.DataFrame), "Dataframe object not returned"
assert diff_df.shape[0] == 10, "Expected 10 rows but returned dataframe has {}".format(diff_df.shape[0])
assert diff_df.shape[1] == 2, "Expected 2 columns but returned dataframe has {}".format(diff_df.shape[1])
assert df_cols == desired_cols, "Column names should be: {}. Returned dataframe has: {}".format(desired_cols, df_cols)

best_team = diff_df.iloc[0].team
best_diff = diff_df.iloc[0].differential
assert (best_team == "Brazil" and best_diff == 152), "{} has best differential of {}. Got team {} having best differential of {}".format("Brazil", 152, best_team, best_diff)

print("\n(Passed!)")


(Passed!)


In [97]:
# Hidden test cell: exercise2_hidden

print("""
In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.
""")

###
### AUTOGRADER TEST - DO NOT REMOVE
###



In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.



**Exercise 3** (1 point). Complete the function, `determine_winners(game_df)`, below. It should determine the winner of each soccer game.

In particular, the function should take in a dataframe like `df` from above. It should return a new dataframe consisting of all the columns from that dataframe plus a new columnn called **`winner`**, holding the name of the winning team. If there is no winner for a particular game (i.e., the score is tied), then the `winner` column should containing the string, `'Draw'`. Lastly, the rows of the output should be in the same order as the input dataframe.

You can use any dataframe manipulation techniques you want for this question _(i.e., pandas methods or SQL queries, as you prefer)._

> You'll need the output dataframe from this exercise for the subsequent exercies, so don't skip this one!

In [98]:
query = '''SELECT *, 
                CASE
                    WHEN home_score > away_score THEN home_team
                    WHEN home_score = away_score THEN 'Draw'
                    ELSE away_team
                END AS winner
            FROM soccer_results
'''
df = pd.read_sql_query(query, disk_engine)
display(df)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winner
0,1994-01-02,Barbados,Grenada,0,0,Friendly,Bridgetown,Barbados,FALSE,Draw
1,1994-01-02,Ghana,Egypt,2,1,Friendly,Accra,Ghana,FALSE,Ghana
2,1994-01-05,Mali,Burkina Faso,1,1,Friendly,Bamako,Mali,FALSE,Draw
3,1994-01-09,Mauritania,Mali,1,3,Friendly,Nouakchott,Mauritania,FALSE,Mali
4,1994-01-11,Thailand,Nigeria,1,1,Friendly,Bangkok,Thailand,FALSE,Draw
...,...,...,...,...,...,...,...,...,...,...
22846,2019-09-10,France,Andorra,3,0,UEFA Euro qualification,Paris,France,FALSE,France
22847,2019-09-10,Moldova,Turkey,0,4,UEFA Euro qualification,Chișinău,Moldova,FALSE,Turkey
22848,2019-09-18,DR Congo,Rwanda,2,3,Friendly,Kinshasa,DR Congo,TRUE,Rwanda
22849,2019-09-29,Bangladesh,Bhutan,4,1,Friendly,Dhaka,Bangladesh,FALSE,Bangladesh


In [99]:
def determine_winners(game_df):
    query = '''SELECT *, 
                    CASE
                        WHEN home_score > away_score THEN home_team
                        WHEN home_score = away_score THEN 'Draw'
                        ELSE away_team
                    END AS winner
                FROM soccer_results
    '''
    df = pd.read_sql_query(query, disk_engine)
    return df


In [100]:
# Test: Exercise 3 (exposed)

game_df = pd.read_sql_query("SELECT * FROM soccer_results", disk_engine)
winners_df = determine_winners(game_df)

game_winner = winners_df.iloc[1].winner
assert game_winner == "Ghana", "Expected Ghana to be winner. Got {}".format(game_winner)

game_winner = winners_df.iloc[2].winner
assert game_winner == "Draw", "Match was Draw. Got {}".format(game_winner)

game_winner = winners_df.iloc[3].winner
assert game_winner == "Mali", "Expected Mali to be winner. Got {}".format(game_winner)

print("\n(Passed!)")


(Passed!)


In [101]:
# Hidden test cell: exercise3_hidden

print("""
In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.
""")

###
### AUTOGRADER TEST - DO NOT REMOVE
###



In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.



**Exercise 4** (3 points): Given a team, its _home advantage ratio_ is the number of home games it has won divided by the number of home games it has played. For this exercise, we'll try to answer the question, how important is the home advantage in soccer? It's importance is factored into draws for competitions, for example, teams wanting to play at home the second leg of the matches of great importance such as tournament knockouts. (_This exercise has a pre-requisite of finishing Exercise 3 as we'll be using the results of the dataframe from that exercise in this one._)

Complete the function, `calc_home_advantage(winners_df)`, below, so that it returns the top 5 countries, among those that have played at least 50 **home** games, having the highest home advantage ratio. It should return a dataframe with two columns, **`team`** and **`ratio`**, holding the name of the team and its home advantage ratio, respectively. The ratio should be rounded to three decimal places. The rows should be sorted in descending order of ratio. If there are two teams with the same winning ratio, the teams should appear in alphabetical order by name.

> **Note 0.** As with our definition of away-games, a team plays a home game if it is the home team (`home_team`) **and** the field is non-neutral (i.e., `neutral` is `FALSE`).
>
> **Note 1.** You should find, for example, that Brazil is the number two team, with a home advantage ratio of 0.773.

In [102]:
display(winners_df[['home_team', 'winner']])

Unnamed: 0,home_team,winner
0,Barbados,Draw
1,Ghana,Ghana
2,Mali,Draw
3,Mauritania,Mali
4,Thailand,Draw
...,...,...
22846,France,France
22847,Moldova,Turkey
22848,DR Congo,Rwanda
22849,Bangladesh,Bangladesh


In [103]:
# Filter the DataFrame to include only rows where 'neutral' is False
filtered_df = winners_df[winners_df['neutral'] == 'FALSE']

# Group by 'home_team'
grouped = filtered_df.groupby('home_team')

# Calculate total_games as the size of each group
total_games = grouped.size()

# Filter for teams with 50 or more games
total_games = total_games[total_games >= 50]

# Calculate total_wins as the count of rows where 'home_team' equals 'winner'
total_wins = filtered_df[filtered_df['home_team'] == filtered_df['winner']].groupby('home_team').size()

# Combine total_games and total_wins into a new DataFrame
result_df = pd.DataFrame({
    'total_games': total_games,
    'total_wins': total_wins
}).reset_index()

# Fill NaN values in 'total_wins' with 0 (for teams that have no wins)
result_df['total_wins'] = result_df['total_wins'].fillna(0).astype(int)

# Calculate win rate as total_wins divided by total_games
# Use np.where to avoid division by zero
result_df['win_rate'] = (result_df['total_wins'] / result_df['total_games']).round(3)

# Optionally, to handle potential division by zero, you could do:
# import numpy as np
# result_df['win_rate'] = np.where(result_df['total_games'] > 0, (result_df['total_wins'] / result_df['total_games']).round(3), 0)

df = result_df.sort_values(by='win_rate', ascending=False).head().rename(columns={'home_team': 'team', 'win_rate': 'ratio'})
df = df[['team', 'ratio']]
print(df)

         team  ratio
207     Spain  0.800
30     Brazil  0.773
106      Iran  0.742
39   Cameroon  0.739
68      Egypt  0.724


In [104]:
def calc_home_advantage(winners_df):
    # Filter the DataFrame to include only rows where 'neutral' is False
    filtered_df = winners_df[winners_df['neutral'] == 'FALSE']

    # Group by 'home_team'
    grouped = filtered_df.groupby('home_team')

    # Calculate total_games as the size of each group
    total_games = grouped.size()

    # Filter for teams with 50 or more games
    total_games = total_games[total_games >= 50]

    # Calculate total_wins as the count of rows where 'home_team' equals 'winner'
    total_wins = filtered_df[filtered_df['home_team'] == filtered_df['winner']].groupby('home_team').size()

    # Combine total_games and total_wins into a new DataFrame
    result_df = pd.DataFrame({
        'total_games': total_games,
        'total_wins': total_wins
    }).reset_index()

    # Fill NaN values in 'total_wins' with 0 (for teams that have no wins)
    result_df['total_wins'] = result_df['total_wins'].fillna(0).astype(int)

    # Calculate win rate as total_wins divided by total_games
    # Use np.where to avoid division by zero
    result_df['win_rate'] = (result_df['total_wins'] / result_df['total_games']).round(3)

    # Optionally, to handle potential division by zero, you could do:
    # import numpy as np
    # result_df['win_rate'] = np.where(result_df['total_games'] > 0, (result_df['total_wins'] / result_df['total_games']).round(3), 0)

    df = result_df.sort_values(by='win_rate', ascending=False).head().rename(columns={'home_team': 'team', 'win_rate': 'ratio'})
    df = df[['team', 'ratio']]
    return df


In [105]:
# Test: Exercise 4 (exposed)
from IPython.display import display

win_perc = calc_home_advantage(winners_df)

print("The solution, according to you:")
display(win_perc)

df_cols = win_perc.columns.tolist()
df_cols.sort()
desired_cols = ['team', 'ratio']
desired_cols.sort()

assert win_perc.shape[0] == 5, "Expected 5 rows, got {}".format(win_perc.shape[0])
assert win_perc.shape[1] == 2, "Expected 2 columns, got {}".format(win_perc.shape[1])
assert df_cols == desired_cols, "Expected {} columns but got {} columns".format(desired_cols, df_cols)

tolerance = .001
sec_team = win_perc.iloc[1].team
sec_perc = win_perc.iloc[1].ratio

assert (sec_team == "Brazil" and abs(sec_perc - .773) <= tolerance), "Second team should be {} with ratio of {}. \
Got {} with ratio of {}".format("Brazil", .773, sec_team, sec_perc)

print("\n(Passed!)")

The solution, according to you:


Unnamed: 0,team,ratio
207,Spain,0.8
30,Brazil,0.773
106,Iran,0.742
39,Cameroon,0.739
68,Egypt,0.724



(Passed!)


In [106]:
# Hidden test cell: exercise4_hidden

print("""
In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.
""")

###
### AUTOGRADER TEST - DO NOT REMOVE
###



In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.



**Exercise 5** (3 points) Now, we've seen how much the home advantage plays in, let us see how the results have looked 
like in the previous tournaments, for the specific case of the FIFA World Cup matches.

In particular, complete the function, `points_table(winners_df, wc_year)`, below, so that it does the following:
- It should take as input a dataframe, `winners_df`, having a "winner" column like that produced in Exercise 3, as well as a target year, `wc_year`.
- It should consider only games in the given target year. Furthermore, it should only consider games where the `tournament` column has the value `"FIFA World Cup"`.
- It should construct and return a "points table". This table should have two columns, **`team`**, containing the team name, and **`points`**, containing a points tally has defined below.
- To compute the points, give the team 3 points for every win, 1 point for every draw, and 0 points (no points) for a loss.
- In case of a tie in the points, sort the teams alphabetically

As an example output, for the 1998 FIFA World Cup, the points table is:

| team        | points |
|-------------|--------|
| France      | 19     |
| Croatia     | 15     |
| Brazil      | 13     |
| Netherlands | 12     |
| Italy       | 11     |

In [107]:
import numpy as np
display(winners_df)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winner
0,1994-01-02,Barbados,Grenada,0,0,Friendly,Bridgetown,Barbados,FALSE,Draw
1,1994-01-02,Ghana,Egypt,2,1,Friendly,Accra,Ghana,FALSE,Ghana
2,1994-01-05,Mali,Burkina Faso,1,1,Friendly,Bamako,Mali,FALSE,Draw
3,1994-01-09,Mauritania,Mali,1,3,Friendly,Nouakchott,Mauritania,FALSE,Mali
4,1994-01-11,Thailand,Nigeria,1,1,Friendly,Bangkok,Thailand,FALSE,Draw
...,...,...,...,...,...,...,...,...,...,...
22846,2019-09-10,France,Andorra,3,0,UEFA Euro qualification,Paris,France,FALSE,France
22847,2019-09-10,Moldova,Turkey,0,4,UEFA Euro qualification,Chișinău,Moldova,FALSE,Turkey
22848,2019-09-18,DR Congo,Rwanda,2,3,Friendly,Kinshasa,DR Congo,TRUE,Rwanda
22849,2019-09-29,Bangladesh,Bhutan,4,1,Friendly,Dhaka,Bangladesh,FALSE,Bangladesh


In [112]:
display(winners_df[(winners_df['home_team'] == 'Croatia') & (winners_df['date'].str.contains('1998'))])

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,winner
3100,1998-04-22,Croatia,Poland,4,1,Friendly,Osijek,Croatia,False,Croatia
3177,1998-06-03,Croatia,Iran,2,0,Friendly,Rijeka,Croatia,False,Croatia
3193,1998-06-06,Croatia,Australia,7,0,Friendly,Zagreb,Croatia,False,Croatia
3509,1998-10-14,Croatia,North Macedonia,3,2,UEFA Euro qualification,Zagreb,Croatia,False,Croatia


In [108]:
import pandas as pd
import numpy as np

# # Convert 'date' column to datetime format
# winners_df['date'] = pd.to_datetime(winners_df['date'])

# Filter the DataFrame to include only rows from the year 1998
winners_df_copy = winners_df[(winners_df['date'].str.contains('1998')) & (winners_df['tournament'] == 'FIFA World Cup')].copy()

# Define the conditions for determining points
conditions = [
    winners_df_copy['winner'] == winners_df_copy['home_team'],  # Home team wins
    winners_df_copy['winner'] == 'Draw',  # Match is a draw
    winners_df_copy['winner'] != winners_df_copy['home_team']  # Home team loses or away team wins
]

# Define the points associated with each condition
choices = [3, 1, 0]

# Assign points based on the conditions
winners_df_copy['points'] = np.select(conditions, choices, default=0)

# Display the modified DataFrame
df = winners_df_copy[['home_team', 'points']].rename(columns={'home_team': 'team'}).groupby('team')['points'].sum().reset_index().sort_values(by='points', ascending=False)
display(df)

Unnamed: 0,team,points
7,France,19
2,Brazil,13
13,Netherlands,11
0,Argentina,10
8,Germany,10
9,Italy,10
16,Romania,7
22,Spain,4
5,Colombia,3
6,England,3


In [85]:
def points_table(winners_df, wc_year):
    
    # Convert 'date' column to datetime format
    winners_df['date'] = pd.to_datetime(winners_df['date'])

    # Filter the DataFrame to include only rows from the year, wc_year
    winners_df_copy = winners_df[(winners_df['date'].dt.year == wc_year) & (winners_df['tournament'] == 'FIFA World Cup')].copy()

    # Define the conditions for determining points
    conditions = [
        winners_df_copy['winner'] == winners_df_copy['home_team'],  # Home team wins
        winners_df_copy['winner'] == 'Draw',  # Match is a draw
        winners_df_copy['winner'] != winners_df_copy['home_team']  # Home team loses or away team wins
    ]

    # Define the points associated with each condition
    choices = [3, 1, 0]

    # Assign points based on the conditions
    winners_df_copy['points'] = np.select(conditions, choices, default=0)

    # Display the modified DataFrame
    df = winners_df_copy[['home_team', 'points']].rename(columns={'home_team': 'team'}).groupby('team')['points'].sum().reset_index().sort_values(by='points', ascending=False)
    return df


In [86]:
# Test: Exercise 5 (exposed)


tbl_1998 = points_table(winners_df, 1998)

assert tbl_1998.iloc[0].team == "France"
assert tbl_1998.iloc[0].points == 19
assert tbl_1998.iloc[1].team == "Croatia"
assert tbl_1998.iloc[1].points == 15
assert tbl_1998.iloc[2].team == "Brazil"
assert tbl_1998.iloc[2].points == 13
assert tbl_1998.iloc[3].team == "Netherlands"
assert tbl_1998.iloc[3].points == 12
assert tbl_1998.iloc[4].team == "Italy"
assert tbl_1998.iloc[4].points == 11

print("\n(Passed!)")


AssertionError: 

In [None]:
# Hidden test cell: exercise5_hidden

print("""
In addition to the tests above, this cell will include some hidden tests.
You will only know the result when you submit your solution to the
autograder.
""")

###
### AUTOGRADER TEST - DO NOT REMOVE
###


**Fin!** You’ve reached the end of this part. Don’t forget to restart and run all cells again to make sure it’s all working when run in sequence; and make sure your work passes the submission process. Good luck!