# Python goals... exploring a football data-set using Pandas

In [None]:
# Import any packages
import pandas as pd
import matplotlib

# Make the plots appear inline
%matplotlib inline

## Plot the number of International matches that took place each year?

Read in the `results.csv` file which is saved in the same directory as this notebook and save it to a Pandas DataFrame named `results_df`.

In [None]:
results_df = pd.read_csv('results.csv')

Look at the first 5 records of `results_df` by using using the Pandas `head()` method.

In [None]:
results_df.head()

Each row represents a single International football match.

Look at the last 5 records of `results_df` by using using the Pandas `tail()` method.

In [None]:
results_df.tail()

Familiarise ourselves with the data-set by checking the number of rows and columns using the Pandas `shape` attribute.

In [None]:
results_df.shape

Check the datatypes of the columns using the Pandas `dtypes` attribute.

In [None]:
results_df.dtypes

The `date` column is of type `object`, which isn't very useful and does not make use of the fact that the column is actually a date, we can select the column as a Pandas Series to inspect.

In [None]:
results_df['date']

Now use `pd.to_datetime()` on the `date` column in `results_df` to convert to a `datetime` datatype Series.

In [None]:
pd.to_datetime(results_df['date'])

Assign the Series to a new column to `results_df` called `date_time`.

In [None]:
results_df['date_time'] = pd.to_datetime(results_df['date'])

Check that the new column has the correct datatype.

In [None]:
results_df.dtypes

Through the use of the `dt` namespace, convenient attributes such as `year` can be extracted.

In [None]:
results_df['date_time'].dt.year

Count the occurance of each `year` by using Pandas `value_counts()`.

In [None]:
results_df['date_time'].dt.year.value_counts()

The index is the `year` and the value is the count relating to it, it is sorted in descending order by default.

In order to plot we must sort by the index using Pandas `sort_index()`.

In [None]:
results_df['date_time'].dt.year.value_counts().sort_index()

By simply using Pandas `plot()` we can visualise the nuber of international matches each year (the default plot type is a line).

Although the plot could certainly use some visual enhancements, a simple Pandas one-liner is able to convey the number of matches that took place each year!

In [None]:
results_df['date_time'].dt.year.value_counts().sort_index().plot()

## Filtering the time-series for more fine grained detail

It would also be insightful to take a look at a narrower range of International matches. 

Create a boolean Series from the `date_time` column.

In [None]:
results_df['date_time'].dt.year >= 2014

Slice the Series and only keep rows that are `True`.

In [None]:
results_df['date_time'].dt.year[results_df['date_time'].dt.year >= 2014]

As previously, use Pandas `value_counts()` to count the occurances each year.

In [None]:
results_df['date_time'].dt.year[results_df['date_time'].dt.year >= 2014].value_counts()

Then sort it by the index using Pandas `sort_index()`.

In [None]:
results_df['date_time'].dt.year[results_df['date_time'].dt.year >= 2014].value_counts().sort_index()

Now using Pandas `plot()` to visualise the Series.

In [None]:
results_df['date_time'].dt.year[results_df['date_time'].dt.year >= 2014].value_counts().sort_index().plot()

There is a large drop in the final two years. 2020 is low mainly due to COVID and 2021 is partly COVID and partly an unfinished year. 

We can see this using Pandas `max()` on the `date_time` column.

In [None]:
results_df['date_time'].max()

## Which Tournament type is the most exciting?

The most exciting International tournament could be defined in many ways, but one way to define it may be to investigate the goals scored! 

To begin with, create a new column `total_score` which calculates the `home_score` + `away_score` to give the total number of goals in each International match.

In [None]:
results_df['total_score'] = results_df['home_score'] + results_df['away_score']

Now check a random 5 records from the DataFrame to validate the calculation.

In [None]:
results_df[['total_score', 'home_score', 'away_score']].sample(5)

For each tournament we can ue Pandas `groupby()` to calculate the average number of match goals and determine the most exciting tournament based on this metric!

Firstly we will filter to tournaments with >= 100 matches in the DataFrame to ensure that there is a reasonable sample size.

Use Pandas `value_counts()` on the `tournament` column to return a Series indexed by `tournament` with the value being the number of matches in the tournament.

In [None]:
results_df['tournament'].value_counts()

Assign this to a variable called `tournament_count`.

In [None]:
tournament_count = results_df['tournament'].value_counts()

See which tournaments have over 100 matches by using boolean indexing as previously.

In [None]:
tournament_count[tournament_count >= 100]

Now assign the index of this Series to a variable called `most_common_tournaments`.

In [None]:
most_common_tournaments = tournament_count[tournament_count >= 100].index

Check that `most_common_tournaments` is what we expect

In [None]:
most_common_tournaments

Filter `results_df` using Pandas `isin()` to retain rows in `most_common_tournaments` using boolean indexing and assign to a variable called `most_common_tournaments_df`.

In [None]:
most_common_tournaments_df = results_df[results_df.tournament.isin(most_common_tournaments)]

Check the `shape` of this to confirm we have subsetted the data-set.

In [None]:
most_common_tournaments_df.shape

Now use Pandas `groupby()` in conjunction with Pandas `agg()` to find the average number of `total_score` per `tournament`.

In [None]:
most_common_tournaments_df.groupby('tournament')['total_score'].agg(['count', 'mean', 'median'])

Sort it by `mean` using Pandas `sort_values()` and use Pandas `tail()` to view the top 5 tournaments.

In [None]:
most_common_tournaments_df.groupby('tournament')['total_score'].agg(['count', 'mean', 'median']).sort_values('mean').tail()

Plot this as a horizontal bar chart using Pandas `plot(kind='barh')`. Notice that we tweak the code slightly to use Pandas `mean()` instead of Pandas `agg()` to make it easier to just plot the mean.

In [None]:
most_common_tournaments_df.groupby('tournament')['total_score'].mean().sort_values().tail().plot(kind='barh')

## What scores are most common at the Euros?

Lets look at the score distribution of matches in the Euros! 

Start by filtering `results_df` using boolean indexing.

In [None]:
euro_df = results_df[results_df['tournament'] == 'UEFA Euro']

Check we have the correct data by using Pandas `head()` and `tail()`.

In [None]:
euro_df.head()

In [None]:
euro_df.tail()

Pefect, the Euros are every 4 years normally but last years is missing due to COVID.

Lets check if the mean score has changed over the years. We use similar syntax to before but ensuring we are group by the year not the whole `date_time`!

In [None]:
euro_df.groupby(euro_df['date_time'].dt.year)['total_score'].mean().plot()

Looks like there was quite a bit of variation prior to 1980, so we will filter them out using boolean indexing and create a new Pandas DataFrame called `recent_euro_df`.

In [None]:
recent_euro_df = euro_df[euro_df['date_time'].dt.year >= 1980]

Again check using `head()` and `tail()` that we have the correct data

In [None]:
recent_euro_df.head()

In [None]:
recent_euro_df.tail()

Looks great, now we can use Pandas `groupby()` on `home_score` and `away_score` with Pandas `size()` to get the number of matches for each score.

In [None]:
recent_euro_df.groupby(['home_score', 'away_score']).size()

We can convert this into something more asthetically pleasing by using Pandas `unstack(fill_value=0)` to pivot the levels in the table and fill the empty cells with 0.

In [None]:
recent_euro_df.groupby(['home_score', 'away_score']).size().unstack(fill_value=0)

Gradient fill the background based on the number of matches using `style` and `background_gradient(axis=None)` to fill based on both axis at once.

In [None]:
recent_euro_df.groupby(['home_score', 'away_score']).size().unstack(fill_value=0).style.background_gradient(axis=None)