## Challenge

As mentioned in the Compass, we will be using data from international football (soccer :)) matches that took place between 1872 and 2019 (148 years). You can download the dataset from [**this link**](https://drive.google.com/file/d/1cCn5botBKzh1XZOvrxpcLle-Ua7Fh9BR/view?usp=sharing) and find more information about it on [**Kaggle**](https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017).

We need to make sure we understand all variables and information they store before we start working on the task. It's very important to understand the dataset to create meaningful visualizations.

> #### Instruction
> Use visualizations to answer following questions. Try different Python packages.

In [120]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [216]:
df = pd.read_csv("results.csv")

In [131]:
df

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False
...,...,...,...,...,...,...,...,...,...
41581,2020-01-10,Barbados,Canada,1,4,Friendly,Irvine,United States,True
41582,2020-01-12,Kosovo,Sweden,0,1,Friendly,Doha,Qatar,True
41583,2020-01-15,Canada,Iceland,0,1,Friendly,Irvine,United States,True
41584,2020-01-19,El Salvador,Iceland,0,1,Friendly,Carson,United States,True


## Task
Which teams scored the largest number of goals in FIFA World Cup?

In [15]:
df["tournament"].value_counts() # Fifa world cup exists in tournaments.

Friendly                                17029
FIFA World Cup qualification             7236
UEFA Euro qualification                  2582
African Cup of Nations qualification     1672
FIFA World Cup                            900
                                        ...  
Dragon Cup                                  4
World Unity Cup                             4
Copa América qualification                  2
AFF Championship qualification              2
Atlantic Heritage Cup                       2
Name: tournament, Length: 112, dtype: int64

In [21]:
# make df with only fifa world cup in tournaments
df_fifa = df[df["tournament"] == "FIFA World Cup"]
df_fifa

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,total_score
1304,1930-07-13,Belgium,United States,0,3,FIFA World Cup,Montevideo,Uruguay,True,3
1305,1930-07-13,France,Mexico,4,1,FIFA World Cup,Montevideo,Uruguay,True,5
1306,1930-07-14,Brazil,Yugoslavia,1,2,FIFA World Cup,Montevideo,Uruguay,True,3
1307,1930-07-14,Peru,Romania,1,3,FIFA World Cup,Montevideo,Uruguay,True,4
1308,1930-07-15,Argentina,France,1,0,FIFA World Cup,Montevideo,Uruguay,True,1
...,...,...,...,...,...,...,...,...,...,...
39945,2018-07-07,Russia,Croatia,2,2,FIFA World Cup,Sochi,Russia,False,4
39946,2018-07-10,France,Belgium,1,0,FIFA World Cup,St. Petersburg,Russia,True,1
39947,2018-07-11,Croatia,England,2,1,FIFA World Cup,Moscow,Russia,True,3
39948,2018-07-14,Belgium,England,2,0,FIFA World Cup,St. Petersburg,Russia,True,2


In [30]:
df_fifa_home = df_fifa[["home_team","home_score","tournament"]]
df_fifa_away = df_fifa[["away_team","away_score","tournament"]]

In [40]:
df_fifa_home = df_fifa_home.sort_values(by="home_team")
df_fifa_away = df_fifa_away.sort_values(by="away_team")

In [62]:
fifa_away_count = df_fifa_away.groupby("away_team").count()
fifa_home_count.reset_index(inplace=True)
fifa_away_count

Unnamed: 0_level_0,away_score,tournament
away_team,Unnamed: 1_level_1,Unnamed: 2_level_1
Algeria,7,7
Angola,2,2
Argentina,21,21
Australia,10,10
Austria,9,9
...,...,...
United Arab Emirates,2,2
United States,20,20
Uruguay,37,37
Wales,5,5


In [63]:
fifa_home_count = df_fifa_home.groupby("home_team").count()
fifa_home_count.reset_index(inplace=True)
fifa_home_count.keys()

Index(['home_team', 'home_score', 'tournament'], dtype='object')

In [67]:
result = pd.merge(fifa_home_count, fifa_away_count)
result["total_score"] = result["home_score"] + result["away_score"]
final_fifa = result[["home_team","total_score"]].sort_values(by="total_score", ascending=False)

In [81]:
final_fifa.head(15)

Unnamed: 0,home_team,total_score
74,Belgium,52
75,Belgium,52
76,Mexico,52
77,Mexico,52
283,Sweden,50
282,Sweden,50
110,Chile,42
73,South Korea,40
72,Austria,40
284,Uruguay,38


## Task
What is the number of matches played in each tournament throughout history?

In [90]:
df[["tournament","date"]].groupby("tournament").count().sort_values(by="date", ascending=False)

Unnamed: 0_level_0,date
tournament,Unnamed: 1_level_1
Friendly,17029
FIFA World Cup qualification,7236
UEFA Euro qualification,2582
African Cup of Nations qualification,1672
FIFA World Cup,900
...,...
Dragon Cup,4
World Unity Cup,4
Copa América qualification,2
Atlantic Heritage Cup,2


## Task 
Show the trend in number of matches per year.

In [255]:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

In [256]:
df['date'].value_counts()

2012-02-29    66
2016-03-29    63
2008-03-26    60
2014-03-05    59
2012-11-14    56
              ..
1973-08-11     1
1973-08-14     1
1973-08-16     1
1973-08-17     1
2020-02-01     1
Name: date, Length: 15196, dtype: int64

In [141]:
yearly_games_trend = df[['date','city']].groupby(pd.Grouper(key='date', axis=0, 
                      freq='Y')).count()
yearly_games_trend.reset_index(inplace=True)

In [152]:
# create the data
import plotly.graph_objs as go
data = [go.Bar(
    x=yearly_games_trend["date"],
    y=yearly_games_trend["city"]
)]

# create the layout
layout = go.Layout(
    title='Yearly Soccer games trend'
)

# create the figure
fig = go.Figure(data=data, layout=layout)
fig.show()

## Task
Which teams are the most successful ones? (winning percentage)

In [170]:
# make winners and losers list, count games, total/wins?
df.keys()

Index(['date', 'home_team', 'away_team', 'home_score', 'away_score',
       'tournament', 'city', 'country', 'neutral'],
      dtype='object')

In [180]:
len(df)

41586

In [236]:
# total games played
concatenated = df["home_team"].append(df["away_team"])
total_games_played = pd.DataFrame(concatenated.value_counts())

Unnamed: 0,index,0
0,Sweden,1018
1,England,1009
2,Brazil,981
3,Argentina,980
4,Germany,949
...,...,...
307,Niue,2
308,Palau,2
309,Asturias,1
310,Madrid,1


In [237]:
total_games_played = total_games_played.set_axis(['country', 'played'], axis=1, inplace=False)

In [239]:
total_games_played # DONE

Unnamed: 0,country,played
0,Sweden,1018
1,England,1009
2,Brazil,981
3,Argentina,980
4,Germany,949
...,...,...
307,Niue,2
308,Palau,2
309,Asturias,1
310,Madrid,1


In [220]:
# create series
# empty_series = df["home_team"].apply(df.loc[df["home_score"] > df["away_score"]])
# How to add winners
df["results"] = np.where(df["home_score"] > df["away_score"], df['home_team'], df["away_team"])
df["results"] = np.where(df["home_score"] == df["away_score"], "draw", df["results"])

In [226]:
# df = df[df["results"] != "draw"]

wins_df = pd.DataFrame(df["results"].value_counts().rename_axis("country").reset_index(name="wins"))
wins_df = wins_df[wins_df.country != "draw"].reset_index(drop=True)
wins_df

Unnamed: 0,country,wins
0,Brazil,625
1,England,572
2,Germany,555
3,Argentina,526
4,Sweden,500
...,...,...
295,Kabylia,1
296,Timor-Leste,1
297,Gozo,1
298,Central Spain,1


In [240]:
wins_df["total_played"] = total_games_played["played"]

In [247]:
wins_df["win%"] = wins_df["wins"]/wins_df["total_played"]

In [251]:
wins_df.sort_values(by="win%", ascending=False).head(10)

Unnamed: 0,country,wins,total_played,win%
0,Brazil,625,1018,0.613949
1,England,572,1009,0.566898
2,Germany,555,981,0.565749
3,Argentina,526,980,0.536735
4,Sweden,500,949,0.52687
7,Hungary,434,869,0.499425
10,Spain,406,815,0.49816
6,Mexico,439,884,0.496606
9,France,417,840,0.496429
11,Netherlands,398,804,0.495025


## Task
Which teams are the least successful ones? (winning percentage)

In [250]:
wins_df.sort_values(by="win%").head(10)

Unnamed: 0,country,wins,total_played,win%
288,Chagos Islands,1,7,0.142857
286,San Marino,1,7,0.142857
287,Raetia,1,7,0.142857
291,Republic of St. Pauli,1,6,0.166667
290,Romani people,1,6,0.166667
289,Saarland,1,6,0.166667
296,Timor-Leste,1,5,0.2
295,Kabylia,1,5,0.2
294,Two Sicilies,1,5,0.2
293,Asturias,1,5,0.2


## Task
Which months through the history had more matches? Is it June, July, or others? Does the number of matches change from month to month?

In [268]:
monthly_games_trend = df[['date','city']].groupby(pd.Grouper(key='date', axis=0, 
                      freq='M')).count()
monthly_games_trend.reset_index(inplace=True)

In [None]:
# need to group months!

In [271]:
monthly_games_trend

Unnamed: 0,date,city
0,1872-11-30,1
1,1872-12-31,0
2,1873-01-31,0
3,1873-02-28,0
4,1873-03-31,1
...,...,...
1763,2019-10-31,183
1764,2019-11-30,195
1765,2019-12-31,13
1766,2020-01-31,6


In [270]:
# create the data
import plotly.graph_objs as go
data = [go.Bar(
    x=monthly_games_trend["date"],
    y=monthly_games_trend["city"]
)]

# create the layout
layout = go.Layout(
    title='Monthly Soccer games trend'
)

# create the figure
fig = go.Figure(data=data, layout=layout)
fig.show()

## Task
Which teams played against each other the most?

In [175]:
# groupby with 2 matching fields?
df[["home_team","away_team"]].value_counts()

home_team  away_team 
Argentina  Uruguay       96
Uruguay    Argentina     80
Austria    Hungary       68
Hungary    Austria       65
Kenya      Uganda        65
                         ..
Palestine  Maldives       1
Gabon      Botswana       1
           Brazil         1
           Cape Verde     1
Abkhazia   Artsakh        1
Length: 10976, dtype: int64

## Task
Apply your creativity to show some additional insights from the data.

## Task (Stretch)
Create these graphs in Tableau as well.