# Project: Exploring a Soccer Match Database

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

I have chosen to analyze the [European Soccer Database](https://www.kaggle.com/hugomathien/soccer). The database includes approximately 26,000 records of soccer matches, including team layout and game outcome. There is also data regarding player and team attributes (taken from the FIFA video games), and values for betting odds on each game across several online gambling platforms.

My focus will be on the players making up each team, team formations, and their respective impacts on team performance. Through this project, I will attempt to draw conclusions about factors influencing team performance based on the available data.

I chose this database and to focus on performance because of the connection with my work background. I am interested in improving the performance of a manufacturing organization which produces only custom made-to-order products. I would like to draw a parallel between analyzing team sport performance and team business performance.

### Importing

First I import the packages relevant for this project:

In [1]:
import sqlite3                      # to read the raw database file in .sqlite format
import pandas as pd                 # for creating and modifying dataframes
from matplotlib.pyplot import plot  # for data visualization
import seaborn as sb                # to clean up visualizations
import os                           # to locate files within the directory
import numpy as np                  #

# visualizations will render in-browser
%matplotlib inline                  

<a id='wrangling'></a>
## Data Wrangling

The file comes in .sqlite format, so it must be unpacked and imported into dataframes to be manipulated. Below, I will check out what data is included with each table and begin to shape the data so that it is useful for my analysis.

### General Properties

In [2]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.


path = os.getcwd()                     # ensures that the full path is being used
database = path + '\\database.sqlite'  # even though the file should be in the same folder

con = sqlite3.connect(database)        # establish a connection with the database
tables = pd.read_sql(                  # write a query to see all tables
    """
    SELECT * FROM sqlite_master
    WHERE type='table';
    """,con=con)
tables

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,sqlite_sequence,sqlite_sequence,4,"CREATE TABLE sqlite_sequence(name,seq)"
1,table,Player_Attributes,Player_Attributes,11,"CREATE TABLE ""Player_Attributes"" (\n\t`id`\tIN..."
2,table,Player,Player,14,CREATE TABLE `Player` (\n\t`id`\tINTEGER PRIMA...
3,table,Match,Match,18,CREATE TABLE `Match` (\n\t`id`\tINTEGER PRIMAR...
4,table,League,League,24,CREATE TABLE `League` (\n\t`id`\tINTEGER PRIMA...
5,table,Country,Country,26,CREATE TABLE `Country` (\n\t`id`\tINTEGER PRIM...
6,table,Team,Team,29,"CREATE TABLE ""Team"" (\n\t`id`\tINTEGER PRIMARY..."
7,table,Team_Attributes,Team_Attributes,2,CREATE TABLE `Team_Attributes` (\n\t`id`\tINTE...


### Tables
I know from reading the documentation provided with the database that the 2 `_Attributes` tables are based on data from the FIFA video games. I will not need them to support my analysis, so they will not be brought into dataframes.

The 

In [3]:
# write query to import player table
player_df = pd.read_sql( 
    """
    SELECT player_api_id as id, player_name, birthday, height, weight 
    FROM Player;
    """, con=con, index_col='id', parse_dates=['birthday'])

# observe column names and datatypes
player_df.info()         

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11060 entries, 505942 to 39902
Data columns (total 4 columns):
player_name    11060 non-null object
birthday       11060 non-null datetime64[ns]
height         11060 non-null float64
weight         11060 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 432.0+ KB


In [4]:
# query to import league table
league_df = pd.read_sql(
    """
    SELECT * FROM League;
    """, con=con, index_col='id')

# observe column names and datatypes
league_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11 entries, 1 to 24558
Data columns (total 2 columns):
country_id    11 non-null int64
name          11 non-null object
dtypes: int64(1), object(1)
memory usage: 264.0+ bytes


In [5]:
team_df = pd.read_sql(
    """
    SELECT * FROM team;
    """, con=con, index_col='team_api_id')
team_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 299 entries, 9987 to 7896
Data columns (total 4 columns):
id                  299 non-null int64
team_fifa_api_id    288 non-null float64
team_long_name      299 non-null object
team_short_name     299 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 11.7+ KB


Because the `Match` table contains the majority of the data pertinent to my analysis, I will try to `JOIN` the other tables as early as possible so that I can drop everything that I don't need and avoid doing any more operations.

First I pull the `Country` and `League` names directly in, using an `INNER JOIN` on their respective ID's.

I also have the `%%time` magic called because this table is so large; I want to also keep track of how long operations take and attempt to remove unnecessary steps at the end.

In [6]:
match_df = pd.read_sql(
    """
    SELECT Country.name as country, League.name as league, match.*
    FROM match
    JOIN Country ON Country.id = match.country_id
    JOIN League ON League.id = match.league_id;
    """, con=con, index_col='id', parse_dates=['date'])
match_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25979 entries, 1 to 25979
Columns: 116 entries, country to BSA
dtypes: datetime64[ns](1), float64(96), int64(8), object(11)
memory usage: 23.2+ MB


One of the first issues I will need to tackle is reducing the amount of columns in this table. The above output shows that there are 116 columns, too many to be listed using `.info()`. I will need to print a list of the columns to take a look.

In [7]:
for num, column in enumerate(list(match_df)):
    print('%i: %s' % (num,column))

0: country
1: league
2: country_id
3: league_id
4: season
5: stage
6: date
7: match_api_id
8: home_team_api_id
9: away_team_api_id
10: home_team_goal
11: away_team_goal
12: home_player_X1
13: home_player_X2
14: home_player_X3
15: home_player_X4
16: home_player_X5
17: home_player_X6
18: home_player_X7
19: home_player_X8
20: home_player_X9
21: home_player_X10
22: home_player_X11
23: away_player_X1
24: away_player_X2
25: away_player_X3
26: away_player_X4
27: away_player_X5
28: away_player_X6
29: away_player_X7
30: away_player_X8
31: away_player_X9
32: away_player_X10
33: away_player_X11
34: home_player_Y1
35: home_player_Y2
36: home_player_Y3
37: home_player_Y4
38: home_player_Y5
39: home_player_Y6
40: home_player_Y7
41: home_player_Y8
42: home_player_Y9
43: home_player_Y10
44: home_player_Y11
45: away_player_Y1
46: away_player_Y2
47: away_player_Y3
48: away_player_Y4
49: away_player_Y5
50: away_player_Y6
51: away_player_Y7
52: away_player_Y8
53: away_player_Y9
54: away_player_Y10
55: awa

It looks like I will want to lose all of the columns after 77, because they contain metrics that I am not interested in for this analysis. In this case, I find it much easier to lose the columns within Pandas rather than SQL, because I want to use a numbered range.

In [8]:
match_df = match_df[match_df.columns[:78]]

Now I will convert the other column values from ID numbers to their respective string values. I still need to do this for the home and away team names, but it was already tackled in the SQL `JOIN` earlier for the 'Country' and 'League' columns. All I need to do is drop those ID columns as well.

In [9]:
values_dict = team_df['team_long_name'].to_dict()
match_df['home_team_api_id'] = match_df['home_team_api_id'].astype('int')
match_df['home_team_name'] = match_df['home_team_api_id'].replace(values_dict)
match_df['away_team_api_id'] = match_df['away_team_api_id'].astype('int')
match_df['away_team_name'] = match_df['away_team_api_id'].replace(values_dict)

In [10]:
match_df.drop(['country_id', 'league_id', 'home_team_api_id', 'away_team_api_id'], axis=1, inplace=True)

In [11]:
match_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25979 entries, 1 to 25979
Data columns (total 76 columns):
country            25979 non-null object
league             25979 non-null object
season             25979 non-null object
stage              25979 non-null int64
date               25979 non-null datetime64[ns]
match_api_id       25979 non-null int64
home_team_goal     25979 non-null int64
away_team_goal     25979 non-null int64
home_player_X1     24158 non-null float64
home_player_X2     24158 non-null float64
home_player_X3     24147 non-null float64
home_player_X4     24147 non-null float64
home_player_X5     24147 non-null float64
home_player_X6     24147 non-null float64
home_player_X7     24147 non-null float64
home_player_X8     24147 non-null float64
home_player_X9     24147 non-null float64
home_player_X10    24147 non-null float64
home_player_X11    24147 non-null float64
away_player_X1     24147 non-null float64
away_player_X2     24147 non-null float64
away_player_X

Now I have gotten from the original 116 columns down to 74, which is still not quite easily readable. This is partially because both teams have 11 players, each of which has 3 dedicated columns in this table:
- `home/away_player_N` - the API ID of the player in that position
- `home/away_player_XN`- the 'X' coordinate position of the player on the field
- `home/away_player_YN`- the 'Y' coordinate position of the player on the field

To me, it makes more sense to condense these 3 fields for each player into a dictionary which specifices the `X,Y` coordinate set of the player's location:
```
{player: (x_coord, y_coord)}
```

To begin, I make a list of all of the player-related columns (they all have the word 'player' in the name).

In [12]:
# a list comprehension of all the column names with the word 'player'
player_cols = [col for col in match_df.columns if 'player' in col]

In [13]:
match_df.dropna(how='any',inplace=True)

In [14]:
match_df.info()
#for col in player_cols:
#    print(match_df[np.isnan(match_df[col])])

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21361 entries, 146 to 25979
Data columns (total 76 columns):
country            21361 non-null object
league             21361 non-null object
season             21361 non-null object
stage              21361 non-null int64
date               21361 non-null datetime64[ns]
match_api_id       21361 non-null int64
home_team_goal     21361 non-null int64
away_team_goal     21361 non-null int64
home_player_X1     21361 non-null float64
home_player_X2     21361 non-null float64
home_player_X3     21361 non-null float64
home_player_X4     21361 non-null float64
home_player_X5     21361 non-null float64
home_player_X6     21361 non-null float64
home_player_X7     21361 non-null float64
home_player_X8     21361 non-null float64
home_player_X9     21361 non-null float64
home_player_X10    21361 non-null float64
home_player_X11    21361 non-null float64
away_player_X1     21361 non-null float64
away_player_X2     21361 non-null float64
away_player

In [15]:
#for col in player_cols:
#    match_df[col].replace(np.nan, 0, inplace=True)

In [16]:
players = {}

for i in range(1,12):
    home_str = 'home_player_'
    away_str = 'away_player_'
    players[home_str+str(i)] = (home_str+'X'+str(i), home_str+'Y'+str(i))
    players[away_str+str(i)] = (away_str+'X'+str(i), away_str+'Y'+str(i))

In [17]:
values_dict = player_df['player_name'].to_dict()

def posit_dict(x, y, z):
    return dict([(x,(int(y),int(z)))])

for player in players.keys():
    df_name = player + '_coords'
    player_x = players[player][0]
    player_y = players[player][1]
    match_df[player] = match_df[player].astype('int')
    match_df[player] = match_df[player].replace(values_dict)
    match_df[df_name] = match_df.apply(lambda x: posit_dict(x[player], x[player_x], x[player_y]), axis=1)
    match_df.drop([player, player_x, player_y], axis=1, inplace=True)

In [18]:
list(match_df)

['country',
 'league',
 'season',
 'stage',
 'date',
 'match_api_id',
 'home_team_goal',
 'away_team_goal',
 'home_team_name',
 'away_team_name',
 'home_player_1_coords',
 'away_player_1_coords',
 'home_player_2_coords',
 'away_player_2_coords',
 'home_player_3_coords',
 'away_player_3_coords',
 'home_player_4_coords',
 'away_player_4_coords',
 'home_player_5_coords',
 'away_player_5_coords',
 'home_player_6_coords',
 'away_player_6_coords',
 'home_player_7_coords',
 'away_player_7_coords',
 'home_player_8_coords',
 'away_player_8_coords',
 'home_player_9_coords',
 'away_player_9_coords',
 'home_player_10_coords',
 'away_player_10_coords',
 'home_player_11_coords',
 'away_player_11_coords']

In [19]:
match_df.tail()

Unnamed: 0_level_0,country,league,season,stage,date,match_api_id,home_team_goal,away_team_goal,home_team_name,away_team_name,...,home_player_7_coords,away_player_7_coords,home_player_8_coords,away_player_8_coords,home_player_9_coords,away_player_9_coords,home_player_10_coords,away_player_10_coords,home_player_11_coords,away_player_11_coords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
25973,Switzerland,Switzerland Super League,2015/2016,8,2015-09-13,1992089,3,3,FC Zürich,FC Thun,...,"{'Cabral': (4, 7)}","{'Sandro Wieser': (6, 6)}","{'Sangone Sarr': (6, 7)}","{'Gonzalo Zarate': (3, 8)}","{'Oliver Buff': (8, 7)}","{'Gianluca Frontino': (5, 8)}","{'Franck Etoundi': (4, 10)}","{'Nelson Ferreira': (7, 8)}","{'Davide Chiumiento': (6, 10)}","{'Roman Buess': (5, 11)}"
25975,Switzerland,Switzerland Super League,2015/2016,9,2015-09-22,1992091,1,0,FC St. Gallen,FC Thun,...,"{'Everton': (6, 6)}","{'Sandro Wieser': (6, 6)}","{'Geoffrey Treand': (3, 8)}","{'Gonzalo Zarate': (3, 8)}","{'Danijel Aleksic': (5, 8)}","{'Gianluca Frontino': (5, 8)}","{'Yannis Tafer': (7, 8)}","{'Simone Rapp': (7, 8)}","{'Sandro Gotal': (5, 11)}","{'Roman Buess': (5, 11)}"
25976,Switzerland,Switzerland Super League,2015/2016,9,2015-09-23,1992092,1,2,FC Vaduz,FC Luzern,...,"{'Moreno Costanzo': (6, 7)}","{'Hekuran Kryeziu': (5, 7)}","{'Joel Untersee': (8, 7)}","{'Remo Freuler': (7, 7)}","{'Markus Neumayr': (5, 9)}","{'Jahmir Hyka': (3, 10)}","{'Robin Kamber': (4, 11)}","{'Marco Schneuwly': (5, 10)}","{'Franz Burgmeier': (6, 11)}","{'Jakob Jantscher': (7, 10)}"
25977,Switzerland,Switzerland Super League,2015/2016,9,2015-09-23,1992093,2,0,Grasshopper Club Zürich,FC Sion,...,"{'Marko Basic': (6, 6)}","{'Veroljub Salatic': (6, 6)}","{'Yoric Ravet': (3, 8)}","{'Ebenezer Assifuah': (3, 8)}","{'Shani Tarashaj': (5, 8)}","{'Carlitos': (5, 8)}","{'Caio': (7, 8)}","{'Edmilson Fernandes': (7, 8)}","{'Munas Dabbur': (5, 11)}","{'Moussa Konate': (5, 11)}"
25979,Switzerland,Switzerland Super League,2015/2016,9,2015-09-23,1992095,4,3,BSC Young Boys,FC Basel,...,"{'Denis Zakaria': (4, 7)}","{'Zdravko Kuzmanovic': (6, 6)}","{'Alain Rochat': (6, 7)}","{'Birkir Bjarnason': (3, 8)}","{'Miralem Sulejmani': (8, 7)}","{'Matias Emilio Delgado': (5, 8)}","{'Yuya Kubo': (4, 10)}","{'Shkelzen Gashi': (7, 8)}","{'Alexander Gerndt': (6, 10)}","{'Breel Embolo': (5, 11)}"


> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [20]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [21]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [22]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [23]:
#from subprocess import call
#call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])