# Fiba Europe Machine Learning Project

## Part 1: Acquiring Data


There are two sites which make raw play-by-play data for Fiba basketball games available for download:

1. Fiba Europe
Ex: http://live.fibaeurope.com/www/playbyplay.aspx?gameid=60636&from=1
Data Format: xml

2. Fiba Live Stats 
Ex: http://www.fibalivestats.com/data/885490/data.json
Data Format: json


Fiba Europe seems to focus primarly on European games (as the title suggests), while Fiba Live Stats appears to include both European and "World" games. Many of these "World" leages/competitions are quite small indeed (think "local amateur tournament in a high school gym in the phillipines" etc). As such, this workbook will focus on ***Fiba Europe***


### Important Note:

*As of November 2019, www.fibalivestats.com appears to be down, or perhaps defunct (fibalivestats.com appears to still be working).*

*I have left in the parts below relating to downloading a fiba basketball play-by-play xml file, in case the site ever comes back. In the meantime, to help demonstrate the data acquisition and processing steps, I've included some sample play-by-play files in the directory `fiba_europe_example`*


### Downloading Some Matches

Provided that you have a match ID for a given match, it is a simple thing to retrieve the XML file by plugging in the match ID in the "gameid=" parameter:

(*Note that I use "match id" and "game id" interchangeably throughout this process*)

So for example, if we are interested in match "60636", the link would look like so:

http://live.fibaeurope.com/www/playbyplay.aspx?acc=1&gameid=60636&from=1



In [35]:
"""
Downloading a match
"""
import requests
import os 


# Make a directory for our example downloads
example_local_destination_directory = 'fiba_europe_example'

if not os.path.exists(example_local_destination_directory):
    os.makedirs(example_local_destination_directory)    
    print("created directory: " + example_local_destination_directory)
    
# now lets download a match    
example_match_id = 60636
url = 'http://live.fibaeurope.com/www/playbyplay.aspx?acc=1&gameid={matchid}&from=1'.format(matchid=str(example_match_id))
response = requests.get(url)

# check to see if response is valid
if response.__dict__['status_code'] != 200:
    print("Unable to download match {example_match_id} ".format(example_match_id=str(example_match_id)) )
    print("Error code: {status_code}".format(status_code=str(response.__dict__['status_code'] )))
    print("Reason: {reason}".format(reason=str(response.__dict__['reason'] )))
else:    
    with open(example_local_destination_directory + str(example_match_id) + '.xml', 'w+') as f:
        f.write(response.text)
    print("1 match downloaded" )    


Unable to download match 60636 
Error code: 404
Reason: Not Found


In [36]:
# in case the download failed, I have added some example play by play files here
os.listdir(example_local_destination_directory) 

['61267.xml', '60680.xml', '60636.xml']

In [1]:
#let's see a sample
example_match_id = '60636'
from itertools import islice
f=open("fiba_europe_example/" + str(example_match_id) + ".xml", "r")
flines = f.readlines()

for line in islice(flines, 40):
    print(line)
        


<FE>

  <HEADER competition="EuroChallenge" round="Last Sixteen, Group L" quarter="4" time="FINAL" logo="http://live.fibaeurope.com/www/gallery/C61B71F3-EC5A-4315-A20B-3AAEA3751891.jpg" duration="15,6432">

    <TEAM name="TS Medical Park" logo="http://www.fibaeurope.com/files/{6A8BCF6E-9977-4706-989E-B582089D3D40}logo_big.gif" pts="76" fouls="7" />

    <TEAM name="Mons-Hainaut" logo="http://www.fibaeurope.com/files/{ECF8A606-44AF-4D55-B2B6-91F54B3977F8}logo_big.gif" pts="68" fouls="4" />

    <QUARTERS>

      <QUARTER n="1" scoreA="23" scoreB="12" time="100" />

      <QUARTER n="2" scoreA="21" scoreB="23" time="100" />

      <QUARTER n="3" scoreA="14" scoreB="10" time="100" />

      <QUARTER n="4" scoreA="18" scoreB="23" time="100,00" />

    </QUARTERS>

  </HEADER>

  <TICKER text="J. Love [MON] - 10 rebounds" duration="0">

    <ITEM text="TS Medical Park - 47,6% FT (10/21)" />

  </TICKER>

  <OVERVIEW duration="46,8509" />

  <PLAYBYPLAY homeTeamImg="http://www.fibaeurope.co


### Immediate Issues

So (hopefully...) we are able to access play-by-play xml files. Great! However, we are then presented with a few challenges. 

**1. As far as I can tell, there is no single repository of fiba gameids**

They can be retrieved using the link structure above, but it is necessary to know the gameid beforehand to access the data.

[Here is a project with some match ids and leagues](https://github.com/bziarkowski/euRobasket). It's nice, and also provides some hard-to-find metadata about the matches (as well as accompanying R analysis, etc), but for the purposes of machine learning in bulk, we'll matches. 

I found the best (if slow, and extremely annoying) solution, was to slowly scrape the site over the course of several weeks, blindly downloading as many matches as I could. 

The end result was a collection of about 40,000 matches. 

**2. While the raw match files contain some metadata like competition name, location of the match and team names, there are NO DATES associated with these matches.**

So, when did they occur, and in what sequence? This is not strictly necessary to create algorythms that rely on in-game information to predict outcomes, but I found it an annoyance and sought some solutions, wich can be found in [Part 3: "Adding Metadata"](fiba_part3_finding_additional_metadata.ipynb).

**3. There is no clean metadata regarding the age group and sex of the players. Matches include all age groups, from U14 to professional, for both male and female athletes.**

I was hoping to split out the algorythms by age and sex, so this too was frustrating. I also sought solutions to this issue. These too can be found in [Part 3: "Adding Metadata"](fiba_part3_finding_additional_metadata.ipynb).



#### Next Steps:

* Now that it is (or at least, was) possible to download raw play-by-play files, they must be processed into a form more conducive to data analysis. 

* This involves extracting what metadata there is in the file (there is some--team names, competition name, etc), then iterating through the play-by-play data and "flattening" the events occurring therein.

* The result is a big flat dataframe with each aspect of each event split out into its own column

**The data processing steps are in [Part 2: "Processing Data"](fiba_part2_process_data.ipynb)**
