# Notebook

This notebook explores how to parse XML data from a string with Python package XML ElementTree.

# Prep

## A - Imports

In [None]:
import sqlite3
import os
import pandas as pd
import xml.etree.ElementTree as ET

## B - DB path

In [None]:
# change this if youy have set up an environment variable
working_dir = os.environ.get("FOOTBALL_ANALYTICS")
db_file = os.path.join(working_dir, "db", "football_database.sqlite")

# or use this and input the path where your version of the DB file is stored
# db_file = "C:/.../filepath.sqlite"

## C - Query

In [None]:
# retrieving Real Madrid matches
get_rma_matches = """
    SELECT
        m.id, 
        m.country_id, 
        m.league_id,
        m.season,
        m.stage,
        m.date,
        m.match_api_id,
        m.home_team_api_id,
        ht.team_long_name AS home_team_name,
        ht.team_long_name AS home_team_acronym,
        m.away_team_api_id,
        at.team_long_name AS away_team_name,
        at.team_long_name AS away_team_acronym,
        m.home_team_goal,
        m.away_team_goal,
        m.goal,
        m.shoton,
        m.shotoff,
        m.foulcommit,
        m.card,
        m.cross,
        m.corner,
        m.possession
    FROM Match m
    LEFT JOIN Team AS ht
    ON ht.team_api_id = m.home_team_api_id
    LEFT JOIN Team AS at
    ON at.team_api_id = m.away_team_api_id
    WHERE (
        m.home_team_api_id = 8633
        OR m.away_team_api_id = 8633
    )
"""

## D - Data retrieval

In [None]:
# connect to DB
conn = sqlite3.connect(db_file)

# create a cursor object
cursor = conn.cursor()

# query
query_output = cursor.execute(get_rma_matches).fetchall()
column_names = [row[0] for row in cursor.description]
df = pd.DataFrame(data=query_output, columns=column_names)

# Using XML ElementTree

## 1 - Getting started

First, we retrieve the XML data, as a string (xml_string).
Second, we get the root element, its tag (the <> thing).

An element, the root or any other element, is characterized by a tag (the <> thing) and a dictionary of attributes (key-value pairs).

In [None]:
# xml string = value at the intersection of the 4th row, column goal
xml_string = df.loc[4, "goal"]

# getting the root element
root = ET.fromstring(xml_string)

# getting the tag
root_tag = root.tag

# getting the attributes, if any
root_attrib = root.attrib

## 2 - Leveraging XML ETree functions

### A - Getting children elements

The root element in itself is of no use in our case, so we will explore what we can get from children elements, elements right below the root.
Note that the game we are analyzing had 4 goal events, so it is logical to find 4 "value" children elements.

In [None]:
# looping over each child in root
for child in root:
    print(child.tag, child.attrib)

### B - Finding meaningful data with XML ETree

In our case, the root is the goal element, its children the value elements and the data we want to fetch is one level below.
<br>We are typically looking for elements such as "elapsed" (when a goal was scored), "subtype" (the kind of goal scored) "player1" (the player who scored), etc..

### C - Accessing an element by its tag

If we know what we are looking for, e.g. elements with the tag "player1", we can use root.iter() function

In [None]:
for element in root.iter("player1"):
    print(element)
    print(element.tag, element.text)

### D - Accessing elements by their location / index

Let's say we want to analyze the data related to the 1st goal for every game.
<br>We would use something like root[i] to navigate our way, or root[i][i] to navigate even further.

In [None]:
# children are nested: we use indexes
root[1].tag

# if we want the grand children, we can use another index
root[1][1].tag

# if we want the grand-grand-children, we can use another index, etc..
root[1][1][0].tag

### E - Accessing elements by their text / values

Let's say you want to analyze the goals of a specific player, e.g. "37469" which scored 2 goals in the game we are analyzing.
<br>Note: player "37576" is Javier Saviola, a former Argentina international player who played for Real Madrid from 2007 to 2009.

In [None]:
searched_text = "37576"

for element in root.iter():
    if element.text == searched_text:
        print(element.tag, element.text)

### F - A note on find and findall

Find and Findall do the same thing, finding an element, but one will fetch the very first value found (find), the other will fetch all values found (findall).
<br>While there were 4 goals scored, find outputs only 1 element for "player1", while findall outputs 4 elements for "player1"

In [None]:
# searching for the number of events to analyze

# outputs a single value (the first found)
print(root.find("value"))

# outputs all existing values (total of 4 in our case)
print(root.findall("value"))

## 3 - Leveraging XPath expressions

XPath expressions, similarly to Regular Expressions (Regex), a dedicated way of parsing content out of a tree structure.
<br>When we use texty in find and findall functions, it's actually an XPath expression that we are passing as an argument. A simple one but an XPath expression nonetheless.
<br>For more details on XPath syntax, check [this W3School resource](https://www.w3schools.com/xml/xpath_syntax.asp)!

In [None]:
# All "player1" grand children from root element
# "." selects the current node (here: root)
# / selects from the current node
for element in root.findall("value"):
    print(element.tag, element.text)

# Getting the goal scorer (player1) for all goals
for element in root.findall("value//player1"):
    print(element.tag, element.text)

# Getting the goal scorer (player1) for the first goal only
for element in root.findall(".//value[1]//player1"):
    print(element.tag, element.text)

## 4 - Extracting meaningful data from goal events and storing these in a dataframe

Now that we have reviewed how to use XML Element Tree and XPath capabilities, here is one way to extract meaningful data out of our SQL table's column made of XML strings.
<br>This will iterate over each row (i.e. over each match) and parse the data into a dictionary, with the event as key and children element texts as values, along with the match_id from the match_column (not in an XML format).
<br>These dictionaries are then concatenated into a list, which is then transformed into a Pandas dataframe.

<br>Note on retrieving the match_id: for data modelling purposes, we want to store the match ID as well. It will be useful when we create events tables, so we can have a 1-N relationship between match (1) and goal events (N).

In [None]:
i = 0

# creating an empty list to host event dictionaries
goal_events_dicts_list = []

# iterate over each row / match
for xml_string in df["goal"]:

    root = ET.fromstring(xml_string)

    # extracting elements below value
    for element in root.findall("./value"):

        # Access specific elements or attributes within each 'value' element
        elapsed_time = element.findtext("elapsed")
        elapsed_time_plus = element.findtext("elapsed_plus")
        team = element.findtext("team")
        goal_scorer = element.findtext("player1")
        assist_player = element.findtext("player2")
        event_type = element.findtext("event_incident_typefk")
        type = element.findtext("type")
        sub_type = element.findtext("subtype")
        event_id = element.findtext("id")

        # appending the list with dictionaries
        goal_events_dicts_list.append({
            "match_id": df["match_api_id"][i], # add current match ID to the dictionary
            "event_id": event_id,
            "event_type": type,
            "event_sub_type": sub_type,
            "team": team,
            "goal_scorer": goal_scorer,
            "assist_player": assist_player,
            "elapsed_time": elapsed_time,
            "elapsed_additional_time": elapsed_time_plus,
            "event_type_key": event_type
        })
    
    i = i + 1

goal_events_df = pd.DataFrame.from_dict(goal_events_dicts_list)

In [None]:
goal_events_df