## Trying to Download HTML ##

Let's try to scrape data for the UK Footballer's Premier League.

In [None]:
import urllib.request, urllib.error, urllib.parse

url = 'https://www.premierleague.com/stats/top/players/goals'

response = urllib.request.urlopen(url)
content = response.read().decode('UTF-8')

print(content[:500])

In [2]:
if "Shearer" in content:
    index = content.index("Shearer")
    print(content[index-100:index+100])

## Using the Football API ##

* [SOAP vs REST vs JSON - a 2021 Comparison by RAYGUN.com](https://raygun.com/blog/soap-vs-rest-vs-json/)
* [Premier League Football Website](https://www.premierleague.com)
* [API Documentation at API-Football](https://www.api-football.com/documentation-v3)
* [API Documentation at RapidAPI](https://rapidapi.com/api-sports/api/api-football/)

Let's begin by trying to make a basic query and then exploring the results. Keep in mind that this code was for a version of the Football API that was most recent in 2021. There may be newer versions of the API to use, although this original code probably still works.

In [None]:
# Tutorial Example from https://rapidapi.com/api-sports/api/api-football/

import requests

url = "https://api-football-v1.p.rapidapi.com/v3/leagues"

headers = {
    "X-RapidAPI-Host": "api-football-v1.p.rapidapi.com",
    "X-RapidAPI-Key": "f1a103386cmsh746f31f1a902b78p17e200jsn9a405529f646"
}

response = requests.request("GET", url, headers=headers)

print(response.text)

In [51]:
data = response.json()

In [None]:
data.keys()

In [None]:
data['get']

In [None]:
data['parameters']

In [None]:
data['errors']

In [None]:
data['paging']

In [None]:
data['results']

In [None]:
data['response']

In [None]:
data['response'][0]['league']

In [None]:
for entry in data['response'][:10]:
    league_name = entry['league']['name']
    league_ctry = entry['country']['name']
    league_seasons = [season['year'] for season in entry['seasons']]
    print(f"{league_name} ({league_ctry}) - {league_seasons}")

## Helper Functions ##

Let's make some helper functions to download various queries and save the results to disk. We want to save the information since we are on a limited request count (100/day). We'll save the JSON to disk and then load the saved version in order to work with it.

In [26]:
import json
import requests

def save_json_data(filename, data):
    with open(filename, 'w') as fout:
        json_string_data = json.dumps(data)
        fout.write(json_string_data)
        
def load_json_data(filename):
    with open(filename) as fin:
        json_data = json.load(fin)
        return json_data

def download_json_data(filename, url, querystring):
    # References:
    #    https://rapidapi.com/api-sports/api/api-football/
    #    https://www.api-football.com/documentation-v3
    headers = {
        "X-RapidAPI-Host": "api-football-v1.p.rapidapi.com",
        "X-RapidAPI-Key": "f1a103386cmsh746f31f1a902b78p17e200jsn9a405529f646"
    }
    response = requests.request("GET", url, headers=headers, params=querystring)

    json_data = response.json()
    save_json_data(filename, json_data)
    pages_left = json_data['paging']['total'] - json_data['paging']['current']
    result = { 'get':json_data['get'], 'parameters':json_data['parameters'], 
               'errors':json_data['errors'], 'results':json_data['results'],
               'pages_remaining':pages_left,
               'response':json_data['response'] }
    return result

## Downloading and Saving JSON ##

Remember, the helper function `download_json_data` saves the raw JSON to disk so that we can examine it later. We must be careful downloading the same data over and over again. The API might allow us so many free queries but at a certain point the website will start charging us.

Feel free to explore the data at this point, although it might better to investage the saved data in a text editor before doing much programming.

In [None]:
# All leagues
url = "https://api-football-v1.p.rapidapi.com/v3/leagues"
querystring = None
filename = 'leagues.json'
data = download_json_data(filename, url, querystring)
print(f"Downloaded {data['results']} results with {data['pages_remaining']} pages remaining")

# I explored the data before writing this line of code
print(f"{data['response'][0]['league']}")
print(f"...")

In [None]:
# Teams in a certain league
url = "https://api-football-v1.p.rapidapi.com/v3/teams"
querystring = {"league":"39","season":"2021"}
filename = 'premier_teams.json'
data = download_json_data(filename, url, querystring)
print(f"Downloaded {data['results']} results with {data['pages_remaining']} pages remaining")

# I explored the data before writing this line of code
print(f"{data['response'][0]['team']['name']} ({data['response'][0]['team']['id']}) - {data['response'][0]['team']['logo']}")
print(f"...")

In [None]:
# Top Scorers in a certain league by season
url = "https://api-football-v1.p.rapidapi.com/v3/players/topscorers"
querystring = {"league":"39","season":"2021"}
filename = 'premier_top_scorers_2021.json'
data = download_json_data(filename, url, querystring)
print(f"Downloaded {data['results']} results with {data['pages_remaining']} pages remaining")

# I explored the data before writing this line of code
print(f"{data['response'][0]['player']['name']} ({data['response'][0]['player']['id']}): {data['response'][0]['statistics'][0]['goals']['total']} goals")
print(f"...")


## Working with Saved JSON Data ##

Now that we've saved a few JSON files, let's explore them and try to extract the information we really need. After all, the JSON data contains *way* more data than we know what to do with.

In [None]:
def lookup_league_id(json_data, league_name='Premier League', league_country='England'):

    if json_data['get'] != 'leagues':
        print(f"Invalid JSON Data: expected 'leagues' but recieved '{json_data['get']}'")
    
    matches = []
    fuzzy_matches = []
    for entry in json_data['response']:
        if entry['league']['name'] == league_name and entry['country']['name'] == league_country:
            matches.append(entry['league']['id'])
        elif entry['league']['name'].startswith(league_name):
            fuzzy_matches.append(entry['league']['id'])
    
    if len(matches) > 0:
        return matches
    else:
        return fuzzy_matches
    
json_data = load_json_data('leagues.json')
print(lookup_league_id(json_data, 'Premier League', 'England'))
print(lookup_league_id(json_data, 'Bundesliga', 'Germany'))

In [None]:
def convert_league_to_csv(json_data):
    
    if json_data['get'] != 'leagues':
        print(f"Invalid JSON Data: expected 'leagues' but recieved '{json_data['get']}'")
    
    rows = ['id, name, country, type, first_season, last_season, logo']
    for entry in json_data['response']:
        first_season = min([season['year'] for season in entry['seasons']])
        last_season = max([season['year'] for season in entry['seasons']])                            
        line = (f"{entry['league']['id']}, {entry['league']['name']}, {entry['league']['type']}, " +
                f"{first_season}, {last_season}, {entry['league']['logo']}")
        rows.append(line)
    
    return rows

convert_league_to_csv(json_data)

In [None]:
def convert_teams_to_csv(json_data):
    
    if json_data['get'] != 'teams':
        print(f"Invalid JSON Data: expected 'teams' but recieved '{json_data['get']}'")
    
    rows = ['league, season, id, name, code, country, founded, national, stadium, city, surface, logo']
    for entry in json_data['response']:
        line = (f"{json_data['parameters']['league']}, {json_data['parameters']['season']}, " +
                f"{entry['team']['id']}, {entry['team']['name']}, {entry['team']['code']}, " +
                f"{entry['team']['country']}, {entry['team']['founded']}, {entry['team']['national']}, " +
                f"{entry['venue']['name']}, {entry['venue']['city']}, {entry['venue']['surface']}, {entry['team']['logo']}")
        rows.append(line)
    
    return rows


json_data = load_json_data('premier_teams.json')
csv_rows = convert_teams_to_csv(json_data)
csv_rows

In [None]:
json_data = load_json_data('premier_top_scorers_2021.json')

print(json_data['response'][0].keys())
print(json_data['response'][0]['player'])
print(json_data['response'][0]['statistics'])

In [None]:
for scorer in json_data['response']:
    print(f"{scorer['player']['name']} ({scorer['player']['nationality']}) ({scorer['player']['id']}): " +
          f"{scorer['player']['height']} / {scorer['player']['weight']}")
    for stats in scorer['statistics']:
        print(f"  {stats['league']['season']} {stats['team']['name']}: Games: {stats['games']['appearences']} " +
              f"Minutes: {stats['games']['minutes']} " + 
              f"Shots On: {stats['shots']['on']}/{stats['shots']['total']} " +
              f"Goals: {stats['goals']['total']} Assists: {stats['goals']['assists']}")
    print()

In [None]:
url = 'https://api-football-v1.p.rapidapi.com/v3/players/squads'
querystring = {"team":"63"}
download_json_data('leeds_players.json', url, querystring)

In [None]:
def convert_roster_to_csv(json_data):
    
    if json_data['get'] != 'players/squads':
        print(f"Invalid JSON Data: expected 'players/squads' but recieved '{json_data['get']}'")
    
    rows = ['team_id, team_name, id, name, age, number, position, photo']
    for entry in json_data['response']:
        for player in entry['players']:
            line = (f"{entry['team']['id']}, {entry['team']['name']}, " +
                    f"{player['id']}, {player['name']}, {player['age']}, " +
                    f"{player['number']}, {player['position']}, {player['photo']}")
            rows.append(line)
    
    return rows

json_data = load_json_data('leeds_players.json')
csv_rows = convert_roster_to_csv(json_data)
csv_rows

In [None]:
url = 'https://api-football-v1.p.rapidapi.com/v3/players'
querystring = {"id":"19134", "season":"2021"}
download_json_data('p_bamford_attacker_leads.json', url, querystring)

In [None]:
def convert_player_to_csv(json_data):
    
    if json_data['get'] != 'players':
        print(f"Invalid JSON Data: expected 'players' but recieved '{json_data['get']}'")
    
    header = 'id, name, age, height, weight, photo, injured, team, league, season, games, position, minutes, ' + \
    'rating, captain, shots_taken, shots_on, goals, assists, passes, accuracy, tackles, blocks, interceptions, ' + \
    'duels, duels_won, dribble_attempts, dribble_success, fouls_drawn, fouls_committed, penalty_scored, penalty_missed'
    rows = [header]
    for entry in json_data['response']:
        for stats in entry['statistics']:
            line = (f"{entry['player']['id']}, {entry['player']['name']}, {entry['player']['age']}, {entry['player']['height']}, " +
                    f"{entry['player']['weight']}, {entry['player']['photo']}, {entry['player']['injured']}, " +
                    f"{stats['team']['name']}, {stats['league']['name']}, {stats['league']['season']}, " +
                    f"{stats['games']['appearences']}, {stats['games']['position']}, {stats['games']['minutes']}, " +
                    f"{stats['games']['rating']}, {stats['games']['captain']}, " +
                    f"{stats['shots']['total']}, {stats['shots']['on']}, " +
                    f"{stats['goals']['total']}, {stats['goals']['assists']}, " +
                    f"{stats['passes']['total']}, {stats['passes']['accuracy']}, " +
                    f"{stats['tackles']['total']}, {stats['tackles']['blocks']}, {stats['tackles']['interceptions']}, " +
                    f"{stats['duels']['total']}, {stats['duels']['won']}, " +
                    f"{stats['dribbles']['attempts']}, {stats['dribbles']['success']}, " +
                    f"{stats['fouls']['drawn']}, {stats['fouls']['committed']}, " +
                    f"{stats['penalty']['scored']}, {stats['penalty']['missed']}")
            rows.append(line)
    
    return rows

json_data = load_json_data('p_bamford_attacker_leads.json')
csv_rows = convert_player_to_csv(json_data)
csv_rows