# 02. Data Collection

## NBA SQLite database ([the NBA Database](https://github.com/wyattowalsh/nba-db))

The data is collected. See [01_problem_definition.ipynb](01_problem_definition.ipynb).

## The NBA PosgreSQL database (collected over API)

Create a PosgreSQL database using DBeaver and name it "nba". 

Then look at the API description to understand which tables we should create to store the data:

Description of API:

https://www.balldontlie.io/home.html?shell#get-all-stats

curl "https://www.balldontlie.io/api/v1/stats"

The above command returns JSON structured like this:
```
{
  "data": [
    {
      "id":29,
      "ast":2,
      "blk":2,
      "dreb":8,
      "fg3_pct":0.25,
      "fg3a":4,
      "fg3m":1,
      "fg_pct":0.429,
      "fga":21,
      "fgm":9,
      "ft_pct":0.8,
      "fta":5,
      "ftm":4,
      "game":{
        "id":1,
        "date":"2018-10-16T00:00:00.000Z",
        "home_team_id":2,
        "home_team_score":105,
        "season":2018,
        "visitor_team_id":23,
        "visitor_team_score":87
      },
      "min":"36:49",
      "oreb":2,
      "pf":3,
      "player":{
        "id":145,
        "first_name":"Joel",
        "last_name":"Embiid",
        "position":"F-C",
        "team_id":23
      },
      "pts":23,
      "reb":10,
      "stl":1,
      "team":{
        "id":23,
        "abbreviation":"PHI",
        "city":"Philadelphia",
        "conference":"East",
        "division":"Atlantic",
        "full_name":"Philadelphia 76ers",
        "name":"76ers"
      },
      "turnover":5
    },
    ...
  ],
  "meta": {
    "total_pages": 2042,
    "current_page": 1,
    "next_page": 2,
    "per_page": 25,
    "total_count": 51045
  }
}
```

The API properties:
- No email required
- No API key required
- Contains data from seasons 1946-current
- Live(ish) game stats are available (updated every ~10 minutes)
- Rate limit of 60 requests per minute
  
There are 4 groups in the JSON-file: game, player, team, and other statistics. So, we will create the games, players, teams and stats tables in the nba database.

It is important to pay attention to the following points:
- The "player" dictionary within the "data" dictionary contains the team_id key. Some players might play for different teams in a single season. But API provides information only about one team. In some cases, the team_id value within the "player" dictionary does not equal the id within the "team" dictionary. It can violate the key constraint that will be created between the players and teams tables. We will use the team value (id) from the "team" dictionary in both cases. This approach is appropriate, because we are primarily interested in player's team when the game happened.
- The team id is located in the "team" dictionary (as id), "game" dictionary (as home_team_id and visitor_team_id), and "player" dictionary (as team_id). The stats table will also include the team_id. This means that the teams table will be linked with all other tables. If we do not insert all teams  in the teams table at the beggining we might face with key constraint issues. Therefore, we will first populate the teams table and then will populate the other tables to avoid violating foreign key constraints. 


Import libraries and set environment:

In [12]:
# Import the standard libraries.
import os
import datetime

# Import the third party libraries.
from dotenv import load_dotenv
import psycopg

# Import the local/project packages, modules, and fucntions.
from utils.data_collection import (
    create_tables,
    get_table_names,
    get_data,
    insert_data_to_teams,
    insert_data,
    fetch_all_data,
    prepare_df
)
# Set environment.
load_dotenv()

True

Establish a connection to our database and create a cursor:

In [2]:
# Connect to the PostgreSQL DB.
conn = psycopg.connect(
    dbname="nba",
    user=os.environ.get("user"),
    password=os.environ.get("password"),
    host=os.environ.get("host"),
    port=os.environ.get("port")
)
# Create a cursor object.
cur = conn.cursor()

Create the tables in the nba database ([see the create_tables function](utils/data_collection.py)) and save changes:

In [3]:
create_tables(cur)
conn.commit()

Output the names of created tables ([see the get_table_names function](utils/data_collection.py)):

In [4]:
get_table_names(cur)

2023-08-10 19:06:49,947 | utils.data_collection | INFO | The following tables have been created: ['teams', 'games', 'players', 'stats'].


Use DBeaver to generate ER diagram for the nba PostgreSQL database:

<div>
    <img src="figures/2.1 ER diagram.png" alt="Fig. 2.1. The nba PostgreSQL ER diagram." style="display: block; margin: 0 auto;">
    <p style="text-align: center;">Fig. 2.1. The nba PostgreSQL ER diagram.</p>
</div>

Insert the data into the teams tables and save changes in the nba database  ([see the insert_data_to_teams function](utils/data_collection.py)).

In [5]:
insert_data_to_teams(cur, conn)

2023-08-10 19:06:56,934 | utils.data_collection | INFO | Successful API call to the teams endpoint.
2023-08-10 19:06:56,944 | utils.data_collection | INFO | The data have been inserted to the teams tables.
2023-08-10 19:06:56,948 | utils.data_collection | INFO | The changes in the DB have been saved.


Insert the data into the games, players, and stats tables and save changes in the nba database. Use batches and seasons to insert the data
([see the get_data function](utils/data_collection.py)). Collect the data from 1946 to 2023. It takes some hours. Note: The remote service sometimes closes the connection even if you haven't reached the limit. If this happens, change the starting year and run the cell again.

In [None]:
seasons = list(range(1946, 2023))
[get_data(cur, conn, season, start_page=1) for season in seasons]

In [21]:
# Close the cursor and connection.
cur.close()
conn.close()

Uncomment and run if you want to test the connection to the service:

In [19]:
# import requests
# response = requests.get(
#     url="https://www.balldontlie.io/api/v1/stats",
#     params={"page": 1, "per_page": 100, "seasons[]": 2022}
# )
# data = response.json()

In [None]:
# data

Create a backup (dump) in DBeaver.

## The csv-files (collected using scraping)

Two other databases do not have enough information, especially about birthdays. We will fetch the data from [basketball-reference.com](https://www.basketball-reference.com/) and save them as csv-files.

Insert the data into the teams tables and save changes in the nba database  ([see the fetch_all_data function](utils/data_collection.py)):

In [2]:
data = fetch_all_data()

2023-08-23 09:29:52,282 | utils.data_collection | INFO | Fetch data for letter: a. Waiting for 60 s ...
2023-08-23 09:31:15,863 | utils.data_collection | INFO | Fetch data for letter: b. Waiting for 81 s ...
2023-08-23 09:32:31,004 | utils.data_collection | INFO | Fetch data for letter: c. Waiting for 73 s ...
2023-08-23 09:33:53,159 | utils.data_collection | INFO | Fetch data for letter: d. Waiting for 80 s ...
2023-08-23 09:35:31,288 | utils.data_collection | INFO | Fetch data for letter: e. Waiting for 96 s ...
2023-08-23 09:37:14,541 | utils.data_collection | INFO | Fetch data for letter: f. Waiting for 100 s ...
2023-08-23 09:39:09,802 | utils.data_collection | INFO | Fetch data for letter: g. Waiting for 112 s ...
2023-08-23 09:40:42,343 | utils.data_collection | INFO | Fetch data for letter: h. Waiting for 90 s ...
2023-08-23 09:42:08,419 | utils.data_collection | INFO | Fetch data for letter: i. Waiting for 84 s ...
2023-08-23 09:43:40,883 | utils.data_collection | INFO | Fetch

In [4]:
data

Unnamed: 0,name,from_year,to_year,pos,height,weight,birth_date,college
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240,"June 24, 1968",Duke
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235,"April 7, 1946",Iowa State
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225,"April 16, 1947",UCLA
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162,"March 9, 1969",LSU
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223,"November 3, 1974","Michigan, San Jose State"
...,...,...,...,...,...,...,...,...
5103,Ante Žižić,2018,2020,F-C,6-10,266,"January 4, 1997",
5104,Jim Zoet,1983,1983,C,7-1,240,"December 20, 1953",Kent State University
5105,Bill Zopf,1971,1971,G,6-1,170,"June 7, 1948",Duquesne
5106,Ivica Zubac,2017,2023,C,7-0,240,"March 18, 1997",


Save the data dataframe as csv-file:

In [17]:
current_date = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')  
filename = f"csv/basketball_reference_com/original_player_info_{current_date}.csv"
data.to_csv(filename, index=False)

Convert the birth_date column to the timestamp type ([see the prepare_df function](utils/data_collection.py)):

In [7]:
df_player_info = prepare_df(data)

In [8]:
df_player_info

Unnamed: 0,name,from_year,to_year,pos,height,weight,birth_date,college
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240,1968-06-24,Duke
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235,1946-04-07,Iowa State
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225,1947-04-16,UCLA
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162,1969-03-09,LSU
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223,1974-11-03,"Michigan, San Jose State"
...,...,...,...,...,...,...,...,...
5103,Ante Žižić,2018,2020,F-C,6-10,266,1997-01-04,
5104,Jim Zoet,1983,1983,C,7-1,240,1953-12-20,Kent State University
5105,Bill Zopf,1971,1971,G,6-1,170,1948-06-07,Duquesne
5106,Ivica Zubac,2017,2023,C,7-0,240,1997-03-18,


Save the data dataframe as csv-file:

In [18]:
current_date = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')  
filename = f"csv/basketball_reference_com/df_player_info_{current_date}.csv"
df_player_info.to_csv(filename, index=False)