In [1]:
# Run this file with the following parameters from comand line:
# jupyter-lab --ServerApp.iopub_msg_rate_limit=1.0e10
# It helps to render from ydata-profing to the JuputerLab
# https://stackoverflow.com/questions/43288550/iopub-data-rate-exceeded-in-jupyter-notebook-when-viewing-image

# 03. Data Exploration of the NBA SQLite database

Use DBeaver to generate ER diagram for the NBA SQLite database:

<figure>
    <img src="figures/3.1 ER diagram.png" alt="Fig. 3.1. The NBA database ER diagram.">
    <figcaption style="text-align:center;">Fig. 3.1. The NBA database ER diagram.</figcaption>
</figure>

As we can see from the ER diagram:
- 16 tables;
- the tables have between 4 and 55 columns;
- the columns have TEXT, REAL, INTEGER, TIMESTAMP (not native, native type is TEXT) types;
- no views;
- no direct relatioships.

Import libraries, import auxilliary functions, set pandas, and set logger:

In [2]:
import sqlite3

from utils.data_exploration_p1 import (
    get_db_info,
    plot_mpl_bars,
    print_db_info,
    create_reports
)



Establish a connection to our database and create a cursor.

In [3]:
conn = sqlite3.connect("/Users/lex/Sync/AI/DB/NBA/nba.sqlite")
cur = conn.cursor()

Return a dictionary of list for every table in DB and a number a rows for every table.

In [4]:
db_info, num_rows = get_db_info(cur)

2023-07-13 22:04:31,995 | utils.data_exploration | INFO | The list of dictionaries db_info has been created.


The num_rows is a DataFrame that have "Table Name" and "A num of rows columns". Let's visualize it ([see Python code](utils/data_exploration_p1.py)):

In [5]:
print(num_rows)

             Table Name  A num of rows
0                  game          65698
1          game_summary          58110
2           other_stats          28271
3             officials          70971
4      inactive_players         110191
5             game_info          58053
6            line_score          58053
7          play_by_play       13592899
8                player           4815
9                  team             30
10   common_player_info           3632
11         team_details             27
12         team_history             50
13  draft_combine_stats           1633
14        draft_history           8257
15     team_info_common              0


In [6]:
plot_mpl_bars(num_rows)

<figure>
    <img src="figures/3.2 A num of rows in tables.png" alt="Fig. 3.2. A num of rows in tables.">
    <figcaption style="text-align:center;">Fig. 3.2. A num of rows in tables.</figcaption>
</figure>

As we can see there is the play_by_play table that has significantly more rows than other tables. The play_by_play table has more than 13 million rows. Using functions like pandas.describe() or ydata-profiling with this table can take several hours on an average computer.

Print a list of table names and main data about columns ([see Python code](utils/data_exploration_p1.py)):

In [7]:
print_db_info(db_info)

Table name: game
Table columns:
                Column Name       Type  NOTNULL DFLT_VALUE   PK
ID                                                            
0                season_id       TEXT        0        None   0
1             team_id_home       TEXT        0        None   0
2   team_abbreviation_home       TEXT        0        None   0
3           team_name_home       TEXT        0        None   0
4                  game_id       TEXT        0        None   0
5                game_date  TIMESTAMP        0        None   0
6             matchup_home       TEXT        0        None   0
7                  wl_home       TEXT        0        None   0
8                      min    INTEGER        0        None   0
9                 fgm_home       REAL        0        None   0
10                fga_home       REAL        0        None   0
11             fg_pct_home       REAL        0        None   0
12               fg3m_home       REAL        0        None   0
13               fg3a_

Generate a report for each table using the ydata-profiling library. We will pass a list of "excluded" tables to the create_reports function, which can be found [here](utils/data_exploration_p1.py). It means a list of "big" tables that are similar to the play_by_play table. For these 'excluded' tables, we will randomly select 100,000 rows (by default) for initial exploration. The reports are saved in the reports directory and outputted to widgets in Jupyter Lab notebook.  It is a more comfortabe way to explore the reports in a browser.

In [8]:
create_reports(db_info, conn, excluded_tables=["play_by_play"])

2023-07-13 22:04:33,006 | utils.data_exploration | INFO | Generating the profile report for table game...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:04:56,892 | utils.data_exploration | INFO | Generating the profile report for table game_summary...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:05:06,695 | utils.data_exploration | INFO | Generating the profile report for table other_stats...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:05:18,213 | utils.data_exploration | INFO | Generating the profile report for table officials...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:05:23,612 | utils.data_exploration | INFO | Generating the profile report for table inactive_players...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:05:32,602 | utils.data_exploration | INFO | Generating the profile report for table game_info...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:05:36,392 | utils.data_exploration | INFO | Generating the profile report for table line_score...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:06:03,104 | utils.data_exploration | INFO | Generating the profile report for table play_by_play...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:06:31,069 | utils.data_exploration | INFO | Generating the profile report for table player...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:06:35,586 | utils.data_exploration | INFO | Generating the profile report for table team...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:06:40,480 | utils.data_exploration | INFO | Generating the profile report for table common_player_info...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:07:05,455 | utils.data_exploration | INFO | Generating the profile report for table team_details...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:07:16,657 | utils.data_exploration | INFO | Generating the profile report for table team_history...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:07:20,186 | utils.data_exploration | INFO | Generating the profile report for table draft_combine_stats...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:07:48,733 | utils.data_exploration | INFO | Generating the profile report for table draft_history...


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

2023-07-13 22:07:58,175 | utils.data_exploration | INFO | The profile reports have been generated.
The following tables have been excluded (several million lines take a few hours to create the report):
['play_by_play']
The empty dataframes for tables:
['team_info_common']


Close cursor and connection:

In [9]:
cur.close()
conn.close()