## **Machine Learning - WNBA Playoffs Prediction**
This notebook will focus on the preparation of the data. We will be using SQLite to store the data due to its scalability & the fact that it's a relational schema.

https://docs.python.org/3/library/sqlite3.html

Import sqlite3 and connect to database file

### **Imports**

In [1]:
import pandas as pd
import sqlite3
from prep_utils import check_missing_values,parse_columns_type,calculate_summary_statistics,get_table_attributes,get_db_tables

### **Database Connection Setup**

In [2]:
import sqlite3
db = sqlite3.connect("db/ac.db")
db_cur = db.cursor()

**Descriptive Statistics**

Descriptive statistics offer valuable metrics for understanding and summarizing data, assisting in data analysis and decision-making.

In [3]:
tables = get_db_tables(db_cur)

for table in tables:   
    print(f"\033[1m{table}\033[0m")
    num, non_num = parse_columns_type(db_cur,table)
    calculate_summary_statistics(db_cur,table,num)
    print('\n')


[1mawards_players[0m
+-------------+---------+---------+-----------------+-------+-------+
| Attribute   |   Count |    Mean |   Std Deviation |   Min |   Max |
| year        |      95 | 5.78947 |            7.55 |     1 |    10 |
+-------------+---------+---------+-----------------+-------+-------+


[1mcoaches[0m
+-------------+---------+-----------+-----------------+-------+-------+
| Attribute   |   Count |      Mean |   Std Deviation |   Min |   Max |
| year        |     162 |  5.31481  |            8.39 |     1 |    10 |
+-------------+---------+-----------+-----------------+-------+-------+
| stint       |     162 |  0.364198 |            0.48 |     0 |     2 |
+-------------+---------+-----------+-----------------+-------+-------+
| won         |     162 | 14.6728   |           41    |     0 |    28 |
+-------------+---------+-----------+-----------------+-------+-------+
| lost        |     162 | 14.6235   |           32.25 |     2 |    30 |
+-------------+---------+------

**Checking for missing values (N/A)**

It's normal for large datasets to have missing values, which needs to be handled early on.

In [4]:
for table in tables:   
    print(f"\033[1m{table} - Missing Values:\033[0m")
    check_missing_values(db_cur,table)
    print('\n')


[1mawards_players - Missing Values:[0m
Column 'playerID' has missing values: False
Column 'award' has missing values: False
Column 'year' has missing values: False
Column 'lgID' has missing values: False


[1mcoaches - Missing Values:[0m
Column 'coachID' has missing values: False
Column 'year' has missing values: False
Column 'tmID' has missing values: False
Column 'lgID' has missing values: False
Column 'stint' has missing values: False
Column 'won' has missing values: False
Column 'lost' has missing values: False
Column 'post_wins' has missing values: False
Column 'post_losses' has missing values: False


[1mplayers - Missing Values:[0m
Column 'bioID' has missing values: False
Column 'pos' has missing values: True
Column 'firstseason' has missing values: False
Column 'lastseason' has missing values: False
Column 'height' has missing values: False
Column 'weight' has missing values: False
Column 'college' has missing values: True
Column 'collegeOther' has missing values: True
Co

In [5]:
#coaches = db_cur.execute("SELECT * FROM coaches")
#coaches.fetchall()

In [6]:
# Get first player from query
players = db_cur.execute("SELECT * FROM players")
players.fetchone()

('abrahta01w',
 'C',
 0,
 0,
 74.0,
 190,
 'George Washington',
 '',
 '1975-09-27',
 '0000-00-00')

In [7]:
#db.close()