# Assignment #7 - Data Gathering and Warehousing - DSSA-5102

Instructor: Melissa Laurino</br>
Spring 2025</br>

Name: Joe D'Agostino
</br>
Date: 3/27/25
<br>
<br>
**At this time in the semester:** <br>
- We have explored a dataset. <br>
- We have cleaned our dataset. <br>
- We created a Github account with a repository for this class and included a metadata read me file about our data. <br>
- We introduced general SQL syntax, queries, and applications in Python.<br>
- Created our own databases from scratch using MySQL Workbench and Python with SQLAlchemy/SQlConnector on our local server and locally on our machine.
<br>

Now we will populate and create **all** tables for our dataset into our database and finalize our ERR diagram.<br>

We created a database three different ways in our previous assignment; One database on our local MySQL server, one test database stored locally that integrates with MySQL and one test database stored only locally as a .db file on your machine. Now we will create all tables and populate your tables with your data from your dataset (Feel free to practice with all methods, but it is encouraged to use the first method that will allow you to create your schema diagram). After populating your database, create a visual database schema diagram in MySQL Workbench. <br>
<br>
Be sure to comment all code. Include a .png image of your database schema from MySQL Workbench in your Blackboard submission or Github repository.

In [1]:
# Load necessary packages:
from sqlalchemy import create_engine, Column, String, Integer, Boolean, BigInteger, Float, text # Database navigation
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import insert
import mysql.connector
import pandas as pd # Python data manilpulation

In [2]:
#define mysql connection variables
conn = mysql.connector.connect(
        host="localhost", # This is my local instance number when you open MySQL Workbench.
        user="root", # This is my username for MySQL Workbench
        password="karateChop") # We wrote this password down in our first class!

# Create a cursor object using the cursor() method
cursor = conn.cursor()

# CREATE DATABASE if it does not already exist for assignment 7
cursor.execute("CREATE DATABASE IF NOT EXISTS usa_olympic_athletes_7_2")

In [3]:
# Time to connect to the database using SQL Alchemy:
DATABASE_URL = "mysql+mysqlconnector://root:karateChop@localhost/usa_olympic_athletes_7_2" # Use MySQL Connector to connect to the database
engine = create_engine(DATABASE_URL) # Creates a connection to the MySQL database

print("Connected to MySQL database successfully!")
# I'm really not sure why I picked karateChop as a password, must've been something I saw that day. Glad I wrote it down!

Connected to MySQL database successfully!


In [4]:
# Read .csv file (Using pandas) we will use to populate our database. 
usa_olympic_athletes_df = pd.read_csv('data/usa_olympic_data.csv') # load the USA Olympic Athlete data

In [5]:
# Preview the dataframe by looking at the first five rows.
usa_olympic_athletes_df.head()

Unnamed: 0,id,name,sex,age,height,weight,team,games,year,season,city,sport,event,medal
0,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 10 kilometres,
1,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 50 kilometres,
2,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 10/15 kilometres Pu...,
3,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 4 x 10 kilometres R...,
4,6,Per Knut Aaland,M,33.0,188.0,75.0,United States,1994 Winter,1994,Winter,Lillehammer,Cross Country Skiing,Cross Country Skiing Men's 10 kilometres,


In [6]:
# Check for NaN values in the dataframe by getting the sum of NaN by column
print(usa_olympic_athletes_df.isna().sum())

id            0
name          0
sex           0
age         285
height     3790
weight     4318
team          0
games         0
year          0
season        0
city          0
sport         0
event         0
medal     12967
dtype: int64


In [6]:
# I have some NaN values in my dataset, converting them to 0's for the database values using a cleaner approach than from Assignment 6
usa_olympic_athletes_df['age'] = usa_olympic_athletes_df['age'].fillna(0)
usa_olympic_athletes_df['height'] = usa_olympic_athletes_df['height'].fillna(0)
usa_olympic_athletes_df['weight'] = usa_olympic_athletes_df['weight'].fillna(0)
usa_olympic_athletes_df['medal'] = usa_olympic_athletes_df['medal'].fillna(0)

In [7]:
# Let's check for NaN values in the dataframe by getting the sum of NaN by column again and see if we got them all
print(usa_olympic_athletes_df.isna().sum())

id        0
name      0
sex       0
age       0
height    0
weight    0
team      0
games     0
year      0
season    0
city      0
sport     0
event     0
medal     0
dtype: int64


In [10]:
# let's create our sport table
create_sport_table_query = """CREATE TABLE IF NOT EXISTS sport (
                                sport_id INT AUTO_INCREMENT PRIMARY KEY,
                                sport_name VARCHAR(255)
                           );     
                           """
# Execute the create_sport_table_query
with engine.connect() as connection:
    connection.execute(text(create_sport_table_query))

print("create_sport_table_query table created successfully!") # output this string if the query worked

create_sport_table_query table created successfully!


In [11]:
# let's create our event table
create_event_table_query = """CREATE TABLE IF NOT EXISTS event (
                                event_id INT AUTO_INCREMENT PRIMARY KEY,
                                event_name VARCHAR(255)
                           );     
                           """
# Execute the create_event_table_query
with engine.connect() as connection:
    connection.execute(text(create_event_table_query))

print("create_event_table_query table created successfully!") # output this string if the query worked

create_event_table_query table created successfully!


In [12]:
# let's create our medal table
create_medal_table_query = """CREATE TABLE IF NOT EXISTS medal (
                                medal_id INT AUTO_INCREMENT PRIMARY KEY,
                                medal_name VARCHAR(25)
                           );     
                           """
# Execute the create_medal_table_query
with engine.connect() as connection:
    connection.execute(text(create_medal_table_query))

print("create_medal_table_query table created successfully!") # output this string if the query worked

create_medal_table_query table created successfully!


In [13]:
# We want our table column names to match what is in the .csv file
create_athlete_table_query = """CREATE TABLE IF NOT EXISTS athlete (
                            athlete_id INT AUTO_INCREMENT PRIMARY KEY,
                            name VARCHAR(100),
                            sex CHAR(1),
                            age FLOAT(4,1),
                            height FLOAT(4,1),
                            weight FLOAT(4,1),
                            team VARCHAR(255),
                            year INT(4),
                            season VARCHAR(10),
                            city VARCHAR(100),
                            sport_id INT,
                            event_id INT,
                            medal_id INT,

                            FOREIGN KEY (sport_id) REFERENCES sport(sport_id),
                            FOREIGN KEY (event_id) REFERENCES event(event_id),
                            FOREIGN KEY (medal_id) REFERENCES medal(medal_id)
                    );"""
# Execute the create_athlete_table_query
with engine.connect() as connection:
    connection.execute(text(create_athlete_table_query))

print("create_athlete_table_query table created successfully!") # output this string if the query worked

create_athlete_table_query table created successfully!


In [8]:
# Let's get and assign the unique sport column values from the dataset
unique_sports = tuple(usa_olympic_athletes_df['sport'].unique())

In [9]:
# Let's get and assign the unique event column values from the dataset
unique_events = tuple(usa_olympic_athletes_df['event'].unique())

In [10]:
# Let's get and assign the unique medal column values from the dataset
unique_medals = tuple(usa_olympic_athletes_df['medal'].unique())
unique_medals

(0, 'Silver', 'Bronze', 'Gold')

In [37]:
sport_count = 0

# INSERT our unique_sports into the sports table
cursor.execute("USE usa_olympic_athletes_7_2;")  # specify the usa_olympic_athletes_7_2 database
for sport in unique_sports:
    # INSERT into sport table if it doesn't exist
    cursor.execute("""INSERT IGNORE INTO sport (sport_name)
                      VALUES (%s)
                   """, (sport,))  # Pass from the unique_sports as a tuple
    sport_count += 1

# Commit the transaction
conn.commit()

print(f"sport table is populated - count {sport_count}")

sport table is populated - count 58


In [39]:
# INSERT our unique_sports into the sports table
cursor.execute("USE usa_olympic_athletes_7_2;")  # specify the usa_olympic_athletes_7_2 database
for event in unique_events:
    # INSERT into event table if it doesn't exist
    cursor.execute("""INSERT INTO event (event_name)
                      VALUES (%s)
                   """, (event,))  # Pass from the unique_events as a tuple

# Commit the transaction
conn.commit()

In [42]:
# INSERT our unique_medals into the sports table
cursor.execute("USE usa_olympic_athletes_7_2;")  # specify the usa_olympic_athletes_7_2 database
for medal in unique_medals:
    # INSERT into medal table if it doesn't exist
    cursor.execute("""INSERT INTO medal (medal_name)
                      VALUES (%s)
                   """, (medal,))  # Pass from the unique_medals as a tuple

# Commit the transaction
conn.commit()

In [14]:
cursor.execute("USE usa_olympic_athletes_7_2;")  # specify the usa_olympic_athletes_7_2 database

# Loop through each row in the usa_olympic_athletes_dfdataframe
for _, row in usa_olympic_athletes_df.iterrows():

    # get the sport_id from the sport table if it matches the sport value in the usa_olympic_athletes_df dataframe
    current_sport = row['sport'] # set the current sport to current_sport
    # select the sport_id of the current_sport match in the sport table
    cursor.execute("""SELECT sport_id
                      FROM sport
                      WHERE sport_name = (%s)
                   """, [current_sport])  # select the sport_id from the sport table where the sport_name is equal to current_sport
    sport_id = cursor.fetchone() # output the sport sport_id to the sport_id variable
    cursor.fetchall()  # fetch all remaining results to clear the result set

    # get the event_id from the event table if it matches the event value in the usa_olympic_athletes_df dataframe
    current_event = row['event'] # set the current event to current_event
    # select the event_id of the current_event match in the event table
    cursor.execute("""SELECT event_id
                      FROM event
                      WHERE event_name = (%s)
                   """, [current_event]) #select the event_id from the event table where the event_name is equal to current_event
    event_id = cursor.fetchone() # output the event event_id to the event_id variable
    cursor.fetchall()  # fetch all remaining results to clear the result set

    # get the medal_id from the medal table if it matches the medal value in the usa_olympic_athletes_df dataframe
    current_medal = row['medal'] # set the current event to current_event
    # select the event_id of the current_event match in the event table
    cursor.execute("""SELECT medal_id
                      FROM medal
                      WHERE medal_name = (%s)
                   """, [current_medal]) #select the medal_id from the medal table where the medal_name is equal to current_medal
    medal_id = cursor.fetchone() # output the medal medal_id to the medal_id variable
    cursor.fetchall()  # fetch all remaining results to clear the result set

    # insert the usa_olympic_athletes_df dataframe and the sport_id, event_id, and medal_id variables
    cursor.execute("""INSERT INTO athlete (
                                name, sex, age, height, weight, team, year, season, city, sport_id, event_id, medal_id
                           ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) 
                           """, (
                                row['name'],
                                row['sex'],
                                row['age'],
                                row['height'],
                                row['weight'],
                                row['team'],
                                row['year'],
                                row['season'],
                                row['city'],
                                sport_id[0],
                                event_id[0],
                                medal_id[0]
            ))
    conn.commit()

    # print(f"sport_id: {sport_id[0]}, event_id: {event_id[0]}, medal_id: {medal_id[0]}") # commented out testing the values before committing to db
    

Success!!!
===========
I'm not sure if this was the most efficient way to insert into the database with all the selects in the for loop, but I'm running on my laptop and not paying for a cloud service so I'll take the victory for now!

In [None]:
#Close the database connection :)
cursor.close()
conn.close()

**MySQL Workbench**<br>
To export your database schema as a .PNG:<br>
->Go to your EER Diagram<br>
->File<br>
->Export<br>
->Export as .PNG

[<img src="https://raw.githubusercontent.com/joedag32/DSSA-5102_Spring2025/refs/heads/main/Assignments/images/assignment7_schema.png">]