# Assignment #6 - Data Gathering and Warehousing - DSSA-5102

Instructor: Melissa Laurino</br>
Spring 2025</br>

Name: Thinh Le</br>
Date: March 11, 2025<br>

**At this time in the semester:** <br>
- We have explored a dataset. <br>
- We have cleaned our dataset. <br>
- We created a Github account with a repository for this class and included a metadata read me file about our data. <br>
- We introduced general SQL syntax, queries, and applications in Python.<br>

Now we will start the process of uploading our dataset into a database. There are many different ways to upload your .csv data into a database (.db file). Databases can be created in many open source applications, MySQL workbench, and even some websites can load your .csv data into a database...for a small fee. Instead of using an application, we are going to first create our database for our dataset from scratch in Python. On a much larger scale, data may be automatically uploaded to a database once it is aquired.<br>

## Assignment #6 Objectives

We will use the Python packages SQL Alchemy and SQLite to create three separate databases for practice. 
- Create one database on our MySQL server (10)
  - Create and populate our first table with appropriate data types
  - View the MySQL workbench schema to see the table you created
- Create one test database locally that we can still use with MySQL (3)
- Create one test database locally as a .db file. (2) <br>

Follow the instructions below to complete the assignment. For submission, please include your .ipynb file with output cells (Or a link to Github), and the screen shot of your first database table in MySQL Workbench. Answer any questions in markdown cell boxes. Be sure to comment all code in your own words.


## Creating our database from scratch to integrate with MySQL Workbench in Python<br>

**BEFORE YOU BEGIN!**<br>
Is your MySQL Server running on your local machine?<br>
**Start the server** if it is not running already.

We need the MySQL connector to work with Python since we are using SQLAlchemy with MySQL Workbench. Let's install the MySQL driver. Run the following code in a terminal window to install the MySQL connector:

```$ pip install mysql-connector-python mysql-connector```

### Creating a database from scratch in Python using SQL Alchemy<br>
Additional sources: <br>
-- https://medium.com/@sandyjtech/creating-a-database-using-python-and-sqlalchemy-422b7ba39d7e <br>
-- https://www.youtube.com/watch?v=xr7vDSFXjW0 <br>
-- https://www.geeksforgeeks.org/how-to-design-a-database-for-spotify/ (My specific inspiration for understanding a Spotify schema)

In [12]:
# Load necessary packages:
from sqlalchemy import create_engine, Column, String, Integer, Boolean, BigInteger, Float, text # Database navigation
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import mysql.connector
import sqlite3 # A second option for working with databases
import pandas as pd # Python data manilpulation

Open MySQL Workbench.
- Click on Local Instance (This is your port number - if needed)

In [13]:
# Connect to the MySQL server 
# Define our variables. We set these during our first class in our technology set up. 
# If you are unsure of these variables, do not guess. 
# Visit MySQL Workbench for the localport number, host and user.
host = '127.0.0.1'
port = 3306
user = 'root'
password = '123456'

mysql_connection = mysql.connector.connect(
    # Server address
    host=host,
    # Port number
    port=port,
    # Username
    user=user,
    # Password
    password=password)

# Create a cursor - an object that can execute operations such as SQL statement
cursor = mysql_connection.cursor()

# CREATE DATABASE (SQL command) if it does not already exist
# Ref: https://dev.mysql.com/doc/refman/8.4/en/create-database.html
database_name = "thinh_db"
cursor.execute(f"CREATE DATABASE IF NOT EXISTS {database_name}")
# MySQL_SpotifyDatabase will be the name when the database is created.

print("Database created successfully in MySQL Workbench! Go check it out.")

Database created successfully in MySQL Workbench! Go check it out.


**STOP**

Confirm your database was created before continuing. [✅]<br>
Open MySQL Workbench.<br>
Under MySQL Connections, click Local Instance<br>
Click the Schemas tab

**You should now see a new (empty) database that you created**<br>
If it does not show up right away, hit refresh (The circular arrows)

In [14]:
# Time to connect to the database using SQL Alchemy:
# Ref:
#   https://dev.mysql.com/doc/refman/8.4/en/connecting-using-uri-or-key-value-pairs.html#connecting-using-uri
#   [scheme://][user[:[password]]@]host[:port][/schema][?attribute1=value1&attribute2=value2...]
#   https://docs.sqlalchemy.org/en/20/dialects/mysql.html#module-sqlalchemy.dialects.mysql.mysqlconnector
database_url = f"mysql+mysqlconnector://{user}:{password}@{host}:{port}/{database_name}"

# Creates a connection to the MySQL database
engine = create_engine(database_url)

print("Connected to MySQL database successfully!")

Connected to MySQL database successfully!


In [15]:
# Read in the CLEAN .csv file (Using pandas) we will use to populate our database. This is the same dataset that you cleaned for Assignment #2!
df = pd.read_csv('laptop_prices.csv')

In [16]:
# Preview the dataframe by looking at the first five rows.
df.head()

Unnamed: 0,brand,processor,ram_gb,storage,gpu,screen_size_inch,resolution,battery_life_hours,weight_kg,operating_system,price
0,Apple,AMD Ryzen 3,64.0,512GB SSD,Nvidia GTX 1650,17.3,2560x1440,8.9,1.42,FreeDOS,3997.07
1,Razer,AMD Ryzen 7,4.0,1TB SSD,Nvidia RTX 3080,14.0,1366x768,9.4,2.57,Linux,1355.78
2,Asus,Intel i5,32.0,2TB SSD,Nvidia RTX 3060,13.3,3840x2160,8.5,1.74,FreeDOS,2673.07
3,Lenovo,Intel i5,4.0,256GB SSD,Nvidia RTX 3080,13.3,1366x768,10.5,3.1,Windows,751.17
4,Razer,Intel i3,4.0,256GB SSD,AMD Radeon RX 6600,16.0,3840x2160,5.7,3.38,Linux,2059.83


In [17]:
# What are all of the column names and data types for our dataset?
df.info()
# It is important to know the column names from the .csv because these are the field names we will want to use for our first table.
# Remember, the field names represent the column names of the csv/table.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11768 entries, 0 to 11767
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   brand               11768 non-null  object 
 1   processor           11768 non-null  object 
 2   ram_gb              11768 non-null  float64
 3   storage             11768 non-null  object 
 4   gpu                 11768 non-null  object 
 5   screen_size_inch    11768 non-null  float64
 6   resolution          11768 non-null  object 
 7   battery_life_hours  11768 non-null  float64
 8   weight_kg           11768 non-null  float64
 9   operating_system    11768 non-null  object 
 10  price               11768 non-null  float64
dtypes: float64(5), object(6)
memory usage: 1011.4+ KB


If you are an experienced Python user, you can create a base Python class for all of our tables before populating them and use built in SQLAlchemy features. <br>
To practice SQL, we will create our database from scratch using SQL commands in Python instead.

We can use a new SQL statement CREATE TABLE to create our first table in our new database by writing a query.<br>
Everyone's data is different! Choose the SQL data types that fit YOUR data needs!<br>
SQL Data Types: https://www.w3schools.com/sql/sql_datatypes.asp

In [18]:
# Create our first table in the database file using SQL statements:
# We want our table column names to match what is in the .csv file
# Ref:
#   https://dev.mysql.com/doc/refman/8.4/en/create-table.html
#   https://dev.mysql.com/doc/refman/8.4/en/numeric-types.html
table_name = "laptop_prices"
first_table_query = f"""CREATE TABLE IF NOT EXISTS {table_name} (
    id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    brand VARCHAR(20),
    processor VARCHAR(20),
    ram_gb TINYINT,
    storage VARCHAR(20),
    gpu VARCHAR(20),
    screen_size_inch FLOAT,
    resolution VARCHAR(20),
    battery_life_hours FLOAT,
    weight_kg FLOAT,
    operating_system VARCHAR(20),
    price FLOAT
)"""
# Note that the primary key for this table is a column/field "id"
# This is not a field that existed previously. AUTO_INCREMENT automatically generates a unique value for each new row added to the table. 
# Each new value is one greater than the previous value. We cannot make the Date column/field our primary key, because it is not unique.

In [19]:
# Execute the query:
with engine.connect() as connection:
    connection.execute(text(first_table_query))

print("First table created successfully!")

First table created successfully!


Define your SQL data types for your first table: <br>

```
id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
brand VARCHAR(20),
processor VARCHAR(20),
ram_gb TINYINT,
storage VARCHAR(20),
gpu VARCHAR(20),
screen_size_inch FLOAT,
resolution VARCHAR(20),
battery_life_hours FLOAT,
weight_kg FLOAT,
operating_system VARCHAR(20),
price FLOAT
```


Why did you choose these values to make up your first database table? What did you choose for your primary key and why?

1. I need an ID column since every row in the dataset is unique and this column will be used as primary key, with auto-increment feature. I chose `MEDIUMINT` because my dataset has nearly 11k rows, while `MEDIUMINT` can contain up to 65k rows.
2. All the text columns will be in `VARCHAR` types, with 20 character allowed. I have checked that no column has more than 20 characters.
3. The `ram_gb` has small integer values, so I assign it with `TINYINT` type.
4. All columns with decimal values are assigned with `FLOAT` types.

In [20]:
# There are multiple ways to populate the fields of the table. 
# Another option is to add a subset of the data into data table, and then populate the database table.
# Please feel free to change or alter the code below.
# This example uses the MySQL connector:

with engine.connect() as connection:
    # Iterate through DataFrame rows
    for index, row in df.iterrows():
        # Construct and execute INSERT statement
        query = f"""INSERT INTO {table_name} (
                        brand, processor, ram_gb,
                        storage, gpu, screen_size_inch,
                        resolution, battery_life_hours, weight_kg,
                        operating_system, price
                    ) VALUES (
                        '{row['brand']}',
                        '{row['processor']}',
                        '{row['ram_gb']}',
                        '{row['storage']}',
                        '{row['gpu']}',
                        '{row['screen_size_inch']}',
                        '{row['resolution']}',
                        '{row['battery_life_hours']}',
                        '{row['weight_kg']}',
                        '{row['operating_system']}',
                        '{row['price']}'
                    )"""
        connection.execute(text(query))

    connection.commit()

![Alt text](data-insert-result.png)

**STOP**<br><br>
In MySQL Workbench, you should see your new table that you have created and populated.<br>
You can now run a SQL query directly in MySQL Workbench!<br>
You can also run a query below to test it:

In [21]:
# Now that we have populated our table, let's try out a query.

with engine.connect() as connection:  # Establish a connection
    practice_query = text(f"""SELECT * FROM {table_name}
                           WHERE brand = 'Apple'
                           """) # Define the query - text() ensures that the query string is read as a SQL expression
    practice_query = pd.read_sql(practice_query, connection) #Use pandas to read the sql query with the connection to the database
    
# Print the results
practice_query

Unnamed: 0,id,brand,processor,ram_gb,storage,gpu,screen_size_inch,resolution,battery_life_hours,weight_kg,operating_system,price
0,1,Apple,AMD Ryzen 3,64,512GB SSD,Nvidia GTX 1650,17.3,2560x1440,8.9,1.42,FreeDOS,3997.07
1,9,Apple,Intel i5,64,256GB SSD,Nvidia RTX 2060,15.6,3840x2160,11.5,1.48,Linux,6409.03
2,25,Apple,Intel i9,64,256GB SSD,Integrated,17.3,1920x1080,6.9,2.55,Windows,4428.68
3,44,Apple,Intel i9,32,2TB SSD,Nvidia GTX 1650,17.3,2560x1440,5.7,1.83,Windows,5373.59
4,65,Apple,Intel i5,64,512GB SSD,Nvidia RTX 3080,14.0,3840x2160,10.9,2.25,Windows,5768.75
...,...,...,...,...,...,...,...,...,...,...,...,...
1257,11721,Apple,Intel i5,64,256GB SSD,Nvidia RTX 3060,16.0,3840x2160,9.2,2.40,macOS,6402.40
1258,11722,Apple,AMD Ryzen 7,64,1TB SSD,Integrated,15.6,1920x1080,6.6,1.82,Windows,3216.15
1259,11729,Apple,Intel i7,16,2TB SSD,AMD Radeon RX 6800,15.6,1920x1080,11.9,3.19,macOS,2354.89
1260,11752,Apple,Intel i7,4,2TB SSD,AMD Radeon RX 6600,16.0,1366x768,6.0,3.48,Linux,1504.18


**STOP**<br>
To create a new schema diagram for your new database (Even though it only has one table...it's good practice!)<br>
Open MySQL Workbench again<br>
Click Home<br>
Click the Models icon<br>
Click the > icon to the right of "Models"<br>
Choose “Create EER Model from Database” <br>
The Reverse Engineer Database Wizard starts and will walk you through your first database schema diagram.<br>
Save your model. <br>
You can now add relationships and or modify tables...but for this assignment, all we need is that first table. <br>

**Add a screen shot of your first schema diagram (The table) to your repository/Blackboard subission.**

![Table diagram](table-diagram.png)

In [22]:
# Close the database connection :)
cursor.close()
connection.close()

### Creating a local database from scratch

#### Creating a local database from scratch in Python using SQL Alchemy for MySQL Workbench:<br>
Another example: https://blog.sqlitecloud.io/sqlite-python-sqlalchemy

In [23]:
# BEFORE YOU BEGIN!
# Is your MySQL Server running on your local machine?
# Doesn't matter this time, please continue! :)
from sqlalchemy import create_engine

In [26]:
# Creates a local database file in the SAME directory as this document.
database_name = "thinhle_local"
engine = create_engine(f"sqlite:///{database_name}.db")
# NOTE: We are not using the local host, but can still connect our database to MySQL

user = 'root'
password = '123456'

mysql_connection = mysql.connector.connect(
    # Username
    user=user,
    # Password
    password=password)

# Create a cursor - an object that can execute operations such as SQL statement
cursor = mysql_connection.cursor()

cursor.execute(f"CREATE DATABASE IF NOT EXISTS {database_name}")

In [27]:
# Close the cursor connection :)
cursor.close()

True

**STOP HERE**<br>
Before moving on, it is **important** to understand the difference of what we have just completed. Using SQL Alchemy, we have created a database LOCALLY. Notice we did not specify a specific host, BUT we did specify a user and password! This means we can access this database locally in MySQL Workbench if we choose.

#### Creating a local database (.db file) from scratch in Python using SQLite:<br>


In [28]:
# Load necessary packages:
from sqlalchemy import create_engine, inspect, text # Database navigation
import sqlite3 # A second option for working with databases
import pandas as pd # Python data manilpulation

In [29]:
database_name = "local_db"
engine = create_engine(f"sqlite:///{database_name}.db")

# Connect to the database - this action creates the empty file
connection = engine.connect()

# Store the dataframe in the database as a single table for quick practice (Never recommended, especially for large data sets)
# Ref: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
# if_exists='replace': drop old table and re-create table
# index=False: no need to write column names from the df
df.to_sql(database_name, con=connection, if_exists="replace", index=False)

**STOP HERE**<br>
This method creates a database as a file on our local machine. The .db file is created in the same location or working directory you are currently in (Go check!). If you did not specify a working directory, the .db file is created where this .ipynb is located. 

In [30]:
#Close the database connection :)
connection.close()