<a href="https://colab.research.google.com/github/ipeirotis/introduction-to-databases/blob/master/session1/A5-Inserting_Data_in_MySQL_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
!sudo pip3 install -U -q PyMySQL sqlalchemy sql_magic

[K     |████████████████████████████████| 43 kB 861 kB/s 
[K     |████████████████████████████████| 1.6 MB 51.7 MB/s 
[K     |████████████████████████████████| 1.6 MB 63.6 MB/s 
[K     |████████████████████████████████| 120 kB 82.4 MB/s 
[K     |████████████████████████████████| 82 kB 512 kB/s 
[?25h

## Inserting data in MySQL using Python

First let's start with a basic piece of code that fetches the data that we want to insert in the database. For our example, we will get the data about the Citibike stations, using the correspoding API call provided by the Citibike website:

In [1]:
import requests
import uuid
from datetime import date, datetime, timedelta

In [2]:
# Let's get the data from the Citibike API
url = "https://gbfs.citibikenyc.com/gbfs/en/station_information.json"
results = requests.get(url).json()

In [3]:
# We only need a subset of the data in the JSON returned by the Citibike API, so we keep only we need
data = results["data"]["stations"]

In [10]:
data[1]

{'capacity': 33,
 'lon': -74.00666661,
 'station_type': 'classic',
 'rental_methods': ['KEY', 'CREDITCARD'],
 'external_id': '66db269c-0aca-11e7-82f6-3863bb44ef7c',
 'station_id': '79',
 'rental_uris': {'ios': 'https://bkn.lft.to/lastmile_qr_scan',
  'android': 'https://bkn.lft.to/lastmile_qr_scan'},
 'eightd_station_services': [],
 'legacy_id': '79',
 'lat': 40.71911552,
 'electric_bike_surcharge_waiver': False,
 'has_kiosk': True,
 'name': 'Franklin St & W Broadway',
 'short_name': '5430.08',
 'eightd_has_key_dispenser': False,
 'region_id': '71'}

In [4]:
len(data)

1708

In [7]:
from sqlalchemy import create_engine

conn_string = "mysql+pymysql://{user}:{password}@{host}/".format(
    host="db.ipeirotis.org", user="student", password="dwdstudent2015"
)

engine = create_engine(conn_string)

Once we have connected successfully, we need to create our database:

In [8]:
# Query to create a database
# In this example, we will try to create the (existing) database "public"
# But in general, we can give any name to the database
db_name = "public"
create_db_query = (
    f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8'"
)

# Create a database
engine.execute(create_db_query)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f84bb81db50>

Then we create the table where we will store our data. For our example, we will just import three fields in the database: station_id, station_name, and number_of_docks

In [9]:
# To avoid conflicts between people writing in the same database, we add a random suffix in the tables
# We only create the variable once while running the notebook
if "suffix" not in globals():
    suffix = str(uuid.uuid4())[:8]
print(suffix)

3a14552e


In [18]:
table_name = f"Docks_{suffix}"
# Create a table
create_table_query = f"""CREATE TABLE IF NOT EXISTS {db_name}.{table_name} 
                                (station_id int, 
                                station_name varchar(250), 
                                capacity int,
                                PRIMARY KEY(station_id)
                                )"""
engine.execute(create_table_query)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f84ba6177d0>

Finally, we import the data into our table, using the INSERT command. (_Note: The `INSERT IGNORE` directs the database to ignore attempts to insert another tuple with the same primary key. In our case, we do not want to allow two entries for the same `station_id`._)

In [19]:
query_template = f"""
                    INSERT IGNORE INTO 
                    {db_name}.{table_name}(station_id,  station_name,  capacity) 
                    VALUES (%s, %s, %s)
                  """

# THIS IS PROHIBITED
# query = "INSERT INTO citibike.Docks(station_id, station_name, number_of_docks) " + \
#         "VALUES ("+entry["id"]+", "+entry["stationName"]+", "+entry["totalDocks"]+")"


for entry in data:
    dockid = entry["station_id"]
    addr = entry["name"]
    docks = entry["capacity"]

    print("Inserting station", dockid, "at", addr, "with", docks, "docks")
    query_parameters = (dockid, addr, docks)
    engine.execute(query_template, query_parameters)

Inserting station 72 at W 52 St & 11 Ave with 20 docks
Inserting station 79 at Franklin St & W Broadway with 33 docks
Inserting station 82 at St James Pl & Pearl St with 27 docks
Inserting station 83 at Atlantic Ave & Fort Greene Pl with 62 docks
Inserting station 116 at W 17 St & 8 Ave with 74 docks
Inserting station 119 at Park Ave & St Edwards St with 53 docks
Inserting station 120 at Lexington Ave & Classon Ave with 19 docks
Inserting station 127 at Barrow St & Hudson St with 31 docks
Inserting station 128 at MacDougal St & Prince St with 56 docks
Inserting station 143 at Clinton St & Joralemon St with 50 docks
Inserting station 144 at Nassau St & Navy St with 58 docks
Inserting station 146 at Hudson St & Reade St with 55 docks
Inserting station 150 at E 2 St & Avenue C with 56 docks
Inserting station 151 at Cleveland Pl & Spring St with 33 docks
Inserting station 152 at Warren St & W Broadway with 49 docks
Inserting station 153 at E 40 St & 5 Ave with 63 docks
Inserting station 15

Now let's see how to query the database

In [14]:
results = engine.execute(f"SELECT * FROM {db_name}.{table_name}")
rows = results.fetchall()
results.close()

In [15]:
for row in rows:
    print("Station ID:", row["station_id"])
    print("Station Name:", row["station_name"])
    print("Number of Docks:", row["capacity"])
    print("=============================================")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Station ID: 3318
Station Name: 2 Ave & E 96 St
Number of Docks: 45
Station ID: 3319
Station Name: 14 St & 5 Ave
Number of Docks: 21
Station ID: 3321
Station Name: Clinton St & Union St
Number of Docks: 38
Station ID: 3322
Station Name: 12 St & 4 Ave
Number of Docks: 25
Station ID: 3323
Station Name: W 106 St & Central Park West
Number of Docks: 59
Station ID: 3324
Station Name: 3 Ave & 14 St
Number of Docks: 0
Station ID: 3325
Station Name: E 95 St & 3 Ave
Number of Docks: 31
Station ID: 3326
Station Name: Clinton St & Centre St
Number of Docks: 21
Station ID: 3327
Station Name: 3 Ave & E 100 St
Number of Docks: 27
Station ID: 3328
Station Name: W 100 St & Manhattan Ave
Number of Docks: 39
Station ID: 3329
Station Name: Degraw St & Smith St
Number of Docks: 27
Station ID: 3330
Station Name: Henry St & Bay St
Number of Docks: 19
Station ID: 3331
Station Name: Riverside Dr & W 104 St
Number of Docks: 58
Station ID: 3332
Sta

Finally, let's clean up and close our database connection.

In [16]:
drop_table_query = f"DROP TABLE IF EXISTS {db_name}.{table_name}"
engine.execute(drop_table_query)

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f84bb0cb3d0>

## Exercise

At `https://gbfs.citibikenyc.com/gbfs/en/station_status.json` we can access the live status of all the stations (e.g., bikes available etc). Using the approach outlined above, create a table in the database (using the same table suffix that we created above) and store the data in the database.