## Inserting and Reading data from MySQL using Pandas

First let's start with a basic piece of code that fetches the data that we want to insert in the database. For our example, we will get the data about the Citibike stations, using the correspoding API call provided by the Citibike website:

In [1]:
# !sudo pip3 install -U -q PyMySQL sqlalchemy sql_magic

In [2]:
import requests

In [3]:
# Let's get the data from the Citibike API

# This gives information for each station that remains stable over time
url_stations = "https://gbfs.citibikenyc.com/gbfs/en/station_information.json"
# This gives the live status of all the stations (e.g., bikes available etc)
url_status = "https://gbfs.citibikenyc.com/gbfs/en/station_status.json"

# We fetch for now just the time-invariant data
results = requests.get(url_stations).json()

In [4]:
# We only need a subset of the data in the JSON returned by the Citibike API, so we keep only what we need
data = results["data"]["stations"]

In [5]:
len(data)

1393

In [6]:
import pandas as pd

df = pd.DataFrame(data)
df.head(5)

Unnamed: 0,external_id,eightd_has_key_dispenser,capacity,short_name,name,rental_methods,lon,has_kiosk,legacy_id,station_type,rental_uris,electric_bike_surcharge_waiver,station_id,lat,region_id,eightd_station_services
0,66db237e-0aca-11e7-82f6-3863bb44ef7c,False,55,6926.01,W 52 St & 11 Ave,"[CREDITCARD, KEY]",-73.993929,True,72,classic,{'android': 'https://bkn.lft.to/lastmile_qr_sc...,False,72,40.767272,71,[]
1,66db269c-0aca-11e7-82f6-3863bb44ef7c,False,33,5430.08,Franklin St & W Broadway,"[CREDITCARD, KEY]",-74.006667,True,79,classic,{'android': 'https://bkn.lft.to/lastmile_qr_sc...,False,79,40.719116,71,[]
2,66db277a-0aca-11e7-82f6-3863bb44ef7c,False,27,5167.06,St James Pl & Pearl St,"[CREDITCARD, KEY]",-74.000165,True,82,classic,{'android': 'https://bkn.lft.to/lastmile_qr_sc...,False,82,40.711174,71,[]
3,66db281e-0aca-11e7-82f6-3863bb44ef7c,False,62,4354.07,Atlantic Ave & Fort Greene Pl,"[CREDITCARD, KEY]",-73.976323,True,83,classic,{'android': 'https://bkn.lft.to/lastmile_qr_sc...,False,83,40.683826,71,[]
4,66db28b5-0aca-11e7-82f6-3863bb44ef7c,False,50,6148.02,W 17 St & 8 Ave,"[CREDITCARD, KEY]",-74.001497,True,116,classic,{'android': 'https://bkn.lft.to/lastmile_qr_sc...,False,116,40.741776,71,[]


In [7]:
# We drop the 'rental methods' columns,
# as they contains multiple values and
# we cannot insert lists in a database cell.
df.drop(
    ["rental_methods", "eightd_station_services", "rental_uris"],
    axis="columns",
    inplace=True,
)
df.head(5)

Unnamed: 0,external_id,eightd_has_key_dispenser,capacity,short_name,name,lon,has_kiosk,legacy_id,station_type,electric_bike_surcharge_waiver,station_id,lat,region_id
0,66db237e-0aca-11e7-82f6-3863bb44ef7c,False,55,6926.01,W 52 St & 11 Ave,-73.993929,True,72,classic,False,72,40.767272,71
1,66db269c-0aca-11e7-82f6-3863bb44ef7c,False,33,5430.08,Franklin St & W Broadway,-74.006667,True,79,classic,False,79,40.719116,71
2,66db277a-0aca-11e7-82f6-3863bb44ef7c,False,27,5167.06,St James Pl & Pearl St,-74.000165,True,82,classic,False,82,40.711174,71
3,66db281e-0aca-11e7-82f6-3863bb44ef7c,False,62,4354.07,Atlantic Ave & Fort Greene Pl,-73.976323,True,83,classic,False,83,40.683826,71
4,66db28b5-0aca-11e7-82f6-3863bb44ef7c,False,50,6148.02,W 17 St & 8 Ave,-74.001497,True,116,classic,False,116,40.741776,71


### Writing a Pandas Dataframe in a MySQL Table

In [8]:
import sqlalchemy
from sqlalchemy import create_engine

conn_string = "mysql+pymysql://{user}:{password}@{host}/".format(
    host="db.ipeirotis.org", user="student", password="dwdstudent2015"
)

engine = create_engine(conn_string)

Once we have connected successfully, we need to create our database:

In [9]:
# Query to create a database
# In this example, we will try to create the (existing) database "public"
# But in general, we can give any name to the database
db_name = "public"
create_db_query = (
    f"CREATE DATABASE IF NOT EXISTS {db_name} DEFAULT CHARACTER SET 'utf8'"
)

# Create a database
engine.execute(create_db_query)

# And lets switch to the database
engine.execute(f"USE {db_name}")

<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fde5a50c190>

In [10]:
# To avoid conflicts between people writing in the same database, we add a random suffix in the tables
# We only create the variable once while running the notebook
import uuid

if "suffix" not in globals():
    suffix = str(uuid.uuid4())[:8]
print(suffix)

c519975f


### Create Table and Store Data in Database using the `to_sql` command

Then we create the table where we will store our data. Since we already have the data in a Pandas DataFrame, it is very easy to put the data in a database.

In [11]:
table_name = f"Stations_{suffix}"
# Create a table
# See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html for the documentation

# This step is optional, but it is good practice to define explicitly the
# data types before storing things in a database. In many cases, this can be ommitted, though.

dtype = {
    "station_id": sqlalchemy.types.SMALLINT(),
    "external_id": sqlalchemy.types.VARCHAR(50),
    "name": sqlalchemy.types.VARCHAR(60),
    "short_name": sqlalchemy.types.VARCHAR(10),
    "lat": sqlalchemy.types.Float,
    "lon": sqlalchemy.types.Float,
    "region_id": sqlalchemy.types.VARCHAR(5),
    "capacity": sqlalchemy.types.SMALLINT(),
    "rental_url": sqlalchemy.types.VARCHAR(100),
    "electric_bike_surcharge_waiver": sqlalchemy.types.BOOLEAN,
    "eightd_has_key_dispenser": sqlalchemy.types.BOOLEAN,
    "has_kiosk": sqlalchemy.types.BOOLEAN,
}


df.to_sql(
    name=table_name,
    schema=db_name,
    con=engine,
    if_exists="replace",
    index=False,
    dtype=dtype,
)

In [12]:
# Once we have the data in the table, we also specify a primary key
# If we had FOREIGN KEYS we can add them in the same way
add_key_query = f"ALTER TABLE {table_name} ADD PRIMARY KEY(station_id)"
print(add_key_query)
engine.execute(add_key_query)

ALTER TABLE Stations_c519975f ADD PRIMARY KEY(station_id)


<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7fde5a4f9e50>

### Reading from a SQL Database in Python using the `read_sql` command in Pandas

We can similarly read from the database using Pandas

In [13]:
query = f"SELECT * FROM {db_name}.{table_name}"
print(query)

SELECT * FROM public.Stations_c519975f


In [14]:
df2 = pd.read_sql(query, con=engine)
df2.head(5)

Unnamed: 0,external_id,eightd_has_key_dispenser,capacity,short_name,name,lon,has_kiosk,legacy_id,station_type,electric_bike_surcharge_waiver,station_id,lat,region_id
0,66db237e-0aca-11e7-82f6-3863bb44ef7c,0,55,6926.01,W 52 St & 11 Ave,-73.9939,1,72,classic,0,72,40.7673,71
1,66db269c-0aca-11e7-82f6-3863bb44ef7c,0,33,5430.08,Franklin St & W Broadway,-74.0067,1,79,classic,0,79,40.7191,71
2,66db277a-0aca-11e7-82f6-3863bb44ef7c,0,27,5167.06,St James Pl & Pearl St,-74.0002,1,82,classic,0,82,40.7112,71
3,66db281e-0aca-11e7-82f6-3863bb44ef7c,0,62,4354.07,Atlantic Ave & Fort Greene Pl,-73.9763,1,83,classic,0,83,40.6838,71
4,66db28b5-0aca-11e7-82f6-3863bb44ef7c,0,50,6148.02,W 17 St & 8 Ave,-74.0015,1,116,classic,0,116,40.7418,71


### Export Data from Database to CSV or Excel

And remember that from Pandas it is also possible to export in other formats, such as Excel of CSV.

In [15]:
# The necessary library to write in Excel
# !sudo pip3 install -U xlwt

In [16]:
df2.to_excel("citibike.xls")
df2.to_csv("citibike.csv")

ModuleNotFoundError: No module named 'xlwt'

### Cleanup

Finally, let's clean up and delete the table that we created

In [None]:
drop_table_query = f"DROP TABLE IF EXISTS {db_name}.{table_name}"
print(drop_table_query)
engine.execute(drop_table_query)

### Exercise

The `url_status = 'https://gbfs.citibikenyc.com/gbfs/en/station_status.json'` URL contains the status of the stations. Write code that reads the results from that API call, and then stores the data in a separate table. Add a "foreign key" constraint from the Status table to the Stations table that we created above.