# Databases

In this activity, we will consider how to access and query data from a database using Structured Query Language (SQL). To access this data we will be using a tool called SQLite in the background.

We can do this using the pandas [`read_sql` function](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) but we need to go through a few extra steps compared with accessing data from csv and Excel files.

The format of this function will be:

```
read_sql(sql, con, ...)

where

sql : SQL query or table name (can be a str)
con : connectable (can be a str), way to connect to the database
```

and we will talk below about how to construct these inputs for the function.

#### Solar panel database

The database we will be looking at is the [UKPVGeo database](https://github.com/openclimatefix/solar-power-mapping-data) which contains data on solar panels and solar farms in the UK. This combines data from multiple datasets (OpenStreetMap geo data and Renewable Energy Planning Database) alongside additional research to produce data on location and capacity for these sites.

The full article on this dataset is available here: https://www.nature.com/articles/s41597-020-00739-0

### Defining our database

We will start by defining the `con` input and what that means.

First, we need to define *what* we want to use to interpret our database, known as a *database service*. A specific database service will have been used to create the database initially and so also needs to be used to read and interpret the database.

The *database service* we are using for this database is called *SQLite* but there are other similiar services such as *MySQL* or *PostgreSQL* which you may come across. All these options use a form of SQL to access data.

In [17]:
database_service = "sqlite"

Second, we need to define *where* our database is stored.

We have a local copy of our SQLite database in the "data" folder called `ukpvgeo.db`, but if we were accessing a database from an online server we could use a URL address here (complete with login information if needed).

In [18]:
database = "data/ukpvgeo.db"

The underlying Python library which understands how to access SQL databases is called `SQLAlchemy` but we don't need to access that directly because pandas can handle that for us. However, we do need to construct a string (which `SQLAlchemy` calls a connectable) that this library understands to be able to load and access our database. We can do so using the inputs we defined above:

In [19]:
connectable = f"{database_service}:///{database}"
print(f"Our connectable for our database is {connectable}")

Our connectable for our database is sqlite:///data/ukpvgeo.db


Don't worry too much about the details of this for now but you can find more information on the format of these inputs in the `SQLAlchemy` documentation on [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) ([SQLite](https://docs.sqlalchemy.org/en/13/core/engines.html#sqlite) specifically).

### Defining our query

One of the benefits of using SQL-type databases is that we can use SQL queries to only grab the data we need. Often these types of databases contain lots of data (100s MB - TB) which would be impractical to load fully into memory, like we did when reading and querying csv files using pandas.

SQL is a language for selecting and filtering data which can be used to construct a database query.

For our "ukpvgeo.db" data we need additional information about the tables contained within this SQLite database and the columns within those tables. These details can be found within the accompanying data dictionary file: ["README_ukpvgeo.txt"](data/README_ukpvgeo.txt) (stored in the "data" directory).

From this file, we can see that the "ukpvgeo.db" database contains one table:
- "pv"

This table contains lots of columns, but some we may want to pick out are:
- "latitude"
- "longitude"
- "capacity_repd_MWp"

The final column listed above is defined as "Renewable Energy Planning Database Megawatt estimated peak capacity of PV panel" and so contains information on the peak energy capacity for each solar panel site.

When we construct an SQL query we use a set of keywords. Here are a few keywords we often need:
- SELECT - which columns to select from the table
- FROM - which table we want the data from (in this case "pv")
- WHERE - any other conditions we want to apply e.g. values above a certain latitude

We can put these all together into a single string, in the right order, to create our SQL query. Here is an example of one query we could create (as a string):

In [20]:
query = "SELECT latitude, longitude, capacity_repd_MWp FROM pv WHERE latitude > 50"

This would *select* the columns "latitude", "longitude" and "capacity_repd_MWp" *from* the "pv" table *where* the latitude values are greater than 50.

You can also include multiple conditions using boolean operators including AND and OR:

In [22]:
# The \ here is just to let us split the string across multiple lines for visual clarity
query = "SELECT latitude, longitude, capacity_repd_MWp FROM pv \
         WHERE latitude > 50 AND latitude < 51 AND longitude > -1"

SQL querying is very powerful and contains scope for performing complex operations and selections from databases. We have considered some of the essentials above but see this [SQL Cheatsheet](https://www.sqltutorial.org/sql-cheat-sheet/) and this specific [SQLite tutorial](https://www.sqlitetutorial.net/sqlite-select/) for more examples.

### Accessing the data

We can now use this query string along with the database (connectable) string we defined previously to access the data from the database and view this as a pandas DataFrame:

In [23]:
import pandas as pd

ukpvgeo_selected = pd.read_sql(query, connectable)
ukpvgeo_selected

Unnamed: 0,latitude,longitude,capacity_repd_MWp
0,50.876279,0.457144,
1,50.873649,0.451525,
2,50.874459,0.453338,
3,50.874683,0.452878,4.0
4,50.945229,-0.333093,10.0
...,...,...,...
2573,50.828018,-0.555577,3.0
2574,50.834428,-0.606416,5.0
2575,50.846932,-0.743131,7.5
2576,50.938196,0.071365,


## Saving to a database

When we have manipulated and subsetted data we may want to save this to a new database for use later on. From the pandas DataFrame format we could choose to save this data to whatever format seems appropriate.

If we wanted to save to an SQLite (or other SQL-based) database, this is very similiar to the process of reading the data. We need to provide a name for our output table and specify where to save the file (as a connectable - same format as above).

In [24]:
table_new = "pv"
connectable_new = "sqlite:///data/ukpvgeo_subset.db"

ukpvgeo_selected.to_sql(table_new, connectable_new, if_exists="replace")

2578

*Note: The `if_replace="replace"` keyword is included to allow this notebook (and this cell) be run multiple times without producing a ValueError. This is not needed in general when creating a database and table.*

---

### Exercise A

From the UKPVGeo database, we want to find a rough value for the number of solar panels/farms in the Bristol area which are listed as operational.

*You can complete this task following the steps below, or approach this in a different way if you prefer. However, you should aim to use SQL queries as part of your solution.*

Data dictionary: [data/README_ukpvgeo.txt](data/README_ukpvgeo.txt)

1) Consider which columns for this database would provide information about the status of the PV panel and whether this is operational. Based on the SQL query defined above (`query`), create and run a new SQL query which also extracts additional column(s) from the database.

In [46]:
new_query = "SELECT latitude, longitude, repd_status, capacity_repd_MWp FROM pv"

2) When querying the database, what would be a useful filter to include to only select a rough area around Bristol (e.g. ~20km)? Based on the SQL query defined in the previous question, create and run an SQL query which only grabs the data for an area around Bristol.

*Hint: central Bristol is at roughly 51.455795, -2.583467 degrees (0.1 degrees is ~11km)*

In [47]:
new_query = "SELECT latitude, longitude, repd_status, capacity_repd_MWp FROM pv WHERE latitude > 51.2739768 AND latitude < 51.637613 AND longitude > -2.765285 AND longitude < -2.401649"
ukpvgeo_bristol = pd.read_sql(new_query, connectable)
ukpvgeo_bristol

Unnamed: 0,latitude,longitude,repd_status,capacity_repd_MWp
0,51.545819,-2.579605,Operational,8.0
1,51.527398,-2.444215,Operational,19.8
2,51.527292,-2.446618,Operational,
3,51.542598,-2.529121,Operational,21.0
4,51.537894,-2.534388,Operational,15.0
...,...,...,...,...
1694,51.504026,-2.506577,,
1695,51.503972,-2.506754,,
1696,51.324821,-2.696806,Awaiting Construction,1.0
1697,51.533551,-2.675828,Operational,1.8


3) For the DataFrame you have created, use pandas to filter this to only include entries where the site is listed as operational

In [48]:
ukpv_oper = ukpvgeo_bristol[ukpvgeo_bristol["repd_status"] == "Operational"]
ukpv_oper

Unnamed: 0,latitude,longitude,repd_status,capacity_repd_MWp
0,51.545819,-2.579605,Operational,8.0
1,51.527398,-2.444215,Operational,19.8
2,51.527292,-2.446618,Operational,
3,51.542598,-2.529121,Operational,21.0
4,51.537894,-2.534388,Operational,15.0
...,...,...,...,...
1667,51.533577,-2.675853,Operational,
1668,51.558873,-2.655957,Operational,
1669,51.559925,-2.655834,Operational,
1670,51.545219,-2.579614,Operational,


4) For this filtered DataFrame, find the number of solar panels/farm in total.

In [49]:
ukpv_oper.shape[0]

100

Extra: What is the approximate average capacity of these solar panels/farms?

In [59]:
capacity = ukpv_oper.dropna()["capacity_repd_MWp"]
capacity.mean()

6.405263157894737

---