- Download and install Firefox.
- Install the SQLite Manager Add-on for Firefox.
- Download the dataset and associated files and unzip it in a convenient location.
Previously, we used Excel and OpenRefine to organize our data into computer-readable data. Now we will move to the next part of the data workflow, which involves reading in our data and using it for analysis and visualization.
The data we will be using is a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.
This is a real dataset that has been used in over 100 publications. It's been simplified just a little bit for the workshop, but you can download the full dataset and work with it using exactly the same tools we’ll learn about today.
- Understand the benefits of using a relational database
- Set up a small database from
.csv
files using SQLite - Understand SQLite data types
- Databases are similar to spreadsheets, but have key differences.
- Spreadsheet
- Directly edit cells
- Use formulas based on other cells
- Databases
- Send commands (called queries) to a database manager (we will use sqlite3).
- The database manager manipulates data for us
- manager does lookups, calculations, and queries
- Returns results in a tabular form
- Send commands (called queries) to a database manager (we will use sqlite3).
- So why go through the trouble of using a database?
- Keep data separate from analysis.
- Easy to reproduce results.
- No risk of accidentally changing data while running analysis.
- Fast (even for large amounts of data)
- Improves quality control of data entry by forcing data types
- Keep data separate from analysis.
Let's begin by looking at the data. Open the following files in a spreadsheet program: surveys.csv
, species.csv
, and plots.csv
.
Question: What files and information would I need for the following questions:
- How has the hindfoot length and weight of the animals belonging to the Dipodomys genus changed over time?
- What is the average weight of each species per year? What operations would I need to perform if I wre doing these analyses by hand?
Answer: Here are some of the basic data operations we will need:
- select subsets of the data (rows and columns)
- group subsets of data
- do math and other caculations
- combine data across spreadsheets
The strength of a database is that it will automate these tasks for us in a way that we can replicate in case our data changes later.
Let's begin by looking at a pre-made database from these .csv
files.
- Open
portal_mamals.sqlite
in the SQLite 3 manager. - Databases contain tables (see left-hand menu).
- To see the contents of a table, click on the table name and click the "Browse & Search" tab.
- On the face of it, a database simply feels like a collection of tables.
- A database also allows us to create relationship between different entries in different tables.
- Rows are refered to as records.
- Each record has a field (columns).
- Each record has a unique identifier, called a primary key.
- Records may have keys that refer to data in other tables.
- This allows us to create relationship between different keys in different tables.
- The "Structure" tab tells us information about the structure of the table itself.
- We see the Structured Query Language statement that would create the original table under "Create statement."
- We have information about the columns of the table under the "Columns" section.
- The "Execute SQL" tab is where we can input SQL commands and queries.
- We can change general database settings in the "DB Settings" tab.
-
Every record's fields should have atomic values (can't be broken down any further)
- See month, day, and year in the surveys table
- In general it is easier to put fields together than to split them apart.
-
There should be one field per type of information.
-
Split data into separate tables with each table corresponding to a different class of information.
-
In order to relate information between tables, there should be a shared column (surveys and species share species_id, plots and surveys share plot_id).
-
Recall that each record has a primary key which serves as its "address." (plot_id in plots)
-
Fields that refer to primary keys in other tables are called foreign keys. (plot_id in surveys)
Let's see how to recreate this database from scratch using the original .csv
files.
- Database -> New Database
- Database -> Import
- Select the
surveys.csv
file to import. - Check "First row contains column names".
- Make sure that "Ignore Trailing Separator/Delimiter" is unchecked.
- When asked if you want to modify the table click OK.
- Change the data types for each column according the table in the handout.
- Repeat this process to create tables for
plots.csv
andsurveys.csv
.
Notice that I can add new records by going to the "Browse & Search" tab and using the "Add" button at the upper-right-hand corner.
- Write and build queries
- Filter data given various criteria
- Sort the results of a query
Let's begin by using the surveys table. Click on the "Execute SQL" tab.
SELECT year FROM surveys;
The result is all the entries from the year field in the surveys table.
Every SQL command must end with ;
. SQL will not consider the command completed until it finds ;
. That means we can write the same command as follows:
SELECT year
FROM surveys;
Therefore, as our queries become more complex, we can spread them accross several lines to make them more readable.
It is also good practice to capitalize the key words in a SQL command so that it is easier to read.
We can get more information about the date by adding more columns to our query.
SELECT year, month, day
FROM surveys;
Finally, I can use the wildcard *
to select all the columns from a table.
SELECT *
FROM surveys;
We can specify only unique values by using the UNIQUE
keyword.
SELECT species_id
FROM surveys;
SELECT DISTINCT species_id
FROM surveys;
If we select more than one column, then the distinct pairs are returned.
SELECT DISTINCT year, species_id
FROM surveys;
Question: Find the distinct years in which measurements were taken for the surveys. Answer:
SELECT DISTINCT year FROM surveys;
We can also do calculations with the values in a query.
SELECT year, month, day, weight/1000
FROM surveys;
We now have weight in kilograms instead of grams.
We can select data based on certain criteria. For example, if we only want data for the species Dipodomys merriami, we can use the WHERE
key word.
SELECT *
FROM surveys
WHERE species_id='DM';
We can set conditions based on numeric data that meets a certain condition.
SELECT * FROM surveys
WHERE year >= 2000;
Notice, we have all the columns for the years both including and after 2000.
SQL also allows us to filter criteria using the key words AND
and OR
.
SELECT *
FROM surveys
WHERE (year >= 2000) AND (species_id = 'DM');
This query gives us all the columns from the surveys
table where the year is 2000 or later and the species is Dipodomys merriami.
If we want a query that returns information for any of the Dipodomys genus, we can use all three species codes in the following query.
SELECT *
FROM surveys
WHERE (species_id = 'DM') OR (species_id = 'DO') OR (species_id = 'DS');
Question: Write a query that returns the day, month, and species_id for individuals caught on Plot 1 that weigh more than 75 grams.
Answer:
SELECT day, month, species_id FROM surveys WHERE (plot_id=1) AND (weight>75);
The above query asking for three different species is somewhat cumbersome. We can simplify it if we have a list of criteria by using the IN
statement.
Before:
SELECT *
FROM surveys
WHERE (species_id = 'DM') OR (species_id = 'DO') OR (species_id = 'DS');
After:
SELECT *
FROM surveys
WHERE (year >= 2000) AND (species_id IN ('DM', 'DO', 'DS'));
When writing more complex queries, it is helpful to start simple and slowly add more elements to the query making sure that each part works.
It can also helpful to build your queries using a smaller subset of the data, making sure everything works before using your code on the full database.
Finally, making human-readable comments is also helpful.
-- Get post 2000 data on Dipodomys' species
-- These are in the surveys table, and we are interested in all columns
SELECT * FROM surveys
-- Sampling year is in the column `year`, and we want to include 2000
WHERE (year >= 2000)
-- Dipodomys' species have the `species_id` DM, DO, and DS
AND (species_id IN ('DM', 'DO', 'DS'));
We can sort our results by using the ORDER BY
key words. Let's look at our species table.
SELECT *
FROM species;
We can order our results by taxa.
SELECT *
FROM species
ORDER BY taxa ASC;
Notice the keyword ASC
which is short for ascending. We can also display results in descending order.
SELECT *
FROM species
ORDER BY taxa DESC;
We could also order by genus and then species.
SELECT *
FROM species
ORDER BY genus ASC, species ASC;
Question: Write a query that returns year, species_id, and weight in kilograms from the surveys table, sorted with the largest weights at the top.
Answer:
SELECT year, species_id, weight/1000 FROM surveys ORDER BY weight DESC;
We can even sort by a column that does not appear in the final results.
SELECT genus, species
FROM species
WHERE taxa = 'Bird'
ORDER BY species_id ASC;
The WHERE
and ORDER BY
statements are executed before the SELECT
statement, allowing us to sort by fields that don't even appear in the final results.
- Apply aggregation to group records in SQL
- Filter and order results of a query based on aggregate functions
- Save a query to make new tables
- Apply filters to find missing values in SQL
Aggregate functions combine results from multiple values to produce a single piece of data.
Some of the aggregate functions SQL has available:
COUNT
- counts a set of recordsSUM
- sumMAX
- maximumMIN
- minimumAVG
- average
Let's count the number of records in surveys
SELECT COUNT(*)
FROM surveys;
By using *
, we are telling the COUNT()
function to count the total number of records.
Let's find out how much all of those individuals weight.
SELECT COUNT(*), SUM(weight)
FROM surveys;
Question: Write a query that returns: total weight, average weight, and the min and max weights for all animals caught over the duration of the survey. Answer:
SELECT SUM(weight), AVG(weight), MIN(weight), MAX(weight) FROM surveys;
Question: Can you modify it so that it outputs these values only for weights between 5 and 10? Answer:
SELECT SUM(weight), AVG(weight), MIN(weight), MAX(weight)
FROM surveys WHERE (weight>=5) AND (weight<=10);
We can group our data by a specific category before executing an aggregate command. For example, if we want to count the number of individuals captured by the species_id, we could use the GROUP BY
command.
SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id;
We can also sort results by using the keyword HAVING
in conjunction with an aggregate command.
SELECT species_id, COUNT(species_id)
FROM surveys
GROUP BY species_id
HAVING COUNT(species_id) > 10;
HAVING
works exactly the same as WHERE
, but instead of using fields, we use aggregate functions.
Question: Write a query that returns the number of genus in each taxa from the species table. Return only the taxa with more than 10 genus.
Answer:
SELECT taxa, COUNT(genus) FROM species GROUP BY taxa HAVING COUNT(genus) > 10;
We can save a query for future use by using the CREATE VIEW
command. If we wanted to look at data from May to September during the year 2000, we could use the following query.
SELECT *
FROM surveys
WHERE year = 2000 AND (month > 4 AND month < 10);
We can save this query for later as a view called summer_2000.
CREATE VIEW summer_2000 AS
SELECT *
FROM surveys
WHERE year = 2000 AND (month > 4 AND month < 10);
We can now directly reference the view as if it were its own table.
SELECT *
FROM summer_2000;
Values that are NULL
are SQL's way of representing an unknown value. The default in the SQLite database manager is to highlight these cells in pink. Let's examine how the database manager deals with NULL
values by counting the number of male and female individuals.
SELECT COUNT(*)
FROM summer_2000
WHERE sex != 'F';
We see there are 375 male individuals.
SELECT COUNT(*)
FROM summer_2000
WHERE sex != 'M';
This query tells us there are 360 female individuals. Counting the total, we should have 735 individuals total.
Let's write a query to check ourselves.
SELECT COUNT(*)
FROM summer_2000;
Why do we end up with a count of 781 individuals total. This is because there were 51 individuals who's sex was not recorded and so these values were NULL
in the database. Since NULL
is the database's version of "I don't know," it could not be determined whether or not these records matched the search criteria and so they were not included in the results.
Things to keep in mind about NULL
values:
- Any basic math operation in which one of the records is
NULL
will always return an answer ofNULL
. - Aggregate functions ignore
NULL
values.
- Employ joins to combine data from two tables
- Apply functions to manipulate individual values
- Employ aliases to assign new names to items in a query
The JOIN
command allows us to combine information from two different tables. By default the JOIN
command produces a cross product.
SELECT *
FROM plots
JOIN species;
This is typically not very useful, and so we need to specify how we want the information joined using ON
.
SELECT *
FROM surveys
JOIN species
ON surveys.species_id = species.species_id;
Now we have joined information from the surveys tables and the species table so that records that have the same species_id are matched together. Notice that we can use "dot" notation to specify a field within a specific table.
All the fields from the first table are listed, and then all the fields from the second table are listed. This means that species_id gets listed twice. To avoid this, we can use the USING
command.
SELECT *
FROM surveys
JOIN species
USING (species_id);
We might not want all the fields from both backgrounds. For example, we might want information on when an individual was captured, but instead of the species_id, we want the actual genus and species.
SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys
JOIN species
ON surveys.species_id = species.species_id;
Let's put all this together. Suppose we want the average mass for individuals on each different kind of plot.
SELECT plots.plot_type, AVG(surveys.weight)
FROM surveys
JOIN plots
ON surveys.plot_id = plots.plot_id
GROUP BY plots.plot_type;
SQL has many functions that allow us to manipulate data that we query. We've already seen several aggregate functions like SUM
and COUNT
. Another useful function is IFNULL
.
SELECT species_id, sex, IFNULL(sex, 'U')
FROM surveys;
Question: Write a query that returns 30 instead of
NULL
for values in thehindfoot_length
column. Answer:SELECT IFNULL(hindfoot_length, 30) FROM surveys;
The opposite of IFNULL
is NULLIF
. This function returns NULL
if the first argument is equal to the second argument.
We can "null out" plot 7:
SELECT species_id, plot_id, NULLIF(plot_id, 7)
FROM surveys;
Aliases are a nice tool for making the labels of our columns easier to read. To change the label of a column, we use the AS
keyword.
Recall our earlier query where we changed all the unknown sexes to 'U'.
SELECT species_id, sex, IFNULL(sex, 'U')
FROM surveys;
Notice the third column is difficult to read. We can change our query as follows to relabel this column.
SELECT species_id, sex, IFNULL(sex, 'U') AS non_null_sex
FROM surveys;
Question: The purpose of SQL is to be able to ask specific questions about our data. With a partner, translate the following questions into a SQL query.
- How many of each plot type is there?
- What is the average weight of each taxa?
- How many specimens of each sex are there for each year?
How many of each plot type is there?
SELECT plot_type, COUNT(*) AS num_plots FROM plots GROUP BY plot_type ORDER BY num_plots DESC;
What is the average weight of each taxa?
SELECT species.taxa AS taxa, AVG(surveys.weight) AS average_weight FROM surveys JOIN species ON species.species_id = surveys.species_id GROUP BY taxa;
How many specimens of each sex are there for each year?
SELECT year, IFNULL(sex, "unknown") AS sex, COUNT(*) AS num_animals FROM surveys GROUP BY year, sex;