Structured Query Language, or SQL, is more than forty years old, and it is one of the most popular technologies used by data professionals, including data analysts, data scientists, and data engineers. Understanding the fundamentals of a more general-purpose language like Python or R is critical for working with data, but knowing SQL helps data professionals do more with their data. And if working with R or Python is one of your goals, SQL can help gather insights from data.

Here are a few key reasons why learning SQL will help anybody interested in working with data.

**SQL is everywhere.**

Almost all of the biggest names in tech use SQL — which is pronounced either “sequel” or “S.Q.L.” Companies like Facebook, Google, and Amazon have built their own high-performance database systems, but even their data teams use SQL to query data and perform data analysis. And it’s not just tech companies: companies big and small around the world use SQL.

**SQL enables us to pull data from many sources.**

In many real-word situations, data is distributed across many sources. SQL allows us to select specific data and transform it to fit our needs. For example, working with spreadsheets can be difficult if the data we need to answer our question is distributed across many files. SQL allows us to structure our data in a way that makes it accessible from one place.

![sql.png](attachment:sql.png)

SQL data is structured into multiple, connected tables.

**SQL is here to stay.**

The Stack Overflow annual Developer Survey, which is the largest and most comprehensive survey of programmers around the world, consistently reveals that SQL is one of the most popular technologies used today.

Check out this Dataquest blog post if you'd like to learn more about why it's important to learn SQL. And if you'd like to learn more about how to learn SQL online with Dataquest, check out this blog post for some tips and tricks to learn SQL online.

So, let's learn about the language itself and how you can use it to query data.


### Introduction to databases

 A database structures data just like a spreadsheet by organizing data in different tables, which are comprised of rows and columns. 
 
 A database can store much more data more securely than a spreadsheet or a text file. Unlike simply opening a spreadsheet, we actually have to "ask" for data from the database.

We primarily interact with a database using a database management system (DBMS) — a computer program to help users interact with data by giving the computer instructions through the DBMS.

We'll begin learning SQL with the DBMS SQLite. SQLite is a lightweight DBMS, and it is the most popular database in the world.

## Set up

In [1]:
# !pip install ipython-sql

In [2]:
# !pip install SQLAlchemy
# https://ifeadewumi.medium.com/how-to-run-sql-queries-from-jupyter-a1bf2d040c83

In [3]:
%%capture
%load_ext sql
%sql sqlite:///jobs.db

## The first query

In this course, we'll explore data from the American Community Survey on job outcome statistics based on college majors that we loaded into a SQLite database.

Here are its first three rows:

![rows.png](attachment:rows.png)

In this table, each row represents a major, and each column gives us some information about that major. Head to the dataset page to become familiar with what each column represents.

We provide a database, jobs.db, loaded with this data into a single table named recent_grads (in the next course, we'll learn how to work with a database containing multiple tables.)

![table.png](attachment:table.png)

In this screen's exercise we'll ask you to submit the SQL instruction (usually called a query) below. This query selects all columns from the recent_grads table.

In [19]:
%%sql

SELECT *
FROM recent_grads
LIMIT 5;

 * sqlite:///jobs.db
Done.


index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564344,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.1018518519999999,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037383,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313196,758,1069,150,692,40,0.050125313,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341630502,25694,23170,5180,16697,1672,0.061097712,65000,50000,75000,18314,4440,972


The order of the different words in this query and the space between SELECT, *, FROM, and recent_grads are crucial features of SQL syntax. If we don't follow the syntax, the database will probably not return the information we want.

The ; character signals the end of the query, but it isn't mandatory.

Here's a visual breakdown of the different components of the query:

![component.png](attachment:component.png)

You may have noticed that SELECT and FROM use uppercase letters. This isn't required, but it makes your code easier to read.

A couple of other elements that aren't required are the line change and indentation right before FROM. The reason why we changed lines and indented this query is the same as above: stylistic conventions.

### The LIMIT Clause

Think of a clause as an optional reserved word that doesn't need to be in the code for the query to execute successfully.

Here's how we can use it to retrieve the first three rows (that we saw in a previous screen).

In [5]:
%%sql

SELECT *
  FROM recent_grads
  LIMIT 3

 * sqlite:///jobs.db
Done.


index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564344,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.1018518519999999,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037383,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0


## Selecting specific columns

Often, we'll only want to look at data from specific columns. To return only the Major column, we need to add the specific column name in the SELECT statement part of the query (instead of using the * character to return all columns):

In [6]:
%%sql

SELECT Major
  FROM recent_grads
  LIMIT 5;

 * sqlite:///jobs.db
Done.


Major
PETROLEUM ENGINEERING
MINING AND MINERAL ENGINEERING
METALLURGICAL ENGINEERING
NAVAL ARCHITECTURE AND MARINE ENGINEERING
CHEMICAL ENGINEERING


We can specify multiple columns this way, and the results table will preserve the order of the columns:

In [7]:
%%sql

SELECT Major, Major_category
  FROM recent_grads
  LIMIT 5;

 * sqlite:///jobs.db
Done.


Major,Major_category
PETROLEUM ENGINEERING,Engineering
MINING AND MINERAL ENGINEERING,Engineering
METALLURGICAL ENGINEERING,Engineering
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
CHEMICAL ENGINEERING,Engineering


## Filtering rows using WHERE

Next, let's use SQL to answer a more specific question: which majors had students who were mostly women?

The SQL workflow translates the question we want to answer to the subset of data we want from the database. To determine which majors had mostly students who were women, we want the following subset:

- Only the Major columns
- Only the rows where ShareWomen is greater than 0.5 (corresponding to 50%)

To filter rows by specific criteria, we can use the **WHERE statement**. A WHERE statement commonly uses three things:

- The column we want the database to filter on: `ShareWomen`
-  A comparison operator that specifies how we want to compare a value in a column: `>=`
- The value against which we want the database to compare each value: `0.5`

Here are the comparison operators we can use:

- Less than: `<`
- Less than or equal to: `<=`
- Greater than: `>`
- Greater than or equal to: `>=`
- Equal to: `=`
- Not equal to: `!=` or `<>`

To return only the values where ShareWomen is greater than or equal to 0.5, we can use the following WHERE clause:

In [8]:
%%sql

SELECT Major 
  FROM recent_grads
 WHERE ShareWomen >= 0.5
 LIMIT 5;

 * sqlite:///jobs.db
Done.


Major
ACTUARIAL SCIENCE
COMPUTER SCIENCE
ENVIRONMENTAL ENGINEERING
NURSING
INDUSTRIAL PRODUCTION TECHNOLOGIES


Here's a breakdown of the different components:

![where.png](attachment:where.png)

We express the specific column we want in the SELECT part of the query and the specific rows we want in the WHERE part. Note that most database systems require that the SELECT and FROM statements come first, before WHERE or any other statements.

*Write a SQL query that returns the majors in the recent_grads table where students who were men outnumbered students who were women.*

In [9]:
%%sql

SELECT Major, ShareWomen
  FROM recent_grads
  WHERE ShareWomen < 0.5
  LIMIT 5;

 * sqlite:///jobs.db
Done.


Major,ShareWomen
PETROLEUM ENGINEERING,0.120564344
MINING AND MINERAL ENGINEERING,0.1018518519999999
METALLURGICAL ENGINEERING,0.153037383
NAVAL ARCHITECTURE AND MARINE ENGINEERING,0.107313196
CHEMICAL ENGINEERING,0.341630502


## Expressing multiple filter criteria using 'AND'

In the previous exercise, we wrote a query to return majors where students who were men outnumbered students who were women:

The comparison value after the `<` operator must be either text or a number, depending on the field. Because ShareWomen is a numeric column, we just write the number 0.5.

For text values, we need to enclose the value in quotes. For example, if we wanted to select only the rows where the Major_category equaled Engineering, we would write the following:

In [10]:
%%sql

SELECT Major
  FROM recent_grads
  WHERE Major_category = 'Engineering'
  LIMIT 5;

 * sqlite:///jobs.db
Done.


Major
PETROLEUM ENGINEERING
MINING AND MINERAL ENGINEERING
METALLURGICAL ENGINEERING
NAVAL ARCHITECTURE AND MARINE ENGINEERING
CHEMICAL ENGINEERING


We can also use the `AND` operator to combine multiple filter criteria. For example, to determine which engineering majors had a majority of female students, we specify two filtering criteria:

In [11]:
%%sql

SELECT Major
  FROM recent_grads
  WHERE Major_category = 'Engineering' AND Sharewomen > 0.5
  LIMIT 5;

 * sqlite:///jobs.db
Done.


Major
ENVIRONMENTAL ENGINEERING
INDUSTRIAL PRODUCTION TECHNOLOGIES


*Write a SQL query that returns all majors that had a majority of female students and a median salary greater than `50000`.*

In [12]:
%%sql

SELECT Major, Major_category, Median, Sharewomen
  FROM recent_grads
  WHERE Sharewomen > 0.5 AND Median > 50000
  LIMIT 5;

 * sqlite:///jobs.db
Done.


Major,Major_category,Median,ShareWomen
ACTUARIAL SCIENCE,Business,62000,0.535714286
COMPUTER SCIENCE,Computers & Mathematics,53000,0.578766338


## Returning one of several conditions with OR

We used the AND operator to specify that our filter needs to pass two Boolean conditions. Both of the conditions had to evaluate to True for the record to appear in the result set. If we wanted to specify a filter that meets either of the conditions instead, we would use the OR operator.

```
SELECT Major 
  FROM recent_grads
 WHERE ShareWomen >= 0.5;
SELECT [column1, column2,...] 
  FROM [table1]
 WHERE [condition1] 
    OR [condition2];
```
    
We won't go into more detail regarding OR because we use the OR and AND operators in similar ways.

One other important feature is that we don't need to compare a column with a value, but we can also compare columns to other columns.

For example, we've been using the condition WHERE ShareWomen > 0.5. We can obtain an equivalent condition by using WHERE Men < Women.


**Exercise**

*Write a SQL query that returns the first 20 majors that either:*
- *Have a Median salary greater than or equal to 10,000, or*
- *Have more men than women*

In [13]:
%%sql

SELECT Major, Median, Men, Women
  FROM recent_grads
  WHERE Median >= 10000 OR Men > Women
  LIMIT 20;

 * sqlite:///jobs.db
Done.


Major,Median,Men,Women
PETROLEUM ENGINEERING,110000,2057,282
MINING AND MINERAL ENGINEERING,75000,679,77
METALLURGICAL ENGINEERING,73000,725,131
NAVAL ARCHITECTURE AND MARINE ENGINEERING,70000,1123,135
CHEMICAL ENGINEERING,65000,21239,11021
NUCLEAR ENGINEERING,65000,2200,373
ACTUARIAL SCIENCE,62000,832,960
ASTRONOMY AND ASTROPHYSICS,62000,2110,1667
MECHANICAL ENGINEERING,60000,12953,2105
ELECTRICAL ENGINEERING,60000,8407,6548


## Grouping operators with parentheses

There's a certain class of questions that we can't answer using only the techniques we learned so far. For example, if we wanted to write a query that returned all Engineering majors that either had mostly female graduates or an unemployment rate below 5.1%, we would need to use parentheses to express this more complex logic.

The three raw conditions we'll need are the following:

```
Major_category = 'Engineering'
ShareWomen >= 0.5
Unemployment_rate < 0.051
``` 

What the SQL query looks like using parentheses:

In [14]:
%%sql

SELECT Major, Major_category, ShareWomen, Unemployment_rate
  FROM recent_grads
 WHERE (Major_category = 'Engineering') 
   AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051);

 * sqlite:///jobs.db
Done.


Major,Major_category,ShareWomen,Unemployment_rate
PETROLEUM ENGINEERING,Engineering,0.120564344,0.018380527
METALLURGICAL ENGINEERING,Engineering,0.153037383,0.024096386
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313196,0.050125313
MATERIALS SCIENCE,Engineering,0.310820285,0.023042836
ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985189,0.006334343
INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.3434732179999999,0.042875544
MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.292607004,0.027788805
ENVIRONMENTAL ENGINEERING,Engineering,0.558548009,0.093588575
INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.75047259,0.028308097
ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174122505,0.03365166


You may notice that we have enclosed the logic we want to evaluate together in parentheses. This is very similar to how we group mathematical calculations together in a particular order. 

## Ordering results using ORDER BY


As the questions we want to answer get more complex, we want more control over the ordering of the results. We can specify the order using the ORDER BY clause. For example, we may want to understand which majors that met the criteria in the WHERE statement had the lowest unemployment rate:

In [15]:
%%sql

SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
  FROM recent_grads
  WHERE (Major_category = 'Engineering') 
         AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
  ORDER BY Unemployment_rate
  LIMIT 10;

 * sqlite:///jobs.db
Done.


Rank,Major,Major_category,ShareWomen,Unemployment_rate
15,ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985189,0.006334343
1,PETROLEUM ENGINEERING,Engineering,0.120564344,0.018380527
14,MATERIALS SCIENCE,Engineering,0.310820285,0.023042836
3,METALLURGICAL ENGINEERING,Engineering,0.153037383,0.024096386
24,MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.292607004,0.027788805
39,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.75047259,0.028308097
51,ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174122505,0.03365166
17,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.3434732179999999,0.042875544
4,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313196,0.050125313
31,ENVIRONMENTAL ENGINEERING,Engineering,0.558548009,0.093588575


If we instead want the results ordered by the same column but in descending order, we can add the DESC keyword:

In [16]:
%%sql

SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
    FROM recent_grads
   WHERE (Major_category = 'Engineering') 
     AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
   ORDER BY Unemployment_rate DESC
   LIMIT 5;

 * sqlite:///jobs.db
Done.


Rank,Major,Major_category,ShareWomen,Unemployment_rate
31,ENVIRONMENTAL ENGINEERING,Engineering,0.558548009,0.093588575
4,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313196,0.050125313
17,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.3434732179999999,0.042875544
51,ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174122505,0.03365166
39,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.75047259,0.028308097


*Write a query that returns all majors that meet the following criteria:*

- *ShareWomen is greater than 0.3*
- *And Unemployment_rate is less than .1*
- *Order the results in descending order by the ShareWomen column.*

In [17]:
%%sql

SELECT Major, Sharewomen, Unemployment_rate
  FROM recent_grads
  WHERE Sharewomen >= 0.3 AND Unemployment_rate < 0.1
  ORDER BY Sharewomen DESC
  LIMIT 5;

 * sqlite:///jobs.db
Done.


Major,ShareWomen,Unemployment_rate
EARLY CHILDHOOD EDUCATION,0.967998119,0.040104981
MATHEMATICS AND COMPUTER SCIENCE,0.927807246,0.0
ELEMENTARY EDUCATION,0.923745479,0.046585715
ANIMAL SCIENCES,0.91093257,0.050862499
PHYSIOLOGY,0.906677337,0.0691628


## Practice

*Write a query that returns the Engineering or Physical Sciences category, the major, and unemployment rate, listed in ascending order of unemployment rate.*

In [18]:
%%sql

SELECT Major_category, Major, Unemployment_rate
 FROM recent_grads
 WHERE Major_category = 'Engineering' OR Major_category = 'Physical Sciences'
 ORDER BY Unemployment_rate
 LIMIT 10;

 * sqlite:///jobs.db
Done.


Major_category,Major,Unemployment_rate
Engineering,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334343
Engineering,PETROLEUM ENGINEERING,0.018380527
Physical Sciences,ASTRONOMY AND ASTROPHYSICS,0.021167415
Physical Sciences,ATMOSPHERIC SCIENCES AND METEOROLOGY,0.022228555
Engineering,MATERIALS SCIENCE,0.023042836
Engineering,METALLURGICAL ENGINEERING,0.024096386
Physical Sciences,GEOSCIENCES,0.024373731
Engineering,MATERIALS ENGINEERING AND MATERIALS SCIENCE,0.027788805
Engineering,INDUSTRIAL PRODUCTION TECHNOLOGIES,0.028308097
Engineering,ENGINEERING AND INDUSTRIAL MANAGEMENT,0.03365166
