# 1. Intro to Databases and SQL

In previous missions, we worked with data sets stored in a single file, which was usually a CSV file. While CSV files are easy to interface with, they have a lot of limitations. It's difficult to load large CSV files into a computer's memory, which is where tools like pandas work with data. CSV files also fall short when it comes to providing security for production applications (imagine if companies like Google or Facebook used CSV files to store and access data).

In addition, CSV files are optimized for static representation. If your data changes quickly, which is true for most technology companies, then you'll need to adopt a different method.

A database is a repository designed for storing, querying, and processing data. Databases store the data we want, and expose an interface for interacting with it. Most technology companies use databases to structure the data they generate and query specific subsets of it later on in order to answer questions or make updates.

Database systems come with administrative software for configuring settings, controlling security and access, and generating reports. They also include a language for interfacing with the database.

In this course, we'll be focusing on a language called SQL, or Structured Query Language. We use SQL to query, update, and modify the data in a database.

SQL is the most common language for working with databases, and an important tool in any data professional's toolkit. While SQL is a language, it's quite different from languages like Python or R. Its creators built it specifically for querying and interacting with databases, so it won't have much of the functionality you can expect in traditional programming languages. Because SQL is a declarative language, the user focuses on expressing what he or she wants, and the computer focuses on figuring out how to perform the computation.

Before diving into SQL syntax, we'll introduce a few database concepts so you're aware of how databases represent data, and why SQL makes it easy to work with that data.

# 2. Querying Databases with SQL

Writing a SQL query is the primary way to interact with a database. A SQL query has to adhere to a defined structure and vocabulary that we use to specify what we want the database to do. The SQL language has a set of general statements that we combine with specific logic to express the intent of our query.

The first and most basic statement in SQL is a SELECT statement. To specify that we want to return 10 specific columns for all of the rows in a certain table, we use the SELECT keyword, along with the names of the 10 columns. We use a SELECT statement whenever we want to return specific data from the database without editing or modifying it.

Let's explore the basic syntax for the SELECT statement.

    SELECT [columnA, columnB, ...]
    FROM tableName;
    
The SQL syntax reads more like English than a programming language like Python. The database converts our query to lower-level logic and returns the results to us. Now let's see what an actual SQL query looks like. The following query selects the Rank and Major columns from the table recent_grads, which represents the information from recent-grads.csv as a table in the database:

    SELECT Rank,Major
    FROM recent_grads;
    
The semicolon (;) at the end of the query is required because it specifies where the query ends. This allow us to write a query on one line, or over multiple lines.

# 3. Querying a SQLite Database
We'll be working with SQLite, a lightweight database that's ideal for exploring and learning SQL. We'll explore how SQLite works under the hood in a later mission. For now, we've taken care of setting up and populating the database for our next exercise.

Writing and running SQL queries in our interface is similar to writing and running Python code. Write the query in the code cell, and then click Check to execute the query against the database. SQLite returns the results as a list of lists, where each inner list represents the values in a row. If you write multiple queries in a code cell, SQLite will only display the last query's results.

Here's a preview of the results that SQLite returns:

    [[1, "PETROLEUM ENGINEERING"], [2, "MINING AND MINERAL ENGINEERING"], [3, "METALLURGICAL ENGINEERING"], [4, "NAVAL ARCHITECTURE AND MARINE ENGINEERING"], [5, "CHEMICAL ENGINEERING"],...

# 3. Specifying Column Order for Our Results
SQL allows us to specify the column order for the results in the SELECT statement. Try swapping the order of the columns we specified in the previous query, and click Check to see the results.

    SELECT Major,Rank FROM recent_grads;
    
When we used Major,Rank instead of Rank,Major in the SELECT statement from the previous step, the first value in each list in our results was the major, while the second value was the rank.

    SELECT Rank,Major_code,Major,Major_category,Total FROM recent_grads;

# 4. Fileritng With the WHERE statement

So far, we've been writing queries that return all of the rows from the table, but only specific columns. If we wanted to figure out which majors had more female graduates than male graduates (where ShareWomen is larger than 0.5), we would need a way to constrain the rows the query returns.

To filter rows by specific criteria, we need to use the WHERE statement. The WHERE statement requires three things:

- The column we want the database to filter on: ShareWomen
- A comparison operator that specifies how we want to compare a value in a column: >
- The value we want the database to compare each value to: 0.5

In the query below, we:

- Use SELECT to specify the column filtering criteria: Major and ShareWomen
- Use FROM to specify the table we want to query: recent_grads
- Use WHERE to specify the row filtering criteria: ShareWomen > 0.5

    SELECT Major,ShareWomen
    FROM recent_grads
    WHERE ShareWomen > 0.5;

Here are the comparison operators we can use:

- Less than: <
- Less than or equal to: <=
- Greater than: >
- Greater than or equal to: >=
- Equal to: =
- Not equal to: !=

The comparison value after the operator must be either text or a number, depending on the field. Because ShareWomen is a numeric column, we don't need to enclose the number 0.5 in quotes. **Finally, most database systems require that the SELECT and FROM statements come first, before any WHERE or other statements.**

    SELECT Major,ShareWomen FROM recent_grads WHERE ShareWomen>0.5;
    
    SELECT Major,Employed FROM recent_grads WHERE Employed > 10000;

# 5. Limiting the Number of Results

Many queries return a large number of results, which can be cumbersome to work with. SQL comes with a statement called LIMIT that allows us to specify how many results we'd like the database to return as an integer value.

The following query returns the first five values in the Major column:

    SELECT Major FROM recent_grads LIMIT 5;

Here's the result of that query:

    [["PETROLEUM ENGINEERING"], ["MINING AND MINERAL ENGINEERING"], ["METALLURGICAL ENGINEERING"], ["NAVAL ARCHITECTURE AND MARINE ENGINEERING"], ["CHEMICAL ENGINEERING"]]

In [4]:
import sqlite3
jobs = sqlite3.connect('jobs.db')

In [20]:
c = jobs.cursor()
for row in c.execute('''SELECT Major FROM recent_grads LIMIT 5;'''):
    print(row)

('PETROLEUM ENGINEERING',)
('MINING AND MINERAL ENGINEERING',)
('METALLURGICAL ENGINEERING',)
('NAVAL ARCHITECTURE AND MARINE ENGINEERING',)
('CHEMICAL ENGINEERING',)


 # 6. Practice: Selecting Columns with Select
 
When we used Major,Rank instead of Rank,Major in the SELECT statement from the previous step, the first value in each list in our results was the major, while the second value was the rank.

Now it's your turn to write a SQL query from scratch.

**Instructions**

Write a query that returns the following five columns from recent_grads, in the same order:
- Rank
- Major_code
- Major
- Major_category
- Total

In [26]:
for row in c.execute('''SELECT Rank,Major_code,Major,Major_category,Total
FROM recent_grads;'''):
    print(row)
c.description

(1, 2419, 'PETROLEUM ENGINEERING', 'Engineering', 2339)
(2, 2416, 'MINING AND MINERAL ENGINEERING', 'Engineering', 756)
(3, 2415, 'METALLURGICAL ENGINEERING', 'Engineering', 856)
(4, 2417, 'NAVAL ARCHITECTURE AND MARINE ENGINEERING', 'Engineering', 1258)
(5, 2405, 'CHEMICAL ENGINEERING', 'Engineering', 32260)
(6, 2418, 'NUCLEAR ENGINEERING', 'Engineering', 2573)
(7, 6202, 'ACTUARIAL SCIENCE', 'Business', 3777)
(8, 5001, 'ASTRONOMY AND ASTROPHYSICS', 'Physical Sciences', 1792)
(9, 2414, 'MECHANICAL ENGINEERING', 'Engineering', 91227)
(10, 2408, 'ELECTRICAL ENGINEERING', 'Engineering', 81527)
(11, 2407, 'COMPUTER ENGINEERING', 'Engineering', 41542)
(12, 2401, 'AEROSPACE ENGINEERING', 'Engineering', 15058)
(13, 2404, 'BIOMEDICAL ENGINEERING', 'Engineering', 14955)
(14, 5008, 'MATERIALS SCIENCE', 'Engineering', 4279)
(15, 2409, 'ENGINEERING MECHANICS PHYSICS AND SCIENCE', 'Engineering', 4321)
(16, 2402, 'BIOLOGICAL ENGINEERING', 'Engineering', 8925)
(17, 2412, 'INDUSTRIAL AND MANUFACTURING

(('Rank', None, None, None, None, None, None),
 ('Major_code', None, None, None, None, None, None),
 ('Major', None, None, None, None, None, None),
 ('Major_category', None, None, None, None, None, None),
 ('Total', None, None, None, None, None, None))

# 7. Filtering With the WHERE Statement

So far, we've been writing queries that return all of the rows from the table, but only specific columns. If we wanted to figure out which majors had more female graduates than male graduates (where ShareWomen is larger than 0.5), we would need a way to constrain the rows the query returns.

To filter rows by specific criteria, we need to use the WHERE statement. The WHERE statement requires three things:

- The column we want the database to filter on: ShareWomen
- A comparison operator that specifies how we want to compare a value in a column: >
- The value we want the database to compare each value to: 0.5

In the query below, we:

- Use SELECT to specify the column filtering criteria: Major and ShareWomen
- Use FROM to specify the table we want to query: recent_grads
- Use WHERE to specify the row filtering criteria: ShareWomen > 0.5

    SELECT Major,ShareWomen
    FROM recent_grads
    WHERE ShareWomen > 0.5;

Here are the comparison operators we can use:

- Less than: <
- Less than or equal to: <=
- Greater than: >
- Greater than or equal to: >=
- Equal to: =
- Not equal to: !=

The comparison value after the operator must be either text or a number, depending on the field. Because ShareWomen is a numeric column, we don't need to enclose the number 0.5 in quotes. **Finally, most database systems require that the SELECT and FROM statements come first, before any WHERE or other statements.**

**Instructions**

Run the query we explored above that returns the Major and ShareWomen values for all rows where ShareWomen exceeded 0.5.
- Ensure that all of the values for ShareWomen (the second value in each inner list) are greater than 0.5.

In [29]:
for row in c.execute('''SELECT Major,ShareWomen
FROM recent_grads
WHERE ShareWomen>0.5
LIMIT 5'''):
    print(row)
c.description

('ACTUARIAL SCIENCE', 0.535714286)
('COMPUTER SCIENCE', 0.578766338)
('ENVIRONMENTAL ENGINEERING', 0.558548009)
('NURSING', 0.896018988)
('INDUSTRIAL PRODUCTION TECHNOLOGIES', 0.75047259)


(('Major', None, None, None, None, None, None),
 ('ShareWomen', None, None, None, None, None, None))

# 8. Practice: Filtering with WHERE statements

Now it's your turn to write a SQL query that uses the WHERE statement to filter the results.

**Instructions**

Write a SQL query that returns all majors with more than 10000 employed graduates.
- In the SELECT statement, specify that we only want the values from the Major and Employed columns (in that order).

In [32]:
for row in c.execute('''SELECT Major,Employed 
FROM recent_grads
WHERE Employed>10000
LIMIT 5'''):
    print(row)

('CHEMICAL ENGINEERING', 25694)
('MECHANICAL ENGINEERING', 76442)
('ELECTRICAL ENGINEERING', 61928)
('COMPUTER ENGINEERING', 32506)
('AEROSPACE ENGINEERING', 11391)


# 9. Limiting the Number of Results

Many queries return a large number of results, which can be cumbersome to work with. SQL comes with a statement called LIMIT that allows us to specify how many results we'd like the database to return as an integer value.

The following query returns the first five values in the Major column:

    SELECT Major FROM recent_grads LIMIT 5;
Here's the result of that query:

    [["PETROLEUM ENGINEERING"], ["MINING AND MINERAL ENGINEERING"], ["METALLURGICAL ENGINEERING"], ["NAVAL ARCHITECTURE AND MARINE ENGINEERING"], ["CHEMICAL ENGINEERING"]]

**Instructions**
Write a query that returns:
- The Major column
- Where Employed exceeds 10000
- Only the first 10 results

In [34]:
for row in c.execute('''SELECT Major,Employed 
FROM recent_grads
WHERE Employed>10000
LIMIT 5'''):
    print(row)

('CHEMICAL ENGINEERING', 25694)
('MECHANICAL ENGINEERING', 76442)
('ELECTRICAL ENGINEERING', 61928)
('COMPUTER ENGINEERING', 32506)
('AEROSPACE ENGINEERING', 11391)


# 10. Next Steps
We've covered the basics of databases and SQL syntax in this lesson, and learned that SQL is an expressive language for working with data. In the next lesson, we'll learn how to combine multiple filtering criteria together to express more complex logic in SQL.