# How can we optimize our sales of financial products?

## Goals

By the end of ths case, you will be familiar with databases. Specifically, you will learn the differences among the major types of databases and the different database management systems available. Basic SQL queries will also be introduced.

You will also be exposed to the technical jargon of databases. While you probably will not use these terms on a daily basis, they will give you a more holistic understanding of the data engineering discipline and facilitate conversations between yourself and other data engineers.

## Introduction

**Business Context.** You are a data analyst at a large financial services firm that sells a diverse set of products. In order to make these sales, the firm relies on a call center where sales agents make calls to current as well as prospective customers. The company would like you to dive into their data to devise strategies to increase their revenue or reduce their costs. Specifically, they would like to double down on their most reliable customers, and to cut out sales agents who are not producing outcomes.

**Business Problem.** The business would like to answer the following questions: **"What types of customers are most likely to buy our product? And which of my sales agents are the most/least productive?**

**Analytical Context.** The data is split across 3 tables: [`agent.xlsx`](data/agent.xlsx), [`call.xslx`](data/call.xlsx), and [`customer.xlsx`](data/customer.xlsx).

The case is sequenced as follows: you will (1) learn the fundamentals of databases and SQL; (2) use SQL `SELECT` statements to identify potentially interesting customers; and (4) use SQL aggregation functions to compute summary statistics on your agents and identify the most/least productive ones.

## Why databases?

While we have dealt with quite a number of data files in Excel, this solution is not very convenient for an organization with very large and complex datasets. First, Excel files [have](https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3) a maximum size of 1,048,576 rows by 16,384 columns, and in practice the limit is much smaller than that because many personal computers can't handle that amount of data along with Excel's user interface at the same time. Furthermore, while having a single very large table may not be so bad, the situation becomes more complicated when you have to look up the data in that table in another very large table, which in turn references another table, etc. After a bunch of `VLOOKUP`s across several large tables, not only your Excel will crash, but you will also risk losing your data.

![Spreadsheet icon](data/images/spreadsheet.png)

On top of that, collaboration is difficult in Excel - how would you allow your colleagues to work together on the same spreadsheet as you? One option could be to use Excel Online or Google Spreadsheets, but although these tools are excellent for collaboration on small datasets, they are not optimized for tasks like the one we are describing.

Add the need to input data from online forms and use it in online dashboards that can be accessed by many people at once, and you quickly realize that this is definitely not a job for Excel anymore. You need a [**database engine**](https://en.wikipedia.org/wiki/Database_engine) instead. This is a piece of software which is purpose-built to do all the things we have just mentioned in a very efficient and secure way.

![Database icon](data/images/database_icon.png)

The [**database**](https://en.wikipedia.org/wiki/Database) is the classic location where modern organizations have chosen to store their data for professional use. An informal way to understand databases is as "spreadsheets of spreadsheets", that is, spreadsheets that link other (potentially many) spreadsheets. Their advantage over Excel is that databases have very strict rules on what is allowed and what not, thus preventing chaos and data loss, and have extremely good performance. Some popular database engines are:

1. Microsoft SQL Server
2. Oracle Database
3. MySQL (open-source)
4. PostgreSQL (open-source)
5. SQLite (open-source)

You might have noticed that the letters "SQL" appear in several of the names of these products. That is because just as in Excel we have formulas, we can have formulas in databases, and the language we use to write them is called [**SQL (Structured Query Language)**](https://en.wikipedia.org/wiki/SQL). Some people pronounce it via spelling it out as "S-Q-L", while others prefer to make it sound like the word "sequel". Either way is fine.

For the remainder of this case, we will be using the **[SQLite](https://www.sqlite.org/index.html)** database engine because it is one of the easiest to use. SQL syntax and functions are standardized across database engines for the most part, so it will not be too difficult for you to switch from SQLite to other packages in the future if needed.

![SQLite logo](data/images/sqlite_logo.gif)

### Inspecting our tables

Let's take a quick peek at what's inside our tables. The `agents.xlsx` table has the names of the call center agents (`Name`) and an ID column with a unique number for each agent (`AgentID`):

![Agents table](data/images/agents.png)

The `customer.xlsx` table has the customer list with several columns of personal data and a unique identifier for each customer (`CustomerID`):

![Customer](data/images/customer.png)

We'll leave the `calls.xlsx` file for a bit later. Let's now see how these tables can be retrieved using SQLite.

## SQLite databases

SQLite stores tables by compressing them into a single file of extension `.db`. In our case, our database file is called [`call_center_database.db`](call_center_database.db) (we preloaded the Excel files into it). You won't be able to see its contents with Excel or with a plain text editor - rather, we will access the database using this notebook instead. To load the `call_center_database.db` file into this notebook, run the following cell (don't worry about learning this code, it is not SQL yet!):

In [1]:
%%capture
!pip install ipython-sql sqlalchemy
import sqlalchemy
sqlalchemy.create_engine("sqlite:///call_center_database.db")
%load_ext sql
%sql sqlite:///call_center_database.db

Let's now write some actual SQL. Hereafter, every time we run some SQL code, we will start the cell with the **`%%sql`** line, which allows us to execute SQL from within this notebook.

### Visualizing a single table

In SQL, the command we use to see the contents of a table is the [**`SELECT`**](https://www.w3schools.com/sql/sql_select.asp) command. In Excel you simply open the spreadsheet and the data appears immediately before your eyes. With SQL, it is a bit different. You have to "call" or "query" the data with `SELECT`. The syntax is as follows:

~~~sql
SELECT column_name(s) FROM table_name
~~~

For instance, if we wanted to retrieve the `Name` column from the `agent` table, this would be the code:

In [2]:
%%sql

SELECT Name FROM agent

 * sqlite:///call_center_database.db
Done.


Name
Michele Williams
Jocelyn Parker
Christopher Moreno
Todd Morrow
Randy Moore
Paul Nunez
Gloria Singh
Angel Briggs
Lisa Cordova
Dana Hardy


### Exercise 1

Write code to retrieve the `AgentID` column from the `agent` table.

**Answer.**

-------

You can also retrieve several columns. For instance, to visualize the `AgentID` and `Name` columns at the same time, we separate the column names with a comma like this:

In [3]:
%%sql

SELECT AgentID, Name FROM agent

 * sqlite:///call_center_database.db
Done.


AgentID,Name
0,Michele Williams
1,Jocelyn Parker
2,Christopher Moreno
3,Todd Morrow
4,Randy Moore
5,Paul Nunez
6,Gloria Singh
7,Angel Briggs
8,Lisa Cordova
9,Dana Hardy


Or if you need to see all the columns of a table, you can use the **`*`** wildcard character:

In [4]:
%%sql

SELECT * FROM agent

 * sqlite:///call_center_database.db
Done.


AgentID,Name
0,Michele Williams
1,Jocelyn Parker
2,Christopher Moreno
3,Todd Morrow
4,Randy Moore
5,Paul Nunez
6,Gloria Singh
7,Angel Briggs
8,Lisa Cordova
9,Dana Hardy


If your table has many rows, you can retrieve only a sample of them with the [**`LIMIT`**](https://www.techonthenet.com/sql/select_limit.php) command. Here we only want to see 4 rows out of the 11 rows in the `agent` table:

In [5]:
%%sql

SELECT * FROM agent LIMIT 4

 * sqlite:///call_center_database.db
Done.


AgentID,Name
0,Michele Williams
1,Jocelyn Parker
2,Christopher Moreno
3,Todd Morrow


### Exercise 2

#### 2.1

Retrieve all the columns of the `customer` table, but get only 15 rows.

**Answer.**

In [6]:
%%sql
SELECT * FROM customer LIMIT 15

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Occupation,Email,Company,PhoneNumber,Age
0,David Melton,Unemployed,DMelton@zoho.com,"Morris, Winters and Ramirez",409-093-0748,16
1,Michael Gonzalez,Student,Gonzalez_Michael@yahoo.com,Hernandez and Sons,231-845-0673,19
2,Amanda Wilson,Student,Amanda.Wilson75@verizon.com,"Mooney, West and Hansen",844-276-4552,18
3,Robert Thomas,"Engineer, structural",RThomas@xfinity.com,Johnson-Gordon,410-404-8000,25
4,Eddie Hall,Surgeon,EddieHall@outlook.com,Dawson LLC,872-287-2196,30
5,Charles Cruz DDS,Unemployed,CharlesDDS55@hotmail.com,Mitchell and Sons,744-564-0382,16
6,Maria Johnson,"Engineer, aeronautical",MJohnson@aol.com,Gibbs-Avery,448-258-9852,22
7,Michael Vaughn,"Librarian, public",MichaelVaughn@zoho.com,Rice LLC,275-669-5217,30
8,Emily Anderson DDS,Solicitor,Emily_DDS@yahoo.com,Little and Sons,334-290-7258,28
9,Travis Jensen,Unemployed,Travis.Jensen@hotmail.com,Edwards-Collins,106-848-8870,13


-------

#### 2.2

Retrieve only the `Name`, `Occupation`, and `Email` columns, still showing only 15 rows:

**Answer.**

In [7]:
%%sql

SELECT Name, Occupation, Email FROM customer LIMIT 15

 * sqlite:///call_center_database.db
Done.


Name,Occupation,Email
David Melton,Unemployed,DMelton@zoho.com
Michael Gonzalez,Student,Gonzalez_Michael@yahoo.com
Amanda Wilson,Student,Amanda.Wilson75@verizon.com
Robert Thomas,"Engineer, structural",RThomas@xfinity.com
Eddie Hall,Surgeon,EddieHall@outlook.com
Charles Cruz DDS,Unemployed,CharlesDDS55@hotmail.com
Maria Johnson,"Engineer, aeronautical",MJohnson@aol.com
Michael Vaughn,"Librarian, public",MichaelVaughn@zoho.com
Emily Anderson DDS,Solicitor,Emily_DDS@yahoo.com
Travis Jensen,Unemployed,Travis.Jensen@hotmail.com


-------

### Sorting and filtering tables

One of the most convenient functionalities of Excel is that it allows you to sort tables by a given column. You can also do this in SQL by adding the [**`ORDER BY`**](https://www.w3schools.com/sql/sql_orderby.asp) keyword to your `SELECT` statement along with the column you want to sort by. You can also specify if you want the sort to be A-Z (`ASC`) or Z-A (`DESC`). Your `ORDER BY` must come before any `LIMIT` constraints in order for the code to work:

In [8]:
%%sql

SELECT * FROM customer
ORDER BY Name ASC
LIMIT 15

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Occupation,Email,Company,PhoneNumber,Age
900,Aaron Gutierrez,IT consultant,Gutierrez.Aaron@protonmail.com,Crawford-Gross,188-915-8192,39
461,Aaron Hendrix,Unemployed,Aaron_H@verizon.com,"Collier, Flynn and Gonzalez",330-908-6683,8
145,Aaron Mcintyre,Unemployed,Mcintyre.Aaron@aol.com,Phillips-Scott,911-295-7056,14
622,Aaron Rose,"Engineer, production",Rose.Aaron@yahoo.com,Carr Ltd,192-727-2376,28
65,Adam Jimenez,Unemployed,Jimenez.Adam47@outlook.com,Clark-Cook,966-848-9733,14
958,Adam Leonard,Unemployed,Adam.L@att.com,"Weber, Ray and Knapp",389-176-8899,11
226,Adam Ward,Police officer,Ward_Adam@yahoo.com,Soto-Hobbs,877-739-8417,28
202,Adrian Aguilar,Unemployed,Adrian_A@protonmail.com,Parker PLC,504-961-9686,1
786,Alan Chambers,Administrator,Alan.C@xfinity.com,Davis Group,662-193-0632,24
985,Alan Mitchell,"Engineer, electrical",Mitchell.Alan77@hotmail.com,Wagner Inc,454-752-1489,46


### Question 1

Can you explain in plain English what the above code is doing?

### Exercise 3

Write a query that gets all the data from the `agent` table, sorted Z-A by `Name`.

**Answer.**

In [9]:
%%sql

SELECT * FROM agent
ORDER BY Name DESC

 * sqlite:///call_center_database.db
Done.


AgentID,Name
3,Todd Morrow
4,Randy Moore
5,Paul Nunez
0,Michele Williams
8,Lisa Cordova
1,Jocelyn Parker
6,Gloria Singh
9,Dana Hardy
2,Christopher Moreno
7,Angel Briggs


-------

If you want to apply filters to your table, you have to use the [**`WHERE`**](https://www.w3schools.com/sql/sql_where.asp) keyword. This keyword establishes the conditions that you want to filter by. For instance, if you want to see only the customers that are unemployed, the condition would be

~~~sql
WHERE Occupation = 'Unemployed'
~~~

In context:

In [10]:
%%sql

SELECT * from customer
WHERE Occupation = 'Unemployed'
LIMIT 15

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Occupation,Email,Company,PhoneNumber,Age
0,David Melton,Unemployed,DMelton@zoho.com,"Morris, Winters and Ramirez",409-093-0748,16
5,Charles Cruz DDS,Unemployed,CharlesDDS55@hotmail.com,Mitchell and Sons,744-564-0382,16
9,Travis Jensen,Unemployed,Travis.Jensen@hotmail.com,Edwards-Collins,106-848-8870,13
12,Valerie Moore,Unemployed,Moore_Valerie@verizon.com,Choi and Sons,288-680-8457,16
13,Tina Cox,Unemployed,Tina_Cox@protonmail.com,"Giles, Harris and Sparks",535-922-1854,16
15,Grace Pearson,Unemployed,Grace_P@yandex.com,Taylor-Walker,973-809-0260,14
18,Zachary Howe,Unemployed,Howe.Zachary@xfinity.com,"Mullins, Dawson and Cross",556-773-8367,0
19,Elizabeth Harris,Unemployed,Elizabeth_Harris36@hotmail.com,"Grant, Bowman and Sawyer",177-551-8499,11
20,Angela Myers,Unemployed,Angela.Myers79@xfinity.com,Valentine-Jenkins,507-312-2781,4
21,Bridget Turner,Unemployed,Turner_Bridget66@verizon.com,Henderson LLC,578-620-4478,10


### Exercise 4

#### 4.1

Filter the `customer` table to only show the people who are related to the `Kelly Inc` company.

**Answer.**

In [11]:
%%sql

SELECT * from customer
WHERE Company = 'Kelly Inc'

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Occupation,Email,Company,PhoneNumber,Age
273,Lisa Edwards,Unemployed,Edwards.Lisa@comcast.net,Kelly Inc,908-636-0957,16
467,Michael Meyers,"Engineer, communications",Meyers_Michael@zoho.com,Kelly Inc,463-026-2476,35
762,Karen Wilson,Airline pilot,Wilson.Karen86@mail.com,Kelly Inc,841-118-9812,22


-------

#### 4.2

Filter the `customer` table to only show the people whose occupation is `Surgeon`. Sort by `Name` in ascending order. Only include the `Name`, `Occupation`, and `Email` columns.

**Hint:** The `ORDER BY` keyword must come after the `WHERE` condition.

**Answer.**

In [12]:
%%sql

SELECT Name, Occupation, Email
FROM Customer
WHERE Occupation = 'Surgeon'
ORDER BY Name ASC

 * sqlite:///call_center_database.db
Done.


Name,Occupation,Email
Brian Williams MD,Surgeon,MD_Brian@protonmail.com
David Powell,Surgeon,DavidPowell@comcast.net
Eddie Hall,Surgeon,EddieHall@outlook.com
Grant Alvarez,Surgeon,GrantAlvarez@att.com
Haley White,Surgeon,White.Haley14@hotmail.com
Natalie Jones,Surgeon,NJones73@xfinity.com


-------

## Finding potentially interesting customer cohorts

Since the firm wants to dig deeper into its customers, let's start by pulling some of their data out of our files; namely, information about customers who are not unemployed (and therefore are more likely to buy from us).

### Exercise 5

Write a query that selects the customer ID, name, and occupation from the `Customer` table, only showing results for customers who are *not* unemployed.

**Hint:** The symbol to check that two values are not equal is `!=` (so use that instead of `=` in your `WHERE` condition).

**Answer.**

In [13]:
%%sql
SELECT CustomerID, Name, Occupation
FROM Customer
WHERE Occupation != 'Unemployed'

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Occupation
1,Michael Gonzalez,Student
2,Amanda Wilson,Student
3,Robert Thomas,"Engineer, structural"
4,Eddie Hall,Surgeon
6,Maria Johnson,"Engineer, aeronautical"
7,Michael Vaughn,"Librarian, public"
8,Emily Anderson DDS,Solicitor
10,Ryan Banks,Student
11,Brandon Alexander,Chemical engineer
14,Michelle Reyes,"Engineer, drilling"


-------

This is a great first step; however, while producing the list of customers that are not unemployed, you inevitably spend a lot of time looking at the different professions your customers have and realize how often engineers appear in your database. You know that engineering jobs tend to command higher salaries these days, so you decide to try to extract a list of all the unique types of engineering jobs that are represented in your database.

### Example 1

Write a query which produces a list (in alphabetical order) of all distinct occupations in the `Customer` table that contain the word "Engineer".

To ensure that you don't get duplicate job titles in your query results, you'll need to write the keyword [**`DISTINCT`**](https://www.w3schools.com/sql/sql_distinct.asp) immediately after `SELECT` in your query. `SELECT DISTINCT` is your SQL way to remove duplicates from a query.

**Hint:** The [**`LIKE`**](https://www.w3schools.com/sql/sql_like.asp) operator can be used when you want to look for values similar to a particular phrase (in this case, "Engineer"). It is included as part of a `WHERE` clause. It needs to be complemented with the `%` symbol, which is a wild card that represents zero, one, or multiple characters. For example, one valid `WHERE` clause utilizing the `LIKE` operator is `WHERE Name LIKE 'Mary%'`, which would return any results where the person's name starts with the word "Mary"; e.g. "Mary" or "Mary Sue" or "Mary Montes", etc.

**Answer.** One possible solution is given below:

In [14]:
%%sql
SELECT DISTINCT Occupation
FROM Customer
WHERE Occupation LIKE '%Engineer%'
ORDER BY Occupation

 * sqlite:///call_center_database.db
Done.


Occupation
Chemical engineer
Electrical engineer
"Engineer, aeronautical"
"Engineer, agricultural"
"Engineer, automotive"
"Engineer, biomedical"
"Engineer, broadcasting (operations)"
"Engineer, building services"
"Engineer, civil (consulting)"
"Engineer, civil (contracting)"


Now, several of your marketing colleagues tell you that people who are 30 or older will have a higher probability of buying your product (presumably because by that point they have more disposable income and savings). You don't want to take your colleagues' word for granted, so you decide not to completely ignore people under 30, and instead add that information to the report regarding the person’s age so that the agent making the subsequent call can decide how they want to use that information. However, due to privacy concerns, you cannot share the person's exact age.

### Example 2

Write a query that retuns the customer ID, their name, and a column containing "Yes" if the customer is 30 years of age or older and "No" if not. Limit the results to 20.

**Hint:** You will need to use the [**`CASE...END`**](https://www.w3schools.com/sql/sql_case.asp) clause. The `CASE...END` clause can be used to evaluate conditional statements and returns a value once a condition is met. If no conditions are true, it returns the value in the `ELSE` clause (or `NULL` if there is no `ELSE` statement). For example:

```SQL
CASE
    WHEN Name = "Mary" THEN 'Yes'
    WHEN Name = "Mary Montes" THEN 'Maybe'
    ELSE 'No'
END
```

The snippet above will output a column whose values will be "Yes" where the name is "Mary", "Maybe" when the name is "Mary Montes", and "No" otherwise.

**Answer.** One possible solution is given below:

In [15]:
%%sql

SELECT CustomerID, Name, Age,
    CASE
        WHEN Age >= 30 THEN 'Yes'
        WHEN Age <  30 THEN 'No'
        ELSE 'Missing Data'
    END
FROM Customer
ORDER BY Name DESC
LIMIT 20

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Age,CASE  WHEN Age >= 30 THEN 'Yes'  WHEN Age < 30 THEN 'No'  ELSE 'Missing Data'  END
392,Zachary Wilson,32,Yes
986,Zachary Stevenson,5,Yes
421,Zachary Ruiz,31,Yes
18,Zachary Howe,0,No
883,Zachary Anderson,15,No
952,Yolanda White,25,No
715,Yesenia Wright,27,No
699,Willie Greene,40,Yes
860,William Thompson,16,No
289,William Scott,16,No


However, we can see that there are some customers who are younger than 30 years old but were marked with a "Yes" in the last column. This is wrong. The cause of this behavior is that SQLite by default treats the `Age` column as if it contained text instead of numbers. To guarantee that it is handled by the engine as a numeric column, you should type `CAST(Age as integer)` instead of simply `Age` inside your `CASE...END` (the [**`CAST`**](https://www.sqlite.org/lang_expr.html#castexpr) function transforms a column from one data type to another):

In [16]:
%%sql

SELECT CustomerID, Name, Age,
    CASE
        WHEN CAST(Age as integer) >= 30 THEN 'Yes'
        WHEN CAST(Age as integer) <  30 THEN 'No'
        ELSE 'Missing Data'
    END
FROM Customer
ORDER BY Name DESC
LIMIT 20

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Age,CASE  WHEN CAST(Age as integer) >= 30 THEN 'Yes'  WHEN CAST(Age as integer) < 30 THEN 'No'  ELSE 'Missing Data'  END
392,Zachary Wilson,32,Yes
986,Zachary Stevenson,5,No
421,Zachary Ruiz,31,Yes
18,Zachary Howe,0,No
883,Zachary Anderson,15,No
952,Yolanda White,25,No
715,Yesenia Wright,27,No
699,Willie Greene,40,Yes
860,William Thompson,16,No
289,William Scott,16,No


The name of the new column is not very informative, as you can see. To assign a new name to a column in a query, we use the [**`AS`**](https://www.w3schools.com/sql/sql_ref_as.asp) keyword. The above query will look like this with a new, more useful column name:

In [17]:
%%sql

SELECT CustomerID, Name, Age,
    CASE
        WHEN CAST(Age as integer) >= 30 THEN 'Yes'
        WHEN CAST(Age as integer) <  30 THEN 'No'
        ELSE 'Missing Data'
    END AS Over30
FROM Customer
ORDER BY Name DESC
LIMIT 20

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Age,Over30
392,Zachary Wilson,32,Yes
986,Zachary Stevenson,5,No
421,Zachary Ruiz,31,Yes
18,Zachary Howe,0,No
883,Zachary Anderson,15,No
952,Yolanda White,25,No
715,Yesenia Wright,27,No
699,Willie Greene,40,Yes
860,William Thompson,16,No
289,William Scott,16,No


### Exercise 6

Modify the above query so that the query only returns customers who work in an engineering profession:

**Answer.**

In [18]:
%%sql
SELECT CustomerID, Name, Age,
    CASE
        WHEN CAST(Age as integer) >= 30 THEN 'Yes'
        WHEN CAST(Age as integer) <  30 THEN 'No'
        ELSE 'Missing Data'
    END AS Over30
FROM Customer
WHERE Occupation LIKE '%Engineer%'
ORDER BY Name DESC

 * sqlite:///call_center_database.db
Done.


CustomerID,Name,Age,Over30
421,Zachary Ruiz,31,Yes
952,Yolanda White,25,No
699,Willie Greene,40,Yes
973,William Jackson,35,Yes
966,William Garcia,29,No
179,William Davis,28,No
139,William Adams,44,Yes
893,Wendy Thornton,40,Yes
31,Victoria Gibson,38,Yes
112,Victoria Becker,24,No


-------

## Investigating customer conversion rates

The last few queries were based on certain assumptions about customers' buying patterns. In order to validate whether our hypotheses about engineers and age are true (e.g. that engineers exhibit higher product sales conversion rates, and perhaps that engineers age 30 or older tend to exhibit an even higher conversion rate), we will need to use two tables: `Call` and `Customer`. This is because the column `ProductSold` (whether the call resulted in a sale or not) lies only in the `Call` table, yet information about customer occupations and ages only lies in the `Customer` table.

This is a sample of the `call.xlsx` table:

![Call table](data/images/call.png)

The other columns are `CallID` (the unique identifier of the call), `AgentID` (the agent identifier) , `CustomerID` (the customer identifier), `PickedUp` (1 if the customer picked up the phone, 0 otherwise), and `Duration` (in seconds).

`SELECT` commands are not restricted to a single table. Theoretically, there is no limit to the number of tables that you can extract data from in a single SQL query. Let's introduce some new concepts that are relevant once we go beyond a single table.

**Primary and foreign keys** are very important concepts that need to be understood by any database professional. Primary keys:

1. Uniquely identify a record in the table. Their name usually includes the word "ID". For example, `CustomerID` is the primary key of the `Customer` table, `AgentID` is the primary key of the `Agent` table, and `CallID` is the primary key of the `Call` table    
2. Do not accept null values (they shouldn't, because they are being used to identify the record)
3. Are limited to one per table.

On the other hand, foreign keys:

1. Are a field in the table that is the primary key in another table
2. Can accept null values
3. Are not limited in any way per table. For example, the `Call` tables has 2 foreign keys: `AgentID` and `CustomerID` pointing to the `Agent` and `Customer` tables, respectively

### Extracting call data for customers working in engineering professions

Let's first extract the relevant data so we can perform this analysis. Here, a [**`JOIN`**](https://www.w3schools.com/sql/sql_join.asp) clause will come in handy. 

`JOIN` clauses are used to combine data from two or more tables in the same query. For example, in the current scenario, we need to get the name of the agent involved in a call. The `Call` table contains only the `AgentID` and not the name of the agent. `JOIN` becomes useful here so we can match up the `Call` table with the `Agent` table, which does contain the name information.

Here's a diagram showing how `JOIN` (specifically, the [**`INNER JOIN`**](https://www.w3schools.com/sql/sql_join_inner.asp), which is the default version and the only one you will need to worry about in this case) works. This is a simplified example on subsets of the tables. Notice that only the rows with `AgentID` of 0 and 2 are extracted because those are the only two `id`s which show up in both tables:

![Join](data/images/inner_join_illustration.svg)

### Example 3

Write a `JOIN` that shows the names of the agents and whether they made a sale in a particular call. The columns to include should be `AgentID`, `Name`, `ProductSold`, and `CallID`.

**Hints:** A `JOIN` clause has the following syntax:

~~~sql
JOIN right_table_name ON join_condition
~~~

Since we now have more than one table, in order to select columns from the right table without ambiguity (think of the `AgentID` column, which is both in `call` and `agent`), we need to specify which table we are referring to by prefixing the column with the table name and a dot, like this: `call.AgentID`.

**Answer.**

-------

In [19]:
%%sql
SELECT agent.AgentID, agent.Name, call.ProductSold, call.CallID
FROM agent
JOIN call ON agent.AgentID = call.AgentID
ORDER BY agent.Name DESC

 * sqlite:///call_center_database.db
Done.


AgentID,Name,ProductSold,CallID
3,Todd Morrow,0,1001
3,Todd Morrow,1,1006
3,Todd Morrow,1,101
3,Todd Morrow,0,1021
3,Todd Morrow,0,1026
3,Todd Morrow,0,1029
3,Todd Morrow,0,1038
3,Todd Morrow,1,1041
3,Todd Morrow,0,1044
3,Todd Morrow,0,1049


### Exercise 7 (optional)

Write a query which returns all calls made out to customers in the engineering profession, and shows whether they are over or under 30 as well as whether they ended up purchasing the product from that call.

**Answer.**

In [20]:
%%sql
SELECT call.CallID, customer.CustomerID, customer.Name, call.ProductSold,
    CASE
        WHEN CAST(Age as integer) >= 30 THEN 'Yes'
        WHEN CAST(Age as integer) <  30 THEN 'No'
        ELSE 'Missing Data'
    END AS Over30
FROM customer
JOIN call ON call.CustomerID = customer.CustomerID
WHERE customer.Occupation LIKE '%Engineer%'
ORDER BY customer.Name DESC

 * sqlite:///call_center_database.db
Done.


CallID,CustomerID,Name,ProductSold,Over30
2049,421,Zachary Ruiz,0,Yes
2960,421,Zachary Ruiz,0,Yes
3365,421,Zachary Ruiz,0,Yes
3386,421,Zachary Ruiz,1,Yes
4332,421,Zachary Ruiz,0,Yes
5017,421,Zachary Ruiz,0,Yes
6029,421,Zachary Ruiz,0,Yes
7459,421,Zachary Ruiz,0,Yes
7661,421,Zachary Ruiz,0,Yes
9856,421,Zachary Ruiz,0,Yes


-------

## Analyzing the call conversion data

Now that we've extracted the required information, we can proceed to test whether our desired cohort exhibits a higher sales conversion rate compared to the overall population of customers. A reasonable way to do this is to count the total number of calls to this cohort which resulted in a sale, and divide that by the total number of calls to this cohort (whether or not they resulted in a sale) to get a percentage, and then compare that with the percentage we compute from the `Call` table overall.

However, to compute these figures, we'll need to learn a bit about [**aggregation functions**](https://mode.com/sql-tutorial/sql-aggregate-functions/). An aggregation function allows you to perform a calculation on a set of values to return a single value - essentially computing some sort of summary statistic.

The following are the most commonly used SQL aggregation functions:

1. **`AVG()`** – calculates the average of a set of values
2. **`COUNT()`** – counts rows in a specified table or view
3. **`MIN()`** – gets the minimum value in a set of values
4. **`MAX()`** – gets the maximum value in a set of values
5. **`SUM()`** – calculates the sum of values

### Example 4

Write two queries - one that computes the total sales and total calls made to customers in the engineering profession, and one that computes the same metrics for the entire customer base. What can you conclude regarding the conversion rate within the engineering customers vs. the overall customer base?

**Answer.** The first query:

In [21]:
%%sql
SELECT SUM(call.ProductSold), COUNT(*)
FROM customer
JOIN call ON call.CustomerID = customer.CustomerID
WHERE customer.Occupation LIKE '%Engineer%'

 * sqlite:///call_center_database.db
Done.


SUM(call.ProductSold),COUNT(*)
760,3619


The second query:

In [22]:
%%sql
SELECT SUM(call.ProductSold), COUNT(*)
FROM customer
JOIN call ON call.CustomerID = customer.CustomerID

 * sqlite:///call_center_database.db
Done.


SUM(call.ProductSold),COUNT(*)
2084,9925


The conversion rate for both groups is ~21%, suggesting that engineers are not more likely to purchase our products than the overall population.

### Exercise 8

Modify the first query from the previous example to include only people over 30 (regardless of their occupation). Is the conversion rate much different from 21%?

**Answer.**

In [23]:
%%sql
SELECT SUM(call.ProductSold), COUNT(*)
FROM customer
JOIN call ON call.CustomerID = customer.CustomerID
WHERE CAST(customer.Age as integer) > 30

 * sqlite:///call_center_database.db
Done.


SUM(call.ProductSold),COUNT(*)
601,2804


-------

## Evaluating our agents' performance

Recall the second part of our business question - we need to figure out which of our agents are the most and least productive. To do this, it makes sense to determine which metrics could be related to productivity. Looking at the features present, the following seem to be reasonable:

1. The number of calls an agent made
2. The lengths of calls an agent made
3. The total number of products an agent sold

### Question 2

For any given agent, would extracting this info be a good way of quickly analyzing their productivity? Why or why not?

While the above metrics are useful, some of them are also too numerous to be easiy analyzed. Specifically, the lengths of calls an agent made is a dataset that is as large as the number of calls the agent made. If the agent made many calls, it will be meaningless to just throw the entire set of call lengths at ourselves. Instead, we ought to compute some summary statistics of this metric; namely, the minimum, maximum, and mean lengths seem reasonable.

### Using `GROUP BY`

One last important thing to mention is that you can compute aggregations not only on an entire query as we have done so far, but also on *subsets* of the query. This is done with the [**`GROUP BY`**](https://www.w3schools.com/sql/sql_groupby.asp) keyword, followed by a column to group by. When we include it in our `SELECT` statements, the engine first partitions the output into subsets and then calculates the aggregation function for each subset.

For instance, if we add this line to our query:

~~~sql
GROUP BY agent.Name
~~~

SQLite executes the `SELECT` query as usual, but before showing it to us, it chops it into subsets so that each subset corresponds to one and only one agent name. If our query has 13 agent names, then SQLite creates 13 subsets, one for each agent. Then, it computes the aggregation function for each subset and finally shows us the results.

This is better understood with an example.

### Example 5

Write a query that returns, *for each agent*, the agent's name, number of calls, longest and shortest call lengths, average call length, and total number of products sold. Name the columns `AgentName`, `NCalls`, `Shortest`, `Longest`, `AvgDuration`, and `TotalSales`, and order the table by `AgentName` alphabetically. Make sure to include the `WHERE PickedUp = 1` clause to only calculate the average across all the calls that were picked up.

**Answer.** One possible solution is given below:

In [24]:
%%sql
SELECT agent.Name AS AgentName, COUNT(*) AS NCalls, MIN(call.Duration) AS Shortest, MAX(call.Duration) AS Longest, AVG(call.Duration) AS AvgDuration, SUM(call.ProductSold) AS TotalSales
FROM call
JOIN agent ON call.AgentID = agent.AgentID
WHERE call.PickedUp = 1
GROUP BY agent.Name
ORDER BY agent.Name

 * sqlite:///call_center_database.db
Done.


AgentName,NCalls,Shortest,Longest,AvgDuration,TotalSales
Agent X,640,101,98,180.975,194
Angel Briggs,591,100,99,181.08121827411168,157
Christopher Moreno,649,100,98,177.979969183359,189
Dana Hardy,554,101,99,177.20397111913357,182
Gloria Singh,662,100,99,182.17522658610272,209
Jocelyn Parker,621,100,96,180.3268921095008,184
Lisa Cordova,639,102,99,179.21439749608763,201
Michele Williams,685,100,99,177.88029197080292,198
Paul Nunez,648,-5,99,181.070987654321,194
Randy Moore,600,101,99,178.595,177


## Conclusions

In this case, you learned the basics of SQL and used it to optimize the sales operations of a financial services firm. We narrowed down our set of potentially interesting customer cohorts and were able to compute summary statistics on the sales conversion rates of those cohorts, particularly versus the mean. In particular, we learned that some of our "no-brainer" hypotheses did not pan out, which illustrates the importance of always investigating the data to validate our thoughts. We also looked at sales agent performance and were able to find the ones that were most/least productive on particular metrics.

## Takeaways

In this case, we learned the differences between spreadsheets and databases. We also built a foundation of basic SQL commands to extract data from a database. Specifically we:

1. Performed ```SELECT...FROM``` queries
2. Learned the ```WHERE```, ```ORDER BY```, ```AS```, ```DISTINCT```, ```LIKE```, ```CASE...END```, and ```JOIN```, keywords
3. Used basic aggregation functions like `SUM`, `MIN`, `MAX`, `COUNT`, and `AVG`, both standalone and in conjunction with `GROUP BY`.

When working with large datasets, SQL is a powerful tool that can help us navigate and understand data in ways that Excel cannot. Sometimes, it can even serve as the first stage of an exploratory data analysis and can help us answer questions all by itself. Furthermore, SQL is the means through which we can create and store data in databases for future, large-scale use.

## Appendix: SQL Cheat Sheet

**SELECT**

```SQL
- SELECT * FROM table_name -- Select all columns from a table
- SELECT column_name(s) FROM table_name -- Select some columns from a table
- SELECT DISTINCT column_name(s) FROM table_name -- Select only the different values
- SELECT column_name(s) FROM table_name -- Select data filtered with the WHERE clause
  WHERE condition
- SELECT column_name(s) FROM table_name -- Order data by multiple columns. DESC for descending 
  ORDER BY column_1, column_2 DESC, column_3 ASC -- and ASC (optional) for ascending order
```

**Operators**
- `<` - Less than
- `>` - Greater than
- `<=` - Less than or equal
- `>=` - Greater than or equal
- `<>` - Not equal
- `=` - Equal
- `BETWEEN v1 AND v2` - Between a specified range
- `LIKE` - Search pattern. Use `%` as a wildcard. E.g., `%o%` matches "o", "bob", "blob", etc.

**Aggregate Functions**
- `AVG(column)` - Returns the average value of a column
- `COUNT(column)` - Returns the number of rows (without a NULL value) of a column
- `MAX(column)` - Returns the maximum value of a column
- `MIN(column)` - Returns the minimum value of a column
- `SUM(column)` - Returns the sum of a column
```SQL
SELECT AVG(column_name), MIN(column_name), MAX(column_name) FROM table_name
```
 
**Miscellaneous**
- `CASE...END` - Used in `SELECT` queries to alter a variable in place. E.g.
```SQL
SELECT column_name
    CASE
        WHEN column_name >= 0 THEN 'POSITIVE'
        ELSE 'NEGATIVE'
    END
FROM table
```
- `AS` - Used to rename a variable. E.g.
```SQL
SELECT SUM(column_name) AS total_column_name FROM table_name
```
- `GROUP BY` - Used to group rows that share the same value(s) in particular column(s). It is mostly used along with aggregation functions
- `ORDER BY` - Determines the order in which the rows are returned by an SQL query