# SQL for Data Analyst
- [Basic][1]
- [Join][2]
- [Aggregation][3]
- [Subqueries][4]
- [Cleaning][5]
[1]:http://127.0.0.1:8888/notebooks/SQL4DA/Basic.ipynb
[2]:http://127.0.0.1:8888/notebooks/SQL4DA/Join.ipynb
[3]:http://127.0.0.1:8888/notebooks/SQL4DA/Aggregation.ipynb
[4]:http://127.0.0.1:8888/notebooks/SQL4DA/Subqueries.ipynb
[5]:http://127.0.0.1:8888/notebooks/SQL4DA/Cleaning.ipynb

# Data Cleaning

* [Subqueries][1]
    - [Exercise][2]
* [WITH CTE][3]
    - [Exercise][4] 
[1]:#Subqueries
[2]:#Exercise
[3]:#WITH-(-Common-Table-Expression，CTE-)
[4]:#Exercise-2

In [2]:
%load_ext autoreload
%autoreload 2
from Query import *
import seaborn as sns
import matplotlib.pyplot as plt

In [6]:
database = 'test'
exercise = 'cleaning.txt'

In [30]:
Query = query(database)
#Query.get_table()
#Query.sql2db('parch-and-posey.sql')
Query.connect()

## LEFT, RIGHT & LENGTH  
PostgreSQL  
Three new functions：
1. `LEFT`：  
`LEFT` pulls a specified number of characters for each row in a specified column starting <mark>at the beginning</mark> (or from the left). As you saw here, you can pull the first three digits of a phone number using `LEFT(phone_number, 3)`.
2. `RIGHT`：  
`RIGHT` pulls a specified number of characters for each row in a specified column starting <mark>at the end</mark> (or from the right). As you saw here, you can pull the last eight digits of a phone number using `RIGHT(phone_number, 8)`.
3. `LENGTH`：  
`LENGTH` provides the number of characters for each row of a specified column. Here, you saw that we could use this to get the length of each phone number as `LENGTH(phone_number)`.

Sqlite
```SQL
substr(data,start,length)
```
example:

| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| s | u | b | s | t | r |   | e | x | a | m | l | p | e |
|-14|-13|-12|-11|-10|-9|-8|-7|-6|-5|-4|-3|-2|-1|

```SQL
substr('substr example',4,10)
```
return ```str example```

```SQL
substr('substr example',-7)
```
return ```example```

## Exercise 
1. In the accounts table, there is a column holding the website for each company. The last three digits specify what type of web address they are using. A list of extensions (and pricing) is provided here. Pull these extensions and provide how many of each website type exist in the accounts table.
```SQL
SELECT DISTINCT COUNT(substr(website,-3))
FROM accounts a;
```

In [12]:
com = readcom(exercise)
result = Query.execute(com)
result

SELECT DISTINCT COUNT(substr(website,-3))
FROM accounts a


Unnamed: 0,"COUNT(substr(website,-3))"
0,351


2. There is much debate about how much the name (or even the first letter of a company name) matters. Use the accounts table to pull the first letter of each company name to see the distribution of company names that begin with each letter (or number).
```SQL
SELECT substr(name,1,1) type, COUNT(*)
FROM accounts a
GROUP BY 1
ORDER BY 2;
```

In [14]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

SELECT substr(name,1,1) type, COUNT(*)
FROM accounts a
GROUP BY 1
ORDER BY 2 DESC


Unnamed: 0,type,COUNT(*)
0,C,37
1,A,37
2,P,27
3,M,22
4,T,17


3. Use the accounts table and a CASE statement to create two groups: one group of company names that start with a number and a second group of those company names that start with a letter. What proportion of company names start with a letter?
```SQL
WITH g AS
(SELECT a.name,
    CASE 
    WHEN a.name REGEXP('^[0-9]') THEN 'numbers'
    WHEN a.name REGEXP('^[a-zA-Z]') THEN 'letters'
    ELSE 'others'
    END AS 'name_group'
FROM accounts a)

SELECT name_group, COUNT(*) AS num
FROM g
GROUP BY 1;
```

In [43]:
com = readcom(exercise)
result = Query.execute(com)
result

WITH g AS
(SELECT a.name,
    CASE 
    WHEN a.name REGEXP('^[0-9]') THEN 'numbers'
    WHEN a.name REGEXP('^[a-zA-Z]') THEN 'letters'
    ELSE 'others'
    END AS 'name_group'
FROM accounts a)

SELECT name_group, COUNT(*) AS num
FROM g
GROUP BY 1


Unnamed: 0,name_group,num
0,letters,350
1,numbers,1


4. Consider vowels as a, e, i, o, and u. What proportion of company names start with a vowel, and what percent start with anything else?
```SQL
WITH g AS
(SELECT a.name,
    CASE 
    WHEN a.name REGEXP('^[aeiouAEIOU]') THEN 'vowels'
    ELSE 'consants'
    END AS 'name_group'
FROM accounts a)

SELECT name_group, COUNT(*) AS num
FROM g
GROUP BY 1;
```

In [45]:
com = readcom(exercise)
result = Query.execute(com)
result

WITH g AS
(SELECT a.name,
    CASE 
    WHEN a.name REGEXP('^[aeiouAEIOU]') THEN 'vowels'
    ELSE 'consants'
    END AS 'name_group'
FROM accounts a)

SELECT name_group, COUNT(*) AS num
FROM g
GROUP BY 1


Unnamed: 0,name_group,num
0,consants,271
1,vowels,80


## Index of character
### SQLite
#### `instr`
The syntax goes like this:

inst(X,Y)
X is the string
Y is the character
```SQL
SELECT instr('Black cat', 'a');
```
result：
```
3
```
Note：If data don’t have the letter it will return 0 for that row.

1. Use the accounts table to create first and last name columns that hold the first and last names for the primary_poc.
```SQL
SELECT
    substr(a.primary_poc, 1, instr(a.primary_poc,' ')-1) AS first_name,
    substr(a.primary_poc, instr(a.primary_poc,' ')+1) AS last_name
FROM accounts a;
```

In [56]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

SELECT
    a.primary_poc,
    substr(a.primary_poc, 1, instr(a.primary_poc,' ')-1) AS first_name,
    substr(a.primary_poc, instr(a.primary_poc,' ')+1) AS last_name
FROM accounts a


Unnamed: 0,primary_poc,first_name,last_name
0,Tamara Tuma,Tamara,Tuma
1,Sung Shields,Sung,Shields
2,Jodee Lupo,Jodee,Lupo
3,Serafina Banda,Serafina,Banda
4,Angeles Crusoe,Angeles,Crusoe


2. Now see if you can do the same thing for every rep name in the sales_reps table. Again provide first and last name columns.
```SQL
SELECT
    s.name,
    substr(s.name, 1, instr(s.name,' ')-1) AS first_name,
    substr(s.name, instr(s.name,' ')+1) AS last_name
FROM sales_reps s;
```

In [54]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

SELECT
    s.name,
    substr(s.name, 1, instr(s.name,' ')-1) AS first_name,
    substr(s.name, instr(s.name,' ')+1) AS last_name
FROM sales_reps s


Unnamed: 0,name,first_name,last_name
0,Samuel Racine,Samuel,Racine
1,Eugena Esser,Eugena,Esser
2,Michel Averette,Michel,Averette
3,Renetta Carew,Renetta,Carew
4,Cara Clarke,Cara,Clarke


## CONCAT
1. CONCAT
2. Piping `||`  

Each of these will <mark>allow you to combine columns together across rows</mark>. In this video, you saw how first and last names stored in separate columns could be combined together to create a full name: `CONCAT(first_name, ' ', last_name)` or with piping as `first_name || ' ' || last_name`.

## Exercise
1. Each company in the accounts table wants to create an email address for each primary_poc. The email address should be the first name of the primary_poc . last name primary_poc @ company name .com.
```SQL
WITH split AS (
    SELECT
        a.name AS company,
        substr(a.primary_poc,1,instr(a.primary_poc,' ')-1) AS first_name,
        substr(a.primary_poc,instr(a.primary_poc,' ')+1) AS last_name
    FROM accounts a)
SELECT 
    first_name ||'.'||last_name||'@'||LOWER(company)||'.com' AS email_address
FROM split;
```

In [69]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

WITH split AS (
    SELECT
        a.name AS company,
        substr(a.primary_poc,1,instr(a.primary_poc,' ')-1) AS first_name,
        substr(a.primary_poc,instr(a.primary_poc,' ')+1) AS last_name
    FROM accounts a)
SELECT 
    first_name ||'.'||last_name||'@'||LOWER(company)||'.com' AS email_address
FROM split


Unnamed: 0,email_address
0,Tamara.Tuma@walmart.com
1,Sung.Shields@exxon mobil.com
2,Jodee.Lupo@apple.com
3,Serafina.Banda@berkshire hathaway.com
4,Angeles.Crusoe@mckesson.com


2. You may have noticed that in the previous solution some of the company names include spaces, which will certainly not work in an email address. See if you can create an email address that will work by removing all of the spaces in the account name, but otherwise your solution should be just as in question 1. Some helpful documentation is here.
```SQL
WITH split AS (
    SELECT
        a.name AS company,
        substr(a.primary_poc,1,instr(a.primary_poc,' ')-1) AS first_name,
        substr(a.primary_poc,instr(a.primary_poc,' ')+1) AS last_name
    FROM accounts a)
SELECT 
    first_name ||'.'||last_name||'@'||REPLACE(LOWER(company),' ','')||'.com' AS email_address
FROM split;
```

In [70]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

WITH split AS (
    SELECT
        a.name AS company,
        substr(a.primary_poc,1,instr(a.primary_poc,' ')-1) AS first_name,
        substr(a.primary_poc,instr(a.primary_poc,' ')+1) AS last_name
    FROM accounts a)
SELECT 
    first_name ||'.'||last_name||'@'||REPLACE(LOWER(company),' ','')||'.com' AS email_address
FROM split


Unnamed: 0,email_address
0,Tamara.Tuma@walmart.com
1,Sung.Shields@exxonmobil.com
2,Jodee.Lupo@apple.com
3,Serafina.Banda@berkshirehathaway.com
4,Angeles.Crusoe@mckesson.com


3. We would also like to create an initial password, which they will change after their first log in. The first password will be the first letter of the primary_poc's first name (lowercase), then the last letter of their first name (lowercase), the first letter of their last name (lowercase), the last letter of their last name (lowercase), the number of letters in their first name, the number of letters in their last name, and then the name of the company they are working with, all capitalized with no spaces.
```SQL
WITH split AS (
    SELECT
        UPPER(REPLACE(a.name,' ','')) AS company,
        LOWER(substr(a.primary_poc,1,instr(a.primary_poc,' ')-1)) AS first_name,
        LOWER(substr(a.primary_poc,instr(a.primary_poc,' ')+1)) AS last_name
    FROM accounts a),
code AS (
    SELECT
        company,
        substr(first_name,1,1) AS c1,
        substr(first_name,-1,1) AS c2,
        substr(last_name,1,1) AS c3,
        substr(last_name,-1,1) AS c4,
        LENGTH(first_name) AS len1,
        LENGTH(last_name) AS len2
    FROM split)
SELECT c1||c2||c3||c4||len1||len2||company AS password
FROM code;
```

In [74]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

WITH split AS (
    SELECT
        UPPER(REPLACE(a.name,' ','')) AS company,
        LOWER(substr(a.primary_poc,1,instr(a.primary_poc,' ')-1)) AS first_name,
        LOWER(substr(a.primary_poc,instr(a.primary_poc,' ')+1)) AS last_name
    FROM accounts a),
code AS (
    SELECT
        company,
        substr(first_name,1,1) AS c1,
        substr(first_name,-1,1) AS c2,
        substr(last_name,1,1) AS c3,
        substr(last_name,-1,1) AS c4,
        LENGTH(first_name) AS len1,
        LENGTH(last_name) AS len2
    FROM split)
SELECT c1||c2||c3||c4||len1||len2||company AS password
FROM code


Unnamed: 0,password
0,tata64WALMART
1,sgss47EXXONMOBIL
2,jelo54APPLE
3,saba85BERKSHIREHATHAWAY
4,asce76MCKESSON


In [None]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

## Working with NULL
The SQLite `ifnull()` function allows you to replace NULL values with another value.  
It takes two arguments, and it returns a copy of its first non-NULL argument, or NULL if both arguments are NULL.  
The `ifnull()` function is equivalent to coalesce() with two arguments.  



|ProductId   |ProductName    |Price|SalesPrice|
|:--:|:--:|:--:|:--:|
|1           |Widget Holder  |139.5|121.3|
|2           |Widget Stick   |89.75|78.65|
|3           |Foo Cap        |11.99|8.99|
|4           |Free Widget    |0.0|0.0|
|5           |Free Foobar    |   |1.0|
|6           |Free Beer      |   |   |



### `ifnull()`
```SQL
SELECT
  ProductId,
  ProductName,
  ifnull(Price, 0.0)
FROM Products;
```

|ProductId   |ProductName    |Price|SalesPrice|
|:--:|:--:|:--:|:--:|
|1           |Widget Holder  |139.5|121.3|
|2           |Widget Stick   |89.75|78.65|
|3           |Foo Cap        |11.99|8.99|
|4           |Free Widget    |0.0|0.0|
|5           |Free Foobar    |0.0|1.0|
|6           |Free Beer      |0.0|   |


### `coalesce(X,Y,...)`
The syntax goes like this:
```SQL
SELECT
  ProductId,
  ProductName,
  coalesce(Price, SalesPrice, 0.0)
FROM Products;
```
|ProductId   |ProductName    |Price|SalesPrice|
|:--:|:--:|:--:|:--:|
|1           |Widget Holder  |139.5|121.3|
|2           |Widget Stick   |89.75|78.65|
|3           |Foo Cap        |11.99|8.99|
|4           |Free Widget    |0.0|0.0|
|5           |Free Foobar    |1.0|1.0|
|6           |Free Beer      |0.0|   |

```SQL
SELECT *

FROM accounts a
LEFT JOIN orders o ON o.account_id = a.id
WHERE o.total IS NULL;
```

In [83]:
com = readcom(exercise)
result = Query.execute(com)
result.head()

SELECT *
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL


Unnamed: 0,id,name,website,lat,long,primary_poc,sales_rep_id,id.1,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1731,Goldman Sachs Group,www.gs.com,40.757444,-73.967309,Loris Manfredi,321690,,,,,,,,,,,


MORE:  
<mark>Memorizing all of this functionality isn't necessary</mark>, but you do need to be able to follow documentation, and learn from what you have done in solving previous problems to solve new problems.

There are a few other functions that work similarly. You can read more about those [here][SQL NULL Functions]. You can also get a walk through of many of the functions you have seen throughout this lesson [here][Using SQL String Functions to Clean Data].

[SQL NULL Functions]:https://www.w3schools.com/sql/sql_isnull.asp
[Using SQL String Functions to Clean Data]:https://mode.com/sql-tutorial/sql-string-functions-for-cleaning/