<h1>SQL for Data Analysis</h1>
<hr>

<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-1">Data Cleaning</a></span></li><li><span><a href="#LEFT,-RIGHT,-and-LENGTH-Functions" data-toc-modified-id="LEFT,-RIGHT,-and-LENGTH-Functions-2">LEFT, RIGHT, and LENGTH Functions</a></span><ul class="toc-item"><li><span><a href="#Practice-Questions" data-toc-modified-id="Practice-Questions-2.1">Practice Questions</a></span></li></ul></li><li><span><a href="#POSITION,-STRPOS,-LOWER,-and-UPPER-Functions" data-toc-modified-id="POSITION,-STRPOS,-LOWER,-and-UPPER-Functions-3">POSITION, STRPOS, LOWER, and UPPER Functions</a></span><ul class="toc-item"><li><span><a href="#Practice-Questions" data-toc-modified-id="Practice-Questions-3.1">Practice Questions</a></span></li></ul></li><li><span><a href="#CONCAT" data-toc-modified-id="CONCAT-4">CONCAT</a></span><ul class="toc-item"><li><span><a href="#Practice-Questions" data-toc-modified-id="Practice-Questions-4.1">Practice Questions</a></span></li></ul></li><li><span><a href="#TO_DATE-and-CAST-Functions" data-toc-modified-id="TO_DATE-and-CAST-Functions-5">TO_DATE and CAST Functions</a></span><ul class="toc-item"><li><span><a href="#Practice-Question" data-toc-modified-id="Practice-Question-5.1">Practice Question</a></span></li></ul></li><li><span><a href="#COALESCE-Function" data-toc-modified-id="COALESCE-Function-6">COALESCE Function</a></span></li><li><span><a href="#Recap" data-toc-modified-id="Recap-7">Recap</a></span></li></ul></div>

## Data Cleaning

## LEFT, RIGHT, and LENGTH Functions

>**LEFT** pulls a specified number of characters for each row in a specified column starting at the beginning (or from the left).<br><br>
>**RIGHT** pulls a specified number of characters for each row in a specified column starting at the end (or from the right).<br><br>
>**LENGTH** provides the number of characters for each row of a specified column.

**Example:**
```sql
SELECT first_name,
       last_name,
       phone_number,
       LEFT(phone_number, 3) AS area_code,
       RIGHT(phone_number, 8) AS phone_number_only,
       /* Alternate method and to demo use of LENGTH function: */
       RIGHT(phone_number, LENGTH(phone_number) - 4) AS phone_number_alt 
FROM customer_data
```

phone_number | area_code | phone_number_only | phone_number_alt
:------------|:----------|:------------------|:----------------
399-751-5387 | 399       | 751-5387          | 751-5387

### Practice Questions

1. In the accounts table, there is a column holding the website for each company. The last three digits specify what type of web address they are using. A list of extensions (and pricing) is provided <a href='https://iwantmyname.com/domains/domain-name-registration-list-of-extensions'>here</a>. Pull these extensions and provide how many of each website type exist in the accounts table.
```sql
SELECT RIGHT(website, 4) AS extension, 
       COUNT(*) num_of_companies
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
```
2. There is much debate about how much the name (or even the first letter of a company name) matters. Use the accounts table to pull the first letter of each company name to see the distribution of company names that begin with each letter (or number).
```sql
SELECT LEFT(UPPER(name), 1) AS first_letter, COUNT(*) num_companies
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
```
3. Use the accounts table and a CASE statement to create two groups: one group of company names that start with a number and a second group of those company names that start with a letter. What proportion of company names start with a letter?
```sql
WITH t1 AS
(SELECT name, 
       CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9')
       THEN 1 ELSE 0 END AS num,
       CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9')
       THEN 0 ELSE 1 END AS letter
FROM accounts)
SELECT SUM(num) nums, SUM(letter) letters
FROM t1
```
<i>There are 350 company names that start with a letter and 1 that starts with a number. This gives a ratio of 350/351 that are company names that start with a letter or 99.7%<i/>.

    
4. Consider vowels as a, e, i, o, and u. What proportion of company names start with a vowel, and what percent start with anything else?
```sql
WITH t1 AS (
SELECT name,
       CASE WHEN LEFT(UPPER(name), 1) IN ('A','E','I','O','U')
       THEN 1 END AS vowel,
       CASE WHEN LEFT(UPPER(name), 1) NOT IN ('A','E','I','O','U')
       THEN 1 END AS other
FROM accounts)
SELECT COUNT(vowel) vowel, COUNT(other) other
FROM t1
```
<i>There are 80 company names that start with a vowel and 271 that start with other characters. Therefore 80/351 are vowels or 22.8%. Therefore, 77.2% of company names do not start with vowels.</i>

## POSITION, STRPOS, LOWER, and UPPER Functions

> **POSITION** provides the position of a string counting from the left.<br><br>
> **STRPOS** also provides the position of a string counting from the left but uses a different syntax.<br><br>
> **POSITION** and **STRPOS** are case sensitive.<br><br>
> **LOWER** and **UPPER** are used to lowercase or capitalize all characters of a string.

**Example:**
```sql
SELECT first_name,
       last_name,
       city_state,
       POSITION(',' IN city_state) AS comma_position,
       STRPOS(city_state, ',') AS substr_comma_position,
       LOWER(city_state) AS lowercase,
       UPPER(city_state) AS uppercase,
       LEFT(city_state, POSITION(',' IN city_state) - 1) AS city
FROM customer_data
```

city_state | comma_position | substr_comma_position | lowercase | uppercase | city
:----------|:---------------|:----------------------|:----------|:----------|:----
Cincinnati, OH | 11         | 11                    |cincinnati, oh | CINCINNATI, OH | Cincinnati

### Practice Questions

1. Use the `accounts` table to create first and last name columns that hold the first and last names for the `primary_poc`. 
```sql
SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name, 
RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name
FROM accounts;
```
2. Now see if you can do the same thing for every rep `name` in the `sales_reps` table. Again provide **first** and **last** name columns.
```sql
SELECT LEFT(name, STRPOS(name, ' ') -1 ) first_name, 
       RIGHT(name, LENGTH(name) - STRPOS(name, ' ')) last_name
FROM sales_reps;
```

## CONCAT

> **CONCAT** combines values from several columns into one column.<br><br>
> Both **CONCAT** and **'||'** can be used to concatenate strings.

**Example:**
```sql
SELECT first_name,
       last_name,
       CONCAT(first_name,' ',last_name) AS full_name,
       /* Alternative method */
       first_name || ' ' || last_name AS full_name_alt
FROM customer_data
```

### Practice Questions

1. Each company in the `accounts` table wants to create an email address for each `primary_poc`. The email address should be the first name of the primary_poc `.` last name primary_poc `@` company name `.com`.
```sql
WITH t1 AS (
 SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,  
     RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
 FROM accounts)
SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com')
FROM t1;
```
2. You may have noticed that in the previous solution some of the company names include spaces, which will certainly not work in an email address. See if you can create an email address that will work by removing all of the spaces in the `account name`, but otherwise your solution should be just as in question `1`. Some helpful documentation is <a href='https://www.postgresql.org/docs/8.1/static/functions-string.html'>here</a>.
```sql
WITH t1 AS (
 SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,  
     RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
 FROM accounts)
SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', REPLACE(name, ' ', ''), '.com')
FROM  t1;
```
3. We would also like to create an initial password, which they will change after their first log in. The first password will be the first letter of the `primary_poc`'s first name (lowercase), then the last letter of their first name (lowercase), the first letter of their last name (lowercase), the last letter of their last name (lowercase), the number of letters in their first name, the number of letters in their last name, and then the name of the company they are working with, all capitalized with no spaces.
```sql
WITH t1 AS (
 SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') - 1) first_name,  
     RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
 FROM accounts)
SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com') AS email, LEFT(LOWER(first_name), 1) || RIGHT(LOWER(first_name), 1) || LEFT(LOWER(last_name), 1) || RIGHT(LOWER(last_name), 1) || LENGTH(first_name) || LENGTH(last_name) || REPLACE(UPPER(name), ' ', '') AS password
FROM t1;
```

## TO_DATE and CAST Functions

>Dates in SQL are stored as **YYYY-MM-DD**<br><br>
>**DATE_PART('month', TO_DATE(month, 'month'))** is used to change a month name into the number associated with that particular month.

>**CAST** allows us to change columns from one data type to another. Both 'CAST' and '::' allow for the converting of one data type to another.

**Example:**
```sql
SELECT *,
  DATE_PART('month', TO_DATE(month, 'month')) AS clean_month,
  year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day AS                   concatenated_date,
  CAST(year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day AS               date) AS formatted_date,
  /* Alternative method to CAST */
  (year || '-' || DATE_PART('month', TO_DATE(month, 'month')) || '-' || day)::date AS           formatted_date_alt
FROM ad_clicks
```

month | day | year | clean_month | concatenated_date | formatted_date
:-----|:----|:-----|:------------|:------------------|:--------------
January | 1 | 2014 | 1           | 2014-1-1          | 2014-01-01

<i>Note: some table columns are not shown</i> 

**Expert Tip**

Most of the functions presented in this lesson are specific to strings. They won’t work with dates, integers or floating-point numbers. However, using any of these functions will automatically change the data to the appropriate type.

**LEFT**, **RIGHT**, and **TRIM** are all used to select only certain elements of strings, but using them to select elements of a number or date will treat them as strings for the purpose of the function. Though we didn't cover TRIM in this lesson explicitly, it can be used to remove characters from the beginning and end of a string. This can remove unwanted spaces at the beginning or end of a row that often happen with data being moved from Excel or other storage systems.

### Practice Question

Write a query to change the date into the correct SQL date format. You will need to use at least **SUBSTR** and **CONCAT** to perform this operation. Once you have created a column in the correct format, use either `CAST` or `::` to convert this to a date.
```sql
SELECT date orig_date, 
(SUBSTR(date,7,4) || '-' || LEFT(date,2) || '-' || SUBSTR(date,4,2)) concat_date,             (SUBSTR(date,7,4) || '-' || LEFT(date,2) || '-' || SUBSTR(date,4,2))::DATE new_date
FROM sf_crime_data;
```

orig_date | concat_date | new_date
:---------|:------------|:--------
01/31/2014 08:00:00 AM +0000 |	2014-01-31 | 2014-01-31T00:00:00.000Z |

## COALESCE Function

> Returns the first non-null value passed for each row

**Example:**
```sql
SELECT *,
  /* Use COALESCE to replace NULL values */
  COALESCE(primary_poc, 'no POC') AS primary_poc_modified
FROM accounts
WHERE primary_poc IS NULL
```

id | name | primary_poc | primary_poc_modified
:--|:-----|:------------|:--------------------
1501 | Intel |          | no POC
2131 | USAA  |          | no POC

This is most valuable when working with a function that treats nulls differently from zero such as a count or an average.
```sql
SELECT COUNT(primary_poc) AS regular_count,
       COUNT(COALESCE(primary_poc, 'no POC')) AS modified_count
FROM accounts
```

regular_count | modified_count
:-------------|:--------------
 345          | 354


<i>Using **COALESCE**, we filled the null values and now get a value in every cell and therefore a higher count</i>.

## Recap

There are a few other functions that work similarly. You can read more about those <a href='https://www.w3schools.com/sql/sql_isnull.asp' target='_blank'>here</a>. You can also get a walk through of many of the functions you have seen throughout this lesson <a href='https://mode.com/resources/sql-tutorial/sql-string-functions-for-cleaning' target='_blank'>here</a>.