## Data Types and Casting

Let's talk about data types. Some databases support only certain data types (for example, SQLite doesn't support datetime or timestamps). The database we are currently using, Postgres, supports many types from timestamps to varchar arrays. 

This exercise will get you up to speed on understanding data types and learning how to cast from one type to another.

In [None]:
# Load the SQL magic extension
%load_ext sql
# Connect to the default database (using SQLAlchemy)
%sql postgresql://localhost/postgres
# Truncate output of your queries so that it's not blowing up the notebook
%config SqlMagic.displaylimit = 10

### Overview of Data Types

There are several different types of data. Just a few important data types to name: numbers, text, and collections. In database terms, these are referred to integers / doubles for numbers, strings for text, and arrays for collections. Most of the time, when you look at a data set, you can probably guess what data type it is.

We'll go over a quick example of how to cast data types to different data types before going into a few specific data types (strings and dates in particular). 

Take a quick look at the `actor` table.

In [None]:
%%sql
-- What are the data types for actor_id and first_name?
select * from actor limit 5;

To double check, we can use the following query to confirm our suspicions. For some systems, there is a user interface layered on top of the data bricks (i.e. Toad and Oracle) that will allow for easier navigation to this information. 

In [None]:
%%sql
select 
  column_name, 
  data_type 
from 
  information_schema.columns
where 
  table_name = 'actor';

### Casting

Data types are important for when we have to compare between two different variables (i.e. checking equality or something else). In some systems, you can compare integers to strings because the backend will automatically convert the numbers to a string. 

However, in other systems, we have to cast them to the appropriate data type before we can perform conditional logic on them. <br>

See the example below: 

In [None]:
%%sql
-- Try checking if last_update is greater than last_name 
-- (weird comparison, but for the sake of this example let's do it)
-- Enter the query here! What kind of error do you get?

In order to make the above query work, we need to use the function [`cast()`](https://docs.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-2017) and also the data type we want the variable to be:

In [None]:
%%sql
-- To cast a variable to another type, you need to use the cast() function
-- But note that comparing last_update to first_name might not make sense! 
select * from actor where cast(last_update as varchar) < last_name limit 5;

### Strings
Strings are pretty much similar to strings in other programming languages. A string is a sequence of characters, either as a  constant or as a variable. Strings are used to represent text, but numbers and dates may still be string types.    
     
**Example Strings:** “String”, “Str!nG2!”, “34578”, “U84-32-12-44”, “2018-12-10”, "Employees of CDL have the best socks."

You can hardcode any string as a field

In [None]:
%%sql
-- How do you manually code a string as a field?
-- Note that for Postgres, strings are in single quotes (edit accordingly)
select
  "banana" as banana_column
from
  actor
limit 5

### Substrings
When working with strings, you may want to investigate certain portions of strings in your table (AKA substrings). For example, **“Cascade”** is a substring of **“Cascade Data Labs”**.  
  
There are many functions in SQL used to pull substrings. The most common are:

[`left()`](https://www.techonthenet.com/sql_server/functions/right.php) pulls a substring starting from the left side <br>
[`right()`](https://www.techonthenet.com/sql_server/functions/left.php) pulls a substring starting from the right side <br>
[`substring()`](https://www.techonthenet.com/sql_server/functions/substring.php) pulls a substring starting from anywhere (as long as you specify the start and end) <br>
[`concat()`](https://www.techonthenet.com/sql_server/functions/concat.php) combines any number of strings in the order that they are passed into the function

In [None]:
%%sql
-- Transform the actor's names into billing and hipster names
-- 1. First initial and last name ex. J. Zhu
-- 2. First name and last name initial ex. Julie Z.
select 
  concat(left(first_name, 1), '. ', last_name) as billing_name, 
  concat(first_name, ' ', left(last_name, 1), '.') as hipster_Name, 
  first_name, 
  last_name 
from 
  actor limit 5

In [None]:
%%sql
-- Try using `substring()` to take only the 3rd and 4th characters of first_name
-- Note what it does to shorter strings like 'Ed'
select
  substring(first_name, 3, 1) as third_and_fourth,
  first_name
from
  actor
limit 5

Substring magic will be very useful when you have coded campaign IDs running through a system with mutiple pieces of information encoded in it.

### String Comparisons

If you want to compare two strings, how do you know which string is greater? Run the following query and make a guess as to how strings are compared. Is it by length or by character?

In [None]:
%%sql
select
  'a' > 'b' as ab_comp,
  'b' > '0' as b0_comp,
  '9' > '0' as num_comp,
  'z9sfw' > '4afes' as z4_comp
from
  actor
limit 5;

### String Matching

When working with data, you often want to filter your output to certain instances of strings (once you get to joins, you'll see why string matching is super important for join keys too). 

In the case of this DVD rental store, we want to be filtering our current film catalog to retrieve certain pieces of information for the customer (Karent prefers paper catalogs, but we are an eco-friendly DVD rental store with data bases and technology). We will investigate the different methods in string matching through the `film` table.

Common string cleaning functions: <br>
[`upper()`](https://www.techonthenet.com/sql_server/functions/upper.php) takes a string and makes all the letters uppercase. <br>
[`lower()`](https://www.techonthenet.com/sql_server/functions/lower.php) takes a string and makes all of the letters lowercase. <br>
[`trim()`](https://www.techonthenet.com/sql_server/functions/trim.php) takes a string and removes all whitespaces before and after the letters. <br>

In [None]:
%%sql
-- Transform actor first_name to be all uppercase in one column and all lowercase in another column
-- Enter your query here!
select
  upper(first_name) as upper_first_name,
  lower(first_name) as lower_first_name
from
  actor
limit 5;

Note that white spaces in strings will affect the outcome of any string comparisons. This kind of messy data is very common in production environments, so be sure to trim those strings before checking for equalities or other values.

In [None]:
%%sql
select
  ' banana' = 'banana' as banana_truths,
  trim(' banana') = 'banana' as trimmed_bananas
from
  actor
limit 1

In [None]:
%%sql
-- Select the actor with the name Laura Verhulst 
-- Laura likes her last name capitalized in a certain way...but you can't ask her. 
-- You gotta use string cleaning functions!
select
  *
from
  actor
where
  upper(first_name) = 'LAURA' and
  upper(last_name) = 'VERHULST'

### Wildcards and Pattern Matching

What happens when you are trying to look for a pattern in a string but it's all over the place and not in a set position? It will be difficult to use position-based functions like `right()`, `left()`, and `substring()`. Two solutions here:

* Wildcards
* Regular Expressions

In general, wildcards are more readable and easier to understand in code. Using regular expressions in SQL is a clean way to sweep up convoluted logic, however, requires the reader to have a good mastery of regex. We will go through both in this section.

**Wildcards and Like Clauses**  

Wildcards are useful when you are searching for a particular keyword for phrase. For example, a customer wants to know all of the films related to Africa. In this case, a quick search through the database for the word "Africa" would lead to some good preliminary results. However, by using a strict equality for film_title = "Africa", we might be losing some movie titles with longer names.  

In order to get all films with the word "Africa" in it, we need the `like` clause, which allows for pattern matching. The `like` clause uses wildcards to identify patterns. There are two wildcards used in conjunction with the `like` operator:   

1. % The percent sign represents zero, one, or multiple characters       
2. _ The underscore represents a single character
  
**Tip**: When pattern matching, make your search pattern as specific as possible. The more general your pattern is, the more likely it will return unexpected outputs.

In [None]:
%%sql
select film_id, title from film where title = 'Africa';

In [None]:
%%sql
select film_id, title from film where title like '%africa%'

In [None]:
%%sql
select film_id, title from film where lower(title) like '%africa%'

**Regular Expressions**

Regular expressions are another way to specify patterns in strings. It allows for a greater range in patterns to search for (more so than the `like` operator). 

For example:

`t(a|i)n`

searches for a string that starts with a 't', followed by either 'a' or 'i', then lastly followed by an 'n'. The strings that satisfy that grammar are 'tan' and 'tin'. If you would like more resources on writing regular expressions, [click here](https://docs.oracle.com/cd/B13789_01/appdev.101/b10795/adfns_re.htm). 

Let's go through a small example. 

`regexp_matches()` returns a boolean after checking whether the given string matches the pattern specified

In [None]:
%%sql
-- Pick a string to match in the first argument
-- Write your regular expression
select regexp_matches('abcAccABC', '[abc]{3}');