## Converting Data

- Convert data stored in a `PostgreSQL` database from one data type to another. 
- Explore the expressions needed to convert `text` to `numeric` types and how to format `strings` for `temporal` data.



In [391]:
cursor.execute("""SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'""")
print('Table in Database:\n')
for table in cursor.fetchall():
       print(table)

Table in Database:

('film_permit',)
('parking_violation',)


### Type conversion with a CASE clause
- One of the `parking_violation` attributes included for each record is the vehicle's location with respect to the street address of the violation. 
    - An `F` value in the `violation_in_front_of_or_opposite` column indicates the vehicle was in front of the recorded address. 
    - A `O` value indicates the vehicle was on the opposite side of the street. The column uses the `TEXT` type to represent the column values. 
    - The same information could be captured using a `BOOLEAN` (true/false) value which uses less memory.

- **Task:** convert `violation_in_front_of_or_opposite` to a `BOOLEAN` column named `is_violation_in_front` using a CASE clause.
    - `true` for records that occur in front of the recorded address and 
    - `false` for records that occur opposite of the recorded address.

In [392]:
%%sql
SELECT
    CASE WHEN
            -- Use true when column value is 'F'
            violation_in_front_of_or_opposite = 'F' THEN true
         WHEN
            -- Use false when column value is 'O'
            violation_in_front_of_or_opposite = 'O' THEN false
         ELSE
            NULL
    END AS is_violation_in_front
FROM
  parking_violation
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
10 rows affected.


is_violation_in_front
True
True
True
True
True
True
False
True
False
True


### Applying aggregate functions to converted values
- converting a column's value from TEXT to a number allows for calculations to be performed using aggregation functions. 
- The summons_number is of type TEXT in the `parking_violation dataset`. 
- The maximum (using `MAX(summons_number)`) and minimum (using `MIN(summons_number)`) of the TEXT representation summons_number can be calculated. 
- However, the size of the range (max - min) of summon_number values is not possible to calculated because the operation of subtraction on `TEXT` types is not defined. 

>=> converting `summons_number` to a `BIGINT` will resolve this problem.



In [393]:

%%sql
SELECT
        MAX(summons_number::BIGINT) - MIN(summons_number::BIGINT) AS range_size
FROM
      parking_violation;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
1 rows affected.


range_size
2954656568


### Date parsing and formatting
#### Cleaning invalid dates
- The `date_first_observed` column in the `parking_violation` dataset represents the date when the parking violation was first observed by the individual recording the violation. 
- But not all `date_first_observed` values were recorded properly. Some records contain a `0` value for this column which cannot be interpreted as a `DATE` automatically as its meaning in this context is ambiguous. The column values require cleaning to enable conversion to a proper DATE column.

 > => convert the `date_first_observed` value of records with a `0` in `date_first_observed` value into `NULL` 

In [394]:
%%sql
SELECT date_first_observed, count(*)
FROM parking_violation
GROUP BY date_first_observed
ORDER BY count DESC

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
33 rows affected.


date_first_observed,count
0,4919
20190707,7
20190625,6
20190702,6
20190621,6
20190701,4
20190614,4
20190628,3
20190706,3
20190630,3


In [395]:
%%sql
With sub AS (SELECT
   DATE(NULLIF(date_first_observed, '0')) AS date_first_observed
FROM
   parking_violation)

SELECT date_first_observed, COUNT(*) AS count 
FROM sub 
GROUP BY date_first_observed
ORDER BY count DESC


 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
33 rows affected.


date_first_observed,count
,4919
2019-07-07,7
2019-07-02,6
2019-06-25,6
2019-06-21,6
2019-06-14,4
2019-07-01,4
2019-06-26,3
2019-06-30,3
2019-06-28,3


In [396]:
%%sql
SELECT
    summons_number,
    DATE(issue_date) AS issue_date,
    DATE(NULLIF(date_first_observed,'0')) AS date_first_observed
FROM
  parking_violation
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
10 rows affected.


summons_number,issue_date,date_first_observed
1447152396,2019-06-28,
1447152402,2019-06-28,
1447152554,2019-06-16,
1447152580,2019-06-24,
1447152724,2019-07-06,
1447152992,2019-06-14,
1447153315,2019-06-14,
1447153327,2019-06-14,
1447153340,2019-06-28,
1447153352,2019-07-06,


In [397]:
%%sql
SELECT
  summons_number,
    TO_CHAR(issue_date, 'YYYYMMDD') AS issue_date,
    TO_CHAR(date_first_observed, 'YYYYMMDD') AS date_first_observed
FROM (
  SELECT
    summons_number,
    DATE(issue_date) AS issue_date,
    DATE(NULLIF(date_first_observed,'0')) AS date_first_observed
  FROM
    parking_violation
) sub

LIMIT 10;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
10 rows affected.


summons_number,issue_date,date_first_observed
1447152396,20190628,
1447152402,20190628,
1447152554,20190616,
1447152580,20190624,
1447152724,20190706,
1447152992,20190614,
1447153315,20190614,
1447153327,20190614,
1447153340,20190628,
1447153352,20190706,


### Timestamp parsing and formatting


In [398]:
%%sql
SELECT 
    summons_number, violation_time,
    --CONCAT(12,substr(violation_time,3,2),'P') AS violation_time1
    CONCAT(12,substr(violation_time,3,2),'P') AS violation_time2
  
FROM parking_violation
WHERE --violation_time NOT SIMILAR TO '00%P'  AND
     --violation_time SIMILAR TO '00%A' AND
     violation_time SIMILAR TO '2%'
ORDER BY summons_number
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
1 rows affected.


summons_number,violation_time,violation_time2
1309081189,2450A,1250P


In [405]:
%%sql
 WITH sub AS (SELECT 
    summons_number, violation_time,
    CONCAT(12,substr(violation_time,3,2),'P') AS violation_time1,
    CONCAT(12,substr(violation_time,3,2),'P') AS violation_time2
FROM parking_violation
WHERE --violation_time NOT SIMILAR TO '00%P'  AND
      violation_time SIMILAR TO '00%A'
      --violation_time SIMILAR TO '2%'
             )

UPDATE parking_violation pv
SET 
    violation_time = violation_time1
    --violation_time = violation_time2
FROM sub

WHERE pv.summons_number = sub.summons_number;

SELECT summons_number, violation_time
FROM parking_violation
WHERE summons_number LIKE '1309081189'
LIMIT 10;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
45 rows affected.
1 rows affected.


summons_number,violation_time
1309081189,1250P


In [406]:
%%sql
SELECT 
    violation_time,
    violation_time1
FROM
        (SELECT 
            summons_number, violation_time,
            CONCAT(violation_time,'M') AS violation_time1
        FROM parking_violation
        WHERE violation_time SIMILAR TO '%[A-Z]' 
            
        )sub
        
 WHERE violation_time1 SIMILAR TO '00%'
    OR violation_time SIMILAR TO '1[3-9]%'
    OR violation_time SIMILAR TO '2[0-4]%';

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
3 rows affected.


violation_time,violation_time1
0059P,0059PM
1955P,1955PM
1634P,1634PM


In [407]:
%%sql
SELECT
    violation_time,
    TO_TIMESTAMP(violation_time1, 'HH12MIPM')::TIME AS violation_time1
FROM   (SELECT 
            summons_number,
            violation_time,
            CONCAT(violation_time,'M') AS violation_time1
        FROM 
            (SELECT 
                summons_number,
                CASE 
                    WHEN violation_time ='0059P' THEN '1259P'
                    WHEN violation_time ='1955P' THEN '0755P'
                    WHEN violation_time ='1634P' THEN '0434P'
                    ELSE violation_time END AS violation_time
            FROM 
               parking_violation)sub1
        WHERE violation_time SIMILAR TO '%[A-Z]' 
              AND violation_time IS NOT NULL) sub2
WHERE violation_time1 IS NOT NULL
LIMIT 20;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


violation_time,violation_time1
1000A,10:00:00
1011A,10:11:00
0107A,01:07:00
0300A,03:00:00
0653A,06:53:00
0515P,17:15:00
0524P,17:24:00
0601P,18:01:00
0935A,09:35:00
1217P,12:17:00


In [411]:
%%sql
SELECT
  -- Populate column with violation_time hours
  EXTRACT('hour' FROM violation_time) AS hour,
  COUNT(*)
FROM (
    SELECT
      TO_TIMESTAMP(violation_time1, 'HH12MIPM')::TIME as violation_time
    FROM
      (SELECT 
            summons_number,
            violation_time,
            CONCAT(violation_time,'M') AS violation_time1
        FROM 
            (SELECT 
                summons_number,
                CASE 
                    WHEN violation_time ='0059P' THEN '1259P'
                    WHEN violation_time ='1955P' THEN '0755P'
                    WHEN violation_time ='1634P' THEN '0434P'
                    ELSE violation_time END AS violation_time
            FROM 
               parking_violation)sub1
        WHERE violation_time SIMILAR TO '%[A-Z]' 
              AND violation_time IS NOT NULL) sub2
    WHERE
      violation_time IS NOT NULL
) sub
GROUP BY
  hour
ORDER BY
  hour

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
24 rows affected.


hour,count
0.0,136
1.0,242
2.0,214
3.0,149
4.0,150
5.0,122
6.0,107
7.0,145
8.0,270
9.0,319
