## Missing, Duplicate, and Invalid Data

- write queries to solve common problems of 
    - missing, 
    - duplicate, and 
    - invalid data in the context of PostgreSQL database tables. 


In [179]:
cursor.execute("""SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'""")
print('Table in Database:\n')
for table in cursor.fetchall():
       print(table)

Table in Database:

('film_permit',)
('parking_violation',)


### Quantifying completeness
- The records for parking violations stored in the `parking_violation` table contain missing values for the `vehicle_body_type` column. 
- Assume this data is missing completely at random (MCAR) due to human error. 
- **Task**
     - to quantify how many records are missing and 
     - perform an analysis for an appropriate fill-in value to replace the missing values.





In [186]:
%%sql
SELECT count(*) from parking_violation 
WHERE vehicle_body_type IS NULL

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
1 rows affected.


count
179


### Using a fill-in value

The sedan body type is the most frequently occurring `vehicle_body_type` in the sample parking violations. 
- changing all NULL-valued vehicle_body_type records in the parking_violations table to `SDN`. 
- not use one of the existing values as a fill-in value. 
- determined by looking up the vehicle using its license plate number. This value is present in most `parking_violation` records. Rather than using the most frequent value to replace NULL `vehicle_body_type` values, a placeholder value of `Unknown` will be used. 
- the actual body type will be updated as license plate lookup data is gathered.

In [188]:
%%sql

UPDATE
  parking_violation
SET
  -- Replace NULL vehicle_body_type values with `Unknown`
    vehicle_body_type = COALESCE(vehicle_body_type, 'Unknown');

SELECT COUNT(*) FROM parking_violation WHERE vehicle_body_type = 'Unknown';

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
5000 rows affected.
1 rows affected.


count
179


### Analyzing incomplete records

- write a query which outputs the issuing agencies along with the number of records attributable to that agency with a `NULL` `vehicle_body_type`. These records will be listed in descending order to determine the order.
- Based on the result, it would be see the `issuing_agency`, which has the most `num_missing` .

In [189]:
%%sql
SELECT
      issuing_agency,
      COUNT(*) AS num_missing
FROM
  parking_violation
WHERE
     vehicle_body_type ='Unknown'
GROUP BY issuing_agency
ORDER BY num_missing DESC;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
4 rows affected.


issuing_agency,num_missing
P,144
X,25
K,6
S,4


### Handling duplicated data
#### Duplicate parking violations
There have been a number of complaints indicating that some New York residents have been receiving multiple parking tickets for a single violation. This is resulting in the affected residents having to incur additional legal fees for a single incident. There is justifiable anger about this situation. 

- identifying records that reflect this duplication of violations.
     - by using `ROW_NUMBER()`, 
     - the `plate_id`, `issue_date`, `violation_time`, `house_number`, and `street_name`, indicating that multiple tickets were issued for the same violation.


In [190]:
%%sql
SELECT
      summons_number,
    -- Use ROW_NUMBER() to define duplicate window
      ROW_NUMBER() OVER(
        PARTITION BY
            plate_id, 
            issue_date, 
            violation_time, 
            house_number, 
            street_name
    -- Modify ROW_NUMBER() value to define duplicate column
      ) - 1 AS duplicate, 
    plate_id, 
    issue_date, 
    violation_time, 
    house_number, 
    street_name
FROM 
    parking_violation
ORDER BY house_number
LIMIT 20;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,duplicate,plate_id,issue_date,violation_time,house_number,street_name
1451266200,0,KAKU08,06/13/2019,1050P,348,LINDEN RD
1449795158,0,JHN9965,06/21/2019,1230P,5,ALICE COURT
1449157865,0,GZN2983,06/23/2019,0818P,1,ORCHARD BEACH ROAD
1449011287,0,GTP2239,06/24/2019,0240P,1,RICHMOND TERR
1449229451,0,GPL7412,07/04/2019,0834P,1,ORCHARD BEACH RD
1449011240,0,GZM7128,06/24/2019,0215P,1,RICHMOND TERR
1449154554,0,GVD9888,06/15/2019,0808P,1,ORCHARD BEACH ROAD
1449201532,0,GEE7227,07/04/2019,0350P,1,ORCHARD BEACH RD
1440680577,0,GSE2927,07/04/2019,0825P,1,ORCHARD BEACH RD
1454168924,0,GKL6232,06/20/2019,0904A,1,EAST LOOP RD


In [191]:
%%sql
SELECT 
    *
FROM (
    SELECT
        summons_number,
        ROW_NUMBER() OVER(
            PARTITION BY 
                plate_id, 
                    issue_date, 
                    violation_time, 
                    house_number, 
                    street_name
            ) - 1 AS duplicate, 
            plate_id, 
            issue_date, 
            violation_time, 
            house_number, 
            street_name 
    FROM 
        parking_violation
) sub
WHERE
    -- Only return records where duplicate is 1 or more
    duplicate != 0;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
52 rows affected.


summons_number,duplicate,plate_id,issue_date,violation_time,house_number,street_name
1448411579,1,GEW9007,06/30/2019,0258P,172-61,BAISLEY BLVD
1410920458,1,GEX3870,06/20/2019,1030P,1520,GRAND CONCOURSE
1446413147,1,GFD4777,06/30/2019,1214P,3543,WAYNE AVE
1448947807,1,GKX9331,06/29/2019,1030P,,S/W C/O W 45 ST
1452062286,1,GR8C1VIC,06/14/2019,0315P,,RIVERBANK STATE PARK
1449470646,1,GUC5106,07/03/2019,1035P,1060,BEACH AVE
1451262127,1,GWC4311,06/30/2019,0728P,,SURF AVE
1452186870,1,GXL4110,06/30/2019,0548P,170 01,118 RD
1449142588,1,HAT3306,06/26/2019,0938A,811,E 219 ST
1454273847,1,HDC7519,07/02/2019,0447A,811,HICKS ST


In [192]:
%%sql
ALTER TABLE parking_violation   
ADD COLUMN fee DECIMAL(5,2);

UPDATE parking_violation    
SET fee = CAST(floor(random()*(115-35+1))+35 AS DECIMAL(5,2));

INSERT INTO parking_violation    


 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
(psycopg2.errors.DuplicateColumn) column "fee" of relation "parking_violation" already exists

[SQL: ALTER TABLE parking_violation ADD COLUMN fee DECIMAL(5,2);]
(Background on this error at: http://sqlalche.me/e/f405)


In [193]:
%%sql
SELECT * FROM parking_violation ORDER BY RANDOM() LIMIT 3; 

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
3 rows affected.


summons_number,plate_id,registration_state,plate_type,issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,street_code1,street_code2,street_code3,vehicle_expiration_date,violation_location,violation_precinct,issuer_precint,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violation_county,violation_in_front_of_or_opposite,house_number,street_name,intersecting_street,date_first_observed,law_section,sub_division,violation_legal_code,days_parking_in_effect,from_hours_in_effect,to_hours_in_effect,vehicle_color,unregistred_vehicle,vehicle_year,meter_number,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation,fee
1452165002,KNWS57,FL,PAS,06/26/2019,85,DELV,INTER,P,0,0,0,20191231,113,113,113,952687,113,0,0925P,0605P,Q,,,C/O 120 AVENUE,142 STREET,0,408,5,,BBBBBBB,ALL,ALL,WH,0,2011,-,0,,,,,,78.0
1442764338,HFL6978,NY,PAS,07/06/2019,48,SUBN,HONDA,P,36420,51120,27630,20200419,52,52,52,961494,52,0,0708P,,BX,F,2894.0,GRAND CONCOURSE,,0,408,E9,,BBBBBBB,ALL,ALL,WH,0,2012,-,0,,,,,,92.0
1446813319,GXS6806,NY,PAS,06/12/2019,19,SUBN,LEXUS,X,50750,49135,60570,20210406,43,43,0,161,0,0,1110P,,BX,F,1590.0,METROPOLITAN AVE,,0,408,F1,,BBBBBBB,ALL,ALL,WH,0,2008,-,0,,,,,,93.0


### Resolving impartial duplicates
- The `parking_violation` dataset has been modified to include a `fee` column indicating the fee for the violation. This column would be useful for keeping track of New York City parking ticket revenue. However, due to duplicated violations revenue calculations based on the dataset would not be accurate. 
- These duplicate records only differ due to the value in the fee column. All other column values are shared in the duplicated records. A decision has been made to use the minimum fee to resolve the ambiguity created by these duplicates.

    - Identify the 3 duplicated `parking_violation` records and 
    - use the `MIN()` function to determine the fee that will be used after removing the duplicate records.

In [194]:
%%sql
SELECT 
    summons_number, 
    MIN(fee) AS fee
FROM 
    parking_violation 
GROUP BY
    summons_number 
HAVING 
    -- Restrict to summons numbers with count greater than 1
    COUNT(summons_number) > 1;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
0 rows affected.


summons_number,fee


### Detecting invalid values
#### Detecting invalid values with regular expressions

- `c{n}` matches strings which contain the character c repeated n times. 
    - `x{4}` would match the pattern `xxxx`. 
- `c+` matches strings which contain the character c repeated one or more times. 
    - `x+` would match strings including `xxxx` as well as `x `and `xx`.

**returning records with a registration_state that does not match two consecutive uppercase letters.**

In [195]:
%%sql
SELECT
    summons_number,
    plate_id,
    registration_state
FROM
    parking_violation
WHERE
    -- Define the pattern to use for matching
    registration_state NOT SIMILAR TO '[A-Z][A-Z]%'
LIMIT 20;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,plate_id,registration_state
1448253469,KNA7381,99
1416382203,ICH2244,99
1448001973,JFM9753,99
1448004755,GPV9990,99
1448705721,GEE5941,99
1449229517,HPX3035,99
1451942450,JRX9181,99
1449470622,GUC5106,99
1449031237,K487440,99
1451377022,GGF7727,99


**returning records containing a `plate_type` that does not match three consecutive *uppercase letters*.**

In [196]:
%%sql
SELECT
    summons_number,
    plate_id,
    plate_type
FROM parking_violation
WHERE
  -- Define the pattern to use for matching
    plate_type  NOT SIMILAR TO '[A-Z]+%';

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
19 rows affected.


summons_number,plate_id,plate_type
1399542539,JKA8821,999
1447683225,JFF2969,999
1446727464,HRC3599,999
1449031237,K487440,999
1447673712,JFN9352,999
1427619566,HKM7133,999
1449185770,JJM3360,999
1454157537,GWY8476,999
1451354812,MR2NIT,999
1452205899,JAN6540,999


In [197]:
%%sql

SELECT
  summons_number,
  plate_id,
  vehicle_make
FROM
  parking_violation
WHERE
  -- Define the pattern to use for matching
  vehicle_make NOT SIMILAR TO '[A_Z][/][\s]'
LIMIT 20;


 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,plate_id,vehicle_make
1449316372,JGZ2243,HONDA
1427964830,KVP7902,NISSA
1449331087,HXV3146,HONDA
1449331130,JHA6583,FORD
1449334544,HJV3747,HONDA
1449335329,KYS2247,HONDA
1449339888,LBC2385,HONDA
1449341287,HZT3578,TOYOT
1449342127,JAL3068,MERCU
1447440547,JJA3864,HYUND


### Identifying out-of-range vehicle model years

- `Type constraints` are **useful for restricting the type of data** that can be stored in a table column. However, there are limitations to how thoroughly these constraints can prevent invalid data from entering the column. 
- `Range constraints` are useful when the goal is **to identify column values** that are included in a range of values or excluded from a range of values. 
- Using type constraints when defining a table followed by checking column values with range constraints are a powerful approach to ensuring the integrity of data.
     - Ex: use a `BETWEEN` clause to build a range constraint to identify invalid vehicle model years in the `parking_violation` table. Valid vehicle model years for this dataset are considered to be between 1970 and 2021.

In [198]:
%%sql
SELECT
    -- Define the columns to return from the query
    summons_number,
    plate_id,
    vehicle_year
FROM
  parking_violation
WHERE
  -- Define the range constraint for invalid vehicle years
    vehicle_year NOT BETWEEN 1970 AND 2021
LIMIT 20;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,plate_id,vehicle_year
1427964830,KVP7902,0
1449335329,KYS2247,0
1449339888,LBC2385,0
1447512777,HBF1760,0
1440219230,KMN0697,0
1449350288,ICWN41,0
1449433315,HYT4003,0
1447647075,GJS3798,0
1441840965,JJC4850,0
1447006940,HZE5092,0


### Identifying invalid parking violations
- The `parking_violation` table has three columns populated by related time values. The `from_hours_in_effect` column indicates the start time when parking restrictions are enforced at the location where the violation occurred. 
- The `to_hours_in_effect` column indicates the ending time for enforcement of parking restrictions. 
- The `violation_time` indicates the time at which the violation was recorded. 

- => use the parking restriction time range defined by `from_hours_in_effect` and `to_hours_in_effect` to identify parking tickets with an invalid violation_time.

In [199]:
%%sql
SELECT from_hours_in_effect FROM parking_violation
WHERE from_hours_in_effect NOT LIKE '%A' AND from_hours_in_effect NOT LIKE '%P'
    AND from_hours_in_effect != 'ALL'


 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
6 rows affected.


from_hours_in_effect
700
800
800
0
800
1130


In [200]:
%%sql

WITH sub AS (SELECT 
    summons_number, 
    REPLACE(violation_time, ' ','') AS violation_time, 
    REPLACE(from_hours_in_effect,' ','') AS from_hours_in_effect, 
    REPLACE(to_hours_in_effect,' ','') AS to_hours_in_effect,
    CASE 
        WHEN violation_time LIKE '%P' AND from_hours_in_effect LIKE '%A' THEN 'True'
        WHEN violation_time LIKE '%A' AND from_hours_in_effect LIKE '%P' THEN 'False'
        ELSE violation_time > from_hours_in_effect
        END AS violation_time_GREATER_from_hours,
    CASE 
        WHEN violation_time LIKE '%P' AND to_hours_in_effect LIKE '%A' THEN 'True'
        WHEN violation_time LIKE '%A' AND to_hours_in_effect LIKE '%P' THEN 'False'
        ELSE violation_time > to_hours_in_effect
        END AS violation_time_GREATER_to_hours,
    CASE 
        WHEN from_hours_in_effect  LIKE '%P' AND to_hours_in_effect LIKE '%A' THEN True
        WHEN from_hours_in_effect LIKE '%A' AND to_hours_in_effect LIKE '%P' THEN False
        ELSE from_hours_in_effect > to_hours_in_effect
        END AS from_hours_GREATER_to_hours
FROM 
  parking_violation)

SELECT     
    summons_number, 
    violation_time, 
    from_hours_in_effect, 
    to_hours_in_effect
FROM sub
WHERE 
    violation_time SIMILAR TO '%[A-Z]' AND
    from_hours_in_effect SIMILAR TO '%[A-Z]' AND
    to_hours_in_effect SIMILAR TO '%[A-Z]' AND
    from_hours_in_effect != to_hours_in_effect AND
    from_hours_greater_to_hours IS False AND
    violation_time != from_hours_in_effect AND
    violation_time != to_hours_in_effect AND
    ( violation_time_GREATER_from_hours IS False OR
    violation_time_GREATER_to_hours IS True)


LIMIT 100;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
48 rows affected.


summons_number,violation_time,from_hours_in_effect,to_hours_in_effect
1447676051,1204P,1100A,0200P
1447676075,1211P,1100A,0200P
1433689406,1212P,1130A,0100P
1448661419,1210P,1130A,0100P
1448774720,1146A,1200A,0500P
1448774731,1135A,1200A,0500P
1454093274,1212P,1130A,0100P
1451597095,1211P,1130A,0100P
1452144692,1229P,1130A,0100P
1452144722,1235P,1130A,0100P


### Invalid violations with overnight parking restrictions

In [201]:
%%sql

WITH sub AS (SELECT 
    summons_number, 
    REPLACE(violation_time, ' ','') AS violation_time, 
    REPLACE(from_hours_in_effect,' ','') AS from_hours_in_effect, 
    REPLACE(to_hours_in_effect,' ','') AS to_hours_in_effect,
    CASE 
        WHEN violation_time LIKE '%P' AND from_hours_in_effect LIKE '%A' THEN 'True'
        WHEN violation_time LIKE '%A' AND from_hours_in_effect LIKE '%P' THEN 'False'
        ELSE violation_time > from_hours_in_effect
        END AS violation_time_GREATER_from_hours,
    CASE 
        WHEN violation_time LIKE '%P' AND to_hours_in_effect LIKE '%A' THEN 'True'
        WHEN violation_time LIKE '%A' AND to_hours_in_effect LIKE '%P' THEN 'False'
        ELSE violation_time > to_hours_in_effect
        END AS violation_time_GREATER_to_hours,
    CASE 
        WHEN from_hours_in_effect  LIKE '%P' AND to_hours_in_effect LIKE '%A' THEN True
        WHEN from_hours_in_effect LIKE '%A' AND to_hours_in_effect LIKE '%P' THEN False
        ELSE from_hours_in_effect > to_hours_in_effect
        END AS from_hours_GREATER_to_hours
FROM 
  parking_violation)

SELECT     
    summons_number, 
    violation_time, 
    from_hours_in_effect, 
    to_hours_in_effect
FROM sub
WHERE 
    violation_time SIMILAR TO '%[A-Z]' AND
    from_hours_in_effect SIMILAR TO '%[A-Z]' AND
    to_hours_in_effect SIMILAR TO '%[A-Z]' AND
    
    from_hours_greater_to_hours IS True AND
    from_hours_in_effect != to_hours_in_effect AND
    
    violation_time != from_hours_in_effect AND
    violation_time != to_hours_in_effect AND
    (violation_time_GREATER_from_hours IS False AND
    violation_time_GREATER_to_hours IS True)
    

LIMIT 100;


 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
4 rows affected.


summons_number,violation_time,from_hours_in_effect,to_hours_in_effect
1454169692,0309P,0700P,0700A
1448533995,1209A,1100P,0600A
1449035772,1230A,1000P,0600A
1451849140,1209A,0800P,0600A
