## Missing, Duplicate, and Invalid Data

- write queries to solve common problems of 
    - missing, 
    - duplicate, and 
    - invalid data in the context of PostgreSQL database tables. 


In [112]:
cursor.execute("""SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'""")
print('Table in Database:\n')
for table in cursor.fetchall():
       print(table)

Table in Database:

('film_permit',)
('parking_violation',)


In [113]:
%%sql
SELECT * FROM parking_violation LIMIT 10;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
10 rows affected.


summons_number,plate_id,registration_state,plate_type,issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,street_code1,street_code2,street_code3,vehicle_expiration_date,violation_location,violation_precinct,issuer_precint,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violation_county,violation_in_front_of_or_opposite,house_number,street_name,intersecting_street,date_first_observed,law_section,sub_division,violation_legal_code,days_parking_in_effect,from_hours_in_effect,to_hours_in_effect,vehicle_color,unregistred_vehicle,vehicle_year,meter_number,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation,fee
4011361940,HZT3945,NY,PAS,06/01/2019,5,SUBN,GMC,V,0,0,0,0,,0,0,0,,,0837P,,BX,,,NB WEBSTER AVE @ E F,ORDHAM RD,0,1111,C,T,,,,GR,,2003,,0,,BUS LANE VIOLATION,,,,96.0
4011361975,GWG9276,NY,PAS,06/01/2019,5,4DSD,SUBAR,V,0,0,0,0,,0,0,0,,,0848P,,MN,,,NB 1ST AVE @ ST MARK,S PL,0,1111,C,T,,,,GY,,2015,,0,,BUS LANE VIOLATION,,,,81.0
4011362827,LFK452,SC,PAS,06/02/2019,5,SU,DODGE,V,0,0,0,0,,0,0,0,,,1105A,,BK,,,WB KINGS HIGHWAY @ M,CDONALD AVE,0,1111,C,T,,,,,,2018,,0,,BUS LANE VIOLATION,,,,107.0
1451305655,KTG3761,PA,PAS,06/15/2019,98,SUBN,NISSA,P,9780,5680,5780,0,68.0,68,68,963437,68.0,0.0,1011A,,K,F,666,64 ST,,0,408,C,,BBBBBBB,ALL,ALL,,0.0,0,-,0,,,,,,36.0
4011363534,GWB6339,NY,PAS,06/02/2019,5,SUBN,VOLVO,V,0,0,0,0,,0,0,0,,,0224P,,MN,,,NB 1ST AVE @ E 106TH,ST,0,1111,C,T,,,,GY,,2018,,0,,BUS LANE VIOLATION,,,,79.0
4011363601,L27KUU,NJ,PAS,06/02/2019,5,WAGO,INFIN,V,0,0,0,0,,0,0,0,,,0233P,,BX,,,EB E 163RD ST @ FOX,ST,0,1111,C,T,,,,BK,,2017,,0,,BUS LANE VIOLATION,,,,77.0
4011363625,JHA6490,NY,PAS,06/02/2019,5,SUBN,FORD,V,0,0,0,0,,0,0,0,,,0235P,,BX,,,EB E 163RD ST @ SIMP,SON ST,0,1111,C,T,,,,WH,,2010,,0,,BUS LANE VIOLATION,,,,94.0
1451917521,JSTY51,FL,PAS,07/03/2019,40,SDN,HYUND,P,19290,57790,8440,0,115.0,115,115,966675,115.0,0.0,0618A,,Q,F,33 31,100 ST,,0,408,C,,BBBBBBB,ALL,ALL,,0.0,0,-,0,,,,,,63.0
1454274001,HYU6693,NY,PAS,06/19/2019,14,,HONDA,P,49630,25370,23830,20200311,,0,801,959518,801.0,0.0,0427A,,K,F,811,HICKS ST,,0,408,C,,BBBBBBB,ALL,ALL,BLK,0.0,2018,-,0,,,,,,56.0
4011288408,GRK1901,NY,PAS,05/23/2019,5,4DSD,TOYOT,V,0,0,0,0,,0,0,0,,,0712A,,BK,,,NB ROGERS AVE @ ERAS,MUS ST,0,1111,C,T,,,,RD,,2009,,0,,BUS LANE VIOLATION,,,,84.0


### Quantifying completeness
- The records for parking violations stored in the `parking_violation` table contain missing values for the `vehicle_body_type` column. 
- Assume this data is missing completely at random (MCAR) due to human error. 
- **Task**
     - to quantify how many records are missing and 
     - perform an analysis for an appropriate fill-in value to replace the missing values.





In [114]:
%%sql
SELECT count(*) from parking_violation 
WHERE vehicle_body_type IS NULL

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
1 rows affected.


count
179


### Using a fill-in value

The sedan body type is the most frequently occurring `vehicle_body_type` in the sample parking violations. 
- changing all NULL-valued vehicle_body_type records in the parking_violations table to `SDN`. 
- not use one of the existing values as a fill-in value. 
- determined by looking up the vehicle using its license plate number. This value is present in most `parking_violation` records. Rather than using the most frequent value to replace NULL `vehicle_body_type` values, a placeholder value of `Unknown` will be used. 
- the actual body type will be updated as license plate lookup data is gathered.

In [115]:
%%sql

UPDATE
  parking_violation
SET
  -- Replace NULL vehicle_body_type values with `Unknown`
    vehicle_body_type = COALESCE(vehicle_body_type, 'Unknown');

SELECT COUNT(*) FROM parking_violation WHERE vehicle_body_type = 'Unknown';

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
5000 rows affected.
1 rows affected.


count
179


### Analyzing incomplete records

- write a query which outputs the issuing agencies along with the number of records attributable to that agency with a `NULL` `vehicle_body_type`. These records will be listed in descending order to determine the order.
- Based on the result, it would be see the `issuing_agency`, which has the most `num_missing` .

In [116]:
%%sql
SELECT
      issuing_agency,
      COUNT(*) AS num_missing
FROM
  parking_violation
WHERE
     vehicle_body_type ='Unknown'
GROUP BY issuing_agency
ORDER BY num_missing DESC;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
4 rows affected.


issuing_agency,num_missing
P,144
X,25
K,6
S,4


### Handling duplicated data
#### Duplicate parking violations
There have been a number of complaints indicating that some New York residents have been receiving multiple parking tickets for a single violation. This is resulting in the affected residents having to incur additional legal fees for a single incident. There is justifiable anger about this situation. 

- identifying records that reflect this duplication of violations.
     - by using `ROW_NUMBER()`, 
     - the `plate_id`, `issue_date`, `violation_time`, `house_number`, and `street_name`, indicating that multiple tickets were issued for the same violation.


In [117]:
%%sql
SELECT
      summons_number,
    -- Use ROW_NUMBER() to define duplicate window
      ROW_NUMBER() OVER(
        PARTITION BY
            plate_id, 
            issue_date, 
            violation_time, 
            house_number, 
            street_name
    -- Modify ROW_NUMBER() value to define duplicate column
      ) - 1 AS duplicate, 
    plate_id, 
    issue_date, 
    violation_time, 
    house_number, 
    street_name
FROM 
    parking_violation
ORDER BY house_number
LIMIT 20;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,duplicate,plate_id,issue_date,violation_time,house_number,street_name
1451266200,0,KAKU08,06/13/2019,1050P,348,LINDEN RD
1449795158,0,JHN9965,06/21/2019,1230P,5,ALICE COURT
1449157865,0,GZN2983,06/23/2019,0818P,1,ORCHARD BEACH ROAD
1449011287,0,GTP2239,06/24/2019,0240P,1,RICHMOND TERR
1449229451,0,GPL7412,07/04/2019,0834P,1,ORCHARD BEACH RD
1449011240,0,GZM7128,06/24/2019,0215P,1,RICHMOND TERR
1449154554,0,GVD9888,06/15/2019,0808P,1,ORCHARD BEACH ROAD
1449201532,0,GEE7227,07/04/2019,0350P,1,ORCHARD BEACH RD
1440680577,0,GSE2927,07/04/2019,0825P,1,ORCHARD BEACH RD
1454168924,0,GKL6232,06/20/2019,0904A,1,EAST LOOP RD


In [118]:
%%sql
SELECT 
    *
FROM (
    SELECT
        summons_number,
        ROW_NUMBER() OVER(
            PARTITION BY 
                plate_id, 
                    issue_date, 
                    violation_time, 
                    house_number, 
                    street_name
            ) - 1 AS duplicate, 
            plate_id, 
            issue_date, 
            violation_time, 
            house_number, 
            street_name 
    FROM 
        parking_violation
) sub
WHERE
    -- Only return records where duplicate is 1 or more
    duplicate != 0;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
52 rows affected.


summons_number,duplicate,plate_id,issue_date,violation_time,house_number,street_name
1448411579,1,GEW9007,06/30/2019,0258P,172-61,BAISLEY BLVD
1410920458,1,GEX3870,06/20/2019,1030P,1520,GRAND CONCOURSE
1446413147,1,GFD4777,06/30/2019,1214P,3543,WAYNE AVE
1448947790,1,GKX9331,06/29/2019,1030P,,S/W C/O W 45 ST
1452062158,1,GR8C1VIC,06/14/2019,0315P,,RIVERBANK STATE PARK
1449470646,1,GUC5106,07/03/2019,1035P,1060,BEACH AVE
1451262115,1,GWC4311,06/30/2019,0728P,,SURF AVE
1452186870,1,GXL4110,06/30/2019,0548P,170 01,118 RD
1449142590,1,HAT3306,06/26/2019,0938A,811,E 219 ST
1454273847,1,HDC7519,07/02/2019,0447A,811,HICKS ST


In [119]:
%%sql
ALTER TABLE parking_violation   
ADD COLUMN fee DECIMAL(5,2);

UPDATE parking_violation    
SET fee = CAST(floor(random()*(115-35+1))+35 AS DECIMAL(5,2));

INSERT INTO parking_violation    


 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
(psycopg2.errors.DuplicateColumn) column "fee" of relation "parking_violation" already exists

[SQL: ALTER TABLE parking_violation ADD COLUMN fee DECIMAL(5,2);]
(Background on this error at: http://sqlalche.me/e/f405)


In [120]:
%%sql
SELECT * FROM parking_violation ORDER BY RANDOM() LIMIT 3; 

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
3 rows affected.


summons_number,plate_id,registration_state,plate_type,issue_date,violation_code,vehicle_body_type,vehicle_make,issuing_agency,street_code1,street_code2,street_code3,vehicle_expiration_date,violation_location,violation_precinct,issuer_precint,issuer_code,issuer_command,issuer_squad,violation_time,time_first_observed,violation_county,violation_in_front_of_or_opposite,house_number,street_name,intersecting_street,date_first_observed,law_section,sub_division,violation_legal_code,days_parking_in_effect,from_hours_in_effect,to_hours_in_effect,vehicle_color,unregistred_vehicle,vehicle_year,meter_number,feet_from_curb,violation_post_code,violation_description,no_standing_or_stopping_violation,hydrant_violation,double_parking_violation,fee
1403960537,HSW9600,NY,PAS,06/24/2019,24,SDN,JAGUA,O,47150,41311,46850,20190821,122,122,992,1674,992,0,0458P,,,F,777.0,SEAVIEW AVE,,0,408,J7,,BBBBBBB,ALL,ALL,BLUE,0,2010,-,0,,,,,,102.0
1449253040,GRT8494,NY,PAS,07/16/2019,50,SUBN,JEEP,P,27060,62720,0,20200609,46,46,46,963310,46,0,1040P,,BX,F,,E 180 ST,RYER AVE,0,408,F2,,BBBBBBB,ALL,ALL,SILVE,0,2012,-,0,,,,,,65.0
1449621314,JGM6049,NY,PAS,06/16/2019,46,SDN,INFIN,P,68930,75130,50830,20210317,73,73,73,954919,73,0,1207A,,K,O,2169.0,PACIFIC STREET,,0,408,F1,,BBBBBBB,ALL,ALL,GRY,0,2009,-,0,,,,,,76.0


### Resolving impartial duplicates
- The `parking_violation` dataset has been modified to include a `fee` column indicating the fee for the violation. This column would be useful for keeping track of New York City parking ticket revenue. However, due to duplicated violations revenue calculations based on the dataset would not be accurate. 
- These duplicate records only differ due to the value in the fee column. All other column values are shared in the duplicated records. A decision has been made to use the minimum fee to resolve the ambiguity created by these duplicates.

    - Identify the 3 duplicated `parking_violation` records and 
    - use the `MIN()` function to determine the fee that will be used after removing the duplicate records.

In [121]:
%%sql
SELECT 
    summons_number, 
    MIN(fee) AS fee
FROM parking_violation 
GROUP BY summons_number 
HAVING 
    COUNT(*) > 1;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
0 rows affected.


summons_number,fee


### Detecting invalid values
#### Detecting invalid values with regular expressions

- `c{n}` matches strings which contain the character c repeated n times. 
    - `x{4}` would match the pattern `xxxx`. 
- `c+` matches strings which contain the character c repeated one or more times. 
    - `x+` would match strings including `xxxx` as well as `x `and `xx`.

**returning records with a registration_state that does not match two consecutive uppercase letters.**

In [122]:
%%sql
SELECT
    summons_number,
    plate_id,
    registration_state
FROM
    parking_violation
WHERE
    -- Define the pattern to use for matching
    registration_state NOT SIMILAR TO '[A-Z][A-Z]%'
LIMIT 20;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,plate_id,registration_state
1447153789,JFW5006,99
1447156195,HMP3112,99
1447156201,JFG3676,99
1449253829,HTE1978,99
1449253878,GTL7675,99
1449296245,K76LCW,99
1449330186,JHE1659,99
1447508040,HTJ3728,99
1447578831,HNC1285,99
1449353186,HMU7178,99


**returning records containing a `plate_type` that does not match three consecutive *uppercase letters*.**

In [123]:
%%sql
SELECT
    summons_number,
    plate_id,
    plate_type
FROM parking_violation
WHERE
  -- Define the pattern to use for matching
    plate_type  NOT SIMILAR TO '[A-Z]+%';

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
19 rows affected.


summons_number,plate_id,plate_type
1449399691,JGX9658,999
1399542539,JKA8821,999
1447673712,JFN9352,999
1447683225,JFF2969,999
1427619566,HKM7133,999
1438238885,JHT9520,999
1448286682,HYB4468,999
1449031237,K487440,999
1446646427,HVN2475,999
1446727464,HRC3599,999


In [124]:
%%sql

SELECT
  summons_number,
  plate_id,
  vehicle_make
FROM
  parking_violation
WHERE
  -- Define the pattern to use for matching
  vehicle_make NOT SIMILAR TO '[A_Z][/][\s]'
LIMIT 20;


 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,plate_id,vehicle_make
1447153649,JCA5331,ACURA
1447153789,JFW5006,HONDA
1447153790,HGR2634,ACURA
1447153819,GYM7645,NISSA
1447153820,KHW5523,DODGE
1447153844,JGX7169,ME/BE
1447154198,JGL4948,TOYOT
1447154411,JKSN62,KIA
1447155294,GMC1999,DODGE
1447155427,JEL8372,HONDA


### Identifying out-of-range vehicle model years

- `Type constraints` are **useful for restricting the type of data** that can be stored in a table column. However, there are limitations to how thoroughly these constraints can prevent invalid data from entering the column. 
- `Range constraints` are useful when the goal is **to identify column values** that are included in a range of values or excluded from a range of values. 
- Using type constraints when defining a table followed by checking column values with range constraints are a powerful approach to ensuring the integrity of data.
     - Ex: use a `BETWEEN` clause to build a range constraint to identify invalid vehicle model years in the `parking_violation` table. Valid vehicle model years for this dataset are considered to be between 1970 and 2021.

In [125]:
%%sql
SELECT
    -- Define the columns to return from the query
    summons_number,
    plate_id,
    vehicle_year
FROM
  parking_violation
WHERE
  -- Define the range constraint for invalid vehicle years
    vehicle_year NOT BETWEEN 1970 AND 2021
LIMIT 20;

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
20 rows affected.


summons_number,plate_id,vehicle_year
1447153649,JCA5331,0
1447153789,JFW5006,0
1447153790,HGR2634,0
1447153820,KHW5523,0
1447154411,JKSN62,0
1447155427,JEL8372,0
1447155750,GHP8968,0
1447155877,KHG6053,0
1447157000,HPA3075,0
1447170593,HZU1090,0


### Identifying invalid parking violations
- The `parking_violation` table has three columns populated by related time values. The `from_hours_in_effect` column indicates the start time when parking restrictions are enforced at the location where the violation occurred. 
- The `to_hours_in_effect` column indicates the ending time for enforcement of parking restrictions. 
- The `violation_time` indicates the time at which the violation was recorded. 

- => use the parking restriction time range defined by `from_hours_in_effect` and `to_hours_in_effect` to identify parking tickets with an invalid violation_time.

In [126]:
%%sql

SELECT 
  summons_number, 
  violation_time, 
  from_hours_in_effect, 
  to_hours_in_effect 
FROM 
  parking_violation 
WHERE 
  -- Exclude results with overnight restrictions 
  from_hours_in_effect < to_hours_in_effect AND 
  violation_time NOT BETWEEN from_hours_in_effect AND to_hours_in_effect;
    

 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
118 rows affected.


summons_number,violation_time,from_hours_in_effect,to_hours_in_effect
1447199789,0608P,0700A,0700P
1447199807,0430P,0700A,0700P
1447209096,0748A,0700A,0700P
1447209102,1120A,0700A,0700P
1447209114,1040A,0700A,0700P
1447209138,1143A,0700A,0700P
1447209140,1138A,0700A,0700P
1447209667,1101A,0700A,0700P
1447209758,0936A,0700A,0700P
1447209795,0405P,0700A,0700P


### Invalid violations with overnight parking restrictions

In [127]:
%%sql
SELECT
  summons_number,
  violation_time,
  from_hours_in_effect,
  to_hours_in_effect
FROM
  parking_violation
WHERE
  -- Ensure from hours greater than to hours
  from_hours_in_effect > to_hours_in_effect AND
  -- Ensure violation_time less than from hours
  violation_time < from_hours_in_effect AND
  -- Ensure violation_time greater than to hours
  violation_time > to_hours_in_effect;




 * postgresql://postgres:***@localhost:5432/NYC_Open_Data
4 rows affected.


summons_number,violation_time,from_hours_in_effect,to_hours_in_effect
1448772825,1136A,1200A,0500P
1448774720,1146A,1200A,0500P
1448774731,1135A,1200A,0500P
1448774767,1142A,1200A,0500P
