### Checking your data quality

This notebook looks at various tests that we can run to check the quality of your data. Below are some of the things we will look into:

- Accuracy: How well does a piece of information reflect reality?
- Completeness: Does it fulfill users’ expectations as to how fully it represents the truth?
- Timeliness: Is your information available when users need it?
- Validity: Is information in a specific format or is it in an unusable format?
- Uniqueness: Is this the only instance in which this information appears in the database?

In [3]:
import pandas as pd

Load the data

In [5]:
election_data  = pd.read_csv("../data/city council (nyccfb.info).csv")

election_data.head()

Unnamed: 0,UNIQUEID,ELECTION,OFFICECD,RECIPID,CANCLASS,RECIPNAME,COMMITTEE,FILING,SCHEDULE,PAGENO,...,INTEMPSTNM,INTEMPCITY,INTEMPST,INTOCCUPA,PURPOSECD,EXEMPTCD,ADJTYPECD,RR_IND,SEG_IND,INT_C_CODE
0,91CB90C2,2023,5,433,P,"Schulman, Lynn",L,6,ABC,,...,,,,,,,,N,N,
1,FFACDF77,2023,5,1159,P,"Narcisse, Mercedes",K,6,ABC,,...,,,,,,,,N,N,
2,BDEB84A7,2023,5,1984,P,"Velazquez, Marjorie",K,7,ABC,,...,,,,,,,,N,N,
3,69252E17,2023,5,2756,P,"Salaam, Yusef",H,10,ABC,,...,,,,,,,,N,N,
4,D3C8E6EE,2023,5,2349,P,"De La Rosa, Carmen N",J,6,ABC,,...,,,,,,,,N,N,


##### Accuracy

We will check on some of the values in the `city` column using the `value_counts()` function and by sorting the list alphabetically.

In [8]:
city_tallies = election_data["CITY"].value_counts()

city_tallies.head()

CITY
New York         669
Brooklyn         246
Staten Island     65
Flushing          59
Jamaica           52
Name: count, dtype: int64

In [23]:
city_tallies.reset_index().sort_values(by="CITY")

Unnamed: 0,CITY,count
39,ASTORIA,2
33,Addisleigh Park,3
8,Astoria,25
5,BOWLING GREEN,43
69,BROOKLYN,1
...,...,...
60,far rockaway,1
51,new york,1
45,newyork,2
37,staten island,3


#### Completeness

Oftentimes you want to see how many `null values` there are. 

In [26]:
## this line of code will print the rows where the value for "UNIQUEID" is a null value. 
election_data[election_data["UNIQUEID"].isna()]

Unnamed: 0,UNIQUEID,ELECTION,OFFICECD,RECIPID,CANCLASS,RECIPNAME,COMMITTEE,FILING,SCHEDULE,PAGENO,...,INTEMPSTNM,INTEMPCITY,INTEMPST,INTOCCUPA,PURPOSECD,EXEMPTCD,ADJTYPECD,RR_IND,SEG_IND,INT_C_CODE


In [28]:
## this line of code will print the rows where the value for "OCCUPATION" is a null value. 

election_data[election_data["OCCUPATION"].isna()]

Unnamed: 0,UNIQUEID,ELECTION,OFFICECD,RECIPID,CANCLASS,RECIPNAME,COMMITTEE,FILING,SCHEDULE,PAGENO,...,INTEMPSTNM,INTEMPCITY,INTEMPST,INTOCCUPA,PURPOSECD,EXEMPTCD,ADJTYPECD,RR_IND,SEG_IND,INT_C_CODE
0,91CB90C2,2023,5,433,P,"Schulman, Lynn",L,6,ABC,,...,,,,,,,,N,N,
1,FFACDF77,2023,5,1159,P,"Narcisse, Mercedes",K,6,ABC,,...,,,,,,,,N,N,
2,BDEB84A7,2023,5,1984,P,"Velazquez, Marjorie",K,7,ABC,,...,,,,,,,,N,N,
3,69252E17,2023,5,2756,P,"Salaam, Yusef",H,10,ABC,,...,,,,,,,,N,N,
4,D3C8E6EE,2023,5,2349,P,"De La Rosa, Carmen N",J,6,ABC,,...,,,,,,,,N,N,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1401,84F3BC1F,2023,5,2063,NP,"Adams, Adrienne",L,2,ABC,,...,,,,,,,,N,N,
1423,CFCE5684,2023,5,2343,P,"Ung, Sandra",J,2,ABC,,...,West 56th Street,New York,NY,Attorney,,,,N,N,IND
1426,8DD1C23F,2023,5,1529,P,"Menin, Julie",K,9,M,,...,,,,,,,2.0,N,N,
1430,8D55B131,2023,5,399,P,"Brewer, Gale",Q,9,M,,...,,,,,,,2.0,N,N,


#### Timeliness

Here we will first format the date column to be interpreted as a `datetime` and then we will find some characteristics of that column by using the `.describe()` function.

In [30]:
election_data["DATE"].dtype

dtype('<M8[ns]')

In [31]:
election_data["DATE"] = election_data["DATE"].astype("datetime64[ns]")

In [32]:
election_data["DATE"].describe()

count                             1460
mean     2023-03-04 16:54:54.246575616
min                2021-11-13 00:00:00
25%                2022-12-13 00:00:00
50%                2023-03-05 12:00:00
75%                2023-06-23 00:00:00
max                2023-12-08 00:00:00
Name: DATE, dtype: object

##### Validity
In this next cell we look at the validity of the data in the zip code column

In [36]:
election_data["ZIP"] =election_data["ZIP"].astype("str")

In [45]:
#this line will create a column that measures the length of the value in the zipcode column
election_data["zip_len"] = election_data["ZIP"].apply(lambda x: len(x))

election_data.head()

Unnamed: 0,UNIQUEID,ELECTION,OFFICECD,RECIPID,CANCLASS,RECIPNAME,COMMITTEE,FILING,SCHEDULE,PAGENO,...,INTEMPCITY,INTEMPST,INTOCCUPA,PURPOSECD,EXEMPTCD,ADJTYPECD,RR_IND,SEG_IND,INT_C_CODE,zip_len
0,91CB90C2,2023,5,433,P,"Schulman, Lynn",L,6,ABC,,...,,,,,,,N,N,,5
1,FFACDF77,2023,5,1159,P,"Narcisse, Mercedes",K,6,ABC,,...,,,,,,,N,N,,5
2,BDEB84A7,2023,5,1984,P,"Velazquez, Marjorie",K,7,ABC,,...,,,,,,,N,N,,5
3,69252E17,2023,5,2756,P,"Salaam, Yusef",H,10,ABC,,...,,,,,,,N,N,,5
4,D3C8E6EE,2023,5,2349,P,"De La Rosa, Carmen N",J,6,ABC,,...,,,,,,,N,N,,5


In [46]:
election_data[election_data["zip_len"] <5]

Unnamed: 0,UNIQUEID,ELECTION,OFFICECD,RECIPID,CANCLASS,RECIPNAME,COMMITTEE,FILING,SCHEDULE,PAGENO,...,INTEMPCITY,INTEMPST,INTOCCUPA,PURPOSECD,EXEMPTCD,ADJTYPECD,RR_IND,SEG_IND,INT_C_CODE,zip_len
10,57406798,2023,5,380,P,"Moya, Francisco P",M,5,ABC,,...,,,,,,,N,N,,4
696,F8A17C16,2023,5,1529,P,"Menin, Julie",K,2,ABC,,...,,,,,,,N,N,,4
906,F953A326,2023,5,1990,P,"Brannan, Justin",L,7,ABC,,...,,,,,,,N,N,,3


##### Uniqueness

We will use the `duplicated()` function to see which rows are duplicates'.

In [48]:
#the line below should show the two rows that are duplicates and follow the
election_data[election_data.duplicated()]

Unnamed: 0,UNIQUEID,ELECTION,OFFICECD,RECIPID,CANCLASS,RECIPNAME,COMMITTEE,FILING,SCHEDULE,PAGENO,...,INTEMPCITY,INTEMPST,INTOCCUPA,PURPOSECD,EXEMPTCD,ADJTYPECD,RR_IND,SEG_IND,INT_C_CODE,zip_len
49,E64E955F,2023,5,2762,P,"Chan, Wai Yee",H,6,ABC,,...,,,,,,,N,N,,5
1325,E64E955F,2023,5,2762,P,"Chan, Wai Yee",H,6,ABC,,...,,,,,,,N,N,,5
