# Advanced Data Standardization
One of the challenges in analyzing police data is that different agencies will use different column names for the same data and will use different codes and terms for the data in the columns. Particularly, if you are looking at multiple datasets, it is valuable for the data to be standardized so that you know in advance what some key columns will be called and what values will be in those columns. To provide the user with more consistent column names and data, OpenPoliceData provides powerful tools to automatically standardize column names and data in order. 

## Standardizable Columns
Columns that OpenPoliceData can standardize are:

In [1]:
import sys
sys.path.insert(0, "../../../../../openpolicedata/")
import openpolicedata as opd
print("Left column is attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER)")
print('Middle column is the corresponding standardized column name (i.e. "OFFICER_AGE")')
opd.defs.columns

Left column is attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER)
Middle column is the corresponding standardized column name (i.e. "OFFICER_AGE")


Unnamed: 0,Attribute,Column Name,Definition
1,AGE_OFFICER,OFFICER_AGE,Age of officer
2,AGE_OFFICER_SUBJECT,OFFICER/SUBJECT_AGE,"Age of either an officer or subject (depending on column ""SUBJECT_OR_OFFICER"")"
6,AGE_SUBJECT,SUBJECT_AGE,Age of subject
3,AGE_RANGE_OFFICER,OFFICER_AGE_RANGE,Age Range of officer
4,AGE_RANGE_OFFICER_SUBJECT,OFFICER/SUBJECT_AGE_RANGE,"Age Range of either an officer or subject (depending on column ""SUBJECT_OR_OFFICER"")"
5,AGE_RANGE_SUBJECT,SUBJECT_AGE_RANGE,Age Range of subject
0,AGENCY,AGENCY,Agency
7,DATE,DATE,"Date. Some agencies only provide the period. In these cases, the date will be the 1st date of the period (i.e. Jan. 1 for years and the 1st of the month for months)."
8,DATETIME,DATETIME,Combination of date and time when both columns are found (not generated when detected date column contains datetime values)
10,ETHNICITY_OFFICER_SUBJECT,OFFICER/SUBJECT_ETHNICITY,"Ethnicity of either an officer or subject (depending on column ""SUBJECT_OR_OFFICER"")"


## Basic Standardization
Basic standardization only requires calling `standardize` on a `Table` object.

In [2]:
src = opd.Source("Virginia")
tbl = src.load_from_url(2022, table_type="STOPS", agency="Arlington County Police Department")
tbl.standardize()

print(f'The columns after standardization are {tbl.table.columns}')

The columns after standardization are Index(['DATE', 'SUBJECT_RACE', 'SUBJECT_AGE', 'SUBJECT_GENDER', 'AGENCY',
       'jurisdiction', 'reason_for_stop', 'person_type', 'english_speaking',
       'action_taken', 'specific_violation', 'virginia_crime_code',
       'person_searched', 'vehicle_searched', 'physical_force_by_officer',
       'physical_force_by_subject', 'residency', 'SUBJECT_ETHNICITY',
       'SUBJECT_RACE_ONLY', 'RAW_incident_date', 'RAW_agency_name', 'RAW_race',
       'RAW_ethnicity', 'RAW_age', 'RAW_gender'],
      dtype='object')


The columns in all caps are the columns are standardized versions of the original columns that have been prepended with "RAW_" (*The original columns were lowercase in this dataset. OPD does not alter the original columns or make them lowercase*). By standardizing, the user does not need to know the exact name of a particular column (they vary greatly) and can check if they exist by:

In [6]:
if opd.defs.columns.RACE_SUBJECT in tbl.table: # Alternatively: if "SUBJECT_RACE" in tbl.table
    race_col = tbl.table[opd.defs.columns.RACE_SUBJECT]  # Alternatively: tbl.table["SUBJECT_RACE"]
    print(f"The values in the standardized subject race column are {tbl.table['RAW_race'].unique()}\n")
    print(f"The values in the standardized subject race column are {race_col.unique()}")
else:
    print("There is no subject race column")

The values in the standardized subject race column are ['WHITE' 'BLACK OR AFRICAN AMERICAN' 'AMERICAN INDIAN OR ALASKA NATIVE'
 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER' 'UNKNOWN']

The values in the standardized subject race column are ['HISPANIC/LATINO' 'BLACK' 'INDIGENOUS' 'WHITE' 'ASIAN / PACIFIC ISLANDER'
 'UNKNOWN']


The values in each standardized column (such as the race column) have been converted from the raw values to standardized ones. Thus, in the standardized race column, Black individuals will always be labeled as BLACK despite different datasets using many different encodings (including B, African American, Black or African American, and various shortened versions and typos). Data types will also be consistent

In [8]:
tbl.table["DATE"].dtype

dtype('<M8[ns]')