# Assignment 1
### Understanding Uncertainty
### Due 9/5

1. Create a new public repo on Github under your account. Include a readme file.
2. Clone it to your machine. Put this file into that repo.
3. Use the following function to download the example data for the course:

In [1]:
def download_data(force=False):
    """Download and extract course data from Zenodo."""
    import urllib.request, zipfile, os
    
    zip_path = 'data.zip'
    data_dir = 'data'
    
    if not os.path.exists(zip_path) or force:
        print("Downloading course data")
        urllib.request.urlretrieve(
            'https://zenodo.org/records/16954427/files/data.zip?download=1',
            zip_path
        )
        print("Download complete")
    else:
        print("Download file already exists")
        
    if not os.path.exists(data_dir) or force:
        print("Extracting data files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Data extracted")
    else:
        print("Data directory already exists")

download_data()

Downloading course data
Download complete
Extracting data files...
Data extracted


4. Open one of the datasets using Pandas:
    1. `ames_prices.csv`: Housing characteristics and prices
    2. `college_completion.csv`: Public, nonprofit, and for-profit educational institutions, graduation rates, and financial aid
    3. `ForeignGifts_edu.csv`: Monetary and in-kind transfers from foreign entities to U.S. educational institutions
    4. `iowa.csv`: Liquor sales in Iowa, at the transaction level
    5. `metabric.csv`: Cancer patient and outcome data
    6. `mn_police_use_of_force.csv`: Records of physical altercations between Minnessota police and private citizens
    7. `nhanes_data_17_18.csv`: National Health and Nutrition Examination Survey
    8. `tuna.csv`: Yellowfin Tuna Genome (I don't recommend this one; it's just a sequence of G, C, A, T )
    9. `va_procurement.csv`: Public spending by the state of Virginia

In [2]:
# Import packages
import pandas as pd
import numpy as np
# Read in data
data = pd.read_csv("data/mn_police_use_of_force.csv")

5. Pick two or three variables and briefly analyze them
    - Is it a categorical or numeric variable?
    - How many missing values are there? (`df['var'].isna()` and `np.sum()`)
    - If categorical, tabulate the values (`df['var'].value_counts()`) and if numeric, get a summary (`df['var'].describe()`)

In [3]:
# Display data to get a sense of the data 
data

Unnamed: 0,response_datetime,problem,is_911_call,primary_offense,subject_injury,force_type,force_type_action,race,sex,age,type_resistance,precinct,neighborhood
0,2016/01/01 00:47:36,Assault in Progress,Yes,DASLT1,,Bodily Force,Body Weight to Pin,Black,Male,20.0,Tensed,1,Downtown East
1,2016/01/01 02:19:34,Fight,No,DISCON,,Chemical Irritant,Personal Mace,Black,Female,27.0,Verbal Non-Compliance,1,Downtown West
2,2016/01/01 02:19:34,Fight,No,DISCON,,Chemical Irritant,Personal Mace,White,Female,23.0,Verbal Non-Compliance,1,Downtown West
3,2016/01/01 02:28:48,Fight,No,PRIORI,,Chemical Irritant,Crowd Control Mace,Black,Male,20.0,Commission of Crime,1,Downtown West
4,2016/01/01 02:28:48,Fight,No,PRIORI,,Chemical Irritant,Crowd Control Mace,Black,Male,20.0,Commission of Crime,1,Downtown West
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12920,2021/08/30 21:38:46,Assault in Progress,Yes,ASLT5,,Bodily Force,Joint Lock,White,Female,69.0,,1,Loring Park
12921,2021/08/30 22:32:22,Unwanted Person,Yes,CIC,,Bodily Force,Joint Lock,,,,,1,Cedar Riverside
12922,2021/08/31 12:03:08,Overdose w/All,Yes,FORCE,,Bodily Force,Body Weight Pin,Black,Male,,,3,Seward
12923,2021/08/31 12:52:52,Attempt Pick-Up,No,WT,,Bodily Force,Body Weight Pin,Black,Male,31.0,,4,Camden Industrial


Variables to investigate: **race**, **age**, **neighborhood**. 

**Race** is a categorical variable that displays the race of the victim of the officer's force.

**Age** is a numerical variable that represents the age of the victim. 

**Neighborhood** is a categorical variable that contains the Minnesota neighborhood where the event occurred. 

In [4]:
# race
race_nas = data['race'].isna().sum()
# age
age_nas = data['age'].isna().sum()
# neighborhood
neighborhood_nas = data['neighborhood'].isna().sum()
# Print all
print(f"There are {race_nas} missing values in the race column.")
print(f"There are {age_nas} missing values in the age column.")
print(f"There are {neighborhood_nas} missing values in the neighborhood column.")

There are 1024 missing values in the race column.
There are 1066 missing values in the age column.
There are 4 missing values in the neighborhood column.


In [5]:
data['race'].value_counts(), data['race'].value_counts(normalize=True)
# Set normalize param. = True to see percentages as well

(race
 Black                 7648
 White                 3129
 Native American        784
 Other / Mixed Race     205
 Asian                  129
 Pacific Islander         6
 Name: count, dtype: int64,
 race
 Black                 0.642635
 White                 0.262919
 Native American       0.065877
 Other / Mixed Race    0.017225
 Asian                 0.010839
 Pacific Islander      0.000504
 Name: proportion, dtype: float64)

There is a much higher percentage of police force used on black people than of any other race, despite there only being a 7% population of black people in the state (via [Minnesota.gov in 2022](https://mn.gov/deed/newscenter/publications/trends/september-2022/disparities.jsp)), this might be indicative of potential racial biases in policing.

In [6]:
data['age'].describe()

count    11859.000000
mean        29.484527
std         10.987780
min          0.000000
25%         22.000000
50%         28.000000
75%         35.000000
max         82.000000
Name: age, dtype: float64

Based on the distributions of ages, it appears that a lot of these interactions are occurring with people in the 20-40 age range, with a mean of ~29.5, median of 28, and the first and third quartiles being from age 22 to 35. 

In [7]:
data['neighborhood'].value_counts(), data['neighborhood'].value_counts(normalize=True) 
# Set normalize param. = True to see percentages as well

(neighborhood
 Downtown West           2928
 Near - North             562
 Lowry Hill East          554
 Hawthorne                502
 Jordan                   479
                         ... 
 Cedar - Isles - Dean       6
 Page                       5
 Camden Industrial          3
 Hale                       2
 Kenny                      1
 Name: count, Length: 86, dtype: int64,
 neighborhood
 Downtown West           0.226608
 Near - North            0.043495
 Lowry Hill East         0.042876
 Hawthorne               0.038851
 Jordan                  0.037071
                           ...   
 Cedar - Isles - Dean    0.000464
 Page                    0.000387
 Camden Industrial       0.000232
 Hale                    0.000155
 Kenny                   0.000077
 Name: proportion, Length: 86, dtype: float64)

There are a majority of police instances occurring in the Downtown West neighborhood (over 20%). This could be an indicator for overpolicing, as the population of Downtown West was listed at under 9000 people ([2020 U.S. Census](https://www.census.gov/)). It's additionally possible that there are just many crimes committed in the area. 

6. What are some questions and prediction tools you could create using these data? Who would the stakeholder be for that prediction tool? What practical or ethical questions would it create? What other data would you want, that are not available in your data?

There are a lot of ethical concerns surrounding this data, as it deals with extremely sensitive information. Such as demographic data, as well as the nature of the dataset being instances of police force. This data could mainly be used by the Minnesota police force, as well as various social justice organizations, to better inform policing habits. This data could be used to predict various polices responses to certain events (such as classification), as well as investigate difference questions and potential concerns regarding policing habits. Such as: has there been overpolicing of certain neighborhoods? Races? Genders? Or how certain actions of force have occurred and the surrounding features for why certain things occurred. It could additionally be helpful to have information like time to arrive or the specific officer that made the offense. This can be used to better police by increasing efficiency or by removing habits, practices or officers that have been producing harmful results.  

7. Commit your work to the repo (`git commit -am 'Finish assignment'` at the command line, or use the Git panel in VS Code). Push your work back to Github and submit the link on Canvas in the assignment tab.