# 0. Problem set 1 (42 points)

This problem set will:

- Make sure your jupyter notebook installation is working
- Give you practice with the following basic Python concepts covered in lecture (and optionally on DataCamp):
  - Python lists
  - Basic list comprehension
- Give you practice loading in and interpreting the Cook County, Illinois (which contains Chicago) sentencing dataset. 
  - This dataset reports the sentence given to defendants convicted of different crimes. 
  - [The data codebook is available here](https://datacatalog.cookcountyil.gov/api/views/tg8v-tm6u/files/8597cdda-f7e1-44d1-b0ce-0a4e43f8c980?download=true&filename=CCSAO%20Data%20Glossary.pdf). For the latest on this data (for future reference), [see the official website](https://datacatalog.cookcountyil.gov/Courts/Sentencing/tg8v-tm6u).
- Give you practice wrangling that data using Pandas (groupby, filtering, apply, etc.)

# 1. Python practice (total 14 points)

## 1.1 Practice with lists

### 1.1.1 List addition (2 points)
We provide you with a list of our names below.

- Use list addition to add your name to the list
- Store this as a new list called `instructor_my_name`
- Print that list

In [None]:
## list of instructor names
instructor_names = ['Eunice', 'Jaren', 'Ramsey']

In [None]:
## your code here to store and print list

### 1.1.2 List indexing (2 points)

- Use the `.index()` method to get the index of the professor's name (Jaren)
- Use that index to extract that name
- Store it in a list called `prof` and print

In [None]:
## your code here to get index of prof's name

In [None]:
## your code here to extract the name and print

### 1.1.3 Indexing lists of lists (2 points)

We provide you with the below list of lists (`roles_listoflists`).

- Use subsetting to pull out the role of Jaren (can just hard code the relevant indices)
- Pull out and print the type of the element at index 1 (Eunice's name and role)

In [None]:
## list of names and roles
roles_listoflists = [['Jaren', 'Prof'], ['Eunice', 'TA'], ['Ramsey', 'Tutor']]

In [None]:
## your code here to pull out role of jaren

In [None]:
## your code here to pull out and print the type at index 1

### 1.1.4 Edit list elements (2 points)

- In `roles_listoflists`, replace the role 'TA' with the role 'Teaching Assistant' (fine to do in two lines of code)
- Print the updated list

In [None]:
## your code here to replace role text and print updated list

## 1.2. Practice with list comprehension

Here, we provide you with a list containing the course codes for a few Dartmouth College Fall 2022 courses.

In [None]:
## course code list 
course_codes = ["QSS 20", "QSS 17", "COSC 1", "GOV 10"]

### 1.2.1 Using list comprehension to keep all elements / transform them (2 points)

- Create a new list, `course_codes_ns`, that removes the spaces in each course code: e.g., 'QSS 20' should become 'QSS20'
- Print `course_codes_ns`

*Hint*: If you're new to regular expressions, then use a built-in string method: `str.replace()`. 

In [None]:
## your code here to create list without spaces and print

### 1.2.2 Using list comprehension to subset a list (2 points)

- Using `course_codes_ns`, create a new list just with courses with 'QSS' in the name; store it as `course_codes_qss`

*Hint*: Use an 'if' statement in the list comprehension to implement the condition.

In [None]:
## your code here to create and store new QSS-specific list

### 1.2.3 Using list comprehension to conditionally change a list's elements (2 points)

- using `course_codes_ns`, add the string prefix "coding_" to COSC 1 and QSS 20; "stats_" to GOV 10 and QSS 17
- Store as `course_codes_detail` and print

*Hint*: You can implement an if-else statement in a list comprehension ("conditional transformation") with this syntax:
```python
[elem_transform1 if condition else elem_transform2 for elem in lst]
```

In [None]:
## your code here to add prefixes, store, and print

# 2. Cleaning and interpreting the sentencing dataset (total 28 points)

## 2.0 Import packages

In [None]:
## basic functionality
import pandas as pd
import numpy as np
import re

## for plotting; can also use seaborn
## note: for plotnine, you likely need to install using pip or conda
import plotnine
from plotnine import *
import matplotlib.pyplot as plt

## repeated printouts
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## datetime util
from dateutil.relativedelta import relativedelta

## 2.1 Load the data (0 points)

Use `pd.read_csv` to load the `sentencing_asof0405.csv` data (make sure to unzip the `pset2_inputdata` folder and not hard code your user-specific path name)

*Notes*: You may receive a warning about mixed data types upon import; feel free to ignore, or call `low_memory=False` as a parameter.

In [None]:
## basic functionality
import pandas as pd
import numpy as np
import re

## plotting
## note: you likely need to install this using
## pip or conda; you can delete this line
## if you're using matplotlib, seaborn, or other
## plotting pkg
import plotnine
from plotnine import *

## repeated printouts
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## datetime util
from dateutil.relativedelta import relativedelta

In [None]:
## your code here loading the data

## 2.2 Inspect the data (0 points)

Print the head, dimensions, and info for the data

In [None]:
## your code here inspecting the data

## 2.3: Understanding the unit of analysis (5 points)


### 2.3.1 Print the number of unique values for the following columns all at once (e.g., with `.apply()`), i.e. without copying/pasting code to do each one separately:

- Cases (CASE_ID)
- People (CASE_PARTICIPANT_ID)
- Charges (CHARGE_ID)

**Source for this question**: [slide 14 here on column-wise apply](https://github.com/jhaber-zz/QSS20_public/blob/main/slides/02_qss20_fa22_pandas.pdf)

In [None]:
## your code here printing numbers of unique values

### 2.3.2  Cases and people

You might have noticed there are more unique people than unique cases and more unique charges than unique people. This is because the same case can have multiple people involved, and the same person can have multiple charges tied to a case. Illustrate this by showing:
   
- an example of a case involving multiple people
- an example of a person in a case involving multiple charges

**Resources**: groupby and agg covered in:
- [The in-class activity on data wrangling](https://github.com/jhaber-zz/QSS20_public/blob/main/activities/00_pandas_datacleaning_blank.ipynb) and [solutions](https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/00_pandas_datacleaning_solutions.ipynb)

- [These lecture slides on data wrangling](https://github.com/jhaber-zz/QSS20_public/blob/main/slides/02_qss20_fa22_pandas.pdf)

In [None]:
## your code here showing a case with multiple people

In [None]:
## your code here showing a case with multiple charges

### 2.3.3 Finding mean and median 

- Print the mean and median number of charges per `CASE_PARTICIPANT_ID`
- Print the mean and median number of participants per `CASE_ID`

In [None]:
## your code here finding mean and median

### 2.3.4 Does the data enable us to follow the same defendant across different cases they're charged in? Write 1 sentence in support of your conclusion.

In [None]:
## your code here checking for linkage of people across cases

(your text response here)

## 2.4 Which offense is final? (3 points)

First, read the data documentation ([link here](https://datacatalog.cookcountyil.gov/api/views/tg8v-tm6u/files/8597cdda-f7e1-44d1-b0ce-0a4e43f8c980?download=true&filename=CCSAO%20Data%20Glossary.pdf)) and summarize in your own words the differences between `OFFENSE_CATEGORY` and `UPDATED_OFFENSE_CATEGORY`.

(your text response here summarizing the differences)

Then construct an indicator `is_changed_offense` that's True for case-participant-charge observations (rows) where there's a difference between the `OFFENSE_CATEGORY` and the `UPDATED_OFFENSE_CATEGORY`. 

**Resources**: row subsetting, groupby/agg, and np.where covered in [lecture slides on data wrangling](https://github.com/jhaber-zz/QSS20_public/blob/main/slides/02_qss20_fa22_pandas.pdf)

In [None]:
## your code here constructing indicator

What are some of the more common changed offenses? Consider both:
  - The raw number of changed offenses that come from each `OFFENSE_CATEGORY` (e.g., using `value_counts()`). This should answer the question: What offenses contribute the most to the pool of changed offenses?
  - The proportion of each `OFFENSE_CATEGORY` that gets changed (can just compute mean and print result of `sort_values()`). This should answer the question: What offenses tend to get changed the most?

In [None]:
## your code here inspecting most common changed offenses

Print one example of a changed offense from one of these categories and comment on what the reason may be.

In [None]:
## your code here printing example

## 2.5 Simplifying the charges (5 points)

Using the field (`UPDATED_OFFENSE_CATEGORY`), create a new field, `simplified_offense_derived`, that simplifies the many offense categories into broader buckets using the following process:

First, create a new variable that strips "Aggravated" (capitalized) from the `UPDATED_OFFENSE_CATEGORY` (e.g., 'Aggravated Battery' just becomes 'Battery', 'Aggravated DUI' becomes 'DUI')

**Resources**: slide 19 of [the lecture on data wrangling with pandas](https://github.com/jhaber-zz/QSS20_public/blob/main/slides/02_qss20_fa22_pandas.pdf) has str.replace (example with stripping the name johnson from a last name)

In [None]:
## your code here stripping 'Aggravated'

Then:
- Combine all offenses with 'Arson' in the string into a single `Arson` category
- Combine all offenses with 'Homicide' in the string into a single `Homicide` category
- Combine all offenses with 'Vehic' in the string into a single `Vehicle-related` category
- Combine all offenses with 'Battery' in the string into a single `Battery` category
- Use the simplified offense variable created above (the one without 'Aggravated') as the fallback/default value (instead of 'other')

Do so efficiently, using `map()` with a dictionary or `np.select()` (or a similar procedure for systematic recoding) rather than separate line for each recoded offense.

**Resources**:
- [Activity code](https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/00_pandas_datacleaning_solutions.ipynb) and [lecture on data wrangling](https://github.com/jhaber-zz/QSS20_public/blob/main/slides/02_qss20_fa22_pandas.pdf) cover `np.select` and `map()` with a dictionary (can use one or the other)

In [None]:
## your code here combining offenses

Print the difference between the # of unique offenses in the original `UPDATED_OFFENSE_CATEGORY` field and the # of unique offenses in your new `simplified_offense_derived` field. How many and which ones change?

*Hint*: You can turn unique values from a column into a list using `df[col].unique().tolist()` and get the difference between two lists using a list comprehension: `[elem for elem in list1 if elem not in list2]`. 

In [None]:
## your code here printing differences

## 2.6 Cleaning additional variables (10 points)

When cleaning the following variables, make sure to retain the original variable in data. We tell you to use the derived suffix so it's easier to pull these cleaned out variables later.

**Resources**: `np.where` in [lecture on data wrangling](https://github.com/jhaber-zz/QSS20_public/blob/main/slides/02_qss20_fa22_pandas.pdf)

### 2.6.1: Race
Based on the `RACE` column, create True/false indicators for `is_black_derived` (Black only or `White/Black [Hispanic or Latino]`), `is_hisp_derived` (Non-Black Hispanic, so either hispanic alone or white hispanic), `is_white_derived` (White non-hispanic), or `is_other_derived` (none of the above). 

You can think of these indicators like this:

`is_black_derived`: True if {Black only, White/Black [Hispanic or Latino]}, else False <br/>
`is_hisp_derived`: True if {HISPANIC or White [Hispanic or Latino]}, else False <br/>
`is_white_derived`: True if White, else False <br/>
`is_other_derived`: True if is_black_derived == is_hisp_derived == is_white_derived == False, else False. In other words, this indicator should be True for all the races that were not included in any of the previous indicators; otherwise, it should be False.

In [None]:
## your code here deriving race

### 2.6.2: Gender
Based on the `GENDER` column, create a boolean true/false indicator for `is_male_derived` (false is female, unknown, or other)

In [None]:
## your code here deriving gender

### 2.6.3: Age at incident
Looking at the `AGE_AT_INCIDENT` column, you may notice outliers like 130-year olds. Recode the top 0.01% of values to be equal to the 99.99th percentile value (this is sometimes called `winsorizing` but don't worry about the terminology). Call this `age_derived`

In [None]:
## your code here deriving age at incident

### 2.6.4: Sentencing date
Create `sentenceymd_derived` that's a version of `SENTENCING_DATE` converted to datetime format. Also create a rounded version, `sentenceym_derived`, that's rounded down to the first of the month and the year (e.g., 01-05-2016 and 01-27-2016 each become 01-01-2016). 

*Hints*: All timestamps are midnight so you can strip the timestamp. Before converting, you'll notice that some of the years have been mistranscribed (e.g., 291X or 221X instead of 201X). Programatically fix those (eg 2914 -> 2014). You can use this regex code to clean the dates or write your own pattern: ### first, use regex to clean up the date columns

```python
sentence['tmp_clnsdate'] = [re.sub(r'2[1-9]([0-9]+)', r"20\1", str(date)) 
                            if bool(re.search('\/2[1-9][0-9]+', str(date))) else 
                            str(date) 
                            for date in 
                            sentence.SENTENCE_DATE]
```

Even after cleaning, there will still be some that are after the year 2021 that we'll filter out later.

**Resources**:

- pd.to_datetime() used in [the data wrangling activity](https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/00_pandas_datacleaning_solutions.ipynb)
- extract the month and year from a datetime object using the dt accessor (similar syntax for year): https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html 


In [None]:
## your code here that creates datetime version of sentencing date

### 2.6.5: Sentencing judge

Create an identifier (`judgeid_derived`) for each unique judge (`SENTENCE_JUDGE`) structured as judge_1, judge_2, etc. 

When finding unique judges, there are various duplicates we could weed out. For this exercise, address only these sources of duplication/redundancy:
1. the different iterations of Doug/Douglas Simpson
2. the different iterations of Shelley Sutker (who appears both with her maiden name and her hyphenated married name) 

Note that you can do this more manually by creating a list with the different name variations and receive full credit (i.e., no need to use regular expressions).

*Hint 1*: due to mixed types, you may need to cast the `SENTENCE_JUDGE` var to a diff type in order to be able to sort.

*Hint 2*: To assign identifiers to the judges, try grouping them with `ngroup()`. You can read about [the parameters in the documentation.](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.ngroup.html)

In [None]:
## your code here that creates unique judge identifier

Then print a random sample of 10 rows with the original and cleaned columns for the relevant variables

**Resources**: [sample command here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)

In [None]:
## your code here that prints sample

## 2.7 Subsetting rows to analytic dataset (5 points)

Let's narrow down the above sentencing dataset in a few ways. First, subset to cases where only one participant is charged, since cases with >1 participant might have complications like plea bargains/informing from other participants affecting the sentencing of the focal participant.

In [None]:
## your code here to limit to one participant

Next, let's go from a participant-case level dataset, where each participant is repeated across charges tied to the case, to a participant-level dataset, where each participant has one charge. To do this, let's subset to a participant's primary charge and their current sentence (`PRIMARY_CHARGE_FLAG` is True and `CURRENT_SENTENCE_FLAG` is True). Double check that this worked by confirming there are no longer multiple charges for the same case-participant.

In [None]:
## your code here to subset to primary charge and current sentence

Finally, apply these two additional filters: 

- filter out observations where judge is nan or nonsensical (indicated by `is.null` or equal to `FLOOD`)
- subset to sentencing date between `01-01-2012` and `04-05-2021` (inclusive)

In [None]:
## your code here to apply remaining filters