# 0. Load imports 

In [23]:
## imports
import pandas as pd
import numpy as np
import re

## print multiple things from same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## load data on 2020 crimes in DC
df = dc_crim_2020 = pd.read_csv("https://opendata.arcgis.com/datasets/f516e0dd7b614b088ad781b0c4002331_2.csv")

# 1. Questions: list comprehension

- In class example, why did we need the "courses" at the beginning of the list iteration
- How did the join syntax work in the example where we paste together offenses from same ward

In [24]:
## toy example

### pool of courses
all_courses = ["QSS20", "QSS17", "GOV10", "GOV4", "CSC1"]


## 1.1 Application 1: filtering to a smaller list

When we might use: have a lot of columns in a dataframe; want to filter to a smaller set using some pattern

In [25]:
### pull out ones that contain GOV in the string
gov_c = [course for course in all_courses
        if "GOV" in course]
gov_c # result

['GOV10', 'GOV4']

In [4]:
### showing that the "course" is just a placeholder/
### arbitrary interator
gov_c_alt = [x for x in all_courses if "GOV" in x]

gov_c == gov_c_alt

True

## 1.2 Application two: keep all objects in the list but do some transformation

In [26]:
all_courses

## strip the numbers from the course names
courses_prefix = [x[:3] for x in all_courses]
courses_prefix # could then find unique elements


['QSS20', 'QSS17', 'GOV10', 'GOV4', 'CSC1']

['QSS', 'QSS', 'GOV', 'GOV', 'CSC']

In [17]:
# Join all together example
" #:)# ".join(courses_prefix)

'QSS #:)# QSS #:)# GOV #:)# GOV #:)# CSC'

#### Your turn: Using original list, add "dartmouth_" prefix to the course name

## 1.3 Subsetting columns

Use list comprehension to filter to columns with id in the string. Then, create a new dataframe called df1 that contains only column heads with "id"

In [27]:
id_cols = [col for col in df.columns if "ID" in col]
id_cols

## Then, filter the data
df[id_cols]

['BID', 'OBJECTID', 'OCTO_RECORD_ID']

Unnamed: 0,BID,OBJECTID,OCTO_RECORD_ID
0,,499862475,
1,,499862478,
2,,499862479,
3,,499862481,
4,,499862484,
...,...,...,...
27927,,500401564,
27928,,500401587,
27929,,500401591,
27930,,500401595,


## 1.4 Comprehension for numbers

Here we compare two ways of creating a list of even numbers.

In [17]:
num_list = np.arange(10000)
num_list

array([   0,    1,    2, ..., 9997, 9998, 9999])

In [18]:
%%time
even_nums = [i for i in num_list if (i % 2) == 0]

CPU times: user 10.2 ms, sys: 1.72 ms, total: 12 ms
Wall time: 12.6 ms


In [21]:
%%time
num_list[~(num_list % 2).astype(bool)]

CPU times: user 1.09 ms, sys: 318 µs, total: 1.41 ms
Wall time: 1.14 ms


array([   0,    2,    4, ..., 9994, 9996, 9998])

#### Your turn: Extract all numbers in num_list that end in 7

#### Your turn: Divide each number  in num_list by 2

# 2. Questions: lambda functions

Two questions:

- General syntax (see here for a reference: https://www.w3schools.com/python/python_lambda.asp 
- How they work in the context of aggregations

How is a lambda function different from a "normal" user-defined function (that has the syntax def func_name(arg): etc?

- Operates similarly to normal user-defined functions in that it can take any # of arguments
- Operates differently in that it's an "anonymous" function or a function that we don't explicitly name/save in memory

In [22]:
def f1(x,y):
    return x+y

f2 = lambda x, y: x+y

f1(2,1)
f2(2,1)

3

3

## 2.1 General syntax for lambda functions

In [36]:
### two pools of courses
socsci = ["QSS20", "QSS17", "GOV10"]
natsci = ["BIO2", "PHYS3"]


## generalize some of the steps
## above into a two-arg function
## that takes the course prefix
## and a list of all courses
def filter_courses(prefix,all_courses):
    rel_courses = [c for c in all_courses if prefix in c]
    return(rel_courses)

### a few applications 
filter_courses(prefix = "QSS", all_courses = socsci)
filter_courses(prefix = "QSS", all_courses = natsci)
filter_courses(prefix = "BIO", all_courses = natsci)

In [None]:
## what's the lambda function version of this
filter_courses_v2 = lambda prefix, all_courses: [c for c in all_courses if prefix in c]
filter_courses_v2(prefix = "BIO", all_courses = natsci)


## 2.2 using alongside agg

In [50]:
## use lambda to find modal block in a ward- multiple ways

### way 1: subsetting agg syntex
df.groupby("WARD")["BLOCK"].agg(lambda x: x.mode())

### way 2: dictionary agg syntax
df.groupby("WARD").agg({"BLOCK": lambda x: x.mode()})


WARD
1           3100 - 3299 BLOCK OF 14TH STREET NW
2    1300 - 1699 BLOCK OF CONNECTICUT AVENUE NW
3      5300 - 5399 BLOCK OF WISCONSIN AVENUE NW
4          100 - 199 BLOCK OF CARROLL STREET NW
5     900 - 999 BLOCK OF RHODE ISLAND AVENUE NE
6                600 - 699 BLOCK OF H STREET NE
7         934 - 1099 BLOCK OF EASTERN AVENUE NE
8        2300 - 2399 BLOCK OF GOOD HOPE ROAD SE
Name: BLOCK, dtype: object

Unnamed: 0_level_0,BLOCK
WARD,Unnamed: 1_level_1
1,3100 - 3299 BLOCK OF 14TH STREET NW
2,1300 - 1699 BLOCK OF CONNECTICUT AVENUE NW
3,5300 - 5399 BLOCK OF WISCONSIN AVENUE NW
4,100 - 199 BLOCK OF CARROLL STREET NW
5,900 - 999 BLOCK OF RHODE ISLAND AVENUE NE
6,600 - 699 BLOCK OF H STREET NE
7,934 - 1099 BLOCK OF EASTERN AVENUE NE
8,2300 - 2399 BLOCK OF GOOD HOPE ROAD SE


#### Your turn: Group by WARD and get the mean and standard deviation (std) of X and Y