# Python For Data Analysis
## Class 2

The objectives of this class are for y'all to have:

1. Gained familiarity with `pandas` API
2. Started exploring our 311 data set
3. Learned the split / apply / combine data munging paradigm
4. Learned some more visualization and interactive data analysis tricks

In [4]:
import pandas as pd
import matplotlib
%matplotlib inline

In [5]:
# Load the data
complaints = pd.read_csv('../pandas-cookbook/data/311-service-requests.csv', low_memory=False)

# Note: It's nice to do this in its own cell so we don't ever have to-rerun this costly line

In [None]:
# Let's clean up our data by doing a few things:
# 1) let's limit to a few columns we know are going to be interesting
# 2) let's clean the column names so we don't have to deal with spaces or capital letters

In [15]:
complaints.columns
useful_cols = ['Created Date', 'Closed Date', 'Agency Name', 'Complaint Type', 'Borough', 'Status']
cleaned = complaints[useful_cols]

In [16]:
cleaned.head()

Unnamed: 0,Created Date,Closed Date,Agency Name,Complaint Type,Borough,Status
0,10/31/2013 02:08:41 AM,,New York City Police Department,Noise - Street/Sidewalk,QUEENS,Assigned
1,10/31/2013 02:01:04 AM,,New York City Police Department,Illegal Parking,QUEENS,Open
2,10/31/2013 02:00:24 AM,10/31/2013 02:40:32 AM,New York City Police Department,Noise - Commercial,MANHATTAN,Closed
3,10/31/2013 01:56:23 AM,10/31/2013 02:21:48 AM,New York City Police Department,Noise - Vehicle,MANHATTAN,Closed
4,10/31/2013 01:53:44 AM,,Department of Health and Mental Hygiene,Rodent,MANHATTAN,Pending


Exercise:
* programatically lower-cases the column names and change the spaces to under-scores
  * Try not to rely on the current ordering of the columns to do this

In [66]:
# One solution
# cleaned.rename(columns=lambda x: x.lower().replace(' ','_'), inplace=True)

In [18]:
cleaned.head()

Unnamed: 0,created_date,closed_date,agency_name,complaint_type,borough,status
0,10/31/2013 02:08:41 AM,,New York City Police Department,Noise - Street/Sidewalk,QUEENS,Assigned
1,10/31/2013 02:01:04 AM,,New York City Police Department,Illegal Parking,QUEENS,Open
2,10/31/2013 02:00:24 AM,10/31/2013 02:40:32 AM,New York City Police Department,Noise - Commercial,MANHATTAN,Closed
3,10/31/2013 01:56:23 AM,10/31/2013 02:21:48 AM,New York City Police Department,Noise - Vehicle,MANHATTAN,Closed
4,10/31/2013 01:53:44 AM,,Department of Health and Mental Hygiene,Rodent,MANHATTAN,Pending


In [22]:
cleaned.complaint_type.unique()

array(['Noise - Street/Sidewalk', 'Illegal Parking', 'Noise - Commercial',
       'Noise - Vehicle', 'Rodent', 'Blocked Driveway',
       'Noise - House of Worship', 'Street Light Condition',
       'Harboring Bees/Wasps', 'Taxi Complaint', 'Homeless Encampment',
       'Traffic Signal Condition', 'Food Establishment', 'Noise - Park',
       'Broken Muni Meter', 'Benefit Card Replacement',
       'Sanitation Condition', 'ELECTRIC', 'PLUMBING', 'HEATING',
       'GENERAL CONSTRUCTION', 'Street Condition', 'Consumer Complaint',
       'Derelict Vehicles', 'Noise', 'Drinking', 'Indoor Air Quality',
       'Panhandling', 'Derelict Vehicle', 'Lead', 'Water System',
       'Noise - Helicopter', 'Homeless Person Assistance',
       'Root/Sewer/Sidewalk Condition', 'Sidewalk Condition', 'Graffiti',
       'DOF Literature Request', 'Animal in a Park',
       'Overgrown Tree/Branches', 'Air Quality', 'Dirty Conditions',
       'Water Quality', 'Other Enforcement', 'Collection Truck Noise',
     

In [36]:
# Let's figure out what the top complaints are
cleaned.groupby('complaint_type').size().sort_values(ascending=False).head()


complaint_type
HEATING                   14200
GENERAL CONSTRUCTION       7471
Street Light Condition     7117
DOF Literature Request     5797
PLUMBING                   5373
dtype: int64

In [76]:
cleaned['complaint_type_cln'] = cleaned['complaint_type'].str.lower()
# cleaned.is_copy = False 

In [41]:
cleaned.head()

Unnamed: 0,created_date,closed_date,agency_name,complaint_type,borough,status,complaint_type_cln
0,10/31/2013 02:08:41 AM,,New York City Police Department,Noise - Street/Sidewalk,QUEENS,Assigned,noise - street/sidewalk
1,10/31/2013 02:01:04 AM,,New York City Police Department,Illegal Parking,QUEENS,Open,illegal parking
2,10/31/2013 02:00:24 AM,10/31/2013 02:40:32 AM,New York City Police Department,Noise - Commercial,MANHATTAN,Closed,noise - commercial
3,10/31/2013 01:56:23 AM,10/31/2013 02:21:48 AM,New York City Police Department,Noise - Vehicle,MANHATTAN,Closed,noise - vehicle
4,10/31/2013 01:53:44 AM,,Department of Health and Mental Hygiene,Rodent,MANHATTAN,Pending,rodent


In [42]:
cleaned.groupby('complaint_type_cln').size().sort_values(ascending=False).head()

complaint_type_cln
heating                   14200
general construction       7471
street light condition     7117
dof literature request     5797
plumbing                   5439
dtype: int64

In [59]:
# which rows have rats and noisy vehicles?
cleaned["complaint_type"].isin(['Rodent', 'Noise - Vehicle'])

0         False
1         False
2         False
3          True
4          True
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13         True
14         True
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22         True
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
111039    False
111040    False
111041    False
111042    False
111043    False
111044    False
111045    False
111046    False
111047    False
111048    False
111049    False
111050    False
111051    False
111052    False
111053    False
111054    False
111055    False
111056    False
111057    False
111058    False
111059    False
111060    False
111061    False
111062    False
111063    False
111064    False
111065    False
111066    False
111067    False
111068    False
Name: complaint_type, dt

In [77]:
# Replace some values
mask = cleaned["complaint_type"].isin(['Rodent', 'Noise - Vehicle'])
new_series = cleaned['complaint_type']
# new_series = cleaned['complaint_type'].copy()
new_series[mask] = 'rats or cars'



In [78]:
new_series.head()

0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3               rats or cars
4               rats or cars
Name: complaint_type, dtype: object

Exercise:
* Write a function that takes a column name, a number n, and a dataframe as an argument, and returns a column with the top n categories and all other categories as "other"

ToDo:
* Introduce summary stats with group by
  * Reset_index to convert to data frame
  * Introduce common aggregates
  * Exercise: find the most common hour for a complaint by borough
* Introduce transform
  * Fill in data with the mean
* Introduce lambdas and custom aggregators with GroupBy
  * Exercise: Center and scale some metric