# Python For Data Analysis
## Class 3

The objectives of this class are for y'all to have:

1. Learned a few more useful `pandas` patterns
2. Generated some plots with seaborn
3. Done some more exploratory data analysis on your own
4. Ran a linear regression (?)

In [None]:
import pandas as pd # use 'as' keyword to namespace a package
import numpy as np
complaints = pd.read_csv('../pandas-cookbook/data/311-service-requests.csv', low_memory=False)


In [None]:
useful_cols = ['Created Date', 
               'Closed Date',
               'Due Date', 
               'Agency',
               'Facility Type',
               'Agency Name', 
               'Complaint Type', 
               'Borough', 
               'Status', 
               'Descriptor']
cleaned = complaints[useful_cols]
cleaned = cleaned.rename(columns=lambda x: x.lower().replace(' ','_'))

In [None]:
cleaned.head()

## Last Week's Exercise

Write a function that takes a column name, a number n, and a dataframe as an argument, and returns a column with the top n categories and all other categories as "other"

In [None]:
def top_n(col_name, top_n, df):
    # Get the value counts of our column name from our DF
    value_counts = df[col_name].value_counts()
    if len(value_counts) <= top_n:
        print("""WARNING: There are fewer distinct categories in df[%s] than requested. 
              No $replacement performed""", col_name)
        return df[col_name] # Do no replacement, return the column
    keep_vals = list(value_counts.head(top_n).index) # Get the top_n rows from value_counts
    keep_mask = df[col_name].isin(keep_vals) # Identify the "keeper" rows
    new_x = df[col_name].copy() # Copy to prevent in-place editing
    new_x[~keep_mask] = 'other' # Negate the mask to identify those we want to replace
    replace_count = len(value_counts) - top_n
    print("Replaced %s values" % replace_count)
    return(new_x)    

### Time to resolution

In [None]:
### Creating a column with time-to-close
# pd.options.mode.chained_assignment = None
cleaned.dtypes
cleaned['created_date'] = pd.to_datetime(cleaned['created_date'])
cleaned['closed_date'] = pd.to_datetime(cleaned['closed_date'])
cleaned.head()

In [None]:
cleaned.dtypes

In [None]:
cleaned['time_to_resolution'] = (cleaned['closed_date'] - cleaned['created_date']) / np.timedelta64(1, 'm')
cleaned.head(20)

### Basic Aggregations

In [None]:
# group our data by complaint type
by_complaint = cleaned.groupby('complaint_type')

In [None]:
# average response time
by_complaint['time_to_resolution'].mean()

In [None]:
###cleaned.loc[cleaned['time_to_resolution']<0,:].head()

Exercise:
  * What's going on with negative time-to-resolution?
  * Determine which types of complaints are most often late (closed_date > due_date)
      * Which types of complaints have the highest *percentage* of late calls?
  * From which boroughs?


### More advanced Data Manipulations with Pandas

In pandas, the split-apply-combine pattern is one of the most powerful but least understood features of the tool. In fact, I don't even understand it very well, but we'll struggle through it together.

We'll cover a few operations *in brief* with specific emphasis on
* Indexes in pandas
* groupby objects
* unstack
* pivot_table


#### Indexes in pandas
Indexes are convenient ways to keep track of the *grain* (i.e., what defines a "row") in a dataframe. Dataframes have the ability to have multiple indexes which allow for slicing-and-dicing in very sophisticated ways. Unfortunately this can also means there's a lot of complexity which can be overwhelming for people who are new to the framework.

The thing to keep in mind is that indexes are **not** columns just like any other. They must be accessed (and manged) differently.

#### Groupby

As we saw above, we can use `groupby()` to summarize our data. The object returned by `groupby()` is not a dataframe -- in fact, it's more like 'instructions for grouping' than actual grouped data.

```python
grpd = cleaned.groupby('Status')
# <pandas.core.groupby.DataFrameGroupBy object at 0x113aeada0>
```

Only when we apply some sort of function to perform an aggregation do we actually get results back

```python
grpd['Status'].count()

Status
Assigned       6189
Closed        57165
Email Sent      129
Open          43972
Pending        3165
Started         447
Unassigned        2
Name: Status, dtype: int64
```

When we group-by data, the column we're grouping by becomes the index of the object we're returning (rather than a column of a table. Because we're now working explicitly with indexes (and sometimes multiple indexes!) it'll be helpful to look at some of the index-specific methods available to us.


#### Unstack

If we group by multiple columns, we'll get data back with multiple indexes. We can "unstack" these indexes to get more tabular data

In [None]:
b_s = cleaned.groupby(['borough','status'])['status'].count()
print(b_s.head(20))
print("----------------------------")
print("Now Unstack!")
print("----------------------------")
print(b_s.unstack())

Our "unstacked" object now looks like tabular data that are much easier to work with

#### Pivot Table

"Pivot tables" are a powerful tool very common in the world of spreadsheet-first data analytics. In fact, when analysts are first making the move from excel to python or R pivot tables are the feature they miss the most (and they generally find the in-code approximations of these tools overly burdensome). Pandas, nicely, has an API that feels familiar to this flavor of analysis.





flights_by_carrier = data.pivot_table(index='flight_date', columns='unique_carrier', values='flight_num', aggfunc='count')
flights_by_carrier.head()

In [None]:
status_by_borough = cleaned.pivot_table(index="status", columns="borough", values="created_date", aggfunc="count")
status_by_borough.head(30)

Exercise: 
* plot a line chart with complaints by day by borough (time on the x axis, one line per borough)

```python
# complaints[['Unique Key', 'Borough']].groupby([complaints.index.date, 'Borough']).count().unstack().plot()
```

## More Plotting with Seaborn

In [None]:
!pip install seaborn

In [None]:
%matplotlib inline
import seaborn as sns
pd.options.mode.chained_assignment = None

In [None]:
# Do this part interactively
bk_manh = cleaned.loc[cleaned['borough'].isin(['BROOKLYN', 'MANHATTAN'])]
x = top_n('agency', 3, bk_manh).copy()
bk_manh.loc[:,'cleaned_agency'] = x
bk_manh = bk_manh.loc[bk_manh['time_to_resolution'] >= 0]
bk_manh = bk_manh.loc[bk_manh['time_to_resolution'] <= 1000]

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="cleaned_agency", y="time_to_resolution", hue="borough", data=bk_manh, split=True,
               inner="quart", palette={"BROOKLYN": "b", "MANHATTAN": "y"})
sns.despine(left=True)

In [None]:
sns.set(style="whitegrid", color_codes=True)
sns.violinplot(x="cleaned_agency", y="time_to_resolution", hue="borough", data=bk_manh, split=True,
               inner="quart", palette={"BROOKLYN": "b", "MANHATTAN": "y"})
sns.despine(left=True)

In [None]:
# Navigate to http://seaborn.pydata.org/tutorial/categorical.html and try out some of the categorical 
# plotting options with seaborn

In [None]:
sns.distplot(cleaned['time_to_resolution'].dropna())

In [None]:
no_zeroes = bk_manh.loc[bk_manh['time_to_resolution'] > 0]
sns.distplot(no_zeroes['time_to_resolution'].dropna())

In [None]:
g = sns.FacetGrid(no_zeroes, row='borough', col='cleaned_agency', margin_titles=True)
g.map(sns.distplot, "time_to_resolution")
sns.plt.show()

# Simple Linear Regression

In [31]:
# Find the coordinates of times square
# 40.7589° N, 73.9851° W
# Append column to df with distance to times square
# sqrt((x_2 - x_1)**2 + (y_2 - y_1)**2)
# Make a scatter plot
# Introduce simple linear regression
# Give intuitive explanation