# Groupby
In this lecture, we'll be talking about one of the most powerful tools in Pandas, the 'groupby' feature.

Before that, let's discuss some special methods in pandas that will also be useful in the groupby.

Data Source: https://www.icpsr.umich.edu/icpsrweb/NACJD/studies/35509

## The delimiter in read_csv

What is a csv file? In its most basic essence, it is data separated by commas. csv even stands for "comma separated values." Pandas assumes this is what separates our data when we call `pd.read_csv()`. But what if our data wasn't separated by commas? 

Notation-wise, we call this separator the delimiter. This most common delimieter is a comma. You may also see data delimited by a tab. How can we tell Pandas that this separates our data. Do we even have to?

Take a look at `data/drugs.tsv`

In [None]:
import pandas as pd
data = pd.read_csv('data/drugs.tsv',delimiter='\t')

In [None]:
data

In [None]:
#make index_col into enc_id
data = pd.read_csv('data/drugs.tsv',delimiter='\t',index_col='QUESTID2')

## use_cols

What if we only wanted to load in certain columns. Perhaps because our data is very large. First, how can we even see the column names when our data is large?

In [None]:
pd.read_csv('data/drugs.tsv',delimiter='\t',index_col='QUESTID2',nrows=10)

Now we can see the column names. What if we only wanted the ones on alcohol?

In [None]:
pd.read_csv('data/drugs.tsv',delimiter='\t',index_col='QUESTID2',usecols=['ALC_EVER','ALC_AGE','ALC_DAYS'])

In [None]:
#Need to also load in our index column
pd.read_csv('data/drugs.tsv',delimiter='\t',index_col='QUESTID2',usecols=['QUESTID2','ALC_EVER','ALC_AGE','ALC_DAYS'])

## .unique

What if we wanted to find the number of days that people have done cocaine in the past 30 days?
We essentially want to find the unique different elements in `data.COC_DAYS`

In [None]:
data.COC_DAYS.unique()

## .value_counts
Similar to `.unique()` what if we wanted to know how many people matched each unique value?

In [None]:
data.COC_DAYS.value_counts()

Note that NaN is ignored. If we wanted to know how many NaNs there are we can fill nans with some other value then call `value_counts.`
-1 is a good choice because we know -1 doesn't make sense as an answer to "How many days have you done cocaine in the past month?"

In [None]:
data.COC_DAYS.fillna(-1).value_counts()

## .describe()
We can get all of our summary statistics for a series in one.

In [None]:
data.COC_AGE.describe()
#Count, counts non NaNs
#returns a series

## Describe on a dataframe

In [None]:
data.describe()

## Apply

What if we want to apply some function to each item in a series that doesn't have some easy numpy or pandas function?

In [None]:
#Multiply every element by 2
data.CIG_EVER.apply(lambda x: x*2)
#What is this notation?
#What is this doing?

In [None]:
#Returns a copy
data.CIG_EVER

Heres a useful example of `apply`. It turns out that the numbers in the `EMP` column encode meaning. That can be summarized by this dictionary.

In [None]:
emp_dict = {1:'Full Time',2:'Part Time',3:'Unemployed',4:'Other',99:'Child'}
emp_dict

In [None]:
data.EMP

In [None]:
#Lets change our numbers to the actual meaning using apply
data.EMP.apply(lambda x: emp_dict[x])

This was shorthand. We also could write a function that takes one element and returns anther and pass that into apply.

In [None]:
#Longhand
def dict_apply(my_int_label):
    return emp_dict[my_int_label]

In [None]:
data.EMP.apply(dict_apply)

Lets actually change the EMP column

In [None]:
data.EMP = data.EMP.apply(dict_apply)

## Get_Dummies

Say we want to run an ML algorithm like Logistic Regression on this dataset to predict something. For example, lets if we wanted to use all other features to predict number of days taken Cocaine.

In [None]:
#For example
y = data.COC_DAYS
X = data.drop('COC_DAYS',1)

In [None]:
X

In [None]:
y

Ignore the obvious issues with NaNs. We may just want to fill those with 0s anyway for days. And for age we can fill with 90 to say they didn't try until "death" or something like that.
We have another issue. What is our ML model going to do with our text column 'EMP'. We cannot do math on text.

(In general, we can say that EMP is a cateogrical feature that describes something qualatitive, whereas the other features are numerical features, that describe something quantative.)
(Also, its somewhat up for debate as to whether the <>_EVER features are numeric or categorical)

What should we do?

One idea may be to encode our text back into the numbers 1,2,3,4 and 99. Why is this a very bad idea? What could we do instead?

In [None]:
#Binary encoding
pd.get_dummies(data.EMP)

In [None]:
#Replace data.EMP with the dummies
temp = pd.get_dummies(data.EMP)

In [None]:
#You could do something like this? 
#Lets not do it though

#data = data.drop('EMP',1)
#data = pd.concat([data,temp],1)

In [None]:
data

# Groupby
Now, we can finally talk about groupby. What is a groupby. Lets use an example.
What if we wanted to know some summary stats based on ones employement?

In [None]:
data.groupby('EMP')

What is this strange object? It is a grouped dataframe. We can only get summary statistics from it. Lets take a look.

In [None]:
#Get mean of all stats based on employment
data.groupby('EMP').mean()

In [None]:
#Max works too
data.groupby('EMP').max()

In [None]:
#We can describe
data.groupby('EMP').describe()

We techincally can groupby other columns but does that really make sense? We typically only use groupby on categorical variables?

## Apply with Groupby

We can technically use any aggregation function in our groupby. An aggregation function is one that takes in a series and returns one value.
Lets say we wanted to get the entropy of the series. 
Entropy is defined as:
$$\sum x*log_2(x)$$
Where if $x=0$ then we say $x*log_2(x)=0.$

Before we write a function lets think of how we might do that.

In [None]:
#Take ALC_AGE as an example

In [None]:
data.ALC_AGE*np.log2(data.ALC_AGE)

In [None]:
#You may think we'd have issues with 0s because of this but they'll just be nans and won't contribue to the sum
0*np.log2(0)

In [None]:
#So we can do
np.sum(data.ALC_AGE*np.log2(data.ALC_AGE))

Now, lets write this as a function:

In [None]:
def entropy(ser):
    return np.sum(ser*np.log2(ser))

Now, we can just put this function into our apply method.

In [None]:
data.groupby('EMP').apply(entropy)

We can also use numpy functions in our `apply` but most of the time this is already built into pandas

In [None]:
#For example, this:
data.groupby('EMP').mean()
#Can be done like this
data.groupby('EMP').apply(np.mean)
