# Advanced Data Wrangling III: Transformations and Groupby

Files needed = ('atussum_2017.dat', 'movies.csv')

Let's continue practicing techniques for manipulating data into forms that are amenable to analysis. In this section, we will cover: 

1. `.replace()` for recoding variables
3. `.map()` for working element-wise on DataFrames
4. String methods for working with strings in DataFrames
5. `.groupby()` for performing group-specific calculations

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

os.chdir('/Users/jackson/Documents/ECON570')

### American Time Use Survey (ATUS)
Let's work with the ATUS data again. To refresh your memory, the Bureau of Labor Statistics oversees the [American Time Use Survey](https://www.bls.gov/tus/overview.htm), which asks a sample of Americans to complete detailed diaries keeping track of each minute of their day. 

Follow this link [www.bls.gov/tus/data/datafiles_2017.htm](https://www.bls.gov/tus/data/datafiles_2017.htm) to the page for the 2017 survey. Download the **ATUS 2017 Activity summary file (zip)** file located in the **2017 Basic ATUS Data Files** section of the page. Alternatively, download it directly [www.bls.gov/tus/datafiles/atussum_2017.zip](https://www.bls.gov/tus/datafiles/atussum_2017.zip). 

Unzip the file. We are looking for `atussum_2017.dat`. It is a comma separated file (even though it has a '.dat' extension). Let's get it loaded.

### Variables

This data set has 421 variables! That's too many for us today. Let's just keep some demographic data and some data about working and sleeping. 

In [2]:
variables2 = {'TEAGE':'age', 'TESEX':'sex', 'PEEDUCA':'edu', 'GTMETSTA':'metro', 'TELFS':'employ', 
 'TUDIARYDAY':'day', 't050101':'work_main', 't050102':'work_other', 't010101':'sleep', 't050201':'work_soc', 't010102':'no_sleep'}

In [4]:
atus_small = pd.read_csv('data/atussum_2017.dat', usecols=variables2.keys())

In [5]:
atus_small.rename(columns=variables2, inplace=True)

In [6]:
atus_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10223 entries, 0 to 10222
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   age         10223 non-null  int64
 1   sex         10223 non-null  int64
 2   edu         10223 non-null  int64
 3   metro       10223 non-null  int64
 4   employ      10223 non-null  int64
 5   day         10223 non-null  int64
 6   sleep       10223 non-null  int64
 7   no_sleep    10223 non-null  int64
 8   work_main   10223 non-null  int64
 9   work_other  10223 non-null  int64
 10  work_soc    10223 non-null  int64
dtypes: int64(11)
memory usage: 878.7 KB


In [7]:
atus_small.head()

Unnamed: 0,age,sex,edu,metro,employ,day,sleep,no_sleep,work_main,work_other,work_soc
0,34,2,39,1,1,1,728,0,450,0,0
1,28,2,40,1,5,7,385,0,0,0,0
2,15,1,35,1,5,4,570,0,0,0,0
3,46,1,39,1,1,2,525,0,480,0,0
4,85,1,44,1,1,7,756,0,0,0,0


Looks pretty good. 

## replace( ) 

The sex variable is coded 1 for male and 2 for female. I do not want to have to remember that!

The `replace()` method replaces one value for another. One syntax is 
```python
atus_small['sex'] = atus_small['sex'].replace(1, 'male')
```
but a more powerful one passes a dict or a list.
```python
atus_small['sex'] = atus_small['sex'].replace({1:'male', 2:'female'})
```

In [None]:
sex_codes = {1:'male', 2:'female'}
atus_small['sex'] = atus_small['sex'].replace(sex_codes)
atus_small.head()

Let's also recode `edu`, which holds the highest level of education obtained. What are the unique values?

In [None]:
atus_small['edu'].unique()

What do all these codes represent? I read the documentation...

In [None]:
# let's group all codes less than 39 as "less than high school"
atus_small.loc[atus_small['edu'] < 39, 'edu'] = 0
# define a dictionary
edu_codes = {0:'less than high school', 39:'high school', 40:'some college', 41:'associate', 42:'associate', 43:'bachelor', 
              44:'master', 45:'prof', 46:'phd'}
# replace
atus_small['edu'] = atus_small['edu'].replace(edu_codes)
atus_small.head()

## Apply a function to a Series or DataFrame: map()

We can apply functions to the individual elements in a Series or a DataFrame using `.map()`. These can be built-in functions, or user-defined functions (including lambda functions).

This is quite powerful. For illustration, let's create a new variable that is the log of `1 + 'sleep'`.

In [None]:
atus_small['log1p_sleep']=atus_small['sleep'].map(np.log1p)
atus_small.head()

Note that we could have just as easily used `np.log1p(atus_small['sleep'])` or `np.log(atus_small['sleep']+1)` in this case.

*Side note*: Why is there a separate numpy function for `log(x+1)`? Let's check the [docs](https://numpy.org/doc/2.1/reference/generated/numpy.log1p.html). Looks like this is useful if `x` is very small.

We can apply the same function to multiple columns of a DataFrame using `map()`.

In [None]:
# An aside about np.log1p
x=1e-50
same = 1+x == 1
print('same?',same)
print('np.log1p(x)=',np.log1p(x))
print('np.log(1+x)=',np.log(1+x))

### Lambda functions

Python offers a simple way to create anonymous, single expression functions in a single line of code. These are called "lambda functions." Here's a simple example.

In [None]:
double_sum = lambda a , b : (a + b) * 2
print(double_sum(5,7))

The basic syntax for any lambda function is:

```python
function_name = lambda arguments : expression
```
The following lambda function converts minutes to hours for all of the time variables in our ATUS DataFrame.

In [None]:
# We can map to several columns at once.
time_vars=['sleep', 'no_sleep','work_main','work_other','work_soc']
atus_small[time_vars] = atus_small[time_vars].map(lambda x: x / 60)
atus_small.head()

## String methods

These are analogous to the string methods in standard python, but they have been optimized for DataFrames. These *vectorized string methods* work element-wise over an entire column. The method call looks like

```python
data['var'].str.method()
```

where `.str.method( )` is the method we are applying. A list of vectorized string methods is available in the [documentation here](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary). Below, we try a few out. 

### MovieLens data set

We are going to work with the [MovieLens](https://grouplens.org/datasets/movielens/) *ml-latest-small* dataset. The GroupLens organization released this data. It is meant to help build recommendation algorithms, like the ones you see in Netflix or Spotify. \[In 2006, [Netflix started a contest](https://en.wikipedia.org/wiki/Netflix_Prize), with a $1 mil. reward, for an algorithm that could beat their own.\] They have other ratings datasets, too, on music, jokes, and books.  

An observation is a movie.

In [None]:
movies = pd.read_csv('movies.csv')
movies.sample(10)

### str.contains( )
The genres are mixed together. Let's get all the comedies. The `.contains( )` method returns a bool Series with True for observations in which the string contains the search term.

In [None]:
movies['genres'].str.contains('Comedy')

In [None]:
print(movies.shape)
comedies = movies[movies['genres'].str.contains('Comedy')]
print(comedies.shape)

### str.split( )
This method splits the string up at the delimiter that is passed to `.split( )`. It returns a list of each chunk that falls between the delimiter. 

This could be useful processing name data that come in the form: last,first or city,state. 

In [None]:
# The movie genres are separated with the '|' character. 
# DataFrames can have columns of lists!
movies['genre_split'] = movies['genres'].str.split('|')
movies.head()

### str.join ( )
Put strings together. Separate the pieces with a delimiter of your choosing. 

In [None]:
movies['with_commas'] = movies['genre_split'].str.join(', ')
movies.sample(5)

## Practice:

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck. I am here, too. 

Our data does not have a column for the year the movie was released. Let's create one. The year the movie was released is in the title string.    

0. Reload 'movies.csv'
1. Use `.str.strip()` ([docs](https://pandas.pydata.org/pandas-docs/version/0.24.2/reference/api/pandas.Series.str.strip.html)) to remove any leading or trailing spaces from 'title'.

1. Extract the four-digit year from the titles and put them into a new column named 'year'.  

Notice that the year, including the parentheses, is always the last 6 digits of the title. You might try `str.slice()` and work with negative indexes to count from the end of 'title' (with index -1 representing the last digit).

If there is any extra space at the end of a title, it will mess up my algorithm! That's why we strip the extra spaces first. 

3. There are 12 movies that do not have a year in their title. Find them in your DataFrame. You might try the `str.isdigit()` [(doc)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.isdigit.html#pandas.Series.str.isdigit) method to see if the year you extracted in step 2. is numeric. 

## Groupby

We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have? 

Pandas provides the `groupby()` method to ease computing statistics by group ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as 

> split-apply-combine

Here is the canonical [illustration](https://www.oreilly.com/library/view/learning-pandas/9781783985128/ch09s02.html). The big idea is to 

1. **Split** the data up into groups. The groups are defined by *key* variables.
2. **Apply** some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
3. **Combine** the results of the 'apply' from each group into a new data structure.
  
  
Apply-split-combine is an incredibly powerful feature of pandas. We will cover the basics here. 

### Split-apply-combine

Let's use our split-apply-combine method in one simple line of code:

1. Split: We pass `.groupby()` a *key* which tells the method which variable(s) to, well, group by. (We will group by `edu` for now.)
2. Apply and Combine: We want to compute some statistic by group. (We will use the `.mean()` for now.)

We typically also want to select certain columns to perform the calculations. (Here, we'll consider `sleep`, `no_sleep`, `work_main`, `work_other`, and `work_soc`. **We also have to keep our groupby variable(s). This means we have to keep `edu` as well!**)

Let's give it a try.

In [None]:
# Grab the cols we want from the df before using the groupby. Remember to keep the grouping variable, too.
timeuse_means = atus_small[['sleep','no_sleep','work_main','work_other','work_soc','edu']].groupby('edu').mean()
timeuse_means

### Aggregation methods

Some common aggregation methods include: `.mean()`, `.sum()`, `.std()`, `.describe()`, `.min()`, `.max()`, but there are many more. Any function that returns a scalar will work. 

### Several statistics at once
Once we have grouped our data, we have been applying methods to compute a single statistic: `mean()`, `count()`,...

We now introduced the `.agg()` method, which lets us compute several moments at once&mdash;you can even pass it a user-defined function or a lambda function.

In [None]:
# agg() lets us compute many stats at once
mult_stats = atus_small[['sleep','work_main','edu']].groupby('edu').agg(['count', 'mean', 'median', 'std', 'max'])
mult_stats

Now we have a multiIndex on the columns.

### groupby( ) with many keys
Can we group by several keys? You know we can. Let's compute the means and the medians of the DataFrame this time.

In [None]:
ed_sex_stats = atus_small[['edu','sex','sleep','work_main']].groupby(['edu','sex']).agg(['mean','median'])
ed_sex_stats

Wow! The MultiIndex in rows is the groupby, and the multiIndex in the columns gives the two moments we specified.

## Practice:

The `.quantile()` method ([docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html)) computes quantiles from the data. (e.g., `.quantile(0.5)` computes the median, or the 50th quantile)

1. Let's look at a measure of variation in sleep time 
   
    A. Compute the 75th quantile for 'sleep' for each edu category. Name the new DataFrame 'q75'.      
    B. Compute the 25th quantile for 'sleep' for each edu category. Name the new DataFrame 'q25'.

For each type, compute the difference between the 75 percentile and the 25 percentile. 

This is sometimes called the *inter-quartile range*. It is a measure of the variability of a variable. It is less sensitive to outliers than the standard deviation. 

2. Calculate the mean, median, standard deviation, and inter-quartile range of sleep, grouped by edu and sex, in a single line of code. \[Hint: Use a lambda function!\]

### Try at home: 
How would you fix the name of your new column in the resulting DataFrame?