### Project Info
- Two Deliverables
    - 1500 word report about analysis of your info. Not Technical
    - Well Documented Jupyter Notebook with Technical Details
- Intermediate Deilverable
    - Project proposal: Clear Statement of question, brief discussion of the data with one figure or stat that shoes your team has loaded the data and is ready for analysis.
    - No sandboxed datasets, like from Kaggle

# Workflow: A simple example

files needed = ('atusresp_0323.dat', 'atussum_0323.dat')
**Note:** *These files are on the larger side, so you can just follow along if you don't want to download.*

In this notebook I am going to take you through a real-world example. Let's do our best to replicate some of the results from [this FRB article](https://www.philadelphiafed.org/-/media/frbp/assets/economy/articles/economic-insights/2023/q4/eiq423-time-use-before-during-and-after-the-pandemic.pdf) on time use before, during, and after the pandemic. 

The goal is to illustrate, from beginning to end, how one takes an idea for an economic analysis to graphical presentation. For brevity, given the limited class time we have, we will take a small bite out of this exercise and focus on producing a single time-series figure.

## 1. Pose a preliminary question(s)

How did time use evolve before, during, and after the pandemic? 

Some initial thoughts:

- Did time spent alone spike at the onset of the pandemic?
- What about time spent on childcare?
- How does the evolution of time use differ by subgroup? We might consider differences across age, gender, race, educational attainment, occupation, etc.


Let's start with a basic trend.

## 2. Find the appropriate data

This is often difficult. Often times, we might not find data that can answer our preliminary question(s) accurately or completely. In these cases, we have to spend time going back and forth between 1. and 2. until we are *confident* that we can access the kind of data that can effectively answer our question.

In this case, detailed individual time use data are publicly available through the American Time Use Survey (ATUS). Data from 2003-2023 can be downloaded on the [ATUS website](https://www.bls.gov/tus/data.htm).

Let's first import the packages that we need to run the analysis, and then we'll read in the data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt 
import seaborn as sns
import matplotlib.ticker as mtick

In [2]:
variables = {'TUCASEID':'id', 'TUYEAR':'year','TRTALONE':'alone',
             'TRTCCC':'cust', 'TRTCHILD':'childcare', 'TRTFAMILY':'family',
             'TRTFRIEND':'friends', 'TRTNOCHILD':'nonownchild'}

In [3]:
atus_resp = pd.read_csv('atusresp_0323.dat', usecols=variables.keys())
atus_resp.head()

FileNotFoundError: [Errno 2] No such file or directory: 'atusresp_0323.dat'

## 3. Get the data into usable form 

The data contain many, many variables with mysterious names. The `variables` dictionary has two purposes. It selects the variables that I want to read in for my analysis, and it translates these variables to usable names. I constructed `variables` by reading the ATUS documentation.

> TRTALONE Total nonwork-related time respondent spent alone (in minutes)
> 
> TRTCCC Total nonwork-related time respondent spent with customers, clients, and co-workers (in minutes) 
> 
> TRTCHILD Total nonwork-related time respondent spent with household or nonhousehold children < 18 (in minutes)
> 
> TRTFAMILY Total nonwork-related time respondent spent with family members (in minutes)
> 
> TRTFRIEND Total nonwork-related time respondent spent with friends (in minutes)
> 
> TRTHHFAMILY Total nonwork-related time respondent spent with household family members (in minutes)
> 
> TRTNOCHILD Total nonwork-related time respondent spent with nonown children < 18 (in minutes)

In [None]:
atus_resp.rename(columns=variables, inplace=True)
atus_resp.head()

Are the types correct? Do the variable values seem reasonable? What are the units?

In [None]:
atus_resp.dtypes

In [None]:
atus_resp.describe()

Let's calculate the percentage of all nonwork time spent alone.

In [None]:
atus_resp['perc_alone']=100*atus_resp['alone']/(atus_resp['alone']+atus_resp['cust']+atus_resp['childcare']+atus_resp['family']+atus_resp['friends']+atus_resp['nonownchild'])
atus_resp.set_index('year', inplace=True)
atus_resp.head()

Hmmm, let's check this denominator to make sure it's sensible.

In [None]:
atus_resp['nonwork'] = (atus_resp['alone']+atus_resp['cust']+atus_resp['childcare']+atus_resp['family']+atus_resp['friends']+atus_resp['nonownchild'])
atus_resp['nonwork'].describe()
# Double counting removal, data wrangling

In [None]:
atus_resp[atus_resp['nonwork']>1440].count()
# solving the problem making rare cases for outliars

The maximum value exceeds `1440 = 24*60`, suggesting that there is some double-counting here. One possible culprit would be double-counting of time spent with children under the `family` variable. Let's try taking that out.

In [None]:
atus_resp['nonwork'] = (atus_resp['alone']+atus_resp['cust']+atus_resp['family']+atus_resp['friends']+atus_resp['nonownchild'])
atus_resp['nonwork'].describe()

In [None]:
atus_resp[atus_resp['nonwork']>1440].count()

The problem is far smaller now, and I have not found a better way to remove double-counting in the denominator. I'll proceed for now, and when I publish the report, I would check that the general takeaways are not sensitive to the inclusion of a small number of respondents with implausible values.

## 4. Preliminary analysis

Calculate the average percent of nonwork time spent alone in each year.

In [None]:
atus_resp['perc_alone']=100*atus_resp['alone']/atus_resp['nonwork']
atus_perc = atus_resp[['year','perc_alone']].groupby('year').mean()
atus_perc.head()

Now, we can plot this basic trend over time.

In [None]:
# Convert the index to datetime
atus_perc.index = pd.to_datetime(atus_perc.index, format='%Y')
# Plot these as a function of time
fig, ax = plt.subplots(figsize=(10,5)) 
ax.plot(atus_perc.index, atus_perc['perc_alone'],          # line plot of perc_alone vs. time
        color='blue'                                       # set the line color to blue
       )  
#ax.set_xlabel('year')        # obvious
#ax.set_ylabel('percent')     # obvious with PercentFormatter                                
ax.set_title('Percent of Nonwork Time Spent Alone')
sns.despine(ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter()) # formats percentages

plt.show()

*Already, this figure tells a story.* Time spent alone was steadily increasing before the pandemic and increased sharply in 2020, at the onset of the pandemic.

Let's take a look at time spent on childcare.

In [None]:
atus_resp['perc_child']=100*atus_resp['childcare']/atus_resp['nonwork']
atus_perc_child = atus_resp[['year','perc_child']].groupby('year').mean()

In [None]:
# Convert the index to datetime
atus_perc_child.index = pd.to_datetime(atus_perc_child.index, format='%Y')
# Plot these as a function of time
fig, ax = plt.subplots(figsize=(10,5)) 
ax.plot(atus_perc_child.index, atus_perc_child['perc_child'],          # line plot of perc_alone vs. time
        color='blue'                                       # set the line color to blue
       )  
#ax.set_xlabel('year')        # obvious
#ax.set_ylabel('percent')     # obvious with PercentFormatter                                
ax.set_title('Percent of Nonwork Time Spent on Childcare')
sns.despine(ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter()) # formats percentages

plt.show()

This steady drop is not what I expected. It would be interesting to break down between mothers and fathers.

## 5. Revisit and revise preliminary questions

Let's focus in on the alone time trend, and dig deeper to see how time trends differ across various subgroups. The below will consider differences across those with and without a Bachelor's degree. Let's grab some basic demographic data for respondents.

Notice that this information comes from a different dataset on the [ATUS website](https://www.bls.gov/tus/data.htm) with more mysterious names... **Reading the data documentation is crucial!!**

In [None]:
variables2 = {'TUCASEID':'id', 'TUYEAR':'year','TEAGE':'age', 'TESEX':'sex', 'PTDTRACE':'race', 'PEEDUCA':'edu'}
atus_sum = pd.read_csv('atussum_0323.dat', usecols=variables2.keys())
atus_sum.rename(columns=variables2, inplace=True)
atus_sum.head()

Looks reasonable. Now, we have to merge the demographic data with the time use data. We'll review merging in some detail soon for those of you who are a bit rusty.

In [None]:
# Merge respondent data with summary data on id
atus_comb = pd.merge(how='left', left=atus_resp, right=atus_sum,
                     left_on=['id', 'year'], right_on=['id', 'year'],
                     indicator=False)
atus_comb.head()

## Sample selection and subgroup selection

We only want adults, and we want to break up the analysis. The categories of 'edu' reveal that 'edu' >= 43 represents those with a bachelor's degree.

In [None]:
# Subset the DataFrame to include only adults.
atus_comb = atus_comb[atus_comb['age'] >= 18]
# Create a Bachelor's degree indicator
atus_comb = atus_comb.assign(bach = atus_comb['edu'] >=43)
atus_comb.head()

In [None]:
# Re-calculate the means for adult subsample
atus_perc =atus_comb[['year','perc_alone']].groupby('year').mean()
atus_perc.head()

Let's use our demographic data to calculate means for different groups. In particular, we're grouping by the year and bachelor's degree.

In [None]:
# Split by bachelor's degree
atus_percbach = atus_comb[['year', 'bach','perc_alone']].groupby(['year', 'bach']).mean()
atus_percbach = atus_percbach.unstack()
atus_percbach.columns = atus_percbach.columns.droplevel(0)
atus_percbach = atus_percbach.rename_axis(None, axis=1)
renamer = {False:'perc_alone_nodeg', True:'perc_alone_deg'}
atus_percbach.rename(columns=renamer, inplace=True)
atus_percbach.head()

## Plotting and interpreting our main results

Before we plot the results, there is one catch: data collection in 2020 was [briefly suspended](https://www.bls.gov/tus/data/datafiles-2020.htm) due to the pandemic! But this is a pivotal year for our analysis... How do we deal with this issue? Social scientists (economists included!) often don't have access to perfectly curated data. Unfortunately, this is a common situation. It is worth referring back to [the FRB article](https://www.philadelphiafed.org/-/media/frbp/assets/economy/articles/economic-insights/2023/q4/eiq423-time-use-before-during-and-after-the-pandemic.pdf) for sensible applications of two common methods of dealing with missing data:

1. **Imputation:** Fill in the missing data with non-missing data.
2. **Estimation using another data source:** In this case, the author uses Google Trends Mobility Data.

In our calculations, we simply ignored the missing data and calculated percent of alone time in 2020 using the months in which data were available. You could argue that this is equivalent to "imputing" the share of time spent alone in the missing period with the average share in the non-missing period. In any case, let's recognize these limitations of how these results were constructed and simply move forward for the sake of time. We will discuss imputation in some more detail later this semester.

Because this is our final plot, let's take care to label and represent each time series clearly in one convenient plot.

In [None]:
# Convert the index to datetime
atus_perc.index = pd.to_datetime(atus_perc.index, format='%Y')
atus_percbach.index = pd.to_datetime(atus_percbach.index, format='%Y')

# Plot all three series as a function of time
# (I've decided that automating is more trouble than it's worth here.)
fig, ax = plt.subplots(figsize=(10,5)) 
ax.plot(atus_perc.index, atus_perc['perc_alone'],
        label='All Adults', color='black'
       )  
ax.plot(atus_percbach.index, atus_percbach['perc_alone_nodeg'],
        label='Adults without a Bachelors degree', color='red'
       )  
ax.plot(atus_percbach.index, atus_percbach['perc_alone_deg'],
        label='Adults with a Bachelors degree', color='blue'
       )

# Locate and label each series (x coordinate is a datetime object)
ax.text(dt.date(2022,9,1), 55.2, 'Adults without a Bachelor\'s degree')
ax.text(dt.date(2022,9,1), 52.7, 'All Adults')
ax.text(dt.date(2022,9,1), 50, 'Adults with a Bachelor\'s degree')

ax.set_title('Percent of Nonwork Time Spent Alone')
sns.despine(ax=ax)
ax.yaxis.set_major_formatter(mtick.PercentFormatter()) # formats percentages

plt.show()

## 6. Final analysis

At this point, you would take time to enrich and formalize the analysis, and test robustness.

We might also want to construct other interesting measures of time use (work from home, childcare) and compare more subgroups (gender, race). With additional results, we might be able to tell a more complete story as in the FRB report.

## 7. Report your findings

Time to collect our findings: 

1. Alone time has increased considerably since 2003.
2. These increases were much greater for those without a college degree.
2. Alone time peaked in 2020 (coinciding with the first/second wave of the pandemic), stabilized during 2020-2022, and increased again in 2023.

Now it is time to choose which figure(s) we want to use and how we want to make them look.

1. **Message:** Share of time spent alone was already increasing in the lead-up to the pandemic. 
2. **Audience:** If the audience is non-experts, I would not focus too much on technical details.
3. **Medium:** Presentation or online distribution. A few colors are just fine. The figure is relatively sparse, so this could be augmented easily if the medium changes.
