## Summary notes

### About the project

- TidyTuesday reference
   - [2021-09-28, Economic records](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021)
- Data source
   - [NBER](https://www2.nber.org/RePEc/nbr/nberwo/)
- Website reference
   - [Female representation and collaboration at the NBER](https://bldavies.com/blog/female-representation-collaboration-nber/)
- Libraries used
   - *Ibis* (*SQLite3*)
   - *Seaborn*

### To-do list

- [X] DataIO
   - temporary database name = *nber*
   - data stored in *paper* table
- [ ] Date exploration
   - [X] preview
   - [X] shape
   - [X] info (col, dtype, *nmissing*, *%missing*)
   - [X] *nunique*, *%missing*
- [ ] Data preprocessing
- [ ] Data transformation
- [ ] Data visualisation

### History

- 2022-10-08
   - Project chosen
   - Initial exploration

## Dependencies

In [1]:
import ibis
from sqlalchemy import create_engine
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from seaborn import objects as so
import laughingrook as lr
from sqlalchemy.engine.base import Engine  # for function typing

In [2]:
%load_ext watermark
%watermark --iv

pandas      : 1.5.0
seaborn     : 0.12.0
laughingrook: 0.2.0
ibis        : 3.2.0
matplotlib  : 3.6.1



## Main

In [3]:
%precision 3
ibis.options.interactive = True
sns.set_theme()

### DataIO

In [4]:
url = ('https://raw.githubusercontent.com/rfordatascience/tidytuesday/'
       + 'master/data/2021/2021-09-28/combo_df.csv')
dbname = 'nber'  # no need for .db
table = 'paper'

# construct the path
dbpath = f'__cache/{dbname}.db'
# spin up a local db (or connect to it if it already exists)
engine = create_engine(f'sqlite:///{dbpath}', echo=False)
# cache the csv file
local_path = lr.dataio.cache_url(url)
# read the file into the database
try:
    (pd.read_csv(local_path)  # barebones read_csv call, may need other args
     .to_sql(table, engine, if_exists='fail', index=False)
    )
except ValueError:
    print(f'No data written because {table} exists in {dbname}')

No data written because paper exists in nber


Initialise the *ibis* connection and get a direct reference to the *paper* table.

In [5]:
nber = ibis.sqlite.connect(dbpath)
paper = nber.table('paper')

### Data exploration

In [6]:
paper.limit(8)

Unnamed: 0,paper,catalogue_group,year,month,title,author,name,user_nber,user_repec,program,program_desc,program_category
0,w0001,General,1973,6,"Education, Information, and Efficiency",w0001.1,Finis Welch,finis_welch,,,,
1,w0002,General,1973,6,Hospital Utilization: An Analysis of SMSA Diff...,w0002.1,Barry R Chiswick,barry_chiswick,pch425,,,
2,w0003,General,1973,6,Error Components Regression Models and Their A...,w0003.1,Swarnjit S Arora,swarnjit_arora,,,,
3,w0004,General,1973,7,Human Capital Life Cycle of Earnings Models: A...,w0004.1,Lee A Lillard,,pli669,,,
4,w0005,General,1973,7,A Life Cycle Family Model,w0005.1,James P Smith,james_smith,psm28,,,
5,w0006,General,1973,7,A Review of Cyclical Indicators for the United...,w0006.1,Victor Zarnowitz,victor_zarnowitz,,,,
6,w0007,General,1973,8,The Definition and Impact of College Quality,w0007.1,Lewis C Solmon,,,,,
7,w0008,General,1973,9,Multinational Firms and the Factor Intensity o...,w0008.1,Merle Yahr Weiss,,,,,


In [7]:
paper.count(), len(paper.columns)

(130081, 12)

In [8]:
paper.schema()

ibis.Schema {
  paper             string
  catalogue_group   string
  year              int64
  month             int64
  title             string
  author            string
  name              string
  user_nber         string
  user_repec        string
  program           string
  program_desc      string
  program_category  string
}

In [9]:
paper.info()

[3m                        Summary of paper                        [0m
[3m                          130081 rows                           [0m
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃[1m [0m[1mName            [0m[1m [0m┃[1m [0m[1mType                 [0m[1m [0m┃[1m [0m[1m# Nulls[0m[1m [0m┃[1m [0m[1m% Nulls[0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ paper            │ [1;35mString[0m[1m([0m[33mnullable[0m=[3;92mTrue[0m[1m)[0m │       0 │    0.00 │
│ catalogue_group  │ [1;35mString[0m[1m([0m[33mnullable[0m=[3;92mTrue[0m[1m)[0m │       0 │    0.00 │
│ year             │ [1;35mInt64[0m[1m([0m[33mnullable[0m=[3;92mTrue[0m[1m)[0m  │       0 │    0.00 │
│ month            │ [1;35mInt64[0m[1m([0m[33mnullable[0m=[3;92mTrue[0m[1m)[0m  │       0 │    0.00 │
│ title            │ [1;35mString[0m[1m([0m[33mnullable[0m=[3;92mTrue[0m[1m)[0m │       0 │    0.00 │
│ a

:::{.callout-note}
Potential recipe
:::

In [10]:
# nunique values by column
{col: paper[col].nunique().execute()
 for col in paper.columns}

{'paper': 29434,
 'catalogue_group': 3,
 'year': 49,
 'month': 12,
 'title': 29419,
 'author': 15437,
 'name': 15398,
 'user_nber': 14246,
 'user_repec': 5455,
 'program': 21,
 'program_desc': 21,
 'program_category': 3}

:::{.callout-note}
Potential recipe
:::

In [11]:
# %unique vals by column
{col: (paper[col].nunique() / paper.count()).mul(100).round(1)
 for col in paper.columns}

{'paper': 22.6,
 'catalogue_group': 0.0,
 'year': 0.0,
 'month': 0.0,
 'title': 22.6,
 'author': 11.9,
 'name': 11.8,
 'user_nber': 11.0,
 'user_repec': 4.2,
 'program': 0.0,
 'program_desc': 0.0,
 'program_category': 0.0}

### Preprocessing

In [36]:
def preprocess(tbl):
    decade = (
            ibis.case()
            .when(tbl['year'] <= 1979, '1970s')
            .when(tbl['year'] <= 1989, '1980s')
            .when(tbl['year'] <= 1999, '1990s')
            .when(tbl['year'] <= 2009, '2000s')
            .when(tbl['year'] <= 2019, '2010s')
            .else_('2020s')
            .end()
            .name('decade')
    )

    return tbl[tbl.columns + [decade]]


proc_paper = paper.pipe(preprocess)

29434

### Analysis

In [38]:
unique_papers = paper['paper'].nunique()
(proc_paper.group_by('decade')
 .aggregate(
     [proc_paper['paper'].nunique().name('#papers'),
      proc_paper['author'].nunique().name('#authors'),
      (proc_paper['paper'].nunique() / unique_papers).name('%papers')]
 )
)

AssertionError: num_froms == 2

AssertionError: num_froms == 2