# Employment

## BLS: Current Population Survey (CPS)

**Survey Details from BLS:**   

The Current Population Survey (CPS) is a monthly survey of households conducted by the Bureau of Census for the Bureau of Labor Statistics. It provides a comprehensive body of data on the labor force, employment, unemployment, persons not in the labor force, hours of work, earnings, and other demographic and labor force characteristics.

Information about the survey [here](https://www.bls.gov/cps/), and concepts/definitions are clearly explained in this [link](https://www.bls.gov/cps/definitions.htm).

In [1]:
import os
import pandas as pd
import numpy as np

# path for the folder "project"
path = "C:\\Users\\pedro\\OneDrive\\NYU\\CSS\\II. Data Skills\\project"
os.chdir(path)

Because the objective is to import the whole survey, using BLS' API doesn't seem to be the right tool. This notebook shows how to import the .txt files directly from BLS.  

As described in the survey webpage:

> Text files allow data users to retrieve large amounts of data with one selection. These datasets are highly suited for statistical software that manipulate large datasets.  The files are arranged first on a broad industry base, followed by industry specific groups. In addition to data files, a series of mapping files are also      available and provide a means for identifying all variables in the data files.


## Importing Data

Importing all `CPS (Household Survey)` files directly from [BLS](https://download.bls.gov/pub/time.series/ln/), and saving as a .parquet file.

In [2]:
CPS = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.data.1.AllData", delimiter="\t",
                  dtype={3:str, 4:str})

In [3]:
CPS.info()
CPS.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7854264 entries, 0 to 7854263
Data columns (total 5 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   series_id          object
 1   year               int64 
 2   period             object
 3          value       object
 4   footnote_codes     object
dtypes: int64(1), object(4)
memory usage: 299.6+ MB


Unnamed: 0,series_id,year,period,value,footnote_codes
0,LNS11000000,1948,M01,60095,
1,LNS11000000,1948,M02,60524,
2,LNS11000000,1948,M03,60070,
3,LNS11000000,1948,M04,60677,
4,LNS11000000,1948,M05,59972,


Now, wrangling the data

In [4]:
# Eliminating empty spaces at the beginning and end of column names:
CPS.columns = CPS.columns.str.strip()

# The same, but now for the values of "series_id":
CPS["series_id"] = CPS["series_id"].str.strip()

# Changing the value column dtype:
CPS["value"] = pd.to_numeric(CPS["value"], errors='coerce')

# Keeping only monthly data, creating "date" column and dropping "footnote_codes"
months = ["M01", "M02","M03","M04","M05","M06","M07","M08","M09","M10","M11","M12"]
CPS = CPS[CPS["period"].isin(months)].drop(columns = "footnote_codes")
CPS["period"] = CPS["period"].str.replace("M","")
CPS["date"] = CPS["year"].astype(str)+"-"+CPS["period"]+"-1"
CPS["date"] = pd.to_datetime(CPS["date"])

# Dropping irrelevant variablels:
CPS = CPS.drop(columns = ["year","period"])[["series_id","date","value"]]

In [6]:
CPS.info()
CPS.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5549966 entries, 0 to 7854262
Data columns (total 3 columns):
 #   Column     Dtype         
---  ------     -----         
 0   series_id  object        
 1   date       datetime64[ns]
 2   value      float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 169.4+ MB


Unnamed: 0,series_id,date,value
0,LNS11000000,1948-01-01,60095.0
1,LNS11000000,1948-02-01,60524.0
2,LNS11000000,1948-03-01,60070.0
3,LNS11000000,1948-04-01,60677.0
4,LNS11000000,1948-05-01,59972.0


In [7]:
# saving as parquet file:
CPS.to_parquet("data\\employment\\data_bls_cps.parquet")

### Building Dictionary

In [8]:
# importing dictionary
series = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.series", delimiter="\t")
series = series[series["periodicity_code"] == "M"].drop(columns = "periodicity_code")

# importing mapping files
# labor force status: in labor force, employeed, unemployeed...
lfst = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.lfst", delimiter="\t")
# absence: Paid absence, unpaid absence...
absn = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.absn", delimiter="\t")
# study activities: Enrolled in: School, High School...
activity = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.activity", delimiter="\t")
# age group:
ages = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.ages", delimiter="\t")
# year of birth:
born = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.born", delimiter="\t")
# with or without certification or license
cert = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.cert", delimiter="\t")
# Children status:
chld = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.chld", delimiter="\t")
# disable/not disable person
disa = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.disa", delimiter="\t")
# duration in weeks
duration = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.duration", delimiter="\t")
# education level
education = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.education", delimiter="\t")
# reentrants/new entrants
entr = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.entr", delimiter="\t")
# experience
expr = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.expr", delimiter="\t")
# Family heads
hheader = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.hheader", delimiter="\t")
# hours worked
hour = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.hour", delimiter="\t")
# industry
indy = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.indy", delimiter="\t")
# want job?
jdes = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.jdes", delimiter="\t")
# reason why lost job
look = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.look", delimiter="\t")
# married status
maried = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.mari", delimiter="\t")
# multiple jobs holder
mjhs = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.mjhs", delimiter="\t")
# occupation
occupation = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.occupation", delimiter="\t")
# Region origins: 
origin = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.orig", delimiter="\t")
# labor force status ratios: 
percent = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.pcts", delimiter="\t")
# Race
race = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.race", delimiter="\t")
# reasons why not working
reason_nw = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.rjnw", delimiter="\t")
# reasons why not in labor force
reason_nlf = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.rnlf", delimiter="\t")
# reasons why not searching for work
reason_unemp = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.rwns", delimiter="\t")
# looking for job?
seek = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.seek", delimiter="\t")
# gender
sex = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.sexs", delimiter="\t")
#data unit: Number in thousands, percent ...
tdat = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.tdat", delimiter="\t")
# veterans classification 
vets = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.vets", delimiter="\t")
#work status: full-time, part-time...
wkst = pd.read_csv("https://download.bls.gov/pub/time.series/ln/ln.wkst", delimiter="\t")

In [9]:
# class of worker:
clss = pd.DataFrame(
    {"class_code":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],
     "class_text":[np.nan,
                   "Wage and salary workers",
                   "Private wage and salary workers",
                   "Government wage and salary workers",
                   "Federal wage and salary workers",
                   "State wage and salary workers",
                   "Local wage and salary workers",
                   "State and local wage and salary workers",
                   "Self-employed workers, unincorporated",
                   "Unpaid family workers",
                   "All classes of workers (1, 8, and 9)",
                   "Nonagriculture government, self employed, and unpaid family worker (3, 8, and 9 above)",
                   "Self-employed unincorporated, and unpaid family workers (8 and 9)",
                   "Wage and salary and self-employed workers ('paid' workers-- 1 and 8)",
                   "Incorporated self-employed",
                   "Other",
                   "Wage and salary workers, excluding incorporated self employed",
                   "Private wage and salary workers, excluding incorporated self employed"]})

In [16]:
#Merge
cps_dict = pd.merge(series, lfst, how = "left")
cps_dict = pd.merge(cps_dict, absn, how = "left")
cps_dict = pd.merge(cps_dict, activity, how = "left")
cps_dict = pd.merge(cps_dict, ages, how = "left")
cps_dict = pd.merge(cps_dict, cert, how = "left")
cps_dict = pd.merge(cps_dict, clss, how = "left")
cps_dict = pd.merge(cps_dict, duration, how = "left")
cps_dict = pd.merge(cps_dict, education, how = "left")
cps_dict = pd.merge(cps_dict, entr, how = "left")
cps_dict = pd.merge(cps_dict, expr, how = "left")
cps_dict = pd.merge(cps_dict, hheader, how = "left")
cps_dict = pd.merge(cps_dict, hour, how = "left")
cps_dict = pd.merge(cps_dict, indy, how = "left")
cps_dict = pd.merge(cps_dict, jdes, how = "left")
cps_dict = pd.merge(cps_dict, look, how = "left")
cps_dict = pd.merge(cps_dict, maried, how = "left")
cps_dict = pd.merge(cps_dict, occupation, how = "left")
cps_dict = pd.merge(cps_dict, origin, how = "left")
cps_dict = pd.merge(cps_dict, percent, how = "left")
cps_dict = pd.merge(cps_dict, race, how = "left")
cps_dict = pd.merge(cps_dict, reason_nw, how = "left")
cps_dict = pd.merge(cps_dict, reason_nlf, how = "left")
cps_dict = pd.merge(cps_dict, reason_unemp, how = "left")
cps_dict = pd.merge(cps_dict, seek, how = "left")
cps_dict = pd.merge(cps_dict, sex, how = "left")
cps_dict = pd.merge(cps_dict, tdat, how = "left")
cps_dict = pd.merge(cps_dict, vets, how = "left")
cps_dict = pd.merge(cps_dict, wkst, how = "left")
cps_dict = pd.merge(cps_dict, born, how = "left")
cps_dict = pd.merge(cps_dict, chld, how = "left")
cps_dict = pd.merge(cps_dict, disa, how = "left")

# removing blank space of column names
cps_dict.columns = cps_dict.columns.str.strip()
cps_dict["series_id"] = cps_dict["series_id"].str.strip()

In [17]:
# excluding codes.
cps_dict = cps_dict.drop(
    columns = ["lfst_code", 'absn_code', 'activity_code','ages_code',
                'cert_code', 'class_code', 'duration_code','education_code',
                'entr_code', 'expr_code', 'hheader_code', 'hour_code','indy_code',
                'jdes_code', 'look_code', 'mari_code', 'mjhs_code','occupation_code',
                'orig_code', 'pcts_code', 'race_code', 'rjnw_code','rnlf_code',
                'rwns_code', 'seek_code', 'sexs_code', 'tdat_code','vets_code',
                'wkst_code', 'born_code', 'chld_code', 'disa_code','footnote_codes',
                'begin_year', 'begin_period', 'end_year','end_period'])
cps_dict.head(2)

Unnamed: 0,series_id,series_title,seasonal,lfst_text,absn_text,activity_text,ages_text,cert_text,class_text,duration_text,...,rnlf_text,rwns_text,seek_text,sexs_text,tdat_text,vets_text,wkst_text,born_text,chld_text,disa_text
0,LNS11000000,(Seas) Civilian Labor Force Level,S,Civilian labor force,,,16 years and over,,,,...,,,,Both Sexes,Number in thousands,,,,,
1,LNS11000001,(Seas) Civilian Labor Force Level - Men,S,Civilian labor force,,,16 years and over,,,,...,,,,Men,Number in thousands,,,,,


In [18]:
# saving dictionary as parquet file: 
cps_dict.to_parquet("data\\employment\\dict_bls_cps.parquet")