# Employment

## BLS: Current Employment Survey (CES)

**Survey Details from BLS:**                                                                    
The Current Employment Statistics (CES) program provides estimates of employment, hours, and earnings information on a national basis and in considerable industry detail. The Bureau of Labor Statistics collects payroll data each month from a sample of business and government establishments in all nonfarm activities.

A sample of approximately 149,000 businesses and government agencies representing approximately 651,000 worksites throughout the United States is utilized for this monthly survey. The sample contains about 300,000 employer units.

Information about the survey can be found here [here](https://download.bls.gov/pub/time.series/ce/ce.txt)

In [1]:
import os
import pandas as pd

# path for the folder "project"
path = "C:\\Users\\pedro\\OneDrive\\NYU\\CSS\\II. Data Skills\\project"
os.chdir(path)

Because the objective is to import the whole survey, using BLS' API doesn't seem to be the right tool. This notebook shows how to import the .txt files directly from BLS.  

As described in the survey webpage:

> Text files allow data users to retrieve large amounts of data with one selection. These datasets are highly suited for statistical software that manipulate large datasets.  The files are arranged first on a broad industry base, followed by industry specific groups. In addition to data files, a series of mapping files are also      available and provide a means for identifying all variables in the data files.


## Importing Data

Importing all `CES Survey` files directly from [BLS](https://download.bls.gov/pub/time.series/ce/), and saving as a .parquet file:

In [2]:
CES = pd.read_csv("https://download.bls.gov/pub/time.series/ce/ce.data.0.AllCESSeries", delimiter="\t")

In [4]:
CES.info()
CES.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8185203 entries, 0 to 8185202
Data columns (total 5 columns):
 #   Column             Dtype  
---  ------             -----  
 0   series_id          object 
 1   year               int64  
 2   period             object 
 3          value       float64
 4   footnote_codes     object 
dtypes: float64(1), int64(1), object(3)
memory usage: 312.2+ MB


Unnamed: 0,series_id,year,period,value,footnote_codes
0,CES0000000001,1939,M01,29923.0,
1,CES0000000001,1939,M02,30100.0,
2,CES0000000001,1939,M03,30280.0,
3,CES0000000001,1939,M04,30094.0,
4,CES0000000001,1939,M05,30299.0,


Now, wrangling the data

In [5]:
# Eliminating empty spaces at the beginning and end of column names:
CES.columns = CES.columns.str.strip()

# The same, but now for the values of "series_id":
CES["series_id"] = CES["series_id"].str.strip()

# Excluding the 'M13' (annual) period values and dropping "footnote_codes" column:
CES = CES[CES["period"] != 'M13'].drop(columns = "footnote_codes")

# Building "date" column:
CES["period"] = CES["period"].str.replace("M","")
CES["date"] = CES["year"].astype(str) + "-" + CES["period"] + "-1"
CES["date"] = pd.to_datetime(CES["date"])

# Dropping irrelevant variablels:
CES = CES.drop(columns = ["year","period"])[["series_id","date","value"]]

In [6]:
# saving as parquet file:
CES.to_parquet("data\\employment\\data_bls_ces.parquet")

### Building Dictionary

In [7]:
# importing data dictionary:
series = pd.read_csv("https://download.bls.gov/pub/time.series/ce/ce.series", delimiter="\t")

# importing mapping files:
data_type = pd.read_csv("https://download.bls.gov/pub/time.series/ce/ce.datatype", delimiter="\t")
sector = pd.read_csv("https://download.bls.gov/pub/time.series/ce/ce.supersector", delimiter="\t")
industry = pd.read_csv("https://download.bls.gov/pub/time.series/ce/ce.industry", delimiter="\t")

In [8]:
# removing blank space of column names
series.columns = series.columns.str.strip()
data_type.columns = data_type.columns.str.strip()
sector.columns = sector.columns.str.strip()
industry.columns = industry.columns.str.strip()

**BLS Dictionary**: series ids + variables codes

In [10]:
series.head(2)

Unnamed: 0,series_id,supersector_code,industry_code,data_type_code,seasonal,series_title,footnote_codes,begin_year,begin_period,end_year,end_period
0,CES0000000001,0,0,1,S,"All employees, thousands, total nonfarm, seaso...",,1939,M01,2022,M11
1,CES0000000010,0,0,10,S,"Women employees, thousands, total nonfarm, sea...",,1964,M01,2022,M11


**Mapping Files: data_type**  
Employment, working hours, wages...

In [18]:
display(data_type.head(3))
print("\ndata_type_text:\n",data_type.data_type_text.unique())

Unnamed: 0,data_type_code,data_type_text
0,1,"ALL EMPLOYEES, THOUSANDS"
1,2,AVERAGE WEEKLY HOURS OF ALL EMPLOYEES
2,3,AVERAGE HOURLY EARNINGS OF ALL EMPLOYEES



data_type_text:
 ['ALL EMPLOYEES, THOUSANDS' 'AVERAGE WEEKLY HOURS OF ALL EMPLOYEES'
 'AVERAGE HOURLY EARNINGS OF ALL EMPLOYEES'
 'AVERAGE WEEKLY OVERTIME HOURS OF ALL EMPLOYEES'
 'PRODUCTION AND NONSUPERVISORY EMPLOYEES, THOUSANDS'
 'AVERAGE WEEKLY HOURS OF PRODUCTION AND NONSUPERVISORY EMPLOYEES'
 'AVERAGE HOURLY EARNINGS OF PRODUCTION AND NONSUPERVISORY EMPLOYEES'
 'AVERAGE WEEKLY OVERTIME HOURS OF PRODUCTION AND NONSUPERVISORY EMPLOYEES'
 'WOMEN EMPLOYEES, THOUSANDS' 'AVERAGE WEEKLY EARNINGS OF ALL EMPLOYEES'
 'AVERAGE WEEKLY EARNINGS OF ALL EMPLOYEES, 1982-1984 DOLLARS'
 'AVERAGE HOURLY EARNINGS OF ALL EMPLOYEES, 1982-1984 DOLLARS'
 'AVERAGE HOURLY EARNINGS OF ALL EMPLOYEES, EXCLUDING OVERTIME'
 'INDEXES OF AGGREGATE WEEKLY HOURS OF ALL EMPLOYEES, 2007=100'
 'INDEXES OF AGGREGATE WEEKLY PAYROLLS OF ALL EMPLOYEES, 2007=100'
 'AVERAGE WEEKLY HOURS OF ALL EMPLOYEES, QUARTERLY AVERAGES, SEASONALLY ADJUSTED'
 'AVERAGE WEEKLY OVERTIME HOURS OF ALL EMPLOYEES, QUARTERLY AVERAGES, SEASONA

**Mapping Files: sector**  
(super)sector codes and names.

In [21]:
display(sector.head(3))
print("\nsupersector_name:\n",sector.supersector_name.unique())

Unnamed: 0,supersector_code,supersector_name
0,0,Total nonfarm
1,5,Total private
2,6,Goods-producing



supersector_name:
 ['Total nonfarm' 'Total private' 'Goods-producing' 'Service-providing'
 'Private service-providing' 'Mining and logging' 'Construction'
 'Manufacturing' 'Durable Goods' 'Nondurable Goods'
 'Trade, transportation, and utilities' 'Wholesale trade' 'Retail trade'
 'Transportation and warehousing' 'Utilities' 'Information'
 'Financial activities' 'Professional and business services'
 'Education and health services' 'Leisure and hospitality'
 'Other services' 'Government']


**Mapping Files: industry**  
industry name within the sector  

`publishing_status`: map to data_type (not important here, see survey info link)
`display_level`, `selectable` and `sort_sequence`: variables to help to understand industry hierarchy

In [24]:
industry.head()

Unnamed: 0,industry_code,naics_code,publishing_status,industry_name,display_level,selectable,sort_sequence
0,0,-,B,Total nonfarm,0,T,1
1,5000000,-,A,Total private,1,T,2
2,6000000,-,A,Goods-producing,1,T,3
3,7000000,-,B,Service-providing,1,T,4
4,8000000,-,A,Private service-providing,1,T,5


Merging dictionary and mapping files

In [26]:
ces_dict = pd.merge(series, data_type, how = "left", on = "data_type_code")
ces_dict = pd.merge(ces_dict, sector, how = "left", on = "supersector_code")
ces_dict = pd.merge(ces_dict, industry, how = "left", on = "industry_code")
ces_dict["series_id"] = ces_dict["series_id"].str.strip()
ces_dict = ces_dict[["series_id","data_type_text","supersector_name","industry_name",
                     "display_level","seasonal","series_title"]]
ces_dict.head(2)

Unnamed: 0,series_id,data_type_text,supersector_name,industry_name,display_level,seasonal,series_title
0,CES0000000001,"ALL EMPLOYEES, THOUSANDS",Total nonfarm,Total nonfarm,0,S,"All employees, thousands, total nonfarm, seaso..."
1,CES0000000010,"WOMEN EMPLOYEES, THOUSANDS",Total nonfarm,Total nonfarm,0,S,"Women employees, thousands, total nonfarm, sea..."


Saving data dictionary as .parquet file:

In [28]:
ces_dict.to_parquet("data\\employment\\dict_bls_ces.parquet")