# Labor Condition Applications (LCAs)
## Import raw data for 2020 onward
**Source:** [Department of Labor > Foreign Labor Certification > Performance Data](https://www.dol.gov/agencies/eta/foreign-labor/performance)  
The first step in the H-1B visa program is for employers to submite a Labor Condition Application (LCA) providing information on the type of visa they are seeking, the occupation they are hiring for, the number of workers, the intended pay range, etc.  

According to the DOL: "Beginning in Fiscal Year 2020, the Program Record Layouts associated with each program disclosure file are substantially different from prior fiscal years and include a number of additional data fields extracted based on new visa application forms, appendices, and addenda implemented by OFLC through the new [Foreign Labor Application Gateway (FLAG) System](https://flag.dol.gov/?_ga=2.116320975.836097246.1711548631-607582160.1711058977)."

In this script, we'll import the files for 2020 onward. Pre-2020 files are handled in another script.

The [DOL website](https://www.dol.gov/agencies/eta/foreign-labor/performance) also states that quarterly files are supposed to be cumulative within a fiscal year, i.e., the Q4 file should contain all cases from previous quarters in the same year.  
However, we won't just import the Q4 files for a couple reasons:
 - Cases can appear in multiple years if submitted in one year and an action is taken in a following year (i.e. certified in one year but witdrawn in the following)
 - Preliminary exploratory analysis showed that some cases that appear in the files for quarters 1-3 do not show up in quarter 4 (even though they are certified) contrary to what the website says.  

To address these issues, we'll import all files and deduplicate later.

In [None]:
# Import packages
import pandas as pd

In [None]:
# Set up parameters
data_dir = '../../data/'
output_dir = data_dir + 'raw/'
output_filename_base = 'lca_raw_'
base_url = 'https://www.dol.gov/sites/dolgov/files/ETA/oflc/pdfs/LCA_Disclosure_Data_FY'
years = [2020, 2021, 2022, 2023]
quarters = [1,2,3,4]

# Standardize column names (some years use slightly different names)
col_renames = {
  'H-1B_DEPENDENT': 'H_1B_DEPENDENT',
  'H1B_DEPENDENT': 'H_1B_DEPENDENT'
  }

We'll import only the relevant fields provided by the DOL

In [None]:
# LCA columns to import for 2020 and later
# We'll import only the columns we'll need
cols = [
  'CASE_NUMBER',
  'CASE_STATUS',
  'RECEIVED_DATE',
  'DECISION_DATE',
  'ORIGINAL_CERT_DATE',
  'VISA_CLASS',
  'JOB_TITLE',
  'SOC_CODE',
  'SOC_TITLE',
  'FULL_TIME_POSITION',
  'BEGIN_DATE',
  'END_DATE',
  'TOTAL_WORKER_POSITIONS',
  'NEW_EMPLOYMENT',
  'CONTINUED_EMPLOYMENT',
  'CHANGE_PREVIOUS_EMPLOYMENT',
  'NEW_CONCURRENT_EMPLOYMENT',
  'CHANGE_EMPLOYER',
  'AMENDED_PETITION',
  'EMPLOYER_NAME',
  'TRADE_NAME_DBA',
  'EMPLOYER_CITY',
  'EMPLOYER_STATE',
  'EMPLOYER_POSTAL_CODE',
  'EMPLOYER_COUNTRY',
  'EMPLOYER_PROVINCE',
  'NAICS_CODE',
  'WORKSITE_WORKERS',
  'SECONDARY_ENTITY',
  'SECONDARY_ENTITY_BUSINESS_NAME',
  'WORKSITE_CITY',
  'WORKSITE_COUNTY',
  'WORKSITE_STATE',
  'WORKSITE_POSTAL_CODE',
  'WAGE_RATE_OF_PAY_FROM',
  'WAGE_RATE_OF_PAY_TO',
  'WAGE_UNIT_OF_PAY',
  'PREVAILING_WAGE',
  'PW_UNIT_OF_PAY',
  'PW_TRACKING_NUMBER',
  'PW_WAGE_LEVEL',
  'PW_OES_YEAR',
  'PW_OTHER_SOURCE',
  'PW_OTHER_YEAR',
  'PW_SURVEY_PUBLISHER',
  'PW_SURVEY_NAME',
  'TOTAL_WORKSITE_LOCATIONS',
  # 'H_1B_DEPENDENT', # In some years this is H-1B_DEPENDENT. Adjust according to the year.
  'WILLFUL_VIOLATOR',
  'SUPPORT_H1B',
  'STATUTORY_BASIS',
]

In [None]:
int_cols = ['TOTAL_WORKER_POSITIONS', 'NEW_EMPLOYMENT', 'CONTINUED_EMPLOYMENT', 'CHANGE_PREVIOUS_EMPLOYMENT', 'NEW_CONCURRENT_EMPLOYMENT', 'CHANGE_EMPLOYER', 'AMENDED_PETITION']
float_cols = ['WAGE_RATE_OF_PAY_FROM', 'WAGE_RATE_OF_PAY_TO', 'PREVAILING_WAGE']

dtypes = {}
for col in cols:
  if col in int_cols:
    dtypes[col] = int
  elif col in float_cols:
    dtypes[col] = float
  else:
    dtypes[col] = str

Import the Excel files for each quarter. Note that the H-1B dependency column name can vary between files.  

In [None]:
# Import all data files
for year in years:
  dfyear = []
  for quarter in quarters:
    url = base_url + str(year) + '_Q' + str(quarter) + '.xlsx'
    print('Importing ' + url)
    
    # Select the right format for the H-1B dependency column
    if year >= 2022:
      h1b_dep = 'H_1B_DEPENDENT'
    elif year == 2020 or (year == 2021 and quarter in [2,4]):
      h1b_dep = 'H-1B_DEPENDENT'
    elif year == 2021 and quarter in [1,3]:
      h1b_dep = 'H1B_DEPENDENT'
    
    # Update the columns to import and the data types
    usecols = cols + [h1b_dep]
    dtypes[h1b_dep] = str

    # Import data
    df = pd.read_excel(url, usecols=usecols, dtype=dtypes)

    # Add additional reference fields and standardize column names
    df['DATAFILE_YEAR'] = year
    df['DATAFILE_QUARTER'] = quarter
    df = df.rename(columns=col_renames)

    # Stack all files within the same year together
    dfyear.append(df)

  # Save data locally
  print('Saving data for ' + str(year))
  pd.concat(dfyear).to_csv(output_dir + output_filename_base + str(year) + '.csv', index=False)