# Labor Condition Applications (LCAs)
## Standardize employer names  

Since the data reflects exactly what the petitioner submits on DOL Labor Condition Applications [forms ETA-9035 & 9035E](https://www.uscis.gov/i-129), manually entered company names may differ slightly, e.g., "Google" and "Google Inc.", or "Facebook" and "Meta Platforms". Often a single company name will capture most of the petitions, but to be sure to include all petitions from some of the largest companies - and ones that we'll look into specifically later on - I created regex terms to search employer names so that slight variations are also included in the total petition counts. This method might accidentally include some false positives, inflating the total position count. But for the ~50 companies I checked, mis-identified companies were <1% of the total petitions. There's also a chance that this method may miss some company name variations.  

In [1]:
# Import packages
import pandas as pd
import re
from functools import lru_cache

In [2]:
# Set up parameters
data_dir = '../../data/'
input_dir = data_dir + 'intermediate/'
input_filename = 'lca_deduped.csv'
output_dir = data_dir = 'intermediate/'
output_filename = 'lca_standardized_employer_names.csv'
parameters_dir = data_dir = 'parameters/'
employer_params_filename = 'employer_params.csv'

In [3]:
# Load LCA data, using only load the columns we'll need
lca = pd.read_csv(input_dir + input_filename, dtype=str, usecols=['EMPLOYER_NAME', 'CASE_NUMBER'])

In [10]:
# Load company-level parameters
# This includes the regex patterns for each employer name
employer_params = pd.read_csv(parameters_dir + employer_params_filename, index_col='employer')

In [5]:
# Store the original employer name in a new column
lca['EMPLOYER_NAME_RAW'] = lca['EMPLOYER_NAME']

In [6]:
# Create a dataframe of unique employer names
# Store all case numbers for each employer as a list
# employer_names = lca[['EMPLOYER_NAME', 'CASE_NUMBER']]
# employer_names = employer_names.groupby('EMPLOYER_NAME', dropna=False)['CASE_NUMBER'].agg(list).reset_index()

In [12]:
# Define a function to standardize company names for the top petitioners
# The function compares the name provided on the form with a regex expression and returns a standardized company name if there is a match
@lru_cache(maxsize=None)
def standardize_employer_name(employer_name):
  for standardized_name, search_term in employer_params.to_dict()['search_term'].items():
    if pd.isna(employer_name):
      return employer_name
    if re.search(search_term, employer_name, flags=re.IGNORECASE):
      return standardized_name
  return employer_name

In [13]:
# Standardize the employer names
lca['EMPLOYER_NAME'] = lca['EMPLOYER_NAME_RAW'].apply(standardize_employer_name)

In [14]:
n_total = lca.shape[0]
n_standardized = lca[lca['EMPLOYER_NAME'] != lca['EMPLOYER_NAME_RAW']].shape[0]
share_standardized = round(100 * n_standardized / n_total, 1)
print(f'{n_standardized:,} of {n_total:,} records ({share_standardized}%) have been standardized')

2,071,861 of 6,463,616 records (32.1%) have been standardized


In [15]:
# Save file
lca.to_csv(output_dir + output_filename, index=False)