# USCIS H-1B Data: Create analysis file
Clean up raw data imported from the [H-1B Data Hub](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub) and create an analysis file. Documentation on the data can be found [here](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub/h-1b-employer-data-hub-files).

In [1]:
# Import packages
import pandas as pd
import re

In [2]:
# Set up parameters
data_dir = '../../data/'
input_dir = data_dir + 'intermediate/'
output_dir = data_dir + 'analysis/'
input_filename = 'uscis_intermediate.csv'
output_filename = 'uscis.csv'
parameters_dir = data_dir + 'parameters/'
employer_params_filename = 'employer_params.csv'

In [3]:
# Load data
uscis = pd.read_csv(input_dir + input_filename, dtype=str)

### New petition approval and denial fields
Add additional columns for total approval, denials, and overall petitions

In [4]:
# Convert petitions columns to numeric
approval_cols = ['Initial Approval', 'Continuing Approval']
denial_cols = ['Initial Denial', 'Continuing Denial']

for col in approval_cols + denial_cols:
  uscis[col] = uscis[col].astype(int)

In [5]:
uscis['Total Approval'] = uscis[approval_cols].sum(axis=1)
uscis['Total Denial'] = uscis[denial_cols].sum(axis=1)
uscis['Total Petitions'] = uscis[approval_cols + denial_cols].sum(axis=1)

### Standardize employer names and identify outsourcing companies
Since the data reflects exactly what the petitioner submits on [Form I-129](https://www.uscis.gov/i-129), company names may differ slightly, e.g., "Google" and "Google Inc.", or "Facebook" and "Meta Platforms". Usually a single company name will capture most of the petitions, but to be sure to include petitions I created regex terms to search employer names so that slight variations are also included in the total petition counts. This method might accidentally include some false positives, inflating the total position count. But for the ~50 companies I checked, mis-identified companies were <1% of the total petitions. There's also a chance that this method may miss some company name variations.  

I also flag companies that have outsouring business models. These companies are either taken directly from [previous](https://www.epi.org/blog/tech-and-outsourcing-companies-continue-to-exploit-the-h-1b-visa-program-at-a-time-of-mass-layoffs-the-top-30-h-1b-employers-hired-34000-new-h-1b-workers-in-2022-and-laid-off-at-least-85000-workers/) [analyses](https://www.nytimes.com/interactive/2015/11/06/us/outsourcing-companies-dominate-h1b-visas.html?smid=tw-share) or identified manually using the same criteria as past reports.  

Both standardized names regex and outsourcing flags are captured in the file `employer_search.csv`

In [6]:
# Load company-level parameters
# This includes the regex patterns for each employer name
employer_params = pd.read_csv(parameters_dir + employer_params_filename, index_col='employer')

In [7]:
# Define a function to standardize company names for the top petitioners
# The function compares the name provided on the form with a regex expression and returns a standardized company name if there is a match
def standardize_employer_name(employer_name):
  for standardized_name, search_term in employer_params.to_dict()['search_term'].items():
    if pd.isna(employer_name):
      return employer_name
    if re.search(search_term, employer_name, flags=re.IGNORECASE):
      return standardized_name
  return employer_name

In [8]:
# Standardize employer names
uscis['Employer (original name)'] = uscis['Employer']
uscis['Employer'] = uscis['Employer (original name)'].apply(standardize_employer_name)

In [9]:
# Add flag outsourcing companies (only for top petitioners)
# employer_params = (
#   employer_params
#     .loc[slice(None), ['outsourcing', 'top_outsourcing']]
#     .rename(columns={
#       'outsourcing': 'Outsourcing', 
#       'top_outsourcing': 'Top Outsourcing'
#     })
# )
# uscis = uscis.set_index('Employer').join(
#   employer_params,
#   how='left',
#   validate='m:1'
#   ).reset_index()

In [10]:
# Save file
uscis.to_csv(output_dir + output_filename, index=False)