# H-1B Visas Article

Results are in the [H-1B Visa Datasheet](https://docs.google.com/spreadsheets/d/1kO5vxEbIkr2X15clL980FAXCyMWD5yVGwkebyssDDa0/edit?gid=0#gid=0)

## Script Setup

In [1]:
# Load packages
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

In [2]:
# Define functions
def value_dist(series, dropna=False):
  counts = series.value_counts(dropna=dropna)
  shares = series.value_counts(dropna=dropna, normalize=True)
  result = pd.DataFrame({'count': counts, 'share': shares})
  return result

In [6]:
# Filenames and parameters
data_dir = '../../data/analysis/'
parameters_dir = '../../data/parameters/'
uscis_filename = 'uscis.csv'
lca_filename = 'lca.csv'

h1b_cap = 85000
industry_flags = ['tech', 'semiconductor', 'ai', 'it_services', 'professional_services', 'software',]
company_flags = industry_flags + ['outsourcing']

In [7]:
# Variables for aggregation
approval_cols = ['Initial Approval', 'Continuing Approval', 'Total Approval']
denial_cols = ['Initial Denial', 'Continuing Denial', 'Total Denial']
petition_cols = approval_cols + denial_cols + ['Total Petitions']

def create_agg(cols, agg):
  dict = {}
  for col in cols:
    dict[col] = agg
  return dict

def create_sum(cols):
  return create_agg(cols, 'sum')

approval_sum = create_sum(approval_cols)
denial_sum = create_sum(denial_cols)
petition_sum = create_sum(petition_cols)

## Load Data

### Employer Characteristics

In [8]:
# Company data from CompanyMarketCap
marketcap = pd.read_csv(
  parameters_dir + 'CompaniesMarketCap - All Companies.csv',
  usecols = ['name', 'country', 'employee_count'] + industry_flags
)

In [9]:
# Company data assembled manually, including outsourcing flag
employer_params = pd.read_csv(parameters_dir + 'employer_params.csv', usecols=['employer', 'outsourcing', 'employee_count', 'in_market_cap_data', 'tech'])

In [None]:
# Create a single dataset with employer characteristics
employers = (
  employer_params.set_index('employer')
    .join(
      other = marketcap.rename(columns={'name':'employer'}).set_index('employer')
      , how = 'outer'
      , lsuffix = '_params'
      , rsuffix = '_marketcap'
      , validate = '1:1'
    )
)

# Create the final tech flag. This flags a few additional companies as tech which are not in the CompaniesMarketCap data (e.g. LinkedIn, Yahoo)
employers['tech'] = employers['tech_marketcap'].combine_first(employers['tech_params'])

# Create the final employee_count field. The _params field contains data from CompaniesMarketCap as well as employee counts gathered from other sources (e.g. Wikipedia)
employers['employee_count'] = employers['employee_count_params'].combine_first(employers['employee_count_marketcap'])

# Drop columns
employers.drop(['tech_params', 'tech_marketcap', 'employee_count_params', 'employee_count_marketcap'], axis=1, inplace=True)

In [None]:
# Remove old files
del(marketcap, employer_params)

### LCA: Labor Conditions Applications

In [None]:
lca_cols = {
  'CASE_NUMBER': 'str',
  'DATAFILE_YEAR': 'int',
  'EMPLOYER_NAME': 'str',
  'H_1B_DEPENDENT': 'str',
  # 'JOB_TITLE'': ',
  # 'SOC_CODE'': ',
  # 'SOC_TITLE'': ',
  'TOTAL_WORKER_POSITIONS': 'float',
  'NEW_EMPLOYMENT': 'float',
  'CONTINUED_EMPLOYMENT': 'float',
  'CHANGE_PREVIOUS_EMPLOYMENT': 'float',
  'NEW_CONCURRENT_EMPLOYMENT': 'float',
  'CHANGE_EMPLOYER': 'float',
  'AMENDED_PETITION': 'float',
  'FULL_TIME_POSITION': 'str',
  'VISA_CLASS': 'str',
  'NAICS_CODE': 'str',
  'WAGE_ANNUAL_FROM': 'float',
  'WAGE_ANNUAL_TO': 'float',
  'PW_ANNUAL': 'float',
  'PW_WAGE_LEVEL': 'str'
}

In [None]:
# Load LCA data
lca = pd.read_csv(data_dir + lca_filename, usecols=lca_cols.keys(), dtype=lca_cols)

In [None]:
# Amend LCA data with employer characteristics
lca = (
  lca.set_index('EMPLOYER_NAME')
    .join(
      other = employers.loc[:, ['country', 'employee_count'] + company_flags],
      how = 'left',
      validate = 'm:1'
    )
  .reset_index()
  .set_index('CASE_NUMBER')
)

### USCIS: US Citizenship and Immigration Servies

In [None]:
# Load USCIS data
uscis = pd.read_csv(data_dir + uscis_filename)

In [None]:
# Amend USCIS dataset with employer characteristics
uscis = (
  uscis.set_index('Employer')
    .join(
      other = employers.loc[:, ['country', 'employee_count'] + company_flags],
      how = 'left',
      validate = 'm:1'
    )
    .reset_index()
)

IBM is categorized as both a tech company and an outsourcing company. To simplify our analysis, we'll categorize it only as an outsourcing company. That way it doesn't appear in both tech and outsourcing categories when we compare the two groups. This is consistent with previous analyses from other sources (see reports from the [Economic Policy Institute](https://www.epi.org/publication/h-1b-visas-and-prevailing-wage-levels/) and the [New York Times](https://www.nytimes.com/interactive/2015/11/06/us/outsourcing-companies-dominate-h1b-visas.html)).

In [None]:
uscis.loc[uscis['Employer'] == 'IBM', 'tech'] = 0

In [None]:
lca.loc[lca['EMPLOYER_NAME'] == 'IBM', 'tech'] = 0

Similarly, Nvidia is categorized as both a semiconductor company and an AI company. To simplify the analysis, we'll take the stance that it's not an AI company per se but a semiconductor company that is a supplier to the AI industry.

In [None]:
uscis.loc[uscis['Employer'] == 'Nvidia', 'ai'] = 0

In [None]:
lca.loc[lca['EMPLOYER_NAME'] == 'Nvidia', 'ai'] = 0

### H-1B Dependent Companies

In [None]:
# Dataset will contain one row per company-year and a flag for whether the company declared H-1B dependency in that year
# Note that H-1B dependency status is only available for 2015 onward
h1b_dependency = (
  lca
    .loc[:, ['EMPLOYER_NAME', 'DATAFILE_YEAR', 'H_1B_DEPENDENT']]
    .drop_duplicates()
    .rename(columns={'EMPLOYER_NAME':'Employer', 'DATAFILE_YEAR': 'Fiscal Year', 'H_1B_DEPENDENT': 'dependency_vals'})
    .reset_index(drop=True)
    .groupby(['Employer', 'Fiscal Year'])
    ['dependency_vals'].apply(set)
    .to_frame()
    .reset_index()
)

def get_h1b_dependency(x):
  if 'Y' in x:
    return 1
  elif 'N' in x:
    return 0
  else:
    return pd.NA

h1b_dependency['H-1B Dependent'] = h1b_dependency['dependency_vals'].apply(get_h1b_dependency)
h1b_dependency.drop('dependency_vals', axis=1, inplace=True)

## Analysis

USCIS petition approvals and denials by year

In [None]:
temp = uscis.groupby('Fiscal Year').agg(petition_sum)
temp['Initial Denial Rate'] = temp['Initial Denial'] / (temp['Initial Approval'] + temp['Initial Denial'])
temp['Continuing Denial Rate'] = temp['Continuing Denial'] / (temp['Continuing Approval'] + temp['Continuing Denial'])
temp['Total Denial Rate'] = temp['Total Denial'] / (temp['Total Approval'] + temp['Total Denial'])
temp.to_clipboard()
temp

USCIS petition approvals and denials by employer type: Tech and Outsourcing

In [None]:
temp = (
  uscis
    .fillna(0)
    .groupby(['tech', 'outsourcing'], dropna=False, as_index=False)
    .agg(petition_sum)
)
temp['Employer Type'] = np.select(
  [
    (temp['tech'] == 0) & (temp['outsourcing'] == 0),
    (temp['tech'] == 1) & (temp['outsourcing'] == 0),
    (temp['tech'] == 0) & (temp['outsourcing'] == 1),
    (temp['tech'] == 1) & (temp['outsourcing'] == 1)
  ], 
  ['Other', 'Tech', 'Outsourcing', 'Both']
)
temp.drop(['tech', 'outsourcing'], axis=1, inplace=True)

temp.to_clipboard(index=False)
temp

USCIS petitions approvals and denials over time by employer type: Tech and Outsourcing

In [None]:
temp = (
  uscis
    .fillna(0)
    .groupby(['tech', 'outsourcing', 'Fiscal Year'], dropna=False, as_index=False)
    .agg(petition_sum)
)
temp['Employer Type'] = np.select(
  [
    (temp['tech'] == 0) & (temp['outsourcing'] == 0),
    (temp['tech'] == 1) & (temp['outsourcing'] == 0),
    (temp['tech'] == 0) & (temp['outsourcing'] == 1),
    (temp['tech'] == 1) & (temp['outsourcing'] == 1)
  ], 
  ['Other', 'Tech', 'Outsourcing', 'Both']
)
temp.drop(['tech', 'outsourcing'], axis=1, inplace=True)

temp.to_clipboard(index=False)
temp

USCIS petitions by company

In [None]:
# Which companies use the H-1B program the most (from 2009-2023)
# Limit to the top 200 employers
nyears = uscis['Fiscal Year'].drop_duplicates().shape[0]

temp = (
  uscis
    .groupby(['Employer'] + company_flags, dropna=False)
    .agg(approval_sum)
    .sort_values('Total Approval', ascending=False)
    # .head(200)
    .reset_index()
    .fillna(0)
)
# temp['Share of Cap'] = temp['Initial Approval'] / (h1b_cap * nyears)
# temp['Share of Cap Sum'] = temp['Share of Cap'].cumsum()
temp['Share of Total Approval'] = temp['Total Approval'] / temp['Total Approval'].sum()
temp['Share of Total Approval Sum'] = temp['Share of Total Approval'].cumsum()
temp['Tech or Outsourcing'] = temp.apply(lambda row: 1 if row['tech']==1 or row['outsourcing'] == 1 else 0, axis=1)
temp.head(200).to_clipboard(index=False)
temp.head(200)

USCIS petition approvals by company-year

In [None]:
# Limit the list to the 100 biggest users of the program
top_employers = (
  uscis
    .groupby('Employer')
    .agg(petition_sum)
    .sort_values('Total Approval', ascending=False)
    .head(100)
    .index
    .tolist()
)

In [None]:
temp = (
  uscis
    .loc[uscis['Employer'].isin(top_employers), :]
    .groupby(['Employer', 'Fiscal Year'] + company_flags)
    .agg(approval_sum)
    .reset_index()
)
temp['Share of Cap'] = temp['Initial Approval'] / h1b_cap
temp['Tech or Outsourcing'] = temp.apply(lambda row: 1 if row['tech']==1 or row['outsourcing'] == 1 else 0, axis=1)
temp = temp.sort_values(['Fiscal Year', 'Total Approval'], ascending=[True, False])
temp.to_clipboard(index=False)
temp

USCIS petitions approvals and denials over time by employer type: AI and Semiconductors

In [None]:
temp = (
  uscis
    .fillna(0)
    .loc[(uscis['ai'] == 1) | uscis['semiconductor'] == 1]
    .groupby(['ai', 'semiconductor', 'Fiscal Year'], dropna=False, as_index=False)
    .agg(petition_sum)
)

temp['Employer Type'] = np.select(
  [
    (temp['ai'] == 0) & (temp['semiconductor'] == 0),
    (temp['ai'] == 1) & (temp['semiconductor'] == 0),
    (temp['ai'] == 0) & (temp['semiconductor'] == 1),
    (temp['ai'] == 1) & (temp['semiconductor'] == 1)
  ], 
  ['Neither', 'AI', 'Semiconductor', 'Both']
)
temp.drop(['ai', 'semiconductor'], axis=1, inplace=True)

temp.to_clipboard(index=False)
temp

AI companies

In [None]:
temp = (
  uscis
    .loc[uscis['ai'] == 1, :]
    .groupby(['Employer', 'Fiscal Year'] + company_flags)
    .agg(approval_sum)
    .reset_index()
)
temp = temp.sort_values(['Employer', 'Fiscal Year'], ascending=[True, False])

temp = (
  temp
    .groupby('Employer')
    .agg(approval_sum)
    .loc[:, ['Total Approval']]
    .rank(ascending=False)
    .rename(columns={'Total Approval': 'Rank'})
    .join(temp.set_index('Employer'))
    .sort_values(['Rank', 'Fiscal Year'])
)

temp.to_clipboard()
temp

Software companies

In [None]:
temp = (
  uscis
    .fillna(0)
    .loc[(uscis['ai'] == 1) | uscis['semiconductor'] == 1]
    .groupby(['software','Fiscal Year'], dropna=False, as_index=False)
    .agg(petition_sum)
)

temp.to_clipboard(index=False)
temp

USCIS petition denial rates by year

In [None]:
# Calculating denial rates
def calc_denials(df):
  d = df.groupby('Fiscal Year').agg(petition_sum)
  d['Denial Rate'] = d['Total Denial'] / d['Total Petitions']
  d['Initial Denial Rate'] = d['Initial Denial'] / (d['Initial Approval'] + d['Initial Denial'])
  d['Continuing Denial Rate'] = d['Continuing Denial'] / (d['Continuing Approval'] + d['Continuing Denial'])
  return d

In [None]:
# Denial rate by year
denials = calc_denials(uscis)
denials_outsourcing = calc_denials(uscis[uscis['outsourcing'] == 1])
denials_tech = calc_denials(uscis[uscis['tech'] == 1])
denials_other = calc_denials(uscis[(uscis['tech'] == 0) & (uscis['outsourcing'] == 0)])

denials = (
  denials
    .join(denials_outsourcing, rsuffix=' - Outsourcing')
    .join(denials_tech, rsuffix=' - Tech')
    .join(denials_other, rsuffix=' - Other')
    .sort_index()
)
# denials[['Denial Rate', 'Denial Rate - Outsourcing', 'Denial Rate - Tech']]

denials.to_clipboard()
denials

H-1B dependent companies

Number of H-1B dependent organizations each year

In [None]:
temp = h1b_dependency.groupby(['Fiscal Year', 'H-1B Dependent'], as_index=False).agg({'Employer': 'count'})
temp['Employers'] = temp.groupby('Fiscal Year')['Employer'].transform('sum')
temp = temp.loc[temp['H-1B Dependent'] == 1, :]
temp['Share H-1B Dependent'] = temp['Employer'] / temp['Employers']
temp.rename(columns={'Employer': 'H-1B Dependent Employers', 'Employers': 'All Employers'}, inplace=True)
temp.drop('H-1B Dependent', axis=1, inplace=True)
temp.to_clipboard(index=False)
temp

H-1B dependency status by company and company type

In [None]:
# Join H-1B dependency status with USCIS data
h1b_dependency['employer_join'] = h1b_dependency['Employer'].str.lower().str.replace('[^a-z0-9]', '', regex=True)
# h1b_join.drop('Employer', axis=1, inplace=True)

uscis_h1b_dependency = uscis.groupby(['Employer', 'Fiscal Year', 'tech', 'outsourcing'], as_index=False).agg(petition_sum)
uscis_h1b_dependency['employer_join'] = uscis_h1b_dependency['Employer'].str.lower().str.replace('[^a-z0-9]', '', regex=True)

uscis_h1b_dependency = (
  uscis_h1b_dependency
  .set_index(['employer_join', 'Fiscal Year'])
  .join(h1b_dependency.set_index(['employer_join', 'Fiscal Year']), how='left', rsuffix='_RIGHT')
  .reset_index()
  .drop(['employer_join', 'Employer_RIGHT'], axis=1)
)

# Filter to outsourcing and tech companies
temp = (
  uscis_h1b_dependency
    .loc[
      (
        (uscis_h1b_dependency['tech'] == 1) | 
        (uscis_h1b_dependency['outsourcing'] == 1)
      ) & 
      (uscis_h1b_dependency['Fiscal Year'] >= 2015),
    :]
)

temp.to_clipboard(index=False)
temp

Wages

In [None]:
# Prevailing wage levels for outsourcing vs tech
# temp = lca.loc[(lca['outsourcing'] == 1) | (lca['tech'] == 1), :]
temp = lca.copy()
# temp.loc[:, ['group']] = pd.NA
# temp.loc[temp['outsourcing'] == 1, 'group'] = 'Outsourcing'
# temp.loc[temp['tech'] == 1, 'group'] = 'Tech'

temp['group'] = np.select(
  condlist = [
    (temp['tech'] == 0) & (temp['outsourcing'] == 0),
    (temp['tech'] == 1) & (temp['outsourcing'] == 0),
    (temp['tech'] == 0) & (temp['outsourcing'] == 1),
    (temp['tech'] == 1) & (temp['outsourcing'] == 1)
  ],
  choicelist = ['Other', 'Tech', 'Outsourcing', 'Both'],
  default = 'Other'
)

temp = (
  temp
    .groupby(['group', 'DATAFILE_YEAR', 'PW_WAGE_LEVEL'])
    .agg({'TOTAL_WORKER_POSITIONS': 'sum', 'PW_ANNUAL': 'median', 'WAGE_ANNUAL_FROM': 'median'})
    .reset_index()
)
temp['group_workers'] = temp['TOTAL_WORKER_POSITIONS'] / temp.groupby(['group', 'DATAFILE_YEAR'])['TOTAL_WORKER_POSITIONS'].transform('sum')

temp.to_clipboard(index=False)
temp

Number of employers  
We'll find the number of employers that petitioned for and received at least one H-1B visa by counting the number of unique employer names in each year. We'll use the original employer name as recorded in the original USCIS data, not the standardized name created during the ingestion process.  
This is an estimate, since the same employer often gets recorded with slight differences in their name. However, we're most interested in the change over time as opposed to the raw number.  
We'll calculate it two ways: (1) counting unique employer names, (2) counting unique employer names and tax IDs

In [None]:
uscis.columns

In [None]:
temp = uscis.groupby('Fiscal Year').agg(Employers = ('Employer (original name)', 'nunique'))

temp.to_clipboard()
temp

Repeat using unique employer names and tax IDs

In [None]:
temp = uscis.loc[:, ['Fiscal Year', 'Employer (original name)', 'Tax ID']].drop_duplicates().groupby('Fiscal Year').size()

temp.to_clipboard()
temp