# Methodology Summary
For the full methodology, see _[Analyzing H-1B Visa Data](https://www.nikhilgahlawat.com/projects/h1b-tech-methodology/)_

In [1]:
# Load packages
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

In [59]:
# Filenames and parameters
data_dir = '../../data/analysis/'
parameters_dir = '../../data/parameters/'
uscis_filename = 'uscis.csv'
lca_filename = 'lca.csv'

industry_flags = ['tech', 'semiconductor', 'ai', 'it_services', 'professional_services', 'software',]

In [55]:
lca_cols = {
  'CASE_NUMBER': 'str',
  'DATAFILE_YEAR': 'int',
  'EMPLOYER_NAME': 'str',
  'H_1B_DEPENDENT': 'str',
  # 'JOB_TITLE'': ',
  # 'SOC_CODE'': ',
  # 'SOC_TITLE'': ',
  'TOTAL_WORKER_POSITIONS': 'float',
  'NEW_EMPLOYMENT': 'float',
  'CONTINUED_EMPLOYMENT': 'float',
  'CHANGE_PREVIOUS_EMPLOYMENT': 'float',
  'NEW_CONCURRENT_EMPLOYMENT': 'float',
  'CHANGE_EMPLOYER': 'float',
  'AMENDED_PETITION': 'float',
  'FULL_TIME_POSITION': 'str',
  'VISA_CLASS': 'str',
  'NAICS_CODE': 'str',
  'WAGE_ANNUAL_FROM': 'float',
  'WAGE_ANNUAL_TO': 'float',
  'PW_ANNUAL': 'float',
  'PW_WAGE_LEVEL': 'str'
}

In [56]:
lca = pd.read_csv(data_dir + lca_filename, usecols=lca_cols.keys(), dtype=lca_cols)

In [None]:
uscis = pd.read_csv(data_dir + uscis_filename)

In [60]:
marketcap = pd.read_csv(
  parameters_dir + 'CompaniesMarketCap - All Companies.csv',
  usecols = ['name', 'country', 'employee_count'] + industry_flags
)

In [90]:
employer_params = pd.read_csv(parameters_dir + 'employer_params.csv', usecols=['employer', 'outsourcing', 'employee_count', 'in_market_cap_data', 'tech'])

## Employer standardization

What share of total petitions are from the top 250 companies?

In [83]:
col = 'Total Petitions'
petitions_by_employer = (
  uscis
    .groupby('Employer')
    .agg({col: 'sum'}) 
    .sort_values(col, ascending=False)
)
petitions_by_employer['Rank'] = petitions_by_employer[col].rank(ascending=False)
petitions_by_employer['Share'] = petitions_by_employer[col] / petitions_by_employer[col].sum()
petitions_by_employer.loc[petitions_by_employer['Rank'] <= 250, 'Share'].sum()

0.4123680052592457

## NAICS Codes

[NAICS codes](https://www.census.gov/naics/) (short for "North American Industry Classification System") is the standard used by federal statistical agencies in the US for classifying employers. When an employer submits LCAs or petitions to the USCIS they provide a NAICS code for their business. Normally this would serve as a convenient source of industry classification, but NAICS codes provided by employers could sometimes prove inconsistent or unexpected.

The top 100 H-1B companies have listed 4 different NAICS codes on average.

In [53]:
top_employers = (
  uscis
    .groupby('Employer')
    .agg({'Total Petitions': 'sum'}) 
    .sort_values('Total Petitions', ascending=False)
    .head(100)
)

(
  top_employers
    .join(uscis.set_index('Employer'), how='inner', validate='1:m', lsuffix='_left')
    .groupby('Employer')
    .agg({'NAICS Code': 'nunique'})
    .agg({'NAICS Code': 'mean'})
)

NAICS Code    4.22
dtype: float64

And some companies have provided unexpected NAICS codes. For instance, Apple is most commonly listed as "Manufacturing"; JPMorgan Chase is "Management of Companies and Enterprises"; and Microsoft is "Information".

In [10]:
uscis.loc[uscis['Employer'] == 'Apple', 'NAICS Code'].value_counts()

NAICS Code
31-33 - Manufacturing                                    88
54 - Professional, Scientific, and Technical Services     5
51 - Information                                          3
Name: count, dtype: int64

In [18]:
uscis.loc[uscis['Employer'] == 'JPMorgan Chase', 'NAICS Code'].value_counts()

NAICS Code
55 - Management of Companies and Enterprises                                     82
51 - Information                                                                  8
52 - Finance and Insurance                                                        6
54 - Professional, Scientific, and Technical Services                             3
11 - Agriculture, Forestry, Fishing and Hunting                                   1
56 - Administrative and Support and Waste Management and Remediation Services     1
Name: count, dtype: int64

In [19]:
uscis.loc[uscis['Employer'] == 'Microsoft', 'NAICS Code'].value_counts()

NAICS Code
51 - Information                                                                 128
54 - Professional, Scientific, and Technical Services                             33
61 - Educational Services                                                          9
52 - Finance and Insurance                                                         5
56 - Administrative and Support and Waste Management and Remediation Services      2
31-33 - Manufacturing                                                              2
21 - Mining, Quarrying, and Oil and Gas Extraction                                 2
11 - Agriculture, Forestry, Fishing and Hunting                                    1
62 - Health Care and Social Assistance                                             1
Name: count, dtype: int64

It might be better to instead use industries / company groupings that are (1) consistent, and (2) represent a classification that is more commonly associated with the company.

## Industry categories

Industry categories come from [CompaniesMarketCap](https://companiesmarketcap.com/)

How many technology companies are identified by CompaniesMarketCap?

In [62]:
marketcap.agg({'tech': 'sum'})

tech    953
dtype: int64

How many of those are in the USCIS H-1B data?

In [103]:
(
  marketcap
    .loc[marketcap['tech'] == 1, :]
    .rename(columns={'name': 'Employer'})
    .merge(
      uscis.groupby('Employer').agg({'Total Petitions': 'sum'}),
      on='Employer',
      validate='1:1'
    )
    .agg({
      'tech': 'sum',
      'Total Petitions': 'sum'
    })
)

tech                  136
Total Petitions    581718
dtype: int64

The list of outsourcing companies are taken from [previous](https://www.epi.org/blog/tech-and-outsourcing-companies-continue-to-exploit-the-h-1b-visa-program-at-a-time-of-mass-layoffs-the-top-30-h-1b-employers-hired-34000-new-h-1b-workers-in-2022-and-laid-off-at-least-85000-workers/) [analyses](https://www.nytimes.com/interactive/2015/11/06/us/outsourcing-companies-dominate-h1b-visas.html?smid=tw-share).

How many outsourcing companies are identified in the data?

In [104]:
(
  employer_params
    .loc[employer_params['outsourcing'] == 1, :]
    .rename(columns={'employer': 'Employer'})
    .merge(
      uscis.groupby('Employer').agg({'Total Petitions': 'sum'}),
      on='Employer',
      validate='1:1'
    )
    .agg({
      'outsourcing': 'sum',
      'Total Petitions': 'sum'
    })
)

outsourcing             31
Total Petitions    1144522
dtype: int64

How many outsourcing companies are in the "IT Services" or "Professional Services" industries?

In [114]:
(
  employer_params
    .loc[employer_params['outsourcing'] == 1, :]
    .merge(marketcap, left_on='employer', right_on='name', how='inner', validate='1:1')
    .loc[: , ['it_services', 'professional_services']]
    .value_counts(dropna=False)
)

it_services  professional_services
0            1                        7
1            1                        7
0            0                        2
1            0                        1
Name: count, dtype: int64

IBM is in both the "tech" and "outsourcing" categories. To remain consistent with previous analysis, IBM is categorized as an outsourcing company and not a tech company.

In [116]:
marketcap[marketcap['name'] == 'IBM']

Unnamed: 0,name,country,employee_count,tech,semiconductor,it_services,software,professional_services,ai
58,IBM,United States,288300,1,0,0,1,0,1


For more data analysis, see the [H-1B Tech notebook](https://github.com/nikhilgahlawat/immigration-h1b/blob/main/src/analysis/h1b_tech.ipynb).