# USCIS H-1B Data: Import raw data from the data hub
**Source:** [H-1B Employer Data Hub ](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub)  
**Documentation:** [Understanding Our H-1B Employer Data Hub](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub/understanding-our-h-1b-employer-data-hub)  

Before running this script you'll need to visit the [H-1B Employer Data Hub](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub) and download CSV files for the years you're interested in exploring and save them in `/data/raw/`. One file per year. The files will download with the name `Employer Information <YYYY>.csv`

Notes:  
 - The [H-1B Employer Data Hub Files](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub/h-1b-employer-data-hub-files) states that the files on this page were last updated 2021-01-08. They do not contain the most up-to-date data. The most recent data is on the [H-1B Employer Data Hub](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub).
 - From the documentation: "The counts of initial approval, initial denial, continuing approval, and continuing denial are aggregated by distinct completion fiscal year, two digit NAICS code, tax ID, state, city, and ZIP code. For example, one employer with multiple addresses in a given fiscal year will have multiple rows in the data. The most common spelling of employer name per unique tax ID is used." See documentaiton for more details.

In [1]:
# Import packages
import pandas as pd

In [4]:
# Define parameters
data_dir = '../../data/'
input_dir = data_dir + 'raw/'
input_filename_base = 'Employer Information '
output_dir = data_dir + 'intermediate/'
output_filename = 'uscis_intermediate.csv'

years = list(range(2009, 2023+1))

dtypes = {
  'Line by line': int,
  'Fiscal Year': int,
  'Employer (Petitioner) Name': str,
  'Tax ID': str,
  'Industry (NAICS) Code': str,
  'Petitioner City': str,
  'Petitioner State': str,
  'Petitioner Zip Code': str,
  'Initial Approval': int,
  'Initial Denial':  int,
  'Continuing Approval': int,
  'Continuing Denial': int
}

column_rename = {
  'Fiscal Year   ': 'Fiscal Year', 
  'Employer (Petitioner) Name': 'Employer',
  'Industry (NAICS) Code': 'NAICS Code'
  }

In [5]:
# Load USCIS data
dfs = []
for year in years:
  df = pd.read_table(
    filepath_or_buffer = input_dir + input_filename_base + str(year) + '.csv',
    sep = '\t',
    encoding = 'utf-16',
    dtype = dtypes,
    thousands = ','
  )

  # Drop the index column from the table
  df.drop('Line by line', axis=1, inplace=True)

  # Standardize column names
  df = df.rename(columns=column_rename)

  # Check that file contains data for only the specified fiscal year
  if df['Fiscal Year'].unique().tolist() != [year]:
    raise Exception('Found more than one fiscal year in file')
  
  dfs.append(df)

uscis = pd.concat(dfs)

In [13]:
# Save file
uscis.to_csv(output_dir + output_filename, index=False)