# Import and Tidy Data

## Import

In this section, the original visas data set is imported from a csv file. As visible from the head of the data set, there are split columns containing the same information which need to be merged (country_of_citizenship vs. country_of_citzenship), as well as inconsistencies in NaN values. Less obviously, there are issues in the numeric columns that prevent them from being read in with a float dtype; some contain commas, while others do not.

In [1]:
import pandas as pd
import numpy as np

In [2]:
visas = pd.read_csv('/data/markellekelly/jboysen/us-perm-visas/us_perm_visas.csv', 
                    usecols=['add_these_pw_job_title_9089','case_no','case_number',
                             'case_status','class_of_admission',
                             'country_of_citizenship','country_of_citzenship','decision_date',
                             'employer_name','employer_num_employees',
                             'foreign_worker_info_birth_country',
                             'foreign_worker_info_education','foreign_worker_info_inst',
                             'foreign_worker_info_major','fw_info_birth_country',
                             'ji_live_in_domestic_service','job_info_education','job_info_experience',
                             'job_info_experience_num_months', 'job_info_foreign_ed',
                             'job_info_foreign_lang_req','job_info_job_title','job_info_major',
                             'job_info_training_field','job_info_work_city',
                             'job_info_work_postal_code','job_info_work_state','naics_2007_us_code',
                             'naics_2007_us_title','naics_code','naics_title','naics_us_code',
                             'naics_us_code_2007','naics_us_title','naics_us_title_2007',
                             'pw_level_9089','pw_soc_title',
                             'recr_info_coll_univ_teacher',
                             'ri_employer_web_post_from', 'ri_employer_web_post_to',
                             'recr_info_professional_occ','ri_layoff_in_past_six_months',
                             'us_economic_sector','wage_offer_from_9089',
                             'wage_offer_to_9089','wage_offer_unit_of_pay_9089',
                             'wage_offered_from_9089','wage_offered_to_9089',
                             'wage_offered_unit_of_pay_9089'],
                    parse_dates=['decision_date','ri_employer_web_post_from',
                                'ri_employer_web_post_to'],
                    dtype=str)

In [3]:
visas.head()

Unnamed: 0,add_these_pw_job_title_9089,case_no,case_number,case_status,class_of_admission,country_of_citizenship,country_of_citzenship,decision_date,employer_name,employer_num_employees,...,ri_employer_web_post_from,ri_employer_web_post_to,ri_layoff_in_past_six_months,us_economic_sector,wage_offer_from_9089,wage_offer_to_9089,wage_offer_unit_of_pay_9089,wage_offered_from_9089,wage_offered_to_9089,wage_offered_unit_of_pay_9089
0,,A-07323-97014,,Certified,J-1,,ARMENIA,2012-02-01,NETSOFT USA INC.,,...,,NaT,,IT,75629.0,,yr,,,
1,,A-07332-99439,,Denied,B-2,,POLAND,2011-12-21,PINNACLE ENVIRONEMNTAL CORP,,...,,NaT,,Other Economic Sector,37024.0,,yr,,,
2,,A-07333-99643,,Certified,H-1B,,INDIA,2011-12-01,"SCHNABEL ENGINEERING, INC.",,...,,NaT,,Aerospace,47923.0,,yr,,,
3,,A-07339-01930,,Certified,B-2,,SOUTH KOREA,2011-12-01,EBENEZER MISSION CHURCH,,...,,NaT,,Other Economic Sector,10.97,,hr,,,
4,,A-07345-03565,,Certified,L-1,,CANADA,2012-01-26,ALBANY INTERNATIONAL CORP.,,...,,NaT,,Advanced Mfg,100000.0,,yr,,,


Because this project aims to predict whether a visa request was approved or denied, I removed observations where the application was withdrawn or the result is unknown, limiting the data set to my only the observations of interest, and then reset the observation numbers.

In [4]:
visas = visas[(visas['case_status'] != 'Withdrawn') & (pd.notnull(visas['case_status']))]

In [5]:
visas.reset_index(drop=True,inplace=True)

## Initial Fixes and Tests

To fix the initial problems I found, I wrote functions to remove commas from numeric values, fix NaN values that were input as a string rather than a float, and convert numeric variables the correct dtype (ignoring errors for now). I then applied these functions to whichever columns necessary, coming back from time to time to add new columns.

In [6]:
def remove_commas(x):
    '''remove commas from numbers in a column'''
    x= str(x)
    newstr = ""
    for letter in x:
        if letter != ',':
            newstr = newstr + letter
    return newstr

In [7]:
def fix_nan(x):
    '''changes the text "nan" in columns into np.nan'''
    if x=='nan':
        x=np.nan
    return x

In [8]:
commas_list = ['employer_num_employees','wage_offer_from_9089','employer_name',
              'wage_offer_to_9089','wage_offered_from_9089','wage_offered_to_9089']
for x in commas_list:
    visas[x] = visas[x].apply(remove_commas)

In [9]:
numeric_list = ['wage_offer_from_9089','wage_offer_to_9089','wage_offered_from_9089',
      'wage_offered_to_9089']
for x in numeric_list:
    visas[x] = visas[x].apply(pd.to_numeric,errors='ignore')

In [10]:
fix_nan_list=['wage_offered_from_9089','wage_offer_from_9089','wage_offer_to_9089',
            'wage_offered_to_9089','wage_offer_unit_of_pay_9089','employer_name',
            'wage_offered_unit_of_pay_9089','employer_num_employees',
            'add_these_pw_job_title_9089','ri_employer_web_post_from',
            'ri_employer_web_post_to']
for x in fix_nan_list:
    visas[x] = visas[x].apply(fix_nan)

## Preliminary Data Overview

Before going further, I made a list of columns in the data set, researching what exactly they contained (this was not specified on Kaggle, so I had to dig through different data set descriptions on the Department of Labor website), deleting ones that were empty or otherwise not useful, and making a note of columns that needed to be merged (denoted in <b>bold</b>).

- add_these_pw_job_title_9089: Job Title
- <b>case_no,case_number</b>: Unique case identifier
- case_status: Whether the applications were confirmed or denied (target variable)
- class_of_admission: Type of visa
- <b>country_of_citizenship,country_of_citzenship</b>: Current country of citizenship
- decision_date: Date of decision
- employer_name: Name of employer
- employer_num_employees: Number of employees of the employer
- <b>foreign_worker_info_birth_country,fw_info_birth_country</b>: Country of birth
- foreign_worker_info_education: Level of education (Bachelor's, Master's, etc.)
- foreign_worker_info_inst: School attended
- foreign_worker_info_major: Major
- ji_live_in_domestic_service: Whether the application is for a live-in domestic service worker (Y/N)
- job_info_education: Minimum level of education required for the job
- job_info_experience: Whether experience is required for the job
- job_info_experience_num_months: Months of experience required for the job
- job_info_foreign_ed: Whether foreign education was accepted for the job
- job_info_foreign_lang_req: Whether knowledge of a foreign language was a job requirement
- job_info_job_title: Job title (generalized)
- job_info_major: Major specified for the job
- job_info_training_field: The field of training for the job, if applicable
- job_info_work_city: The city where the immigrant is applying to work
- job_info_work_postal_code: The postal code where the immigrant is applying to work
- job_info_work_state: The state where the immigrant is applying to work
- <b>naics_2007_us_code, naics_code, naics_us_code, nacis_us_code_2007</b>: Employer's industry code as classified by the North American Industrial Classification System 
- <b>naics_2007_us_title,naics_title, naics_us_title, naics_us_title_2007</b>: Title associated with employer's NAICS industry code
- pw_level_9089: Level of the prevailing wage determination (a Department of Labor classification)
- pw_social_title: Title associated with the occupational code as classified by the Standard Occupational Classification (SOC) System.
- ri_employer_web_post_from: Date the job posting was posted on the employer's website
- ri_employer_web_post_to: Date the job posting was removed from the employer's website
- <i>recr_info_coll_univ_teacher,recr_info_professional_occ </i>: Whether or not the job is a college or university teacher; whether or not the job is a professional occupation, other than a college or university professor (will combine into one column of professional job vs. not professional job)
- ri_layoff_in_past_six_months: Whether the employer had a layoff in the field within the last six months
- us_economic_sector: Major economic sector associated with the NAICS code of the employer
- <b>wage_offer_from_9089,wage_offered_from_9089</b>: Lower range of the wage offer
- <b>wage_offer_to_9089,wage_offered_to_9089</b>: Upper range of the wage offer
- <b>wage_offer_unit_of_pay_9089,wage_offered_unit_of_pay_9089</b>: Unit of pay of the wage offer (e.g. Hr/Yr)

## Column Merges

I believed that the columns I designated to be merged (in bold above) did not have any overlap, since they contained the same information, just in different columns, depending on the year asked. To double check this, I wrote a function to assert that the columns do not overlap before I merged them.

In [11]:
def non_repeating(col1, col2):
    '''test that two columns do not have overlapping values'''
    test_data = visas[(pd.notnull(visas[col1])) 
              & (pd.notnull(visas[col2]))]
    return (len(test_data) == 0)

In [12]:
naics_codes = ['naics_code','naics_us_code','naics_us_code_2007','naics_2007_us_code']
naics_titles = ['naics_2007_us_title','naics_title', 'naics_us_title', 'naics_us_title_2007']
for i in [naics_codes,naics_titles]:
    for j in range(4):
        for k in range(4):
            if j!=k:
                assert non_repeating(i[j],i[k])
assert non_repeating('wage_offer_from_9089','wage_offered_from_9089')
assert non_repeating('wage_offer_to_9089','wage_offered_to_9089')
assert non_repeating('wage_offer_unit_of_pay_9089','wage_offered_unit_of_pay_9089')
assert non_repeating('case_no','case_number')

Since these tests were successful, I merged the columns. I used the .fillna() method to accomplish this, filling the null values of the first column with values from the second column.

In [13]:
def merge_cols(cols):
    '''adds the values of col2 to col1 and then deletes col2'''
    col1,col2 = cols[0],cols[1]
    visas[col1] = visas[col1].fillna(visas[col2])
    del visas[col2]

In [14]:
merge_list = [('country_of_citizenship','country_of_citzenship'), 
              ('foreign_worker_info_birth_country','fw_info_birth_country'),
              ('wage_offer_from_9089','wage_offered_from_9089'),
              ('wage_offer_to_9089','wage_offered_to_9089'),
              ('wage_offer_unit_of_pay_9089','wage_offered_unit_of_pay_9089'),
              ('case_no','case_number')]
for i in merge_list:
    merge_cols(i)
for j in ['naics_2007_us_code', 'naics_us_code','naics_us_code_2007']:
    merge_cols(('naics_code',j))
for k in ['naics_2007_us_title','naics_us_title', 'naics_us_title_2007']:
    merge_cols(('naics_title',k))

## Column Tests

After fixing preliminary issues and merging columns, I wrote a series of tests for each type of column (text-based, categorical, postal codes, general numeric, dates, and visa case numbers) and functions to fix issues that came up while testing.

### Text-Based Columns

I tested that text columns are in the desired format, ensuring NaN value consistencies, checking for extra commas at the end of values, and confirming that values were strings (or null). I wrote functions converting some columns to uppercase, and fixing a problem where some applications had street addresses, rather than city names, in the city column.

In [15]:
def test_text_cols(col_name):
    '''tests columns with textual data (not nominal data) to ensure NaNs are not strings,
    there are no commas at the end of values, and the values are either strings or NaN'''
    for x in visas[col_name]:
        assert x != 'nan'
        if pd.notnull(x):
            assert x[-1] != ','
        assert type(x) == str or (type(x) == float and pd.isnull(x))

In [16]:
def fix_upper_case(x):
    '''make text columns all upper case'''
    if pd.notnull(x):
        return x.upper()
    return x

In [17]:
def fix_city_names(x):
    '''Set addresses in city name columns to NaN'''
    if pd.notnull(x) and any(char.isdigit() for char in x):
        return np.nan
    return x

In [18]:
visas['job_info_work_city'] = visas['job_info_work_city'].apply(fix_upper_case).apply(fix_city_names)
visas['employer_name'] = visas['employer_name'].apply(fix_upper_case)

In [19]:
for column in visas: 
    visas[column] = visas[column].apply(remove_commas).apply(fix_nan)

In [20]:
text_cols = ['add_these_pw_job_title_9089','country_of_citizenship',
             'employer_name','foreign_worker_info_birth_country',
            'foreign_worker_info_inst','foreign_worker_info_major','job_info_education',
             'job_info_job_title', 'job_info_major','job_info_training_field','job_info_work_city',
             'job_info_work_state','naics_title','pw_soc_title','us_economic_sector']
for column in text_cols:
    test_text_cols(column)

### Categorical Columns

I wrote a function to check that categorical columns have only the correct categories, input as a second parameter, as values. To fix this, I had to write several functions to ensure values conformed to these categories.

In [21]:
def test_nominal(col_name,possible_values):
    '''checks that each value is either NaN or within the specified list
    of acceptable values '''
    for x in visas[col_name]:
        assert (x in possible_values) or (pd.isnull(x))

Since whether the visa has expired is not relevant to my prediction of whether it was initially approved or not, I removed the extension "-Expired" from all visa status values. 

In [22]:
def fix_visa_status(x):
    '''remove "Expired" from visa status to group all approvals together'''
    if x=='Certified-Expired':
        x='Certified'
    return x

In [23]:
visas['case_status'] = visas['case_status'].apply(fix_visa_status)

To ensure that all values of the class_of_admission column were real visa types, I printed the unique values of the column, researched them to ensure they were real visa types, and wrote a function to combine or adjust them where necessary.

In [24]:
print(visas['class_of_admission'].unique())

['J-1' 'B-2' 'H-1B' 'L-1' 'EWI' 'E-2' nan 'E-1' 'H-2B' 'TPS' 'F-1' 'B-1'
 'C-1' 'Not in USA' 'TN' 'H-4' 'O-1' 'R-1' 'L-2' 'Q' 'F-2' 'H-1B1'
 'Parolee' 'G-5' 'E-3' 'H-2A' 'VWT' 'P-1' 'A1/A2' 'D-1' 'A-3' 'R-2' 'H-1C'
 'H-3' 'J-2' 'P-4' 'I' 'H-1A' 'G-1' 'VWB' 'G-4' 'P-3' 'AOS/H-1B' 'O-3'
 'Parol' 'O-2' 'H1B' 'N' 'T-1' 'TD' 'M-1' 'U-1' 'AOS' 'P-2' 'C-3' 'K-1'
 'V-2' 'M-2']


In [25]:
types_of_visas = ['A1/A2','A-3','AOS','B-1','B-2','C-1','C-3','D-1','E-1','E-2','E-3',
                  'EWI','F-1','F-2','G-1','G-4','G-5','H-1A','H-1B','H-1C','H-2A','H-2B',
                  'H-3','H-4','I','J-1','J-2','K-1','L-1','L-2','M-1','M-2','N','O-1',
                  'O-2','O-3','P-1','P-2','P-3','P-4','Q-1','R-1','R-2','T-1','TD','TN',
                  'TPS','U-1','V-2','VWB','VWT','Not in USA','Parole']

In [26]:
def fix_visa_type(x):
    '''adjusts some inconsistencies in the visa type column'''
    if x == 'Parolee' or x=='Parol':
        x='Parole'
    elif x == 'Q':
        x = 'Q-1'
    elif x == 'H-1B1' or x == 'H1B':
        x = 'H-1B'
    elif x == 'AOS/H-1B':
        x = 'AOS'
    if x not in types_of_visas:
        x = np.nan
    return x

In [27]:
visas['class_of_admission'] = visas['class_of_admission'].apply(fix_visa_type)

Since some state values were input as abbreviations and others included the whole word, I used a function and dictionary to convert the spelled out states to their abbreviations.

In [28]:
states = ['AL', 'AK', 'AZ', 'AR', 'BC', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'FM', 'GA',
          'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI',
          'MN', 'MP', 'MH', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC',
          'ND', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VI',
          'VT', 'VA', 'WA', 'WV', 'WI', 'WY']

In [29]:
states_dict1 = {'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas' : 'AR',
                'British Columbia' : 'BC', 'California': 'CA', 'Colorado': 'CO',
                'Connecticut': 'CT', 'District of Columbia' : 'DC', 'Delaware': 'DE',
                'Florida': 'FL', 'Federated States of Micronesia' : 'FM', 'Georgia': 'GA',
                'Guam': 'GU','Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL',
                'Indiana': 'IN', 'Iowa': 'IA','Kansas': 'KS', 'Kentucky': 'KY', 
                'Louisiana': 'LA', 'Maine': 'ME', 'Marshall Islands': 'MH','Maryland': 'MD', 
                'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN',
                'Mississippi': 'MS', 'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE',
                'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
                'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 
                'North Dakota': 'ND', 'Northern Mariana Islands': 'MP', 'Ohio': 'OH',
                'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA','Puerto Rico': 'PR',
                'Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD',
                'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
                'Virginia': 'VA', 'Virgin Islands': 'VA', 'Washington': 'WA',
                'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'}
states_dict = {a.upper():b for a,b in states_dict1.items()}

In [30]:
def fix_states(x):
    '''Replace full word states with their abbreviation when necessary'''
    if x not in states:
        if x in states_dict.keys():
            x = states_dict[x]
    return x

In [31]:
visas['job_info_work_state'] = visas['job_info_work_state'].apply(fix_states)

Similarly to state values, some pay unit values were saved as abbreviations (yr, hr) while others were spelled out (Year, Hour), so I wrote a function to convert the abbreviations to the full values and applied it to the pay unit column.

In [32]:
y = visas['wage_offer_unit_of_pay_9089'].unique()
print(y)

['yr' 'hr' 'mth' 'wk' 'bi' nan 'Year' 'Hour' 'Week' 'Month' 'Bi-Weekly']


In [33]:
correct_pay_units = ['Year','Hour','Week','Month','Bi-Weekly']

In [34]:
pay_units_dict = {'yr':'Year','hr':'Hour','mth':'Month','wk':'Week','bi':'Bi-Weekly'}

In [35]:
def fix_pay_units(x):
    '''Replace pay unit abbreviations with the full word when necessary'''
    if x not in correct_pay_units:
        if x in pay_units_dict.keys():
            x = pay_units_dict[x]
    return x

In [36]:
visas['wage_offer_unit_of_pay_9089'] = visas['wage_offer_unit_of_pay_9089'].apply(fix_pay_units)

One column included NAICS codes, which represent the specific industry of the job, and can have several different lengths, depending on how specific they are. Since many applications only included the first two digits (representing the general industry), and in order to more effectively use the column as a categorical variable, I wrote a function to convert longer codes to their first two digits, and check that these shorter codes are legitimate NAICS codes.

In [37]:
naics_codes = [11,21,22,23,31,42,44,48,51,52,53,54,55,56,61,62,71,72,81,92]

In [38]:
def fix_naics_codes(x):
    '''Simplify all NAICS codes to their first 2 digits, which represent the general industry,
    and ensure they are within the range of possible values'''
    if pd.isnull(x):
        return x
    if len(x) > 2:
        x = x[:2]
    x = int(x)
    if x == 32 or x == 33:
        x=31
    elif x == 45:
        x=44
    elif x == 49:
        x=48
    if x not in naics_codes:
        print(x)
    assert x in naics_codes
    return x

In [39]:
visas['naics_code'] = visas['naics_code'].apply(fix_naics_codes)

Finally, I used my test_nominal function to ensure all categorical columns contained only the values they should.

In [40]:
test_nominal('case_status',['Certified','Denied'])
test_nominal('class_of_admission',types_of_visas)
test_nominal('job_info_work_state',states)
education = ['foreign_worker_info_education', 'job_info_education']
for h in education:
    test_nominal(h,['High School',"Associate's", "Bachelor's","Master's",'Doctorate',
                    "None","Other"])
yes_no = ['ji_live_in_domestic_service',
         'job_info_experience','job_info_foreign_ed','job_info_foreign_lang_req',
         'recr_info_coll_univ_teacher',
         'recr_info_professional_occ','ri_layoff_in_past_six_months']
for y in yes_no:
    test_nominal(y,['Y','N'])
test_nominal('pw_level_9089',['Level I','Level II','Level III','Level IV'])
test_nominal('wage_offer_unit_of_pay_9089', correct_pay_units)
test_nominal('naics_code',naics_codes)

### Postal Codes

Since some applications included the additional four digits at the end of the postal code (ZIP-plus-4 codes), and these don't provide any significant useful information for my data analyisis and modeling, I wrote a function to remove these extensions, putting all postal codes in the same format. Additionally, postal codes from out of the U.S. were switched to the string "Not in US". I then tested that all postal codes in the column were either "Not in US", null, or five-digit strings.

In [41]:
def test_postal_codes(col_name):
    '''Tests whether postal codes are of the type str and of length 5'''
    for x in visas[col_name]:
        if x != "Not in US":
            assert (pd.isnull(x) or len(x) == 5)
            assert (pd.isnull(x) or type(x) == str)

In [42]:
def fix_postal_codes(x):
    '''If postal codes include the four-digit extension, remove it; if they omit the 0 at
    the beginning, add it; if they are out of US zip codes (length != 5), mark them as out of US'''
    if pd.isnull(x):
        return x
    if (len(x) == 9 or len(x) == 10) and x[-5] == '-':
        x = x.split('-')[0]
    if len(x) == 4:
        x = '0' + x
    if len(x) != 5:
        x = "Not in US"
    return x

In [43]:
visas['job_info_work_postal_code'] = visas['job_info_work_postal_code'].apply(fix_postal_codes)
test_postal_codes('job_info_work_postal_code')

### General Numeric

I tested that values of numeric columns are of the correct type and greater than 0 (since negative values do not make sense for any of the columns). For some reason, wage values were sometimes input as "#############" if missing, so I wrote a function to convert these values to NaN, and was able to convert the numeric columns to the correct type without ignoring errors (as was the case in the preliminary look at the data set).

In [44]:
def test_numerical(col_name):
    '''Test that numerical values are floats that are NaN or >0'''
    for x in visas[col_name]:
        assert (pd.isnull(x) or (type(x)==float and x >= 0))

In [45]:
def fix_wage_offers(x):
    '''fix inconsistencies in wage offer values'''
    if x == '#############' or float(x) <=0:
        x= np.nan
    return x

In [46]:
visas['wage_offer_from_9089'] = visas['wage_offer_from_9089'].apply(fix_wage_offers)
visas['wage_offer_to_9089'] = visas['wage_offer_to_9089'].apply(fix_wage_offers)

In [47]:
convert_floats = ['employer_num_employees','job_info_experience_num_months',
                  'wage_offer_from_9089', 'wage_offer_to_9089']
for col in convert_floats:
    visas[col] = visas[col].astype(float)

In [48]:
numeric_cols = ['employer_num_employees','job_info_experience_num_months','naics_code',
               'wage_offer_from_9089', 'wage_offer_to_9089']
for col in numeric_cols:
    test_numerical(col)

### Dates

I wrote a function to fix an error where the year was input as "0214" instead of 2014, and to ensure no date values corresponded to a date after the data set was published (since these would be errors). I then converted date values to the datetime type and tested that they are reasonable dates, including a test ensuring that the date the job posting was posted on the employer's web site is before the date it was removed.

In [49]:
def test_dates(col_name):
    '''Test that dates are within the appropriate time frame'''
    for x in visas[col_name]:
        if pd.notnull(x):
            assert x.year > 1995 and x.year < 2017
            assert x.month < 13 and x.month > 0
            assert x.day < 32 and x.day > 0

In [50]:
def fix_dates(x):
    '''Fix inconsistencies in date values'''
    if x=='0214-04-22':
        x='2014-04-22'
    if pd.notnull(x) and (int(str(x)[:4]) > 2017):
        x=np.nan
    return x

In [51]:
visas['ri_employer_web_post_from'] = visas['ri_employer_web_post_from'].apply(fix_dates)

In [52]:
date_cols = ['decision_date', 'ri_employer_web_post_from','ri_employer_web_post_to']
for col in date_cols:
    visas[col] = pd.to_datetime(visas[col])

In [53]:
for col in date_cols:
    test_dates(col)
for x,y in zip(visas['ri_employer_web_post_from'],visas['ri_employer_web_post_to']):
    if (pd.notnull(x) and pd.notnull(y)):
        assert x <= y

### Case Numbers

I tested that all values in the case number column are real visa application case number values. 

In [54]:
for x in visas['case_no']:
    a,b,c = x.split('-')
    assert a == 'A' or a == 'C'
    assert len(b) == 5
    test1 = int(b)
    assert len(c) == 5
    test2 = int(c)

## Transformation of Columns

After testing the columns, I decided to create some new columns that would be more useful for further analysis:<br> First, since some applications included one value for wage offer while others had a range of wages, I created a new column containing a single value for wage: the wage specified for applications with one value, and the mean of the two values for applications with a range. <br>Second, since some wage values were already in terms of annual salary, while others were hourly rates (with a few weekly, bi-weekly, or monthly values), I created a column that contained an estimated annual wage for all observations, assuming that applicants work 40 hours a week.<br> Third, I combined two columns regarding whether the job was professional, one about whether it was a professional job other than professor, and the other about whether the job was a professor position. <br> Fourth, I combined the columns containing the date the job was posted on the employer website and the date it was removed, creating a column of information on the total length of time the job posting was on the employer's web site.<br> Finally, I combined all text columns containing information about the applicant's job or industry into one column, which is more practical for topic modeling.

In summary, the new columns to create are:
- center of min wage offer and max wage offer (overall_wage_offer)
- overall_wage_offer converted to annual wages (annual_wage)
- combine university professor with other professional careers (professional_occupation)
- length of time the job posting was on the employer's web site (employer_desperation)
- all of the text columns containing job or industry info, combined (all_text)

In [55]:
visas['employer_desperation'] = visas['ri_employer_web_post_to'] - visas['ri_employer_web_post_from']
visas['employer_desperation'] = [x.days for x in visas['employer_desperation']]
del visas['ri_employer_web_post_from'], visas['ri_employer_web_post_to']

In [56]:
def determine_wage(x,y):
    '''Returns the midpoint of the minimum and maximum wage if both specified;
    otherwise, returns the single wage specified'''
    if pd.notnull(x) and pd.notnull(y):
        return (x+y)/2
    elif pd.notnull(y):
        return y
    return x

In [57]:
overall_wage_list = []
for x,y in zip(visas['wage_offer_from_9089'],visas['wage_offer_to_9089']):
    overall_wage_list.append(determine_wage(x,y))
visas['overall_wage_offer'] = overall_wage_list
del visas['wage_offer_from_9089'],visas['wage_offer_to_9089']

In [58]:
def return_annual_wage(x,y):
    '''If unit of pay is specified, use that to convert to annual. If unspecified, make a
    guess based on <400: hourly, >400 & <2500: weekly, >2500 & <5000: biweekly, 
    >5000 & <15000: monthly, >15000:annual'''
    if pd.isnull(x):
        return x
    if pd.notnull(y):
        if x < 30 or y=='Hour':
            return x*40*52
        if y== 'Year':
            if x < 2500:
                return x*26
            return x
        elif y=='Month':
            return x*12
        elif y=='Bi-Weekly':
            return x*26
        else:
            assert y=='Week'
            return x*52
    if x > 15000:
        return x
    elif x <= 15000 and x > 5000:
        return x*12
    elif x <=5000 and x > 2500:
        return x*26
    elif x <=2500 and x > 400:
        return x*52
    else:
        assert x <=400
        return x*40*52

In [59]:
annual_wage_list = []
for x,y in zip(visas['overall_wage_offer'],visas['wage_offer_unit_of_pay_9089']):
    annual_wage_list.append(return_annual_wage(x,y))
visas['annual_wage'] = annual_wage_list

In [60]:
def is_it_professional(x,y):
    '''Combines two columns regarding professional occupation into one'''
    if y == 'Y':
        return 'Y'
    return x

In [61]:
professional_list = []
for x,y in zip(visas['recr_info_coll_univ_teacher'],visas['recr_info_professional_occ']):
    professional_list.append(is_it_professional(x,y))
visas['professional_occupation'] = professional_list
del visas['recr_info_coll_univ_teacher'],visas['recr_info_professional_occ']

In [62]:
visas['all_text'] = (visas['add_these_pw_job_title_9089'].fillna('') 
    + ' ' + visas['foreign_worker_info_major'].fillna('')
    + ' ' + visas['job_info_job_title'].fillna('') + ' ' + visas['job_info_major'].fillna('')
    + ' ' + visas['job_info_training_field'].fillna('') 
    + ' ' + visas['naics_title'].fillna('') + ' ' + visas['pw_soc_title'].fillna('')
    + ' ' + visas['us_economic_sector'].fillna(''))

In [63]:
del (visas['add_these_pw_job_title_9089'],visas['foreign_worker_info_major'],
    visas['job_info_job_title'],visas['job_info_major'],visas['job_info_training_field'],
    visas['naics_title'],visas['pw_soc_title'],visas['us_economic_sector'])

## New Column Tests

Finally, I tested the new columns I created, and converted the employer desperation column, which was a pandas TimeDelta, to a float containing the number of days, which is much easier to work with.

In [64]:
def test_time_delta(col_name):
    '''test that time deltas converted successfully to number of days'''
    for x in visas[col_name]:
        if pd.notnull(x):
            assert x<10000
            assert x >= 0
            assert type(x) == float

In [65]:
test_time_delta('employer_desperation')
visas['employer_desperation'] = visas['employer_desperation'].apply(pd.to_numeric)

In [66]:
test_numerical('overall_wage_offer')
test_numerical('annual_wage')
test_nominal('professional_occupation',['Y','N'])

In [67]:
for x,y,z,w in zip(visas['annual_wage'],visas['overall_wage_offer'],visas['wage_offer_unit_of_pay_9089'],visas['all_text']):
    if pd.notnull(x):
        if x < 9000:
            print(x,y,z,w)
        assert x > 9000 or x==5226.0

5226.0 100.5 Week   HOUSEKEEPER/BABY SITTER     


In [68]:
for x in visas['all_text']:
    assert type(x) == str or pd.isnull(x)

## Final Column Overview

In [70]:
visas.head()

Unnamed: 0,case_no,case_status,class_of_admission,country_of_citizenship,decision_date,employer_name,employer_num_employees,foreign_worker_info_birth_country,foreign_worker_info_education,foreign_worker_info_inst,...,job_info_work_state,naics_code,pw_level_9089,ri_layoff_in_past_six_months,wage_offer_unit_of_pay_9089,employer_desperation,overall_wage_offer,annual_wage,professional_occupation,all_text
0,A-07323-97014,Certified,J-1,ARMENIA,2012-02-01,NETSOFT USA INC.,,,,,...,NY,54.0,Level II,,Year,,75629.0,75629.0,,Computer Systems Design Services Computer...
1,A-07332-99439,Denied,B-2,POLAND,2011-12-21,PINNACLE ENVIRONEMNTAL CORP,,,,,...,NY,56.0,Level I,,Year,,37024.0,37024.0,,Hazardous Waste Treatment and Disposal Ha...
2,A-07333-99643,Certified,H-1B,INDIA,2011-12-01,SCHNABEL ENGINEERING INC.,,,,,...,MD,54.0,Level I,,Year,,47923.0,47923.0,,Engineering Services Civil Engineers Aero...
3,A-07339-01930,Certified,B-2,SOUTH KOREA,2011-12-01,EBENEZER MISSION CHURCH,,,,,...,NY,81.0,Level II,,Hour,,10.97,22817.6,,Religious Organizations File Clerks Other...
4,A-07345-03565,Certified,L-1,CANADA,2012-01-26,ALBANY INTERNATIONAL CORP.,,,,,...,NY,31.0,Level IV,,Year,,100000.0,100000.0,,Paper Industry Machinery Manufacturing Sa...


All Columns: (<i> New Column </i>)
- case_no: Unique case identifier
- case_status: Target Variable (whether they were confirmed or denied)
- class_of_admission: Type of Visa
- country_of_citizenship: Current country of citizenship
- decision_date: Date of decision
- employer_name: Name of Employer
- employer_num_employees: Number of Employees at the Employer
- foreign_worker_info_birth_country: Country of birth
- foreign_worker_info_education: Level of education (Bachelor's, Master's, etc.)
- foreign_worker_info_inst: School attended
- ji_live_in_domestic_service: whether the application is for a live-in domestic service worker (Y/N)
- job_info_education: Minimum level of education required for the job
- job_info_experience: Whether experience is required for the job
- job_info_experience_num_months: Months of experience required for the job
- job_info_foreign_ed: Whether foreign education was accepted for the job
- job_info_foreign_lang_req: Whether knowledge of a foreign language was a job requirement
- job_info_work_city: The city where the immigrant is working
- job_info_work_postal_code: The postal code where the immigrant is working
- job_info_work_state: The state where the immigrant is working
- naics_code: Employer's industry code as classified by the North American Industrial Classification System 
- pw_level_9089: Level of the prevailing wage determination
- ri_layoff_in_past_six_months: Whether the employer had a layoff in the field within the last six months
- wage_offer_unit_of_pay_9089: Unit of pay of the wage offer (e.g. Hr/Yr)
- <i>employer_desperation</i>: How long the job posting was on the employer's website
- <i>overall_wage_offer</i>: Wage offered to the employee (if a range given, the average of this range)
- <i>annual_wage</i>: The overall wage offer in terms of annual salary
- <i>professional_occupation</i>: Whether or not the job was classified as professional (educational or otherwise)
- <i>all_text</i>: Combination of all job-descriptive text columns, including job titles, majors, training fields, industry titles, and economic sectors

## Export

I exported the finalized visa dataset to csv for use in the rest of the notebooks.

In [69]:
visas.to_csv('/data/markellekelly/visas.csv',encoding = "utf8")

Next: [Exploratory Data Analysis](03-Exploratory.ipynb)