# Lecture 1, Part 2: A Brief Foray into Modeling

##### To get a sense of what is coming, let's build a quick model. 


## Nearest neighbors: a cheap intro into data science modeling.
##### Let's run a `knn` model to find out what some similar jobs might be. 
![NN](https://media.giphy.com/media/XMPFBfeB2tgK4/giphy-downsized-large.gif)

But first Let's pick up where we left off, by reloading and straigtening out some of the data.



In [119]:
import pandas as pd
import numpy as np
import ipywidgets as widgets

from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import normalize
from sklearn.neighbors import KNeighborsRegressor


First, we reload the data and take a quick look to confirm all is as planned.

In [120]:
raw_df = pd.read_csv('https://grantmlong.com/data/H-1B_Disclosure_Data_FY2019_slim.csv')

raw_df.sample(5)

Unnamed: 0,WORKSITE_CITY,WORKSITE_COUNTY,WORKSITE_STATE,WORKSITE_POSTAL_CODE,EMPLOYER_NAME,EMPLOYER_DBA,WAGE_RATE_OF_PAY_FROM,WAGE_UNIT_OF_PAY,NAICS_CODE,SOC_CODE,JOB_TITLE,CASE_STATUS,CASE_SUBMITTED
59387,NASHVILLE,DAVIDSON,TN,37214,CAPGEMINI AMERICA INC,,79100.0,Year,5416.0,15-1131,PROGRAMMER/ DEVELOPER 3,CERTIFIED,2019-02-11
250228,WEST PALM BEACH,PALM BEACH,FL,33401,"SOFTVISION, LLC",,81016.0,Year,541990.0,15-1132,ASSOCIATE,CERTIFIED,2019-03-18
1990,MENLO PARK,SAN MATEO,CA,94025,"FACEBOOK, INC.",,142500.0,Year,518112.0,15-1132,SOFTWARE ENGINEER,CERTIFIED,2019-01-08
222492,FRAMINGHAM,MIDDLESEX,MA,1702,TATA CONSULTANCY SERVICES LIMITED,,71614.0,Year,541511.0,15-1199,TECHNICAL LEAD,CERTIFIED,2019-03-14
297184,NEW BRUNSWICK,MIDDLESEX,NJ,8901,PARAMOUNT GLOBAL SOLUTIONS INC,,96366.0,Year,541511.0,15-1132,SOFTWARE DEVELOPER,CERTIFIED,2019-03-23


## Suppose we want to find similar roles to one we've been offered. How might we do this?

#### We might look at jobs ... 
* in the same industry
* in the same geographic area
* in a similar occupation

#### First, let's narrow down a few things:
* Let's only look at visas that have been certified. 
* Let's filter out null values.
* We can track occupation with the `SOC_CODE` column, which reflects a government-defined [Standard Occupation Code](https://www.bls.gov/soc/).
* We can track industry with the `NAICS_CODE` column, which reflects a the [North American Industry Classification Codes](https://www.census.gov/eos/www/naics/).
* However, both of these latter columns will need cleaning. 



In [121]:
raw_df = raw_df.loc[
    (raw_df['WAGE_UNIT_OF_PAY']=='Year') & 
    (raw_df['CASE_STATUS']=='CERTIFIED') & 
    (raw_df['WORKSITE_POSTAL_CODE'].notnull()) & 
    (raw_df['SOC_CODE'].str.len()==7),
].reset_index()

raw_df['OCC_CODE'] = (
    raw_df['SOC_CODE'].str.slice(0,2) + 
    raw_df['SOC_CODE'].str.slice(3,7)
).astype(int)

raw_df['NAICS_POWER'] = raw_df['NAICS_CODE'].astype(int).astype(str).str.len()
raw_df['IND_CODE'] = (raw_df['NAICS_CODE'] * (10 ** (6-raw_df['NAICS_POWER'])))


#### Let's start with geography. Let's convert worksite zip codes to latitude and longitude.

In [122]:
# let's read a file with a list of latitude and longitude for each zip code in the country
zip_df = pd.read_csv('https://grantmlong.com/data/ZipsLatLon.txt')

# we'll need to clean this file up a bit to play with it
zip_df['ZIP'] = zip_df.ZIP.astype(int).astype(str).str.zfill(5)
zip_df = zip_df.set_index('ZIP')

# having done this, we can add this to our data
df = raw_df.merge(
    zip_df,
    left_on='WORKSITE_POSTAL_CODE',
    right_index=True,
    how='left'
)

# finally, let's make sure none of our values are null
print(df['LNG'].isnull().sum(), df['LNG'].notnull().sum())

10469 257384


Let's just drop the null values so we have a full data set 

In [123]:
df = df.dropna(subset=['LAT', 'LNG'])

df.shape

(257384, 19)

#### Now we have a data set we can have a little fun with!

In [124]:
df.sample(4)

Unnamed: 0,index,WORKSITE_CITY,WORKSITE_COUNTY,WORKSITE_STATE,WORKSITE_POSTAL_CODE,EMPLOYER_NAME,EMPLOYER_DBA,WAGE_RATE_OF_PAY_FROM,WAGE_UNIT_OF_PAY,NAICS_CODE,SOC_CODE,JOB_TITLE,CASE_STATUS,CASE_SUBMITTED,OCC_CODE,NAICS_POWER,IND_CODE,LAT,LNG
27407,30523,NEW YORK,NEW YORK,NY,10011,"LYFT, INC.",,145500.0,Year,541511.0,15-1132,ANDROID ENGINEER,CERTIFIED,2019-01-26,151132,6,541511.0,40.742039,-74.00062
35678,39623,JERSEY CITY,HUDSON,NJ,7310,LARSEN & TOUBRO INFOTECH LIMITED,,96366.0,Year,541511.0,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",CERTIFIED,2019-01-31,151132,6,541511.0,40.730133,-74.036816
20344,22715,SANTA CLARA,SANTA CLARA,CA,95054,INTELLISWIFT SOFTWARE INC,,109242.0,Year,541511.0,15-1132,SOFTWARE DEVELOPER,CERTIFIED,2019-01-22,151132,6,541511.0,37.393491,-121.96467
100324,111752,PALO ALTO,SANTA CLARA,CA,94301,"AMAZON.COM SERVICES, INC.",,125000.0,Year,454111.0,15-2031,DATA SCIENTIST I,CERTIFIED,2019-02-27,152031,6,454111.0,37.444123,-122.149911


# Nearest Neighbors in a Nutshell

The nearest neighbors algorithm will take all our points, and find the closest `k` points in space to our starting point.  
![NN](https://media.giphy.com/media/qofOXXchB3EnS/giphy.gif)


The `sklearn` package makes building this model a breeze. We'll gloss over what's going on here, but there's plenty of [excellent documentation here](https://scikit-learn.org/stable/modules/neighbors.html). 

Given that all of our values of numeric, this should be easy! However, we still have a little bit of prep to do. Let's create two different sets of variables, our `vals` that we'll run our algorithm on, and our `cols` that we'll look at to describe our data. 

In [125]:
# columns we'll use to evaluate our output
cols = ['JOB_TITLE', 'EMPLOYER_NAME', 'WORKSITE_CITY', 
        'WORKSITE_STATE', 'WORKSITE_POSTAL_CODE', 'WAGE_RATE_OF_PAY_FROM']

# values we'll use to build our algorithm
vals = ['LAT', 'LNG', 'IND_CODE', 'OCC_CODE']


#### Sampling, Standardization, and Normalization
We need to make sure all of our data points are evenly spaced so that they'll all be treated similarly in our algorithm. 

Let's create a `sample_df` with all of the rows of data we want, and a vector `X` with all the data we need for the model

In [126]:
down_sample = True

if down_sample:
    sample_df = df[vals + cols].sample(20000, random_state=20190905).reset_index()
    X_raw = sample_df[vals].values
else:
    sample_df = df[vals + cols].reset_index()
    X_raw = sample_df[vals].values

X, norms = normalize(X_raw, axis=0, return_norm=True)
pd.DataFrame(X).describe()

Unnamed: 0,0,1,2,3
count,20000.0,20000.0,20000.0,20000.0
mean,0.00702,-0.006924,0.006994,0.006911
std,0.000847,0.001434,0.001043,0.001497
min,0.002479,-0.011649,0.001504,0.004849
25%,0.006287,-0.008679,0.007087,0.006601
50%,0.007157,-0.006476,0.007325,0.006601
75%,0.007516,-0.005686,0.007325,0.006604
max,0.011262,0.010744,0.012541,0.022584


#### Now we're ready to build our model

We can train the model in just two lines of code - it's not the hard part!

In [127]:
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
indices                                           


array([[17514,     0, 13827, 10309,  2655],
       [18781, 16648,  5584, 18438,  2335],
       [    2,  3803, 10663,  9959,  6023],
       ...,
       [19997, 15803, 16819, 13126,   561],
       [17164,  8935,  5650, 11568, 11399],
       [14304, 16252, 19999, 17213, 13657]])

#### Hurray! That was pretty easy, right? 

Now let's display the nearest neighbors to a randomly selected observation to see what our model is doing.

In [128]:
i = sample_df.sample().index.values[0]
sample_df.loc[sample_df.index.isin([i] + list(indices[i,:])), cols + vals].transpose()

Unnamed: 0,2485,8978,16241,17523,18804
JOB_TITLE,ENGINEER,ELECTRICAL ENGINEER,DESIGN ENGINEER,RAN TEST ENGINEER,SR. SOFTWARE ENGINEER
EMPLOYER_NAME,"INTERTEK TESTING SERVICES NA, INC.",SYSPLUS TECHNOLOGY SOLUTIONS,"VERISILICON, INC.",ADI WORLDLINK LLC,"SAMSUNG RESEARCH AMERICA, INC."
WORKSITE_CITY,PLANO,PLANO,PLANO,PLANO,RICHARDSON
WORKSITE_STATE,TX,TX,TX,TX,TX
WORKSITE_POSTAL_CODE,75074,75074,75074,75023,75082
WAGE_RATE_OF_PAY_FROM,62850,89461,85000,106000,105518
LAT,33.0316,33.0316,33.0316,33.0568,32.9916
LNG,-96.6732,-96.6732,-96.6732,-96.7308,-96.6631
IND_CODE,541380,541511,541330,541511,541710
OCC_CODE,172111,172071,172072,172071,172071


# Building a Prediction Model

It's great to find similar jobs given one that we already know about, but what about predicting the salary for any job we come up with? 

This will require a prediction model, or a *supervised* machine learning model. We'll talk more about this later in the semester, but for now, let's explore just how simple it is to build on of these models. 

### First, let's set up a few widgets to make evaluating our work a little easier

I've done some work behind the scenes here to make this easier, but essentially we are just making it easier to look up location, occupation code, and industry code for any job we might want to predict. I've created some files with the top codes and locations and their associated labels. 

In [129]:
occ_names = pd.read_csv('https://grantmlong.com/data/OCC_NAMES2.csv', index_col=0)
w = widgets.Dropdown(
    options=[name[0] for name in occ_names.values],
    value=occ_names.values[0],
    description='Occupation:',
)

ind_names = pd.read_csv('https://grantmlong.com/data/IND_NAMES.csv', index_col=0)
y = widgets.Dropdown(
    options=[name[0] for name in ind_names.values],
    value=ind_names.values[0],
    description='Industry:',
)

loc_names = pd.read_csv('https://grantmlong.com/data/LOC_NAMES.csv', index_col=0)
z = widgets.Dropdown(
    options=list(loc_names.index.values),
    value=list(loc_names.index.values)[0],
    description='Location:'
)



### Let's Model!

We'll use a `KNeighborsRegressor` regressor to find the five closest observations to what we'd like to predict and then fit the model based on the `sample_df` we created above. 

Again, while the process includes a lot of math and computation, python makes this super easy for us, and it takes just two lines of code to build our model. 

In [130]:
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X, sample_df['WAGE_RATE_OF_PAY_FROM'].values) 


KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=5, p=2,
          weights='uniform')

### Tying it all together

Finally, we'll need to display our widget and use them to create a prediction. 

In [131]:
def get_prediction(model=model):
    obs_raw = [
        loc_names.loc[z.value].values[0],
        loc_names.loc[z.value].values[1],
        ind_names.loc[ind_names.Label==y.value].index.values[0],
        (occ_names.loc[occ_names.SOC_NAME==w.value].index.values[0]),
    ]

    obs = (obs_raw / norms).reshape(1, -1)
    
    return model.predict(obs)[0]


widgets.VBox([w, y, z])

VBox(children=(Dropdown(description='Occupation:', options=('SOFTWARE DEVELOPERS, APPLICATIONS', 'COMPUTER OCC…

In [132]:
get_prediction()

149400.0