# Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

# Scraping the Data from Indeed

In [1]:
# Load required scripts
from bs4 import BeautifulSoup as Soup
import urllib, requests, re, pandas as pd

In [63]:
# List various salary bandings for data scientist jobs in California
# Salary banding according to the suggested estimates by Indeed
base_url_ca_60 = 'https://www.indeed.com/jobs?q=Data+Scientist+$60,000&l=California&radius=50&jt=fulltime&sort='
base_url_ca_80 = 'https://www.indeed.com/jobs?q=Data+Scientist+$80,000&l=California&radius=50&jt=fulltime&sort='
base_url_ca_95 = 'https://www.indeed.com/jobs?q=Data+Scientist+$95,000&l=California&radius=50&jt=fulltime&sort='
base_url_ca_110 = 'https://www.indeed.com/jobs?q=Data+Scientist+$110,000&l=California&radius=50&jt=fulltime&sort='

# Sort data by date and by start page number (to append later)
sort_by = 'date'          
start_from = '&start='    

# Remove the column limit for pandas
pd.set_option('max_colwidth',500)   

# Pre-establish the database
df = pd.DataFrame()

In [64]:
def scrape(df, base_url, salary_estimate) :
    """ Takes in a dataframe and then scrapes Indeed.com for jobs using the base URL
    provided by the user. For user's own reference, include the salary estimate parameter
    used for the job search. """

    # Scrape page 1 to 100 (last accessible page is 100)
    for page in range(1,101):
        
        # Multiple by 10 as the numbers follow number of jobs listed per page
        page = (page-1) * 10  
        
        # Create full URL
        url = "%s%s%s%d" % (base_url, sort_by, start_from, page)
        
        # Scrape
        target = Soup(urllib.urlopen(url), "lxml") 

        # Get a job from each row
        targetElements = target.findAll('div', attrs={'class':" row result"})
    
        # Try to get each specific job information
        for elem in targetElements: 
            try:
                comp_name = elem.find('span', attrs={'class':'company'}).getText().strip()
            except: 
                comp_name = None
                
            try:
                job_title = elem.find('a', attrs={'class':'turnstileLink'}).attrs['title']
            except:
                job_title = None
            
            try:
                listed_job_salary = elem.find('span', attrs={'class': "no-wrap"}).getText()
            except:
                listed_job_salary = None
            
            try:
                job_addr = elem.find('span', attrs={'class':'location'}).getText()
            except:
                job_addr = None
            
            try:
                job_summary = elem.find('span', attrs={'class': 'summary'}).getText()
            except:
                job_summary = None


            # Add job info to the data frame
            df = df.append({'comp_name': comp_name, 'job_title': job_title, 
                            'salary_estimated': salary_estimate,'job_summary' : job_summary,
                            'job_location': job_addr, 'listed_job_salary' : listed_job_salary
                           }, ignore_index=True)
    return df

In [None]:
# Scrape from the estimated $60,000 band
df = scrape(df, base_url_ca_60, 60000)

In [None]:
# Scrape from the estimated $80,000 band
df = scrape(df, base_url_ca_80, 80000)

In [None]:
# Scrape from the estimated $95,000 band
df = scrape(df, base_url_ca_95, 95000)

In [None]:
# Scrape from the estimated $110,000 band
df = scrape(df, base_url_ca_110, 110000)

In [None]:
# Remove the new line spacing from the dataframe
df = df.replace('\n','', regex=True)

In [None]:
# Save the result to CSV
df.to_csv('../indeed-results.csv', encoding='utf-8')

## Read in CSV so that we don't have to scrape again

In [4]:
df_read = pd.read_csv('../indeed-results.csv')

In [5]:
# Here, we take a look at the number of listed salaries we have
df_read["listed_job_salary"].value_counts()

$150,000 a year                                 17
$140,000 - $165,000 a year                      13
$120,000 - $150,000 a year                      12
$150,000 - $180,000 a year                      11
$180,000 a year                                 10
$100,000 - $180,000 a year                       9
$130,000 - $150,000 a year                       9
$150,000 - $200,000 a year                       8
$160,000 - $170,000 a year                       7
$140,000 - $160,000 a year                       7
$140,000 - $200,000 a year                       6
$100,000 - $160,000 a year                       6
$140,000 - $180,000 a year                       6
$125,000 - $155,000 a year                       5
$180,000 - $250,000 a year                       5
$180,000 - $200,000 a year                       5
$120,000 - $140,000 a year                       5
$130,000 - $195,000 a year                       5
$180,000 - $210,000 a year                       5
$160,000 - $180,000 a year     

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Do some light cleaning here. Since we don't have a lot of jobs that have the listed salary (as expected), we will remove that column. We also suspect duplicates in jobs, so taking the job_summary column, we drop any duplicates found as it is unlikely that two different jobs will have identical, word-for-word descriptions.

In [6]:
df_read.drop(["Unnamed: 0", "listed_job_salary"], axis=1, inplace=True)

In [7]:
df_read.head()

Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated
0,Walmart eCommerce,"San Bruno, CA 94066","Data scientists, front and back-end engineers,...","Director, Retail Learning & Development",60000.0
1,Facebook,"Menlo Park, CA","Work with engineering, data science, and desig...","Product Manager, Advanced Network Planning",60000.0
2,Kaiser Permanente,"Oakland, CA","He or she will manage a team of analysts, data...",Senior Manager Decision Support,60000.0
3,PaxVax,"San Diego, CA",Good data anaylsis skills. Is seeking a Scient...,"Scientist, Process Development - Upstream",60000.0
4,The Aerospace Corporation,"El Segundo, CA 90245",Our state-of-the-art laboratory facilities are...,Associate Software Engineer,60000.0


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Here we drop the duplicate job listings. We can determine this by checking for complete
duplicates on the job descriptions because it is unlikely that two different jobs
will have the exact same description.

In [8]:
df_read.drop_duplicates(subset='job_summary', inplace=True)

In [9]:
df_read.shape

(1527, 5)

In [10]:
df_read.salary_estimated.value_counts()

60000.0     725
110000.0    294
80000.0     286
95000.0     222
Name: salary_estimated, dtype: int64

# Question 1

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

## City Count Vectorizer

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

First, we will utilize random forest classifiers to determine if a variable or combination of variables have a significant impact on whether a job's salary will be high or low. 

Here, we will start with the city variable.

In [52]:
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import numpy as np

In [53]:
# 60k and 80k will be considered low salary
# 95k and 110k will be considered high salary

df_read['high_salary'] = [1 if a > 80000 else 0 for a in df_read.salary_estimated]

In [54]:
# Reasonably balanced distribution of high and low salary
df_read.high_salary.value_counts()

0    1011
1     516
Name: high_salary, dtype: int64

In [55]:
# Remove extraneous information. Some of the cities included what part of the city and 
# the postal codes. Because this information is inconsistent, this will help to standardize
# the location data.
loc_clean = [a[0:a.find(', CA')] for a in df_read.job_location]
df_read["job_location"] = loc_clean

# In some cases, only "California" is listed as a city of work. Due to the lack of specificity,
# we are dropping those rows.
cali_index = df_read[df_read["job_location"] == "California"].index
df_read.drop(df_read.index[cali_index], inplace=True)
df_read.reset_index(drop=True)

Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated,high_salary
0,Walmart eCommerce,San Brun,"Data scientists, front and back-end engineers,...","Director, Retail Learning & Development",60000.0,0
1,Facebook,Menlo Par,"Work with engineering, data science, and desig...","Product Manager, Advanced Network Planning",60000.0,0
2,Kaiser Permanente,Oaklan,"He or she will manage a team of analysts, data...",Senior Manager Decision Support,60000.0,0
3,PaxVax,San Dieg,Good data anaylsis skills. Is seeking a Scient...,"Scientist, Process Development - Upstream",60000.0,0
4,The Aerospace Corporation,El Segund,Our state-of-the-art laboratory facilities are...,Associate Software Engineer,60000.0,0
5,Walmart eCommerce,San Brun,We constantly improve our products based on us...,Senior Android Engineer,60000.0,0
6,BeiGene,San Francisc,"Collects and researches data; Review, query, a...","Director, Clinical Science",60000.0,0
7,Ascent Services Group,South San Francisc,Interpret data and communicate results in depa...,Research Associate I,60000.0,0
8,Remind,San Francisc,Analyze data to identify trends and opportunit...,Data Scientist,60000.0,0
9,First American,Agoura Hill,Knowledge and experience with diverse statisti...,Data Scientist,60000.0,0


In [56]:
# Get dummy variables and encode 
city_dummies = pd.get_dummies(df_read.job_location)

X_city = city_dummies
y_city = df_read.high_salary

In [57]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_city, y_city, test_size=0.3)

In [59]:
# Random forest classifier parameters and fit
rfc = RandomForestClassifier(n_estimators=300)
rfc.fit(X_train, y_train)

# Predict test results and look at accuracy
rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

# Cross-validate on original data
s = cross_val_score(rfc, X_city, y_city, cv=4, n_jobs=-1)
print "Cross Validation Score:\t{:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.647
Cross Validation Score:	0.646 ± 0.01


In [60]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_city.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_medians = []
for i in X_city.columns:
    feature_medians.append(np.median(df_read[df_read.job_location == i].salary_estimated))

feature_importances['median_salary'] = feature_medians
feature_importances['high_or_low'] = [1 if i > 80000 else 0 for i in feature_importances.median_salary]

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance,median_salary,high_or_low
98,San Francisc,0.079411,80000.0,0
62,Mountain Vie,0.051302,80000.0,0
25,Cupertin,0.048274,110000.0,1
29,El Segund,0.028959,60000.0,0
130,Woodlan,0.027742,102500.0,1
100,San Jos,0.026564,80000.0,0
70,Orang,0.024382,95000.0,1
34,Fremon,0.022744,80000.0,0
74,Palo Alt,0.022337,80000.0,0
104,San Ramo,0.021695,70000.0,0


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Cupertino seems to have the biggest impact on whether a salary is high, while San Francisco and Mountain View seem to influence the lower salary bandings.

In [61]:
feature_importances.shape

(132, 4)

## Summary Count Vectorizer

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">


<p>
Due to the number of words present for job summaries, we need something a bit more heavy duty than dummy coding. For the following sections, we 
utilize <b>CountVectorizer</b> to look at word frequencies. We do this for 
job titles as well afterwards.
</div>

In [62]:
from sklearn.feature_extraction.text import CountVectorizer

In [63]:
salaries_w_desc = df_read.copy(deep=False)

X_summ = salaries_w_desc['job_summary']
y_summ = salaries_w_desc['high_salary']

In [64]:
# Exclude standard stop words from sk-learn
cv = CountVectorizer(stop_words="english")
cv.fit(X_summ)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [65]:
# Densify the table otherwise we would be looking at many, many features
X_summ_trans = pd.DataFrame(cv.transform(X_summ).todense(), columns=cv.get_feature_names())

In [66]:
X_train, X_test, y_train, y_test = train_test_split(np.asmatrix(X_summ_trans), y_summ, test_size=0.3,
                                                    stratify=y_summ)

In [67]:
word_counts = X_summ_trans.sum(axis=0)
word_counts.sort_values(ascending = False).head(25)

data           2540
scientists      539
scientist       431
experience      328
team            258
analysis        227
engineers       205
learning        190
science         178
work            175
machine         173
product         147
research        131
analytics       130
read            122
looking         116
working         108
large           104
big             104
design          104
senior          103
development      93
sets             86
management       86
algorithms       82
dtype: int64

In [68]:
rfc = RandomForestClassifier(300)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_summ_trans.as_matrix(), y_summ.as_matrix(), cv=4, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.636
Cross Validation Score: 0.659 ± 0.029


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Overall, not bad. Remember we are looking at a baseline of 50% (high or low salary).

In [69]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_summ_trans.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_medians = []
feature_means = []
for i in X_summ_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc['job_summary'].str.lower().str.contains(i)].salary_estimated))
    feature_means.append(np.mean(salaries_w_desc[salaries_w_desc['job_summary'].str.lower().str.contains(i)].salary_estimated))


feature_importances['median_salary'] = feature_medians
feature_importances['mean_salary'] = feature_means
feature_importances['over_or_under'] = [1 if i > 80000 else 0 for i in feature_importances['median_salary']]

feature_importances.sort_values('importance', ascending=False).head(20)

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,feature,importance,median_salary,mean_salary,over_or_under
2436,read,0.019338,80000.0,85664.0625,0
779,data,0.012187,80000.0,78854.166667,0
2630,science,0.007611,80000.0,81445.783133,0
2959,team,0.006767,80000.0,80300.0,0
2639,scientists,0.006333,80000.0,79591.633466,0
1122,experience,0.005689,80000.0,80729.483283,0
2635,scientist,0.005051,80000.0,78763.570567,0
412,building,0.005001,95000.0,91818.181818,1
2913,supporting,0.00498,95000.0,94090.909091,1
417,business,0.004896,80000.0,83716.216216,0


In [70]:
feature_importances.shape

(3231, 5)

## Title Count Vectorizer

In [71]:
salaries_w_desc = df_read.copy(deep=False)

X_title = salaries_w_desc['job_title']
y_title = salaries_w_desc['high_salary']

In [72]:
cv = CountVectorizer(stop_words="english")
cv.fit(X_title)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [73]:
X_title_trans = pd.DataFrame(cv.transform(X_title).todense(), columns=cv.get_feature_names())

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X_title_trans, y_title, test_size=0.3)

In [75]:
rfc = RandomForestClassifier(300)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_title_trans.as_matrix(), y_title.as_matrix(), cv=4, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.606
Cross Validation Score: 0.626 ± 0.024


In [76]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_title_trans.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_medians = []
feature_means = []
for i in X_title_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc["job_title"].str.lower().str.contains(i)].salary_estimated))
    feature_means.append(np.mean(salaries_w_desc[salaries_w_desc["job_title"].str.lower().str.contains(i)].salary_estimated))


feature_importances['median_salary'] = feature_medians
feature_importances['mean_salary'] = feature_means
feature_importances['over_or_under'] = [1 if i > 80000 else 0 for i in feature_importances.median_salary]

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance,median_salary,mean_salary,over_or_under
227,data,0.036019,80000.0,82282.157676,0
765,senior,0.029814,80000.0,80948.275862,0
301,engineer,0.024205,80000.0,82097.560976,0
751,scientist,0.023599,80000.0,78033.613445,0
489,learning,0.020889,95000.0,84803.370787,1
813,sr,0.020122,80000.0,81170.212766,0
503,machine,0.019854,95000.0,85343.75,1
725,research,0.018788,60000.0,74444.444444,0
791,software,0.0177,80000.0,82544.91018,0
669,principal,0.013455,80000.0,81515.151515,0


In [77]:
feature_importances.shape

(930, 5)

## Combining Title CV, Summary CV, and Location CV

In [78]:
salaries_w_desc = df_read.copy(deep=False).reset_index(drop=True)
city_dummies = pd.get_dummies(df_read.job_location)

In [79]:
X = pd.concat([city_dummies.reset_index(drop=True), 
               X_title_trans.reset_index(drop=True), 
               X_summ_trans.reset_index(drop=True)], axis=1)
y = salaries_w_desc.high_salary

In [80]:
print X.shape
print y.shape

(1527, 4293)
(1527,)


In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [82]:
rfc = RandomForestClassifier(300)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(6)

s = cross_val_score(rfc, X, y, cv=4, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.675381
Cross Validation Score: 0.675 ± 0.02


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

So many calculations for something that's 17% better than chance? Maybe we should try looking at something else...

In [83]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X.columns).reset_index()
feature_importances.columns = ['feature', 'importance']
df_read.reset_index(drop=True, inplace=True)
salaries_w_desc.reset_index(drop=True, inplace=True)
feature_importances

Unnamed: 0,feature,importance
0,Agoura Hill,1.206920e-05
1,Alameda Harbo,3.207228e-04
2,Alhambr,0.000000e+00
3,Aliso Viej,2.396266e-04
4,Bakersfiel,3.150329e-06
5,Belmon,3.838838e-07
6,Berkele,5.064478e-05
7,Beverly Hill,0.000000e+00
8,Bre,6.172078e-06
9,Brisban,1.672762e-05


In [84]:
feature_medians = []
for i in city_dummies.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc.job_location == i].salary_estimated))
    
for i in X_title_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc["job_title"].str.lower().str.contains(i)].salary_estimated))

for i in X_summ_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc['job_summary'].str.lower().str.contains(i)].salary_estimated))

feature_importances['median_salary'] = feature_medians
feature_importances['over_or_under'] = [1 if i > 80000 else 0 for i in feature_importances.median_salary]

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance,median_salary,over_or_under
3498,read,0.024689,80000.0,0
1841,data,0.01061,80000.0,0
359,data,0.00933,80000.0,0
98,San Francisc,0.007722,80000.0,0
635,machine,0.006542,95000.0,1
433,engineer,0.005656,80000.0,0
621,learning,0.005554,95000.0,1
3692,science,0.004637,80000.0,0
2740,large,0.004566,80000.0,0
883,scientist,0.004448,80000.0,0


## Logistic Regression Model

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

We repeat the same basic cleaning as above for a fresh copy of the dataset.

In [208]:
df_read.head()

Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated,high_salary
0,Walmart eCommerce,San Bruno,"Data scientists, front and back-end engineers, product managers, and web and UX/UI teams collaborate alongside e-commerce experts to envision, prototype, and...","Director, Retail Learning & Development",60000.0,0
1,Facebook,Menlo Park,"Work with engineering, data science, and design teams to build innovative data products and supporting infrastructure....","Product Manager, Advanced Network Planning",60000.0,0
2,Kaiser Permanente,Oakland,"He or she will manage a team of analysts, data scientists and statisticians, have understanding and appreciation of analytical processes and business process...",Senior Manager Decision Support,60000.0,0
3,PaxVax,San Diego,Good data anaylsis skills. Is seeking a Scientist to join the Process Development and Clinical Production teams focused on vaccine development from preclinical...,"Scientist, Process Development - Upstream",60000.0,0
4,The Aerospace Corporation,El Segundo,"Our state-of-the-art laboratory facilities are staffed by some of the leading scientists in the world. Bring fresh ideas from all areas, including distributed...",Associate Software Engineer,60000.0,0


In [12]:
df_rmodel = df_read[["comp_name", "job_location", "salary_estimated", "high_salary"]]
df_rmodel.reset_index(drop=True, inplace=True)

In [13]:
df_rmodel.shape

(1527, 4)

In [15]:
cali_index = df_rmodel[df_rmodel["job_location"] == "California"].index
df_rmodel.drop(df_rmodel.index[cali_index], inplace=True)
df_rmodel.reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,comp_name,job_location,salary_estimated,high_salary
0,Walmart eCommerce,"San Bruno, CA 94066",60000.0,0
1,Facebook,"Menlo Park, CA",60000.0,0
2,Kaiser Permanente,"Oakland, CA",60000.0,0
3,PaxVax,"San Diego, CA",60000.0,0
4,The Aerospace Corporation,"El Segundo, CA 90245",60000.0,0
5,Walmart eCommerce,"San Bruno, CA 94066",60000.0,0
6,BeiGene,"San Francisco, CA",60000.0,0
7,Ascent Services Group,"South San Francisco, CA",60000.0,0
8,Remind,"San Francisco, CA",60000.0,0
9,First American,"Agoura Hills, CA 91301",60000.0,0


In [16]:
loc_clean = [a[0:a.find(', CA')] for a in df_rmodel.job_location]

In [17]:
df_rmodel["job_location"] = loc_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [18]:
df_rmodel.head()

Unnamed: 0,comp_name,job_location,salary_estimated,high_salary
0,Walmart eCommerce,San Bruno,60000.0,0
1,Facebook,Menlo Park,60000.0,0
2,Kaiser Permanente,Oakland,60000.0,0
3,PaxVax,San Diego,60000.0,0
4,The Aerospace Corporation,El Segundo,60000.0,0


In [19]:
df_rmodel["job_location"].value_counts()

San Francisco             337
San Diego                 112
San Jose                   87
Mountain View              69
Santa Clara                66
Palo Alto                  64
South San Francisco        60
Sunnyvale                  58
Los Angeles                56
El Segundo                 48
Redwood City               44
Irvine                     32
Menlo Park                 27
Emeryville                 25
San Mateo                  21
Livermore                  19
San Bruno                  19
Fremont                    16
Pleasanton                 16
Los Gatos                  15
San Ramon                  14
Foster City                13
Thousand Oaks              13
Santa Monica               13
San Francisco Bay Area     10
Pasadena                   10
Stanford                    9
Oakland                     9
Cupertino                   9
Hayward                     8
                         ... 
San Bernardino              1
Hollywood                   1
Redding   

In [20]:
X = df_rmodel[["comp_name", "job_location"]]
y = df_rmodel["high_salary"]

In [21]:
X = pd.get_dummies(X, drop_first=True)

In [22]:
# Scale the data
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
Xs = ss.fit_transform(X)

In [23]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.3)

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [17]:
# Establish parameters for gridsearch on logistic regression
gs_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

In [18]:
lr = LogisticRegression()

lr_gridsearch = GridSearchCV(lr, gs_params, cv=5, verbose=1, n_jobs=-1)

In [27]:
lr_gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   14.8s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:   29.2s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [28]:
print lr_gridsearch.best_score_

0.671388101983


In [29]:
lr_gridsearch.best_params_

{'C': 0.030538555088334154, 'penalty': 'l1', 'solver': 'liblinear'}

In [30]:
best_gs = lr_gridsearch.best_estimator_
best_gs.score(X_test, y_test)

0.66740088105726869

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">
Remember that the baseline is 50%, and our random forest gave us an accuracy of 67.53%.
How does the model perform when given salary bands instead of high or low?

In [26]:
y = df_rmodel["salary_estimated"]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.3)

In [28]:
lr_gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   12.6s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   36.2s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  1.4min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [29]:
print lr_gridsearch.best_score_

0.470254957507


In [30]:
lr_gridsearch.best_params_

{'C': 1.0000000000000001e-05, 'penalty': 'l1', 'solver': 'liblinear'}

In [31]:
best_gs = lr_gridsearch.best_estimator_
best_gs.score(X_test, y_test)

0.48458149779735682

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Given that there are 4 salary bands, the base is 25%. This is not bad.

# Question 2

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.

## Second round of Scraping

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

For this round of scraping, we used an exact search for "Data Scientist" with quotation marks, as opposed to the inital scraping which pulled any mention of "data" or "scientist" regardless of order.

Since we are interested in what factors affect job title, this precision is more necessary.

Otherwise, the analysis is basically identical to that of the previous section.

In [61]:
# List various salary bandings for exact match "data scientist" jobs in California
url_ca_exact_60 = 'https://www.indeed.com/jobs?q=%22Data+Scientist%22+$60,000&l=California&radius=50&jt=fulltime&sort='
url_ca_exact_80 = 'https://www.indeed.com/jobs?q=%22Data+Scientist%22+$80,000&l=California&radius=50&jt=fulltime&sort='
url_ca_exact_95 = 'https://www.indeed.com/jobs?q=%22Data+Scientist%22+$95,000&l=California&radius=50&jt=fulltime&sort='
url_ca_exact_110 = 'https://www.indeed.com/jobs?q=%22Data+Scientist%22+$110,000&l=California&radius=50&jt=fulltime&sort='

# Pre-establish the database
df2 = pd.DataFrame()

In [67]:
df2

Unnamed: 0,comp_name,job_location,job_summary,job_title,listed_job_salary,salary_estimated
0,CLARA analytics,"Santa Clara, CA","\nThe Data Scientist role involves working on all stages of the data science pipeline, from acquiring and assessing data, selecting appropriate models and...",Data Scientist,,60000.0
1,Shutterfly,"Redwood City, CA",\nThe Data Scientist will be responsible for designing and directing experiments and observational studies to optimize our customer acquisition and engagement...,Data Scientist,,60000.0
2,Lam Research,"Fremont, CA 94538 (Irvington area)","\nDefine data structures, evaluate data quality, perform appropriate data analyses using software such as Python and MATLAB....",Data Scientist 4,,60000.0
3,"GRAIL, Inc.","Menlo Park, CA",\nParticipate in data quality review activities and efforts to resolve data quality issues. Develop and manage interactive data visualization and analytic tools...,Senior Staff Clinical Data Scientist,,60000.0
4,Chatham Group,"San Francisco, CA 94104 (Financial District area)","\nDo you like data? Experience in data analysis, gaming, mobile applications, consulting, or business/financial analysis....",Senior Data Scientist,"\n$70,000 a year",60000.0
5,Petco,"San Diego, CA",\nThis role will supervise the assistant merchant managers and work very closely with the customer data scientist and manager business analytics....,Services Manager,,60000.0
6,Workbridge Associates,"San Jose, CA 95113 (Downtown area)","\nA well-established retail company located in Silicon Valley is looking for a contract Senior Data Scientist to take on a role providing modeling, analysis, and...",Senior Data Scientist (Contract),,60000.0
7,Jobspring Partners,"Palo Alto, CA",\nA Series C Healthcare Startup Located in Palo Alto is on the seeking for a bold Mid-Level Data Scientist to join to the team....,Mid-level Data Scientist,"\n$130,000 - $165,000 a year",60000.0
8,Remind,"San Francisco, CA","\nAnalyze data to identify trends and opportunities, surface actionable insights, and help teams set goals, forecasts and prioritization of initiatives....",Data Scientist,,60000.0
9,FullDeck,"Los Angeles, CA","\n2+ years of experience in data mining and/or data science. A flourishing digital media agency with high profile entertainment clients, has an immediate need for...",Data Scientist,,60000.0


In [65]:
df2 = scrape(df2, url_ca_exact_60, 60000)

In [68]:
df2 = scrape(df2, url_ca_exact_80, 80000)

In [69]:
df2 = scrape(df2, url_ca_exact_95, 95000)

In [70]:
df2 = scrape(df2, url_ca_exact_110, 110000)

In [71]:
df2 = df2.replace('\n','', regex=True)

In [72]:
# Save the result to CSV
df2.to_csv('../indeed-results-exact-ds.csv', encoding='utf-8')

## Read in again

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Same initial cleaning, duplicate dropping, and location fixing.

In [21]:
df_exact = pd.read_csv('../indeed-results-exact-ds.csv')

In [22]:
df_exact.drop(["Unnamed: 0", "listed_job_salary"], axis=1, inplace=True)

In [23]:
df_exact.drop_duplicates(subset='job_summary', inplace=True)

In [24]:
df_exact.head()

Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated
0,CLARA analytics,"Santa Clara, CA",The Data Scientist role involves working on al...,Data Scientist,60000.0
1,Shutterfly,"Redwood City, CA",The Data Scientist will be responsible for des...,Data Scientist,60000.0
2,Lam Research,"Fremont, CA 94538 (Irvington area)","Define data structures, evaluate data quality,...",Data Scientist 4,60000.0
3,"GRAIL, Inc.","Menlo Park, CA",Participate in data quality review activities ...,Senior Staff Clinical Data Scientist,60000.0
4,Chatham Group,"San Francisco, CA 94104 (Financial District area)","Do you like data? Experience in data analysis,...",Senior Data Scientist,60000.0


In [77]:
df_exact.shape

(851, 5)

In [25]:
# Changed all the job titles to lower case so that our search is not case-sensitive
df_exact["job_title"] = df_exact.job_title.str.lower()

In [26]:
df_exact.head()

Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated
0,CLARA analytics,"Santa Clara, CA",The Data Scientist role involves working on al...,data scientist,60000.0
1,Shutterfly,"Redwood City, CA",The Data Scientist will be responsible for des...,data scientist,60000.0
2,Lam Research,"Fremont, CA 94538 (Irvington area)","Define data structures, evaluate data quality,...",data scientist 4,60000.0
3,"GRAIL, Inc.","Menlo Park, CA",Participate in data quality review activities ...,senior staff clinical data scientist,60000.0
4,Chatham Group,"San Francisco, CA 94104 (Financial District area)","Do you like data? Experience in data analysis,...",senior data scientist,60000.0


In [29]:
df_exact.reset_index(drop=True, inplace=True)

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

We only considered data with any mention of "Data Scientist". We excluded any roles that only mention data scientist in the job summary, or if the title was something along the lines of "data analyst".

In [30]:
jobs = []

for a in range(len(df_exact["job_title"])) :
    if "data scientist" in df_exact["job_title"][a] :
        jobs.append(a)

In [31]:
df_ds = df_exact.iloc[jobs]

In [32]:
df_ds.head()

Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated
0,CLARA analytics,"Santa Clara, CA",The Data Scientist role involves working on al...,data scientist,60000.0
1,Shutterfly,"Redwood City, CA",The Data Scientist will be responsible for des...,data scientist,60000.0
2,Lam Research,"Fremont, CA 94538 (Irvington area)","Define data structures, evaluate data quality,...",data scientist 4,60000.0
3,"GRAIL, Inc.","Menlo Park, CA",Participate in data quality review activities ...,senior staff clinical data scientist,60000.0
4,Chatham Group,"San Francisco, CA 94104 (Financial District area)","Do you like data? Experience in data analysis,...",senior data scientist,60000.0


In [33]:
df_ds.shape

(691, 5)

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

We use specific terms to isolate which positions are considered "senior" positions. Then, we assign the rows a value of "Senior" or "Not Senior" using 0 and 1 values.

In [34]:
searchfor = ['senior', 'lead', 'sr', 'principal']

In [35]:
tf_vector = df_ds.job_title.str.contains('|'.join(searchfor))

senior_not_senior = [1 if a == True else 0 for a in tf_vector]
df_ds["senior_not_senior"] = senior_not_senior

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [36]:
df_ds.head()

Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated,senior_not_senior
0,CLARA analytics,"Santa Clara, CA",The Data Scientist role involves working on al...,data scientist,60000.0,0
1,Shutterfly,"Redwood City, CA",The Data Scientist will be responsible for des...,data scientist,60000.0,0
2,Lam Research,"Fremont, CA 94538 (Irvington area)","Define data structures, evaluate data quality,...",data scientist 4,60000.0,0
3,"GRAIL, Inc.","Menlo Park, CA",Participate in data quality review activities ...,senior staff clinical data scientist,60000.0,1
4,Chatham Group,"San Francisco, CA 94104 (Financial District area)","Do you like data? Experience in data analysis,...",senior data scientist,60000.0,1


In [179]:
df_ds.shape

(691, 6)

In [182]:
# Pretty well balanced
df_ds["senior_not_senior"].value_counts()

0    413
1    278
Name: senior_not_senior, dtype: int64

In [37]:
loc_clean = [a[0:a.find(', CA')] for a in df_ds.job_location]
df_ds["job_location"] = loc_clean

cali_index = df_ds[df_ds["job_location"] == "California"].index
df_ds.drop(df_ds.index[cali_index], inplace=True)
df_ds.reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,comp_name,job_location,job_summary,job_title,salary_estimated,senior_not_senior
0,CLARA analytics,Santa Clara,The Data Scientist role involves working on al...,data scientist,60000.0,0
1,Shutterfly,Redwood City,The Data Scientist will be responsible for des...,data scientist,60000.0,0
2,Lam Research,Fremont,"Define data structures, evaluate data quality,...",data scientist 4,60000.0,0
3,"GRAIL, Inc.",Menlo Park,Participate in data quality review activities ...,senior staff clinical data scientist,60000.0,1
4,Chatham Group,San Francisco,"Do you like data? Experience in data analysis,...",senior data scientist,60000.0,1
5,Workbridge Associates,San Jose,A well-established retail company located in S...,senior data scientist (contract),60000.0,1
6,Jobspring Partners,Palo Alto,A Series C Healthcare Startup Located in Palo ...,mid-level data scientist,60000.0,0
7,Remind,San Francisco,Analyze data to identify trends and opportunit...,data scientist,60000.0,0
8,FullDeck,Los Angeles,2+ years of experience in data mining and/or d...,data scientist,60000.0,0
9,Brainworks,Oakland,"Our client, a fast-growing, well-funded, ecomm...",data scientist,60000.0,0


## Vectorizing by City again

In [210]:
city_dummies = pd.get_dummies(df_ds.job_location)

X_city = city_dummies
y_city = df_ds["senior_not_senior"]

In [211]:
X_train, X_test, y_train, y_test = train_test_split(X_city, y_city, test_size=0.3)

In [212]:
rfc = RandomForestClassifier(n_estimators=300)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_city, y_city, cv=10, n_jobs=-1)
print "Cross Validation Score:\t{:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.577
Cross Validation Score:	0.627 ± 0.041


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Considering a base of 50%, this is not very good, but almost expected considering a position's location should not have impact on the seniority of the position.

In [214]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_city.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance
24,Los Gatos,0.0869
45,San Francisco Bay Area,0.074265
23,Los Angeles,0.047961
35,Pasadena,0.043167
27,Menlo Park,0.040599
59,Tustin,0.029254
4,Burbank,0.029103
60,Venice,0.028669
6,Californi,0.0284
61,Westlake Village,0.028388


## Summary Vectorizer again

In [216]:
salaries_w_desc = df_ds.copy(deep=False)

X_summ = salaries_w_desc['job_summary']
y_summ = salaries_w_desc['senior_not_senior']

In [217]:
cv = CountVectorizer(stop_words="english")
cv.fit(X_summ)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [218]:
X_summ_trans = pd.DataFrame(cv.transform(X_summ).todense(), columns=cv.get_feature_names())

In [219]:
X_train, X_test, y_train, y_test = train_test_split(np.asmatrix(X_summ_trans), y_summ, test_size=0.3,
                                                    stratify=y_summ)

In [220]:
word_counts = X_summ_trans.sum(axis=0)
word_counts.sort_values(ascending = False).head(25)

data          1596
scientist      504
read           224
experience     185
science        156
team           133
learning       107
looking        103
senior         100
analysis        95
analytics       89
machine         85
large           84
sets            78
join            62
big             62
work            56
years           52
modeling        52
role            51
help            50
working         48
mining          48
lead            46
algorithms      44
dtype: int64

In [221]:
rfc = RandomForestClassifier(300)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_summ_trans.as_matrix(), y_summ.as_matrix(), cv=10, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.851
Cross Validation Score: 0.882 ± 0.051


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

This is much more promising because it shows that the job summaries seem to be accurately reflecting the seniority of a position, allowing more accurate predictions to be made. 

In [222]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_summ_trans.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance
1263,senior,0.099406
1076,principal,0.022764
796,lead,0.022202
349,data,0.0114
1238,scientist,0.010638
1148,read,0.010269
513,experience,0.009751
1412,team,0.008452
192,candidate,0.006131
522,exploration,0.005955


## Salary Count Vectorizer

In [224]:
salary_dummies = pd.get_dummies(df_ds.salary_estimated)

X_sal = salary_dummies
y_sal = df_ds["senior_not_senior"]

In [225]:
X_train, X_test, y_train, y_test = train_test_split(X_sal, y_sal, test_size=0.3)

In [226]:
rfc = RandomForestClassifier(n_estimators=300)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_city, y_city, cv=10, n_jobs=-1)
print "Cross Validation Score:\t{:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.582
Cross Validation Score:	0.632 ± 0.047


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Does not perform well, but again, this is expected, as different cities in California should have different expected expenses, and salaries will be adjusted according to living costs without huge emphasis on seniority of the title.

In [227]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_sal.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance
0,60000.0,0.26942
3,110000.0,0.249994
1,80000.0,0.244458
2,95000.0,0.236128


## Combine city, summary, and salary vectorizers

In [231]:
salaries_w_desc = df_ds.copy(deep=False).reset_index(drop=True)
city_dummies = pd.get_dummies(df_ds.job_location)

X = pd.concat([city_dummies.reset_index(drop=True), 
               X_sal.reset_index(drop=True), 
               X_summ_trans.reset_index(drop=True)], axis=1)
y = salaries_w_desc["senior_not_senior"]

In [233]:
print X.shape
print y.shape

(691, 1636)
(691,)


In [234]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [235]:
rfc = RandomForestClassifier(300)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(6)

s = cross_val_score(rfc, X, y, cv=10, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.870192
Cross Validation Score: 0.867 ± 0.067


<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

Thumbs up.

In [236]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance
1332,senior,0.082236
865,lead,0.019881
1145,principal,0.013036
418,data,0.01056
24,Los Gatos,0.00895
1217,read,0.00846
1344,services,0.008439
1307,scientist,0.008055
45,San Francisco Bay Area,0.00665
582,experience,0.006357


## Regression Model

In [38]:
df_rmodel2 = df_ds[["comp_name", "job_location", "salary_estimated", "senior_not_senior"]]
df_rmodel2.reset_index(drop=True, inplace=True)

In [39]:
X = df_rmodel2[["comp_name", "job_location"]]
y = df_rmodel2["senior_not_senior"]

In [40]:
X = pd.get_dummies(X, drop_first=True)

In [43]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [44]:
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [45]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.3)

In [46]:
gs_params = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

lr = LogisticRegression()

lr_gridsearch = GridSearchCV(lr, gs_params, cv=5, verbose=1, n_jobs=-1)

In [47]:
lr_gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 593 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 985 out of 1000 | elapsed:    6.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:    6.3s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00]), 'solver': ['liblinear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [48]:
print lr_gridsearch.best_score_

0.737060041408


In [49]:
lr_gridsearch.best_params_

{'C': 0.10974987654930568, 'penalty': 'l1', 'solver': 'liblinear'}

In [50]:
best_gs = lr_gridsearch.best_estimator_
best_gs.score(X_test, y_test)

0.76923076923076927

<div style="width:800px;background:#ffff00;border:1px solid black;text-align:left;padding:8px;">

WIth a random forest accuracy of around 87%, the logistic regression model does not perform as well.