# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest regressor, as well as another classifier of your choice; either logistic regression, SVM, or KNN. 

- **Question**: Why would we want this to be a classification problem?
- **Answer**: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

In [None]:
url = "http://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&start=1"

In [None]:
import requests
import bs4
from bs4 import BeautifulSoup

In [None]:
# try out scraping function on a single page to see if it scrapes the right contents successfully

response = requests.get(url)
soup = BeautifulSoup(response.content)
jobs=[]
for post in soup.find_all('div', {'class':' row result'}):
    job={}
    job['title'] = post.find('a').get('title')
    try:
        job['company'] = post.find('span', {'itemprop':'name'}).getText()
    except:
        job['company'] = None
    job['location'] = post.find('span', {'class':'location'}).getText()
    job['summary'] = post.find('span', {'class':'summary'}).getText()
    try:
        job['salary'] = post.find('td', {'class':'snip'}).find('nobr').renderContents()
    except:
        job['salary'] = None
    jobs.append(job)
jobs

In [None]:
# try to convert contents into dataframe to take a look
import pandas as pd
test = pd.DataFrame(jobs)
test

In [None]:
# create a function to automatically go through pages and scrape results for more cities

url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 5000 # Set this to a high-value (5000) to generate more results. 

results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Virginia']):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        r = requests.get(url_template.format(city, start))
        soup = BeautifulSoup(r.content)
        for post in soup.find_all('div', {'class':' row result'}):
            result={}
            try:
                result['company'] = post.find('span', {'itemprop':'name'}).getText()
            except:
                result['company'] = None
            result['title'] = post.find('a').get('title')
            result['location'] = post.find('span', {'class':'location'}).getText()
            result['summary'] = post.find('span', {'class':'summary'}).getText()
            try:
                result['salary'] = post.find('td', {'class':'snip'}).find('nobr').renderContents()
            except:
                result['salary'] = None
            results.append(result)
       

In [None]:
# convert results to dataframe
df = pd.DataFrame(results)

In [None]:
# save raw data to a file

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
df.to_csv('~/desktop/dsjobs.csv')

In [None]:
# Find the entries with annual salary entries, 
# by filtering the entries without salaries or salaries that are not yearly 
#(filter those that refer to hour or week). Also, remove duplicate entries

df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
df = df[df.salary.str.contains('hour')==False]
df = df[df.salary.str.contains('week')==False]
df = df[df.salary.str.contains('month')==False]
df = df[df.salary.str.contains('day')==False]

In [None]:
# clean strings
df.summary = df.summary.apply(lambda x: x.strip())
df.company = df.company.apply(lambda x: x.strip())

In [None]:
# Convert salary string to number, and average a salary range

df.salary = df.salary.apply(lambda x: x.replace(' a year', ''))
df.salary = df.salary.apply(lambda x: x.replace('$', ''))
df.salary = df.salary.apply(lambda x: x.replace(',', ''))

In [None]:
def number(x):
    if '-' in x:
        return (int(x.split('-')[0])+int(x.split('-')[1]))/2
    else:
        return int(x)
df.salary = df.salary.apply(number)

### Save your results as a CSV

In [None]:
df.to_csv('~/desktop/cleandata.csv', index=False)

## Predicting salaries using Random Forests + Another Classifier

In [None]:
data = pd.read_csv('~/desktop/cleandata.csv')

In [None]:
# create binary dependent variable

import numpy as np
data['high_salary'] = data.salary.apply(lambda x: 0 if x<np.median(data.salary) else 1)
data.rename(columns={'company': 'employer'}, inplace=True)

In [None]:
# check the benchmark

float(data.high_salary.sum())/len(data.high_salary)

In [None]:
# uniform location to city level

def place(x):
    return x.split(',')[0] + ', ' + x.split(',')[1][0:3]
data.location = data.location.apply(place)

#### Create a Random Forest model to predict High/Low salary. Start by ONLY using the location as a feature. 

In [None]:
# label categorical feature

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
rf = RandomForestClassifier()
le = LabelEncoder()
data['location_num'] = le.fit_transform(data.location)

In [None]:
X = data.location_num.reshape(-1, 1)
y = data.high_salary
rf.fit(X, y)
print rf.score(X, y)

#### Create a few new variables in your dataframe to represent interesting features of a job title.

In [None]:
# create a categorical feature from 'title'

def position(x):
    if 'Manager' in x:
        return 3
    elif 'Principal' in x:
        return 2
    elif 'Senior' in x:
        return 1
    else:
        return 0
data['high_position'] = data.title.apply(position)

In [None]:
data.high_position.value_counts()

In [None]:
X = data[['location_num', 'high_position']]
rf.fit(X, y)
print rf.score(X, y)

### try NLP on summary

In [None]:
# create text matrix

from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english')
tvec.fit(data.summary)
words = pd.DataFrame(tvec.transform(data.summary).todense(), columns = tvec.get_feature_names())

In [None]:
# merge text matrix back to main dataframe

df = pd.concat([data, words], axis=1)

In [None]:
# utilize random forest to select features

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.grid_search import GridSearchCV
X = df.iloc[:,6:]
y = df.high_salary
rf.fit(X, y)

In [None]:
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = X.columns,
                                    columns=['importance']).sort_values('importance',
                                                                        ascending=False)
feature_importances.head(20)

In [None]:
X = df[['location_num', 'high_position', 'scientists', 'data', 'big', 'scientist', 'team', 'analysis', 'analytics', 
       'large', 'looking', 'responsible', 'company', 'python', 'experience']]

In [None]:
# utilize grid search to optimize parameters

cv = KFold(len(y), n_folds=5, shuffle=True)
rf_params = {'n_estimators': [5,10,15,20], 'criterion': ['gini', 'entropy'], 'max_features': ['auto', 'sqrt', 'log2']}
rfgs = GridSearchCV(rf, rf_params)
rfgs.fit(X, y)
print rfgs.best_params_
print rfgs.best_score_

In [None]:
# use the best model from grid search to check the cross validation score

from sklearn.cross_validation import cross_val_predict
rf = RandomForestClassifier(max_features='auto', n_estimators=10, criterion='entropy')
rfscore = cross_val_score(rf, X, y, cv=cv, n_jobs=-1).mean()
rf_pred = cross_val_predict(rf, X, y, cv=cv, n_jobs=-1)
print rfscore

In [None]:
# print confusion matrix

from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import matplotlib.pyplot as plt
%matplotlib inline
conmat = np.array(confusion_matrix(y, rf_pred, labels=[1,0]))
confusion = pd.DataFrame(conmat, index=['high', 'low'], columns=['pred high', 'pred low'])
confusion

In [None]:
print classification_report(y, rf_pred)

In [None]:
# plot the roc curve and calculate auc

from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=41)
rf.fit(X_train, y_train)
y_prob = rf.predict_proba(X_test)[:,1]

In [None]:
fpr = dict()
tpr = dict()
roc_auc=dict()
fpr[1], tpr[1], _ = roc_curve(y_test, y_prob)
roc_auc[1] = auc(fpr[1], tpr[1])

In [None]:
plt.figure(figsize=[11,9])
plt.plot(fpr[1], tpr[1], label='ROC curve (area = %0.2f)' % roc_auc[1], linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Receiver operating characteristic for high salary', fontsize=18)
plt.legend(loc="lower right")
plt.show()

In [None]:
# try to use boosting to see if the accuracy score improve

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc_params = {'n_estimators': [100, 200, 250, 300], 'max_features': ['auto', 'sqrt', 'log2', None]}
gbcgs = GridSearchCV(gbc, gbc_params)
gbcgs.fit(X, y)

In [None]:
print gbcgs.best_params_
print gbcgs.best_score_

In [None]:
from sklearn.cross_validation import cross_val_score, cross_val_predict, KFold
cv = KFold(len(y), n_folds=5, shuffle=True)
gbc = GradientBoostingClassifier(max_features='sqrt', n_estimators=100)
gbcscore = cross_val_score(gbc, X, y, cv=cv, n_jobs=1).mean()
gbc_pred = cross_val_predict(gbc, X, y, cv=cv, n_jobs=1)
print gbcscore

#### SVM

In [None]:
# create dummies for categorical features

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categorical_features=[0,1])
X = enc.fit_transform(X)

In [None]:
# grid search

from sklearn import svm
clf = svm.SVC()
clf_params = {'kernel': ['rbf', 'sigmoid', 'linear', 'poly'], 'C': 10.**np.arange(-2,3), 
              'gamma': 10.**np.arange(-5,2), 'degree': [2,3,4]}
clfgs = GridSearchCV(clf, clf_params)
clfgs.fit(X, y)

In [None]:
print clfgs.best_params_
print clfgs.best_score_

In [None]:
# cross validation

clf = svm.SVC(kernel='linear', C=100.0, probability=True)
clfscore = cross_val_score(clf, X, y, cv=cv, n_jobs=-1).mean()
clf_pred = cross_val_predict(clf, X, y, cv=cv, n_jobs=-1)
print clfscore

In [None]:
# print evaluation matrices

clfcm = np.array(confusion_matrix(y, clf_pred, labels=[1,0]))
clf_confusion = pd.DataFrame(clfcm, index=['high', 'low'], columns=['pred high', 'pred low'])
clf_confusion

In [None]:
print classification_report(y, clf_pred)

In [None]:
clf.fit(X_train, y_train)
clf_prob = clf.predict_proba(X_test)[:,1]
clffpr = dict()
clftpr = dict()
clfroc_auc=dict()
clffpr[1], clftpr[1], _ = roc_curve(y_test, clf_prob)
clfroc_auc[1] = auc(clffpr[1], clftpr[1])
plt.figure(figsize=[11,9])
plt.plot(clffpr[1], clftpr[1], label='ROC curve (area = %0.2f)' % clfroc_auc[1], linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Receiver operating characteristic for high salary (SVM)', fontsize=18)
plt.legend(loc="lower right")
plt.show()