# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title, and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest classifier, as well as another classifier of your choice; either logistic regression, SVM, or KNN. 

- **Question**: Why would we want this to be a classification problem?
- **Answer**: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Set up a request (using requests) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)
The URL here has many query parameters
- q for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- l for a location
- start for what result number to start on

In [284]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [285]:
import requests
import bs4
from bs4 import BeautifulSoup

In [286]:

url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&fromage=last&start="
x=0
url_start = url+str(x) # if we set the start variable in the URL to 0 to begin with, it will first pull up results 1-10

page = requests.get(url_start).content
soup = BeautifulSoup(page,'lxml')

## Write 4 functions to extract each item: location, company, job, and salary.¶

In [287]:

def get_location(webpage):
    tag = webpage.find('span', attrs={'class':'location'})
    return tag.text
    
def get_company(webpage):
    tag = webpage.find('span', attrs={'class':'company'})
    return tag.text.strip('\n')

def get_job(webpage):
        tag = webpage.find('a', title=True, attrs={'data-tn-element':'jobTitle'})
        try:
            return tag['title']
        except AttributeError:
            pass

def get_salary(webpage):
    try:
        return webpage.find('table').tr.td.nobr.renderContents()
    except:
        pass

def get_description(webpage):
    description = webpage.find('span', attrs={'itemprop':"description"})
    try:
        return description.text.strip('\n')
    except:
        pass

##### Complete the following code to collect results from multiple cities and starting points.
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [288]:
# YOUR_CITY = ''

#### I am not scraping for a particular list of cities, just all cities in general

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [289]:
### I chose to write a function to combine some of the steps
### I also chose not to clean the salaries each time, but instead
### will just clean them all at once, when I import the data from the csvs


## define two functions that will be used in the main scraping function

def numbers_commas_to_int(string):
    import locale # given a string of a number with commas, convert to float
    locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') #for american comma notation
    # if european comma notation needed, change 2nd parameter to 'fr_FR'
    num = locale.atof(string)
    return float(num)


## import the results that have been previously exported
def compile_files():
    import glob
    import pandas as pd
    import numpy as np
    indeed_csvs = '/Users/jennydoyle/Desktop/dsi/indeed/'
    files = glob.glob(indeed_csvs + '*.csv')
    indeed_final = pd.DataFrame(columns=['job','company','location','salary','description'])
    for f in files:
        f = pd.read_csv(f, names=['job','company','location','salary','description'],low_memory=False)
        indeed_final = indeed_final.append(f)
    indeed_final.drop_duplicates(inplace=True)
    return indeed_final


######################################################
######################################################
######################################################

def scrape_indeed():
    ## compile previously scraped results to see if there are new jobs to add
    indeed = compile_files()    
    base = len(indeed)
    import requests
    from bs4 import BeautifulSoup
    import datetime
    import time
    import re
    import numpy as np
    start = datetime.datetime.now()
    print 'Start time: ',start.strftime("%Y-%m-%d %H:%M:%S")
    print 'Base file has ', base, ' records'
    
    # save content of URL to variable page
    url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&fromage=last&start="
    x=0
    url_start = url+str(x) # if we set the start variable in the URL to 0 to begin with, it will first pull up results 1-10

    page = requests.get(url_start).content
    soup = BeautifulSoup(page,'lxml')
    print 'Page scraped & souped'

    for results in soup.find('div', attrs={'id':'searchCount'}):
        count = str(results).split()    # take the full line that says 'Jobs x to y of z' and turn into a list
        total = count[len(count)-1]          # set total to z, the total number o[f results
        total = numbers_commas_to_int(total) # since there are commas if the number > 999, this function will deal with that and convert to int
    
    x = 0 # if we set the start variable in the URL to 0 to begin with, it will first pull up results 1-10

    while x <= total:
        url_new_page = url + str(x)
        page = requests.get(url_new_page).content
        soup = BeautifulSoup(page)
        
        for num_listings in  soup.find('div', attrs={'id':'searchCount'}) :
            num_listings = num_listings.split()[3]
        
        main = soup.find('td',{'id':'resultsCol'})   # limit our searching to solely the results portion of the page
        results = main.find_all('div', {'class': re.compile("result$")}) # create a list consisting only of the 15 results

        for i in range(len(results)):
            job = get_job(results[i])
            company = get_company(results[i])         # put all companies for each posting on curent results page into companies list
            location = get_location(results[i])       # put all locations for each posting on current results page into locations list
            salary = get_salary(results[i])           # put all salaries for each posting on current results page into salaries list
            description = get_description(results[i]) # put all descriptions for each posting on current results page into descriptions list

            add_job = [job, company, location, salary, description]
            indeed.append(add_job)
        x+=10
        new = len(indeed) - base
        elapsed = datetime.datetime.now() - start
        remaining = total - x
        est_pages = remaining/10
        
        
        print 'Added ', new, ' jobs-- scraped ',num_listings,' of ', total, ' listings in ', elapsed, '; ', est_pages, ' pages remaining'
        
        time.sleep(0.5)
            
    finish = datetime.datetime.now()
    now = finish.strftime("%Y-%m-%d %H:%M:%S")
    print 'Finish time: ',now

    elapsed = finish-start
    print 'Elapsed: ',elapsed
    indeed = pd.DataFrame(indeed)
    indeed.to_csv('/Users/jennydoyle/Desktop/dsi/indeed/'+now+'.csv',sep=',', encoding='utf-8',header=False,index=False)
    return indeed


Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [290]:
## YOUR CODE HERE

## I've decided to clean data after re-importing all data

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [291]:
## YOUR CODE HERE

#numbers_commas_to_int defined with main scrape_indeed function

### Save your results as a CSV

In [292]:
# Export to csv

## in main function scrape_indeed()

## Predicting salaries using Random Forests + Another Classifier

#### Load in the the data of scraped salaries

In [324]:
## YOUR CODE HERE

indeed = compile_files()

indeed.head()
# indeed.reset_index(drop=True, inplace=True)



Unnamed: 0,job,company,location,salary,description
0.0,Data Scientist,Novetta,"Crystal City, VA",,
1.0,Data Scientist,"Syntelli Solutions, Inc","Charlotte, NC 28277",,
2.0,Software Engineer (Data and Analytics),The Advisory Board Company,"Richmond, VA",,
3.0,Data Scientist,TechStratium Inc.,"McLean, VA",,TechStratium is hiring Data Scientists to join...
4.0,Advanced Analytics Data Scientist,IBM,"Springfield, VA",,"As an Advanced Analytics Data Scientist, you'l..."


In [325]:
df=indeed[indeed.salary.notnull()]
df.salary = df.salary.astype(str)

In [326]:
import numpy as np
df['salary_list'] = df.salary.str.split()

mask = df.salary.str.contains('-')

df['low_end'], df['high_end'], df['salary_clean'], df['period'] = np.NaN, np.NaN, np.NaN, np.NaN
df['low_end'][mask] = map(lambda x: x[0],df.salary_list.loc[mask])
df['high_end'][mask] = map(lambda x: x[2],df.salary_list.loc[mask])

mask = df.salary.str.contains('year')
df['period'] = ''
df['period'][mask] = map(lambda x: 1,df.salary_list.loc[mask])

mask = df.salary.str.contains('month')
df['period'][mask] = map(lambda x: 12,df.salary_list.loc[mask])

mask = df.salary.str.contains('hour')
df['period'][mask] = map(lambda x: 2080,df.salary_list.loc[mask])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

In [327]:
df.low_end[df.high_end==1] = np.NaN
df.high_end[df.high_end==1] = np.NaN

df['salary_clean'][df.salary.notnull()]= [x[0].strip('$') for x in df.salary_list[df.salary_list.notnull()]]
df['low_end'][df.low_end.notnull()]= df.low_end.str.strip('$')
df['high_end'][df.high_end.notnull()]= df.high_end.str.strip('$')

df.salary_clean[df.low_end.notnull()&df.high_end.notnull()] = np.NaN
df.salary_clean[df.salary_clean=='salary'] = np.NaN

df.salary_clean[df.salary_clean.notnull()] = [numbers_commas_to_int(x) for x in df.salary_clean[df.salary_clean.notnull()]]
df.low_end[df.low_end.notnull()] = [numbers_commas_to_int(x) for x in df.low_end[df.low_end.notnull()]]
df.high_end[df.high_end.notnull()] = [numbers_commas_to_int(x) for x in df.high_end[df.high_end.notnull()]]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [328]:
df.salary_clean[df.salary_clean.isnull()] = (df.low_end + df.high_end) / 2
df.salary_clean[df.salary_clean.notnull()&df.period.notnull()] = df.period * df.salary_clean


#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't _have_ to choose the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries, 

In [329]:
## YOUR CODE HERE
df = df[df.salary_clean.notnull()]
median_salary = np.median(df.salary_clean)
df['high_salary'] = True
df['high_salary'][df.salary_clean <= median_salary] = False 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


#### Thought experiment: What is the baseline accuracy for this model?

In [330]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score


dummies = pd.get_dummies(df[['job','company','location']])

X = pd.concat([dummies, df['salary_clean']], axis=1)
y = df.high_salary

cross_val_score(LogisticRegression(), X, y)

array([ 0.49295775,  0.5       ,  0.5       ])

#### Create a Random Forest model to predict High/Low salary using Sklearn. Start by ONLY using the location as a feature. 

In [331]:
## clean up locations
# remove areas in parentheses
df.location = df.location.str.replace('\((.*?)\)','')
df.location = df.location.str.strip()
# remove zip codes
df.location = df.location.str.replace(r'(\d{5}(\-\d{4})?)$','')
df.location = df.location.str.strip()
# create feature with states
df['state'] = df.location.str.findall('\,\s(\D{2})$')
# remove state from location
df.location = df.location.str.replace('(\,\s\D{2})$','')

# take the states out of the list they were for some reason placed in
df.state = df.state.astype(str)
df.state = df.state.str.replace('(\[)','')
df.state = df.state.str.replace('(\])','')
df.state = df.state.str.replace('(\')','')


In [332]:
import pandas as pd
# get dummies!!
X = pd.get_dummies(df[['location','state']])

In [333]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

def classify(Classifier, X, y, weight):
    name = str(Classifier)
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)
    if weight != '':
        dt = Classifier(class_weight=weight)
    else:
        dt = Classifier()
    s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
    print "{} Score:\t{:0.3} ± {:0.3}".format(name, s.mean().round(3), s.std().round(3))


classify(RandomForestClassifier, X, y, 'balanced')

<class 'sklearn.ensemble.forest.RandomForestClassifier'> Score:	0.645 ± 0.035


#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title 
- or whether 'Manager' is in the title. 
- Then build a new Random Forest with these features. Do they add any value? 


In [334]:
## YOUR CODE HERE

df.job = df.job.str.upper()
df['analyst'] = 0
df['analyst'][df.job.str.contains('ANALY')] = 1

df['statistician'] = 0
df['statistician'][df.job.str.contains('STATISTIC')] = 2

df['machine_learning'] = 0
df['machine_learning'][df.job.str.contains('MACHINE')] = 2

df['research'] = 0
df['research'][df.job.str.contains('RESEARCH')] = 1

df['engineer'] = 0
df['engineer'][df.job.str.contains('ENGIN')] = 2

df['mid_level'] = 0
df['mid_level'][df.job.str.contains('MANAGER')] = 2
df['mid_level'][df.job.str.contains('MID_LEVEL')] = 2
df['mid_level'][df.job.str.contains('\WII\W')] = 2
df['mid_level'][df.job.str.contains('\WII$')] = 2
df['mid_level'][df.job.str.contains('2')] = 2
df['mid_level'][df.job.str.contains('ASSISTANT')] = 2

df['entry_level'] = 0
df['entry_level'][df.job.str.contains('\WI\W')] = 1
df['entry_level'][df.job.str.contains('\WI$')] = 1
df['entry_level'][df.job.str.contains('ENTRY_LEVEL')] = 1
df['entry_level'][df.job.str.contains('1')] = 1
df['entry_level'][df.job.str.contains('PART-TIME')] = 1
df['entry_level'][df.job.str.contains('INTERN')] = 1

df['senior_level'] = 0
df['senior_level'][df.job.str.contains('\WIII\W')] = 3
df['senior_level'][df.job.str.contains('\WIII$')] = 3
df['senior_level'][df.job.str.contains('3')] = 3
df['senior_level'][df.job.str.contains('SR\W')] = 3
df['senior_level'][df.job.str.contains('SENIOR')] = 3
df['senior_level'][df.job.str.contains('LEAD')] = 3
df['senior_level'][df.job.str.contains('PRINCIPAL')] = 3
df['senior_level'][df.job.str.contains('DIRECTOR')] = 3




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

#### Rebuild this model with the new variables
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [386]:
dummies = pd.get_dummies(df.location)
df_final = pd.concat([dummies, df[['high_salary','salary_clean','analyst','statistician','engineer','machine_learning','research','mid_level','entry_level','senior_level']]], axis=1)
X = pd.concat([dummies, df[['analyst','statistician','engineer','machine_learning','research','mid_level','entry_level','senior_level']]], axis=1)
features = X.columns
y = list(df.high_salary.values)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


model = RandomForestClassifier().fit(X_train,y_train)
model.score(X_test,y_test)

0.77142857142857146

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model. 

In [387]:
cross_val_score(RandomForestClassifier(), X, y)

array([ 0.81690141,  0.82857143,  0.72857143])

#### Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients

In [388]:
sorted(zip(model.feature_importances_,features), key=lambda pair: pair[0], reverse=True)

[(0.15873364358172534, 'research'),
 (0.10900826024688226, 'analyst'),
 (0.065341740529850187, 'statistician'),
 (0.036614210819368755, 'Reston'),
 (0.034678687062958208, 'New York'),
 (0.031305129733286134, 'engineer'),
 (0.029896077624091699, 'Los Angeles'),
 (0.025327156776843579, 'Md City'),
 (0.0238659465421961, 'entry_level'),
 (0.023835508037411135, 'senior_level'),
 (0.02317681929165059, 'Jacksonville'),
 (0.022903099087745685, 'Iowa City'),
 (0.020236994633370569, 'Groton'),
 (0.019679664481310481, 'Washington'),
 (0.019101052694672407, 'Boston'),
 (0.018935812666367222, 'Richmond'),
 (0.0184801016371611, 'Queens'),
 (0.017111144992004366, 'Reno'),
 (0.017013001621228634, 'mid_level'),
 (0.015614114385209821, 'Chantilly'),
 (0.01465731428719442, 'Austin'),
 (0.013900159555269681, 'machine_learning'),
 (0.012006565212418571, 'Columbus'),
 (0.0118011777464536, 'Wake County'),
 (0.011689665950791173, 'Olympia'),
 (0.011330058743268843, 'Chicago'),
 (0.010569019765853402, 'Charlot

#### Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary - which entries have the highest predicted salaries?

In [389]:
df_final.head()


Unnamed: 0,Albany,Albuquerque,Alexandria,Anchorage,Arlington,Atlanta,Austin,Baltimore,Bellevue,Berkeley Heights,...,high_salary,salary_clean,analyst,statistician,engineer,machine_learning,research,mid_level,entry_level,senior_level
19.0,0,0,0,0,0,0,0,0,0,0,...,False,55000,0,0,0,0,0,0,0,0
36.0,0,0,0,0,0,0,0,0,0,0,...,True,160000,0,0,0,0,0,0,0,0
54.0,0,0,0,0,0,0,0,0,0,0,...,True,120000,0,0,0,0,0,2,0,0
56.0,0,0,0,0,0,0,0,0,0,0,...,True,200000,0,0,0,0,0,0,0,0
94.0,0,0,0,0,0,0,0,0,0,0,...,False,50000,0,0,0,0,1,0,0,0


### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [440]:
## YOUR CODE HERE


2227

In [24]:
## YOUR CODE HERE