# Web Scraping for Indeed.com & Predicting Salaries

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web Scraping & Logistic Regression

### Description

This week, we learned about web scraping and we have already seen classification models.  Now, we're going to put both of these skills to the test!

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal thinks the best way to gauge salary amounts is to take a look at what industry factors influence the pay scale for these professionals.

Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. Your job is to understand what factors most directly impact data science salaries and effectively, accurately find appropriate data science related jobs in your metro region.

#### Project Summary

In this project, we will practice two major skills. Collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the **location, title, and summary of the job**, we will attempt to predict a corresponding salary for that job. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries for other listings will be extremely useful for negotiations :)

Normally we could use regression for this task; however, instead we will convert this into a classification problem and use Logistic Regression.

- **Question**: Why would we want this to be a classification problem?
- **Answer**: While more precision may be better, there is a fair amount of natural variance in job salaries; therefore, predicting a range be may be useful.

The first part of assignment will be focused on scraping [Indeed.com](www.indeed.com)  or another job aggregator and the second will be focused on using the listings with salary information to build a model and predict salaries.

Your job is to:

1. Collect data from [Indeed.com](www.indeed.com) or another job aggregator on data science salary trends for your analysis.
  - Select and parse data from at least 1000 postings for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (Title, location, department, etc.). In this case, we do not want to predict mean salary as would be done in a regression. Your boss believes that salary is better represented in categories than continuously
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Author a report to your Principal detailing your analysis.

**BONUS PROBLEMS:**
1. Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your logistic regression models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.
2. Text variables and regularization:
  - **Part 1**: Job descriptions contain more potentially useful information you could leverage. Use the job summary to find words you think would be important and add them as predictors to a model.
  - **Part 2**: Gridsearch parameters for Ridge and Lasso for this model and report the best model.


**Goal:** Scrape & clean data, run logistic regression, derive insights, present findings.

---

### Requirements

- Scrape and prepare your data using BeautifulSoup.
- A Jupyter Notebook with your regression analysis for a peer audience of data scientists.
- A written report directed to your (non-technical!) Principal

 **Pro Tip:** You can find a good example report [here](https://www.dlsweb.rmit.edu.au/lsu/content/2_assessmenttasks/assess_tuts/reports_ll/report.pdf).

---

### Necessary Deliverables / Submission

- Materials must be in a clearly labeled Jupyter notebook.
- Materials must be pushed to student's github master branch.
- Materials must be submitted by the beggining of Week 6.

---

### Dataset

1. We'll be utilizing a dataset derived from live web data: [Indeed.com](https://www.indeed.com)

2. To get the data, we will use the requests library and BeautifulSoup to scrape the webpage.

---

### Suggested Ways to Get Started

- Read the docs for whatever technologies you use. Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success!
- Document **everything**.
- Look up sample executive summaries online.

### Additional Resources
- [Advice on How to Write for a Non-Technical Audience](http://programmers.stackexchange.com/questions/11523/explaining-technical-things-to-non-technical-people)
- [Documentation for BeautifulSoup can be found here](http://www.crummy.com/software/BeautifulSoup/).


In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the **location, title and summary of the job we will attempt to predict the salary of the job.** For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

In [1]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [2]:
import requests
import bs4
import pandas as pd
from bs4 import BeautifulSoup

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

### Write 4 functions to extract each item: location, company, job, and salary.

example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- **Make sure these functions are robust and can handle cases where the data/field may not be available.**
    - Remember to check if a field is empty or `None` for attempting to call methods on it
    - Remember to use `try/except` if you anticipate errors
- **Test** the functions on the results above and simple examples

In [3]:
html = requests.get('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10').text

In [4]:
soup = BeautifulSoup(html, 'lxml')

In [5]:
sl = soup.find_all('td', 'snip')

In [6]:
job_title = soup.find_all('h2', 'jobtitle')
for title in job_title:
    n = title.text
print len(job_title)

10


In [9]:
def extract_location_from_result(list):
    jobs_info = []
    location_info = []
    company_info = []
    salary_info = []
    sal = []
    df = {}
    # iteration over pages
    for num in list:
        print num
        html = requests.get('https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start='+str(num)).text
        soup = BeautifulSoup(html, 'lxml')
            # extracting job title
        job_title = soup.find_all('h2', 'jobtitle')
        while len(job_title) < 10:
            job_title.append(0)
        for title in job_title:
            n = title.text
            replace = n.replace('\n','')
            jobs_info.append(replace)
            # extracting location
        location = soup.find_all('span', 'location')
        while len(location) < 10:
            location.append(0)
        for l in location:
            loc = l.text
            replace_loc = loc.replace('\n','')
            location_info.append(replace_loc)
                # extracting company name
        company = soup.find_all('span', 'company')
        while len(company) < 10:
            company.append(0)
        for c in company:
            co = c.text
            replace_co = co.replace('\n','')
            company_info.append(replace_co)
            # extracting company name
        salary = soup.find_all('td', 'snip')
        while len(salary) < 10:
            salary.append(0)
        for s in salary:
            te = s.find_all('span','no-wrap')
            sal.append(te+[None])
                    
    df['title'] = jobs_info
    df['location'] = location_info
    df['company'] = company_info
    df['salary'] = sal
    return df

In [10]:
pages = range(0, 500, 10)

In [12]:
ny = extract_location_from_result(pages)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320


AttributeError: 'int' object has no attribute 'text'

In [19]:
html2 = requests.get('https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=330').text

In [20]:
soup2 = BeautifulSoup(html2, 'lxml')

In [21]:
job_title = soup2.find_all('span', 'company')
for t in job_title:
    print t.text



    CBS Television Network


    Gartner


    SG CIB

    DEPT OF CITYWIDE ADMIN SVCS


    MDRC


    Memorial Sloan Kettering


    Capital One


    Brookhaven National Laboratory


    Showtime


    VillageCare


Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

#### Complete the following code to collect results from multiple cities and starting points. 
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [None]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 300 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami']):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        # Append to the full set of results
        pass

In [127]:
cities = ['New+York', 'Chicago', 'San+Francisco','Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas','Denver', 'Houston']
def extract_location_from_result(list):
    jobs_info = []
    location_info = []
    company_info = []
    salary_info = []
    sal = []
    dic = {}
    # iteration over pages
    for city in list:
        print city 
        for num in range(0, 200, 10):
            print num
            html = requests.get('https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l='+str(city)+'&start='+str(num)).text
            soup = BeautifulSoup(html, 'lxml')
            # extracting job title
            job_title = soup.find_all('h2', 'jobtitle')
            while len(job_title) < 10:
                job_title.append('0')
            for title in job_title:
                n = title.text
                replace = n.replace('\n','')
                jobs_info.append(replace)
                    # extracting location
            location = soup.find_all('span', 'location')
            while len(location) < 10:
                location.append('0')
            for l in location:
                loc = l.text
                replace_loc = loc.replace('\n','')
                location_info.append(replace_loc)
                        # extracting company name
            company = soup.find_all('span', 'company')
            while len(company) < 10:
                company.append('0')
            for c in company:
                co = c.text
                replace_co = co.replace('\n','')
                company_info.append(replace_co)
                    # extracting company name
            salary = soup.find_all('td', 'snip')
            while len(salary) < 10:
                salary.append('0')
            for s in salary:
                te = s.find_all('span','no-wrap')
                sal.append(te+['0'])
                
    dic['title'] = jobs_info
    dic['location'] = location_info
    dic['company'] = company_info
    dic['salary'] = sal
    return dic


In [160]:
data_sc = extract_location_from_result(cities)

New+York
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Chicago
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
San+Francisco
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Austin
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Seattle
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Los+Angeles
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Philadelphia
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Atlanta
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Dallas
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Denver
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
Houston
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190


#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [194]:
indeed_data = pd.DataFrame.from_dict(data_sc)
indeed_data.head()


Unnamed: 0,company,location,salary,title
0,AIG,"New York, NY",[0],Statistical Machine Learning Scientist
1,Butterfly Network,"New York, NY",[0],Machine Learning Research Scientist
2,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",[0],Data Science Intern
3,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",[0],Quantitative Analyst
4,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",[0],BRM Securities Lending Quantitative Analyst


In [195]:
indeed_data.shape

(2200, 4)

Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [196]:
# removing duplicates
indeed = indeed_data.apply(lambda x: tuple(x) if type(x) is list else x)
indeed.shape

(2200, 4)

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [197]:
salary_test = []
for element in indeed.salary:
    if len(element) == 2:
        salary_test.append(element[0])
    else:
        salary_test.append(element)
salary_test


[['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 <span class="no-wrap">\n                $130,000 - $158,000 a year</span>,
 ['0'],
 ['0'],
 ['0'],
 <span class="no-wrap">\n                $120,000 - $153,000 a year (Indeed est.)</span>,
 <span class="no-wrap">\n                $174,000 - $221,000 a year (Indeed est.)</span>,
 ['0'],
 ['0'],
 ['0'],
 <span class="no-wrap">\n                $98,000 - $125,000 a year (Indeed est.)</span>,
 ['0'],
 <span class="no-wrap">\n                $122,000 - $156,000 a year (Indeed est.)</span>,
 <span class="no-wrap">\n                $50,000 a year</span>,
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 ['0'],
 <span class=

In [198]:
indeed.insert(loc=2, column='Salary',value=salary_test)
indeed= indeed.drop('salary', axis=1)

In [199]:
beta = []
for e in indeed['Salary']:
    for sublist in e:
        beta.append(sublist)

In [200]:
indeed.insert(loc=3, column='salary',value=beta)
indeed = indeed.drop('Salary', axis=1)

In [201]:
indeed

Unnamed: 0,company,location,salary,title
0,AIG,"New York, NY",0,Statistical Machine Learning Scientist
1,Butterfly Network,"New York, NY",0,Machine Learning Research Scientist
2,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0,Data Science Intern
3,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0,Quantitative Analyst
4,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0,BRM Securities Lending Quantitative Analyst
5,Clarifai,"New York, NY",0,Internship: Applied Machine Learning
6,NBA,"New York, NY",0,Senior Data Scientist
7,State Street,"New York, NY","$130,000 - $158,000 a year",Quantitative Analyst
8,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0,Data Science Analyst I - Mount Sinai Health Pa...
9,QBE,New York State,0,Data Science Lead (Credit Modeler)


In [202]:
for r in indeed.salary:
    if len(r) != 1:
        spl = r.split()
        print spl

[u'$130,000', u'-', u'$158,000', u'a', u'year']
[u'$120,000', u'-', u'$153,000', u'a', u'year', u'(Indeed', u'est.)']
[u'$174,000', u'-', u'$221,000', u'a', u'year', u'(Indeed', u'est.)']
[u'$98,000', u'-', u'$125,000', u'a', u'year', u'(Indeed', u'est.)']
[u'$122,000', u'-', u'$156,000', u'a', u'year', u'(Indeed', u'est.)']
[u'$50,000', u'a', u'year']
[u'$12', u'-', u'$15', u'an', u'hour']
[u'$70,286', u'-', u'$83,784', u'a', u'year']
[u'$59,708', u'-', u'$65,678', u'a', u'year']
[u'$14.93', u'-', u'$23.73', u'an', u'hour']
[u'$59,708', u'-', u'$65,678', u'a', u'year']
[u'$14.93', u'-', u'$23.73', u'an', u'hour']
[u'$70,286', u'-', u'$84,753', u'a', u'year']
[u'$10', u'an', u'hour']
[u'$50,000', u'-', u'$75,000', u'a', u'year']
[u'$50,000', u'-', u'$75,000', u'a', u'year']
[u'$55,100', u'a', u'year']
[u'$200,000', u'-', u'$300,000', u'a', u'year']
[u'$54,000', u'-', u'$58,000', u'a', u'year']
[u'$112,684', u'-', u'$141,752', u'a', u'year']
[u'$27.78', u'-', u'$33.09', u'an', u'hour']


In [203]:
sala = []
for r in indeed['salary']:
        if len(r) != 1:
            spl = r.split()
            if len(spl) == 3:
                sala.append(spl[0])
            else:
                sala.append(spl[0:3])
        else:
            sala.append(r)
sala
        

['0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 [u'$130,000', u'-', u'$158,000'],
 '0',
 '0',
 '0',
 [u'$120,000', u'-', u'$153,000'],
 [u'$174,000', u'-', u'$221,000'],
 '0',
 '0',
 '0',
 [u'$98,000', u'-', u'$125,000'],
 '0',
 [u'$122,000', u'-', u'$156,000'],
 u'$50,000',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 [u'$12', u'-', u'$15'],
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 [u'$70,286', u'-', u'$83,784'],
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 '0',
 [u'$59,708', u'-', u'$65,678'],
 [u'$14.93', u'-', u'$23.73'],
 '0',
 [u'$59,708', u'-', u'$65,678'],
 [u'$14.93', u'-', u'$23.73'],
 '0',
 '0',
 '0',
 '0

In [204]:
indeed.insert(loc=2, column='salario',value=sala)

In [205]:
indeed = indeed.drop('salary',axis=1)

In [209]:
indeed

Unnamed: 0,company,location,salario,title
0,AIG,"New York, NY",0,Statistical Machine Learning Scientist
1,Butterfly Network,"New York, NY",0,Machine Learning Research Scientist
2,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0,Data Science Intern
3,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0,Quantitative Analyst
4,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0,BRM Securities Lending Quantitative Analyst
5,Clarifai,"New York, NY",0,Internship: Applied Machine Learning
6,NBA,"New York, NY",0,Senior Data Scientist
7,State Street,"New York, NY","[$130,000, -, $158,000]",Quantitative Analyst
8,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0,Data Science Analyst I - Mount Sinai Health Pa...
9,QBE,New York State,0,Data Science Lead (Credit Modeler)


In [210]:
for k,i in enumerate(indeed.salario):
    if len(i) == 3:
        print k
        print i
        print len(i)

7
[u'$130,000', u'-', u'$158,000']
3
11
[u'$120,000', u'-', u'$153,000']
3
12
[u'$174,000', u'-', u'$221,000']
3
16
[u'$98,000', u'-', u'$125,000']
3
18
[u'$122,000', u'-', u'$156,000']
3
67
[u'$12', u'-', u'$15']
3
97
[u'$70,286', u'-', u'$83,784']
3
108
[u'$59,708', u'-', u'$65,678']
3
109
[u'$14.93', u'-', u'$23.73']
3
111
[u'$59,708', u'-', u'$65,678']
3
112
[u'$14.93', u'-', u'$23.73']
3
126
[u'$70,286', u'-', u'$84,753']
3
291
$10
3
378
[u'$50,000', u'-', u'$75,000']
3
385
[u'$50,000', u'-', u'$75,000']
3
475
[u'$200,000', u'-', u'$300,000']
3
506
[u'$54,000', u'-', u'$58,000']
3
593
[u'$112,684', u'-', u'$141,752']
3
595
[u'$27.78', u'-', u'$33.09']
3
616
[u'$6,666', u'-', u'$8,333']
3
630
[u'$32,345', u'-', u'$33,545']
3
631
[u'$5,259', u'-', u'$6,941']
3
644
[u'$3,100', u'-', u'$5,200']
3
652
[u'$94,000', u'-', u'$120,000']
3
653
[u'$94,000', u'-', u'$120,000']
3
654
[u'$121,000', u'-', u'$155,000']
3
658
[u'$4,599', u'-', u'$6,066']
3
664
[u'$4,599', u'-', u'$6,066']
3
679
[u

In [211]:
indeed = indeed.drop([291,2161],axis=0)

In [212]:
indeed.shape

(2198, 4)

In [213]:
for k,i in enumerate(indeed.salario):
    if len(i) == 3:
        print k
        print i
        print len(i)

7
[u'$130,000', u'-', u'$158,000']
3
11
[u'$120,000', u'-', u'$153,000']
3
12
[u'$174,000', u'-', u'$221,000']
3
16
[u'$98,000', u'-', u'$125,000']
3
18
[u'$122,000', u'-', u'$156,000']
3
67
[u'$12', u'-', u'$15']
3
97
[u'$70,286', u'-', u'$83,784']
3
108
[u'$59,708', u'-', u'$65,678']
3
109
[u'$14.93', u'-', u'$23.73']
3
111
[u'$59,708', u'-', u'$65,678']
3
112
[u'$14.93', u'-', u'$23.73']
3
126
[u'$70,286', u'-', u'$84,753']
3
377
[u'$50,000', u'-', u'$75,000']
3
384
[u'$50,000', u'-', u'$75,000']
3
474
[u'$200,000', u'-', u'$300,000']
3
505
[u'$54,000', u'-', u'$58,000']
3
592
[u'$112,684', u'-', u'$141,752']
3
594
[u'$27.78', u'-', u'$33.09']
3
615
[u'$6,666', u'-', u'$8,333']
3
629
[u'$32,345', u'-', u'$33,545']
3
630
[u'$5,259', u'-', u'$6,941']
3
643
[u'$3,100', u'-', u'$5,200']
3
651
[u'$94,000', u'-', u'$120,000']
3
652
[u'$94,000', u'-', u'$120,000']
3
653
[u'$121,000', u'-', u'$155,000']
3
657
[u'$4,599', u'-', u'$6,066']
3
663
[u'$4,599', u'-', u'$6,066']
3
678
[u'$8,333', 

In [215]:
salari = []
for i in indeed.salario:
    if len(i) == 1:
        salari.append(i)
    elif len(i) == 3:
        first_value = i[0]
        repla = first_value.replace('$','')
        repl = repla.replace(',','')
        last = repl.replace('.','')
        first_num = int(last)
        second_value = i[2]
        pl = second_value.replace('$','')
        lp = pl.replace(',','')
        ult = lp.replace('.','')
        second_num = int(ult)
        math = (first_num+second_num)/2
        salari.append(float(math))
    else:
        salari.append(i)
    

In [216]:
indeed.insert(loc=2, column='salari',value=salari)

In [217]:
indeed

Unnamed: 0,company,location,salari,salario,title
0,AIG,"New York, NY",0,0,Statistical Machine Learning Scientist
1,Butterfly Network,"New York, NY",0,0,Machine Learning Research Scientist
2,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0,0,Data Science Intern
3,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0,0,Quantitative Analyst
4,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0,0,BRM Securities Lending Quantitative Analyst
5,Clarifai,"New York, NY",0,0,Internship: Applied Machine Learning
6,NBA,"New York, NY",0,0,Senior Data Scientist
7,State Street,"New York, NY",144000,"[$130,000, -, $158,000]",Quantitative Analyst
8,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0,0,Data Science Analyst I - Mount Sinai Health Pa...
9,QBE,New York State,0,0,Data Science Lead (Credit Modeler)


In [218]:
indeed = indeed.drop('salario',axis=1)

In [219]:
indeed['salari'] = indeed['salari'].astype(str)

In [220]:
indeed['salari'] = indeed['salari'].apply(lambda x: x.replace('$',''))

In [221]:
indeed['salari'] = indeed['salari'].apply(lambda x: x.replace(',',''))

In [222]:
indeed['salari'] = indeed['salari'].astype(float)

In [223]:
indeed

Unnamed: 0,company,location,salari,title
0,AIG,"New York, NY",0.0,Statistical Machine Learning Scientist
1,Butterfly Network,"New York, NY",0.0,Machine Learning Research Scientist
2,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0.0,Data Science Intern
3,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0.0,Quantitative Analyst
4,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0.0,BRM Securities Lending Quantitative Analyst
5,Clarifai,"New York, NY",0.0,Internship: Applied Machine Learning
6,NBA,"New York, NY",0.0,Senior Data Scientist
7,State Street,"New York, NY",144000.0,Quantitative Analyst
8,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0.0,Data Science Analyst I - Mount Sinai Health Pa...
9,QBE,New York State,0.0,Data Science Lead (Credit Modeler)


### Save your results as a CSV

In [1]:
indeed.to_csv('Indeed_job_offers.csv', header=True, encoding='UTF-8', index=False)

NameError: name 'indeed' is not defined

## Predicting salaries using Logistic Regression

#### Load in the the data of scraped salaries

In [224]:
import numpy as np

In [225]:
indeed.head()

Unnamed: 0,company,location,salari,title
0,AIG,"New York, NY",0.0,Statistical Machine Learning Scientist
1,Butterfly Network,"New York, NY",0.0,Machine Learning Research Scientist
2,Mount Sinai Health System,"New York, NY 10029 (Yorkville area)",0.0,Data Science Intern
3,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0.0,Quantitative Analyst
4,Morgan Stanley,"New York, NY 10032 (Washington Heights area)",0.0,BRM Securities Lending Quantitative Analyst


#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choice the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries, 

In [275]:
#since our analysis is focused on the salary we don't need those jobs without salary
indeed_df = indeed[indeed['salari'] != 0]

In [276]:
indeed_df

Unnamed: 0,company,location,salari,title,level
7,State Street,"New York, NY",144000.0,Quantitative Analyst,High
11,QBE,New York State,136500.0,Data Science Lead (Credit Modeler),High
12,Goldman Sachs,"New York, NY 10282 (Tribeca area)",197500.0,Engineering - GSAM Tech - Data Science and Mac...,High
16,SiriusXM,"New York, NY 10104 (Midtown area)",111500.0,"Intern, Data Science - Engineering - Part-Time",High
18,McKinsey & Company,"New York, NY 10022 (Midtown area)",139000.0,Data Science Translator - TMT Sector,High
19,Oracle,"New York, NY",50000.0,Student / Intern,Low
67,State University of New York at Albany Res...,"New York, NY",13.0,Statistical Intern - Charter Schools Institute...,Low
97,ADMIN FOR CHILDREN'S SVCS,"Manhattan, NY",77035.0,Data Analyst,Low
108,DEPT OF HEALTH/MENTAL HYGIENE,"Queens, NY",62693.0,"Quality Assurance Analyst, Bureau of STD Preve...",Low
109,POLICE DEPARTMENT,"New York, NY",1933.0,SUMMER GRADUATE INTERN,Low


In [288]:
#calculating the anual salary for all jobs
salary = []
for i in indeed_df.salari:    
    if i < 100:
        day = i * 8
        week = day * 5
        month = week * 4.3
        year = month * 12
        salary.append(year)
    elif 100 < i < 10000:
        anual = i * 12
        salary.append(anual)
    else:
        salary.append(i)

In [289]:
indeed_df.insert(column='Salary', loc=2 ,value= salary)
indeed_df = indeed_df.drop('salari',axis=1)

In [290]:
median = np.median(indeed_df.Salary)
median

84239.5

In [293]:
level = []
for e in indeed_df.Salary:
    if e > median:
        level.append('High')
    else:
        level.append('Low')

In [294]:
indeed_df.insert(loc=4, column='level',value=level)
indeed_df = indeed_df.drop('Level',axis=1)

#### Thought experiment: What is the baseline accuracy for this model?

The median $84239.5

#### Create a Logistic Regression model to predict High/Low salary using statsmodel. Start by ONLY using the location as a feature. Display the coefficients and write a short summary of what they mean.

In [None]:
cities = ['New York', 'Chicago', 'San Francisco','Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas','Denver', 'Houston']

In [315]:
state = []
for i in indeed_df.location:
    if 'NY' in i:
        state.append('0')
    elif 'TX' in i:
        state.append('1')
    elif 'New York' in i:
        state.append('0')
    elif 'GA' in i:
        state.append('2')
    elif 'IL' in i:
        state.append('3')
    elif 'CA' in i:
        state.append('4')
    elif 'CO' in i:
        state.append('5')
    elif 'WA' in i:
        state.append('6')
    elif 'PA' in i:
        state.append('7')
    else:
        state.append('0')

In [316]:
indeed_df.insert(loc=4, column='state',value=state)

In [319]:
from sklearn.linear_model import LogisticRegression

In [324]:
lg = LogisticRegression()
lg.fit(indeed_df[['state']], indeed_df.level)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [326]:
score = lg.score(indeed_df[['state']], indeed_df.level)
score

0.4930555555555556

#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title 
- or whether 'Manager' is in the title. 
- Then build a new Logistic Regression model with these features. Do they add any value? 


In [333]:
position = []
for i in indeed_df.title:
    if 'Intern' in i:
        position.append(0)
    elif 'Junior' in i:
        position.append(1)
    elif 'Senior' in i:
        position.append(2)
    elif 'Manager' in i:
        position.append(3)
    else:
        position.append(i)

In [334]:
indeed_df.insert(loc=4, column='position',value=position)

In [357]:
position_df = indeed_df[(indeed_df['position'] == 0) | (indeed_df['position'] == 1)| (indeed_df['position'] == 2)| (indeed_df['position'] == 3)]
position_df.shape

(27, 7)

In [354]:
lg = LogisticRegression()
lg.fit(position_df[['position']], position_df.level)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [358]:
score = lg.score(position_df[['position']], position_df.level)
score

0.5925925925925926

In [360]:
position_df

Unnamed: 0,company,location,Salary,title,position,state,level
16,SiriusXM,"New York, NY 10104 (Midtown area)",111500.0,"Intern, Data Science - Engineering - Part-Time",0,0,High
19,Oracle,"New York, NY",50000.0,Student / Intern,0,0,Low
67,State University of New York at Albany Res...,"New York, NY",26832.0,Statistical Intern - Charter Schools Institute...,0,0,Low
126,DEPT OF HEALTH/MENTAL HYGIENE,"Queens, NY",77519.0,"Senior Data Analyst, Bureau of Environmental S...",2,0,Low
475,Blue Owl,"San Francisco, CA",250000.0,Senior Data Engineer,2,4,High
631,Health & Human Services Comm,"Austin, TX",73200.0,Senior Statistical Analyst,2,1,Low
722,Wunderman,"Austin, TX",70500.0,Senior Accountant,2,1,Low
724,City of Austin,"Austin, TX 78702 (Rosewood area)",30960.0,Intern -Environmental Scientist Associate,0,1,Low
729,Indeed,"Austin, TX 78731",127000.0,Senior Product Manager,2,1,High
744,University of Texas at Austin,"Austin, TX",60000.0,Research Engineering/ Scientist Associate IV -...,2,1,Low


#### Rebuild this model with scikit-learn.
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [361]:
from sklearn import linear_model, metrics, model_selection

In [362]:
lm = linear_model.LinearRegression()

In [364]:
lm.fit(X=position_df[['position','state']], y = postion_df['Salary'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [367]:
lm.score(X=position_df[['position','state']], y = postion_df['Salary'])

0.15802787055225964

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy, AUC, precision and recall of the model. 
- Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.

In [None]:
## YOUR CODE HERE

#### Compare L1 and L2 regularization for this logistic regression model. What effect does this have on the coefficients learned?

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

#### Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients

#### Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary - which entries have the highest predicted salaries?

### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate the logistic regression model using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

#### Re-test L1 and L2 regularization. You can use LogisticRegressionCV to find the optimal reguarlization parameters. 
- Re-test what text features are most valuable.  
- How do L1 and L2 change the coefficients?

In [None]:
## YOUR CODE HERE