# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Webscraping Project 4 Lab

Week 4 | Day 4

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com (or other sites at your team's discretion). In the second part, the focus is on using listings with salary information to build a model and predict high or low salaries and what features are predictive of that result.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

In [385]:
URL = "http://www.indeed.com/jobs?q=data+scientist&l=Tucson%2C+AZ"

In [2]:
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re

In [386]:
## YOUR CODE HERE

r = requests.get(URL)

In [388]:
r

<Response [200]>

In [387]:
page = r.content

In [389]:
soup = BeautifulSoup(page, 'lxml')

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

In [7]:
print soup.prettify()

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <!-- pll -->
  <script src="/s/af4f8a6/en_US.js" type="text/javascript">
  </script>
  <link href="/s/6aa5893/jobsearch_all.css" rel="stylesheet" type="text/css"/>
  <link href="http://rss.indeed.com/rss?q=data+scientist+%2420%2C000&amp;l=New+York" rel="alternate" title="Data Scientist $20,000 Jobs, Employment in New York, NY" type="application/rss+xml"/>
  <link href="/m/jobs?q=data+scientist+%2420%2C000&amp;l=New+York" media="handheld" rel="alternate"/>
  <script type="text/javascript">
   window['closureReadyCallbacks'] = [];

    function call_when_jsall_loaded(cb) {
        if (window['closureReady']) {
            cb();
        } else {
            window['closureReadyCallbacks'].push(cb);
        }
    }
  </script>
  <script src="/s/954c65a/jobsearch-all-compiled.js" type="text/javascript">
  </script>
  <script type="text/javascript">
   var pingUrlsForGA = [];

var

### Write 4 functions to extract each item: location, company, job, and salary.

example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- Make sure these functions are robust and can handle cases where the data/field may not be available.
- Test the functions on the results above
- Include any other features you may want to use later (e.g. summary, #of reviews...)

In [492]:
## YOUR CODE HERE
import pandas as pd

def get_location(result):
    
    loc = []
    
    for d in result.findAll('div', class_ = "  row  result"):

        try:
            a = d.find('span', class_ = 'location')
            loc.append(a.text.strip())
        except:
            loc.append("None")

    for d in soup.findAll('div', class_ = "lastRow  row  result"):

        try:
            a = d.find('span', class_ = 'location')
            loc.append(a.text.strip())
        except:
            loc.append("None")
    
    return pd.Series(loc)

In [493]:
print get_location(soup)

0                               Tucson, AZ
1    Tucson, AZ 85712 (Glenn Heights area)
2                         Tucson, AZ 85750
3                               Tucson, AZ
4                               Tucson, AZ
5                               Tucson, AZ
6                               Tucson, AZ
7                               Tucson, AZ
8                               Tucson, AZ
9                               Tucson, AZ
dtype: object


In [490]:
def get_company(result):
    
    company = []
    
    for d in result.findAll('div', class_ = "  row  result"):

        try:
            a = d.find('span', class_ = 'company')
            company.append(a.text.strip())
        except:
            company.append("None")

    for d in soup.findAll('div', class_ = "lastRow  row  result"):

        try:
            a = d.find('span', class_ = 'company')
            company.append(a.text.strip())
        except:
            company.append("None")

    return pd.Series(company)

In [491]:
print get_company(soup)

0    University of Arizona
1     liver institute PLLC
2                Honeywell
3    University of Arizona
4                 Raytheon
5    University of Arizona
6    University of Arizona
7                 Raytheon
8    University of Arizona
9    University of Arizona
dtype: object


In [480]:
def get_job(result):
    
    job = []
    
    for d in result.findAll('div', class_ = "  row  result"):

        try:
            a = d.find(class_ = 'jobtitle')
            job.append(a.find('a').text.strip())
        except:
            job.append("None")

    for d in soup.findAll('div', class_ = "lastRow  row  result"):

        try:
            a = d.find(class_ = 'jobtitle')
            job.append(a.find('a').text.strip())
        except:
            job.append("None")
        
    return pd.Series(job)



In [571]:
print get_job(soup)

0                        Director, RWI Data Science
1               Clinical Data Sciences Program Lead
2                Data Scientist Intern, Summer 2017
3                               Statistical Analyst
4                          Data Scientist-Analytics
5              Research Scientist/ Bioinformatician
6                  Software Product Manager - Forms
7           Research Scientist I- Mass Spectrometry
8    2016 National Black MBA Association Conference
9               Senior Manager Predictive Analytics
dtype: object


In [430]:
def get_salary(result):
    sal = []
    
    for d in result.findAll('div', class_ = "  row  result"):

        try:
            a = d.findAll('td', class_ = 'snip')
            sal.append((str(d.nobr)).strip())
        except:
            sal.append("None")

    for d in soup.findAll('div', class_ = "lastRow  row  result"):

        try:
            a = d.findAll('td', class_ = 'snip')
            sal.append((str(d.nobr)).strip())
        except:
            sal.append("None")
    
    return pd.Series(sal)

In [431]:
print get_salary(soup)

0                                    None
1                <nobr>$10 an hour</nobr>
2                                    None
3                                    None
4                                    None
5                                    None
6                                    None
7                                    None
8             <nobr>$60,000 a year</nobr>
9    <nobr>$10.82 - $13.53 an hour</nobr>
dtype: object


In [426]:
def get_review(result):
    rev = []
    
    for d in result.findAll('div', class_ = "  row  result"):

        try:
            a = d.find('span', class_ = 'slNoUnderline').text
            rev.append(str(a).strip())
        except:
            rev.append("None")

    for d in soup.findAll('div', class_ = "lastRow  row  result"):

        try:
            a = d.find('span', class_ = 'slNoUnderline').text
            rev.append(str(a).strip())

        except:
            rev.append("None")

    return pd.Series(rev)
    
    

In [427]:
print get_review(soup)

0      236 reviews
1             None
2    3,474 reviews
3      236 reviews
4    1,247 reviews
5      236 reviews
6      236 reviews
7    1,247 reviews
8      236 reviews
9      236 reviews
dtype: object


In [519]:
# Extract ratings from string

def reg_rate(x):
    matchObj = re.findall( r'(\d+\.\d)', x, re.I)
    if matchObj:
        return float(matchObj[0])
    else:
        return 0.

In [524]:
def get_rating(result):
    
    rate = []
    
    for d in result.findAll('div', class_ = "  row  result"):

        try:
            a = d.find('span', class_ = 'rating')['style']
            rate.append(reg_rate(str(a).strip()))
        except:
            rate.append("None")

    for d in soup.findAll('div', class_ = "lastRow  row  result"):

        try:
            a = d.find('span', class_ = 'rating')['style']
            rate.append(reg_rate(str(a).strip()))

        except:
            rate.append("None")

    return pd.Series(rate)
    
    

In [527]:
print get_rating(soup)

0    53.4
1    None
2    44.4
3    53.4
4      51
5    53.4
6    53.4
7      51
8    53.4
9    53.4
dtype: object


In [670]:
a = '<span class="summary" itemprop="description">Develop <b>data</b> mining and/or <b>data</b> modeling plans to support RWI&amp;A projects and programs. Manage, mentor and develop a high performing <b>data</b> <b>scientists</b> team through...</span>'

In [672]:
print a.replace()

<span class="summary" itemprop="description">Develop data</b> mining and/or data</b> modeling plans to support RWI&amp;A projects and programs. Manage, mentor and develop a high performing data</b> scientists</b> team through...</span>


In [678]:
def reg_sum(x):
    
    y = x[45:(len(x) - 7)]
    y = y.replace('<b>', '')
    y = y.replace('</b>', '')
    
    return y.strip()

In [679]:
def get_summary(result):
    
    summ = []
    
    for d in result.findAll('div', class_ = "  row  result"):

        try:
            a = d.find('span', class_ = 'summary')
            summ.append(reg_sum(str(a)))
        except:
            summ.append("None")

    for d in soup.findAll('div', class_ = "lastRow  row  result"):

        try:
            a = d.find('span', class_ = 'summary')
            summ.append(reg_sum(str(a)))
        except:
            summ.append("None")

    return pd.Series(summ)
    
    

In [680]:
print get_summary(soup)

0    Develop data mining and/or data modeling plans...
1    Clinical Data Scientist:. Clinical Data Scienc...
2    Modeling historical data to forecast demand, p...
3    Present data findings to management and senior...
4    3+ years' experience analyzing large data sets...
5    Collaborate with information technology on int...
6    From data scientists to sales and operations e...
7    This candidate is expected to have industry ex...
8    Integrity is imperative for all financial data...
9    The Data Scientist has a passion for analyzing...
dtype: object


Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

### Complete the following code to collect results from multiple cities and starting points. 
- Indeed.com only has salary information for an estimated 20% of job postings. You may want to add other cities to the list below to gather more data. 
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

Source
[Best_cities_for_data_scientists]('https://infogr.am/30f00c18-997e-4a9d-b903-191899a53890')

In [358]:
# Importing a df of cities best for data science

df_cities = pd.read_csv('ds_cities.csv')
df_cities.head()

Unnamed: 0,City,State
0,Morrisville,NC
1,Palo Alto,CA
2,Redmond,WA
3,Mountain View,CA
4,El Segundo,CA


In [371]:
# Indeed.com's url format of city, state is like this 'Tucson%2C+AZ'

cities_list = []

for i in range(len(df_cities)):
    x = df_cities.iloc[i, 0] + "%2C+" + df_cities.iloc[i, 1]
    cities_list.append(x)

In [372]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={&start={"

In [528]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={&start={"
url_list = url_template.split('{')
max_results_per_city = 100

results = []

for city in set(cities_list):
    
    for start in range(0, max_results_per_city, 10):
        
        url = url_list[0] + city + url_list[1] + str(start)
        
        r = requests.get(url)
        if str(r) == "<Response [200]>":
            
            page = r.content
            soup = BeautifulSoup(page, 'lxml')
            
            results.append(get_job(soup))
            results.append(get_location(soup))
            results.append(get_company(soup))
            results.append(get_salary(soup))
            results.append(get_review(soup))
            results.append(get_rating(soup))
            results.append(get_summary(soup))
            
        else:
            print "This url is not responding favorably."
            print "url =", url
            print "response =", r
            
        # Grab the results from the request (as above)
        # Append to the full set of results

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [529]:
# Initializing the dataframe

df = pd.DataFrame(results[0:7]).T
print len(df)
df.head()


10


Unnamed: 0,0,1,2,3,4,5
0,Operations Research Analyst,"Hampton, VA",AECOM,,"1,806 reviews",43.2
1,Market Research Analyst II,"Newport News, VA 23606",Canon,,431 reviews,43.2
2,Research Analyst,"Hampton, VA",Thomas Nelson Comm College,"<nobr>$55,000 - $64,260 a year</nobr>",2 reviews,54.0
3,"PRINCIPAL OPERATIONS RESEARCH ANALYST, NORFOLK...","Norfolk, VA",CACI International Inc,,815 reviews,44.4
4,I&O Analyst -- Reporting & Metrics Analysis,"Norfolk, VA 23510",ADP,,"1,655 reviews",44.4


In [531]:
# Number of columns scrapped
m = 7

for n in range(m, (len(results) - m), m):
    
    # Creates an interim df for each page scrapped
    df1 = pd.DataFrame(results[n:(n+m)]).T
    
    # Concatenate column-wise with df
    df = pd.concat([df,df1])

# Labelling each coloumn
df.columns = ['Job Title', 'Location', 'Company', 'Salary', 'No. of Reviews', 'Company Ratings', 'Job Summary']

# Checking df
print len(df)
df.head()

31295


Unnamed: 0,Job Title,Location,Company,Salary,No. of Reviews,Company Ratings
0,Operations Research Analyst,"Hampton, VA",AECOM,,"1,806 reviews",43.2
1,Market Research Analyst II,"Newport News, VA 23606",Canon,,431 reviews,43.2
2,Research Analyst,"Hampton, VA",Thomas Nelson Comm College,"<nobr>$55,000 - $64,260 a year</nobr>",2 reviews,54.0
3,"PRINCIPAL OPERATIONS RESEARCH ANALYST, NORFOLK...","Norfolk, VA",CACI International Inc,,815 reviews,44.4
4,I&O Analyst -- Reporting & Metrics Analysis,"Norfolk, VA 23510",ADP,,"1,655 reviews",44.4


In [174]:
df['Salary'].unique()

array(['None', '<nobr>$55,000 - $64,260 a year</nobr>',
       '<nobr>$94,963 a year</nobr>',
       '<nobr>$33,770 - $55,060 a year</nobr>',
       '<nobr>$187,460 a year</nobr>', '<nobr>$40,861 a year</nobr>',
       '<nobr>$90,000 - $127,000 a year</nobr>',
       '<nobr>$48,000 a year</nobr>',
       '<nobr>$80,000 - $125,000 a year</nobr>',
       '<nobr>$100,000 - $171,000 a year</nobr>',
       '<nobr>$100,000 - $170,000 a year</nobr>',
       '<nobr>$37.08 an hour</nobr>',
       '<nobr>$88,305 - $114,802 a year</nobr>',
       '<nobr>$104,349 - $135,656 a year</nobr>',
       '<nobr>$130,000 a year</nobr>', '<nobr>$80,000 a year</nobr>',
       '<nobr>$30,000 - $32,000 a year</nobr>',
       '<nobr>$90,000 - $102,000 a year</nobr>',
       '<nobr>$150,000 - $205,000 a year</nobr>',
       '<nobr>$100,000 - $125,000 a year</nobr>',
       '<nobr>$130,000 - $180,000 a year</nobr>',
       '<nobr>$63,863 a year</nobr>',
       '<nobr>$65,572 - $84,611 a year</nobr>',
       '<nob

In [532]:
# Convert money in string form of $100,000 or $20 into float values

def conv_money(x):
    
    if ',' in x:
        x = x[1:].split(',')
        y = float(x[0] + x[1])
    else:
        y = float(x[1:])
    
    return y

In [533]:
# Find salary values

def reg_sal_low(x):
    matchObj = re.findall( r'(\$\d+\W?\d*\s)', x, re.I)
    if matchObj:
        return conv_money(matchObj[0])
    else:
        return 0.

In [534]:
def reg_sal_high(x):
    matchObj = re.findall( r'(\$\d+\W?\d*\s)', x, re.I)
    if matchObj:
        try:
            return conv_money(matchObj[1])
        except:
            return 0.
    else:
        return 0.


In [543]:
def reg_sal_freq(x):
    
    matchObj = re.findall( r'(year|month|hour|day)', x, re.I)
    if matchObj:
        return matchObj[0]
    else:
        return 'None'

In [536]:
df['Salary Lower'] = df['Salary'].apply(reg_sal_low)

In [537]:
df['Salary Higher'] = df['Salary'].apply(reg_sal_high)

In [544]:
df['Salary Frequency'] = df['Salary'].apply(reg_sal_freq)

In [545]:
df.head()

Unnamed: 0,Job Title,Location,Company,Salary,No. of Reviews,Company Ratings,Salary Lower,Salary Higher,Salary Frequency
0,Operations Research Analyst,"Hampton, VA",AECOM,,"1,806 reviews",43.2,0.0,0.0,
1,Market Research Analyst II,"Newport News, VA 23606",Canon,,431 reviews,43.2,0.0,0.0,
2,Research Analyst,"Hampton, VA",Thomas Nelson Comm College,"<nobr>$55,000 - $64,260 a year</nobr>",2 reviews,54.0,55000.0,64260.0,year
3,"PRINCIPAL OPERATIONS RESEARCH ANALYST, NORFOLK...","Norfolk, VA",CACI International Inc,,815 reviews,44.4,0.0,0.0,
4,I&O Analyst -- Reporting & Metrics Analysis,"Norfolk, VA 23510",ADP,,"1,655 reviews",44.4,0.0,0.0,


In [539]:
len(df[df['Salary'] != 'None'])

2272

In [547]:
df_sal = df[df['Salary'] != 'None']
df_sal.head()

Unnamed: 0,Job Title,Location,Company,Salary,No. of Reviews,Company Ratings,Salary Lower,Salary Higher,Salary Frequency
2,Research Analyst,"Hampton, VA",Thomas Nelson Comm College,"<nobr>$55,000 - $64,260 a year</nobr>",2 reviews,54.0,55000.0,64260.0,year
3,Sr. Socio-behavioral Scientist,"Norfolk, VA",Eastern Virginia Medical School,"<nobr>$94,963 a year</nobr>",35 reviews,42.0,94963.0,0.0,year
0,Research Analyst,"Norfolk, VA","City of Norfolk, VA","<nobr>$33,770 - $55,060 a year</nobr>",,,33770.0,55060.0,year
2,"Executive Director of the Virginia Modeling, A...","Norfolk, VA",Old Dominion University,"<nobr>$187,460 a year</nobr>",148 reviews,53.4,187460.0,0.0,year
2,"Executive Director of the Virginia Modeling, A...","Norfolk, VA",Old Dominion University,"<nobr>$187,460 a year</nobr>",148 reviews,53.4,187460.0,0.0,year


In [548]:
df_sal.loc[:,'Salary'] = df_sal['Salary'].apply(lambda x: x[6:(len(x) - 7)])
df_sal.head()

Unnamed: 0,Job Title,Location,Company,Salary,No. of Reviews,Company Ratings,Salary Lower,Salary Higher,Salary Frequency
2,Research Analyst,"Hampton, VA",Thomas Nelson Comm College,"$55,000 - $64,260 a year",2 reviews,54.0,55000.0,64260.0,year
3,Sr. Socio-behavioral Scientist,"Norfolk, VA",Eastern Virginia Medical School,"$94,963 a year",35 reviews,42.0,94963.0,0.0,year
0,Research Analyst,"Norfolk, VA","City of Norfolk, VA","$33,770 - $55,060 a year",,,33770.0,55060.0,year
2,"Executive Director of the Virginia Modeling, A...","Norfolk, VA",Old Dominion University,"$187,460 a year",148 reviews,53.4,187460.0,0.0,year
2,"Executive Director of the Virginia Modeling, A...","Norfolk, VA",Old Dominion University,"$187,460 a year",148 reviews,53.4,187460.0,0.0,year


Lastly, we need to clean up salary data. 
1. Some of the salaries are not yearly but hourly, these will be useful to us for now
2. The salaries are given as text and usually with ranges.

#### Filter out the salaries that are not yearly (filter those that refer to hour)

In [556]:
## YOUR CODE HERE

df_sal_hr = df_sal[df_sal['Salary Frequency'] == 'hour']
len(df_sal_hr)

494

In [561]:
df_sal_year = df_sal[(df_sal['Salary Frequency'] != 'hour') & (df_sal['Salary Frequency'] != 'day')]
len(df_sal_year)

1764

In [562]:
df_sal_year.loc[df_sal_year['Salary Frequency'] == 'month', ['Salary Lower', 'Salary Higher']] = \
df_sal_year.loc[df_sal_year['Salary Frequency'] == 'month', ['Salary Lower', 'Salary Higher']].applymap(lambda x: x*12)

In [243]:
df_sal_year[df_sal_year['Salary Frequency'] == 'month']

Unnamed: 0,Job Title,Location,Company,Salary,Salary Lower,Salary Higher,Salary Frequency
10,Research Analyst,"West Valley City, UT",West Valley College,"$6,327 - $6,930 a month",75924.0,83160.0,month
5,SENIOR DASHBOARD DATA SCIENTIST (160050),"Los Angeles, CA",California State University,"$6,249 - $10,857 a month",74988.0,130284.0,month
6,SERVICE DESK ANALYST,"Seattle, WA",University of Washington,"$3,635 - $4,888 a month",43620.0,58656.0,month
11,Forensic Scientist - DNA - Portland,"Portland, OR",State of Oregon,"$3,904 - $6,869 a month",46848.0,82428.0,month
8,Research Analyst,"West Valley City, UT",West Valley College,"$6,327 - $6,930 a month",75924.0,83160.0,month
7,Fintech Data Scientist,"Rye Brook, NY",ACP Investment Group,"$4,100 a month",49200.0,0.0,month
10,Associate Legislative Research Analyst 1,"Nashville, TN 37243",TN Comptroller of the Treasury,"$3,674 a month",44088.0,0.0,month
4,Research and Systems Technology Analyst,"Oakland, CA",Peralta Community College District,"$6,730 - $7,395 a month",80760.0,88740.0,month
7,SERVICE DESK ANALYST,"Seattle, WA",University of Washington,"$3,635 - $4,888 a month",43620.0,58656.0,month
8,Fintech Data Scientist,"Rye Brook, NY",ACP Investment Group,"$4,100 a month",49200.0,0.0,month


In [245]:
df_sal_year.loc[:, "Salary"] = (df_sal_year['Salary Lower'] + df_sal_year['Salary Higher']) / 2.0

df_sal_year.loc[df_sal_year['Salary Higher'] != 0, ['Salary Lower', 'Salary Higher']] = \
df_sal_year.loc[df_sal_year['Salary Higher'] != 0, ['Salary Lower', 'Salary Higher']].applymap(lambda x: x*12)

In [341]:
def get_state(x):
    matchObj = re.search( r'([A-Z][A-Z]$)|([A-Z][A-Z]\s)', x)
    if matchObj:
        return str(matchObj.group()).strip()
    else:
        return 'None'


In [309]:
df['Location'].unique()

array([u'United States', u'Hampton, VA', u'Newport News, VA 23606', ...,
       u'Tucson, AZ 85718', u'Tucson, AZ 85756',
       u'Tucson, AZ 85702 (Barrio Viejo area)'], dtype=object)

In [356]:
a = 'United States'
print get_state(a)

None


In [343]:
df_sal_year.loc[:,'State'] = df_sal_year["Location"].apply(get_state)
len(df_sal_year[df_sal_year['State'] == 'None'])

0

In [314]:
# Import full list of states to check extraction of states

states = pd.read_csv('states.csv')

In [347]:
st = df_sal_year['State'].unique()

In [348]:
# Checking to see if any non-state two-letter abbreviation was extracted

for i in st:
    if i not in list(states['State']):
        print "These are not actual states:", i

In [349]:
df_sal_year.head()

Unnamed: 0,Job Title,Location,Company,Salary,Salary Lower,Salary Higher,Salary Frequency,State
5,Research Analyst,"Hampton, VA",Thomas Nelson Comm College,59630.0,55000.0,64260.0,year,VA
3,Sr. Socio-behavioral Scientist,"Norfolk, VA",Eastern Virginia Medical School,47481.5,94963.0,0.0,year,VA
10,Research Analyst,"Norfolk, VA","City of Norfolk, VA",44415.0,33770.0,55060.0,year,VA
8,"Executive Director of the Virginia Modeling, A...","Norfolk, VA",Old Dominion University,93730.0,187460.0,0.0,year,VA
8,"Executive Director of the Virginia Modeling, A...","Norfolk, VA",Old Dominion University,93730.0,187460.0,0.0,year,VA


UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 17: ordinal not in range(128)

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [23]:
## YOUR CODE HERE

### Save your results as a CSV

In [24]:
## YOUR CODE HERE