# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping & Logistic Regression


1. Collect data on data science salary trends from a job listings aggregator for your analysis.
  - Select and parse data from at least ~1000 postings for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (title, location, department, etc.). In this case, we do not want to predict mean salary as would be done in a regression. Your boss believes that salary is better represented in categories than continuously
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Prepare a presentation for your Principal detailing your analysis.

**BONUS PROBLEMS:**
1. Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your logistic regression models to ease her mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.
2. Text variables and regularization:
  - **Part 1**: Job descriptions contain more potentially useful information you could leverage. Use the job summary to find words you think would be important and add them as predictors to a model.
  - **Part 2**: Gridsearch parameters for Ridge and Lasso for this model and report the best model.


**Goal:** Scrape & clean data, run logistic regression, derive insights, present findings.

In [24]:
## Import libraries

import requests
import bs4
from bs4 import BeautifulSoup
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver

import pandas as pd
import numpy as np
import datetime
import urllib
import urllib2
import re

from time import sleep # To prevent overwhelming the server between connections
from collections import Counter # Keep track of our term counts
from nltk.corpus import stopwords # Filter out stopwords, such as 'the', 'or', 'and'

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, auc, roc_auc_score, roc_curve
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.grid_search import GridSearchCV

# Scrape Data - Indeed.com

In [3]:
################################################################
############### EDIT THESE CONSTANTS
############### 
###############        EDIT THESE CONSTANTS

MAX_RESULTS_PER_CITY = 1000      ### DO NOT SET MORE THAN 1000
URL_SEARCH_TERM = 'Data Scientist' ### DO NOT SET MORE THAN SINGLE SEARCH TERM (TITLE)
CITY_SET = ['New York', 'Chicago', 'San Francisco', 'Austin', 'Atlanta', '', 'Boston', 'Seattle'\
            'Los Angeles','Washington, DC', 'San Jose','Denver', 'Atlanta','Houston',\
            'Dallas','Nashville','San Diego','Cleveland','Minneapolis','Baltimore','Philadelphia','Detroit']
###############
################################################################


def extract_location_from_resultRow(result):
    try:
        location = (result.find(class_='location').text.strip())
    except:
        location = ''
    return location

def extract_company_from_resultRow(result):
    try:
        company = (result.find(class_='company').text.strip())
    except:
        company = ''
    return company

def extract_jkid_from_resultRow(result):
    try:
        row = (result.find(class_='jobtitle turnstileLink'))
        jkid = result['data-jk']
    except: 
        jkid = ''
    return jkid

def extract_title_from_resultRow(result):
    try:
        title = (result.find(class_='turnstileLink'))
        title_text = title.text
    except: 
        title_text = ''
    return title_text

def extract_salary_from_resultRow(result):
    try:
        salary = (result.find(class_='snip').find('nobr').text)
    except:
        salary = ''
    salary_text = salary
    return salary_text

def extract_reviews_from_resultRow(result):
    try:
        reviews = (result.find(class_='slNoUnderline').text.strip().strip(' reviews').replace(',',''))
    except: 
        reviews = ''
    return reviews

def extract_stars_from_resultRow(result):
    try: 
        stars = (result.find(class_='rating')['style']).split(';background-position:')[1].split(':')[1].split('px')[0].strip()
    except: 
        stars = ''
    return stars

def extract_date_from_resultRow(result):
    try: 
        date = (result.find(class_='date').text.strip(' ago').strip())
    except: 
        date = ''
    return date

def extract_summary_from_resultRow(result):
    try: 
        summary = (result.find("span", {"itemprop" : "description"}).text.strip())
    except: 
        summary = ''
    return summary

In [4]:
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36")

driver = webdriver.PhantomJS(executable_path='/Users/kristensu/Downloads/phantomjs-2.1.1-macosx/bin/phantomJS', desired_capabilities=dcap)
driver.set_window_size(1024, 768) 

for city in CITY_SET:
    job_dict = []
    now = datetime.datetime.now()
    for start in range(0, MAX_RESULTS_PER_CITY, 10):

        URL = "http://www.indeed.com/jobs?q="+urllib.quote(URL_SEARCH_TERM)+"&l="+urllib.quote(city)+"&start="+str(start)
        driver.get(URL)
        soup = BeautifulSoup(driver.page_source, "lxml")

        for i in soup.findAll("div", {"data-tn-component" : "organicJob"}):

            location = extract_location_from_resultRow(i)
            company = extract_company_from_resultRow(i)
            summary = extract_summary_from_resultRow(i)
            jkid = extract_jkid_from_resultRow(i)
            title = extract_title_from_resultRow(i)
            salary = extract_salary_from_resultRow(i)
            reviews = extract_reviews_from_resultRow(i)
            stars = extract_stars_from_resultRow(i)
            post_date = extract_date_from_resultRow(i)

            job_dict.append([location, company, summary, jkid, title, salary, stars, reviews, post_date, now])
            
        job_df = pd.DataFrame(job_dict, columns=['location', 'company', 'summary', 'jkid', 'title', 'salary', 'stars', 'reviews', 'post_date', 'pull_date'])       

    job_df.to_csv('scrape'+city+'_'+str(MAX_RESULTS_PER_CITY)+'.csv', encoding='utf-8')
        

# Scrape Data - glassdoor.com

In [None]:
### Insert Amish's code here

# Wrangle Data

In [None]:
### Merge df's here

# Import master.csv

In [31]:
master_df = pd.read_csv('master-3.csv')

# DELETE ANY HEADER ROWS LEFT OVER FROM CSV MERGE
try: master_df = master_df[master_df['reviews'] != 'reviews'] 
except: pass

  result = getattr(x, name)(y)


# Clean Data

In [6]:
###### REVIEWS CLEAN TO FLOAT
######

master_df['reviews'] = master_df['reviews'].fillna(0)

def indeed_review_cleanup(review): 
    try:
        review = review.str.replace(',','')
        review = review.strip(' reviews')
        review = review.strip(' review')
        review = review.strip('reviews')
        review = review.strip()
        review = float(review)
    except:
        #print review
        pass
    return review

master_df['clean_review'] = master_df[['reviews']].applymap(lambda x:indeed_review_cleanup(x))

master_df['clean_review'].sort_values().unique()

master_df['clean_review'] = master_df['clean_review'].astype(float)
master_df['reviews'] = master_df['clean_review']
master_df.drop('clean_review', axis=1, inplace=True)

#########  END CLEAN REVIEWS
###################

In [7]:
###### POST_DATE CLEAN TO FLOAT
######

try:
    master_df['clean_post_date'] = master_df['post_date']
except: pass


def post_date_to_day_float(dateValue):
    try:
        temp = dateValue
        dateValue.replace('s','')
        if 'day' in dateValue:
            temp = dateValue.split()[0]
        elif 'hour' in dateValue:
            temp = dateValue.split()[0]
            temp = float(temp)/24
        elif 'minute' in dateValue:
            temp = dateValue.split()[0]
            temp = float(temp)/24/60
        if '+' in dateValue:
            temp = 45           
    except: 
        pass
    return temp

master_df['clean_post_date'] = master_df[['clean_post_date']].applymap(lambda x: post_date_to_day_float(x))

master_df['clean_post_date'].sort_values().unique()

master_df['clean_post_date'] = master_df['clean_post_date'].astype(float)
master_df['post_date'] = master_df['clean_post_date']
master_df.drop('clean_post_date', axis=1, inplace=True)
master_df.rename(columns = {'post_date':'post_date_daysAgo'}, inplace=True)

#########  END CLEAN POST_DATE
###################

In [8]:
###### STARS CLEAN TO FLOAT
######


master_df['clean_stars'] = master_df['stars'].fillna(0)
master_df['clean_stars'] = master_df[['stars']].astype(float).applymap(lambda x: x//6/2)


master_df['stars'] = master_df['clean_stars']
master_df.drop('clean_stars', axis=1, inplace=True)


#########  END CLEAN STARS
###################

In [9]:
#####Create JOB_LINK column from JKID
#####

master_df['job_link'] = master_df[['jkid']].applymap(lambda x: 'http://www.indeed.com/rc/clk?jk='+x)

#########  END JOB_LINK COLUMN
###################

In [10]:
##### Location Cleanup
#####

def location_cleanup(location):
    temp = location
    temp_city = location.split(',')[0]
    try:
        temp_state = location.split(',')[1].split()[0]
    except: 
        temp_state = ''
    return temp_city+", "+temp_state
    
master_df['location_clean'] = master_df[['location']].applymap(lambda x: location_cleanup(x))
master_df['location_clean'].sort_values().unique()

master_df['location'] = master_df['location_clean']
master_df.drop('location_clean', axis=1, inplace=True)

#########  END LOCATION CLEANUP COLUMN
###################

In [11]:
##### Salary Cleanup
#####

master_df['salary'] = master_df['salary'].fillna(0)

def cleanup_salary(salary):
    if "year" in str(salary):
        temp = salary.strip(" a year")
        temp = temp.split('-')
        low_range = int(temp[0].strip().replace("$","").replace(",",""))
        high_range = int(temp[-1].strip().replace("$","").replace(",",""))
        avg = (low_range + high_range) / 2
        salary_list = [low_range,high_range,avg]
    elif "month" in str(salary):
        temp = salary.replace("a month","")
        temp = temp.split('-')
        low_range = int(temp[0].replace("$","").replace(",",""))*12
        high_range = int(temp[-1].replace("$","").replace(",",""))*12
        avg = (low_range + high_range) / 2
        salary_list = [low_range,high_range,avg]
    elif "hour" in str(salary):
        temp = salary.replace("an hour","")
        temp = temp.split('-')
        low_range = float(temp[0].replace("$","").replace(",",""))*2080
        high_range = float(temp[-1].replace("$","").replace(",",""))*2080
        avg = (low_range + high_range) / 2
        salary_list = [low_range,high_range,avg]
    else:
        salary_list = [0,0,0]
        low_range = 0
        high_range = 0
        avg = 0
        
    return low_range, high_range, avg
master_df['salary_clean'] = master_df[['salary']].applymap(lambda x: cleanup_salary(x))

master_df['salary'] = master_df['salary_clean']

master_df['sal_low'] = master_df['salary'].apply(lambda x: x[0])
master_df['sal_high'] = master_df['salary'].apply(lambda x: x[1])
master_df['sal_avg'] = master_df['salary'].apply(lambda x: x[2])

master_df.drop('salary_clean', axis=1, inplace=True)
master_df.drop('salary', axis=1, inplace=True)



#########  END SALARY CLEANUP COLUMN
###################
#Add Comment
# master_df['salary_clean'] = master_df[['salary']].applymap(lambda x: cleanup_salary(x))

# master_df['salary'] = master_df['salary_clean']
# master_df.drop('salary_clean', axis=1, inplace=True)

#########  END SALARY CLEANUP COLUMN
###################

In [None]:
# has_salary = master_df[master_df['salary'] != (0,0,0)].shape[0]
# all_records = master_df.shape[0]
# print "Job listings with salary info:", has_salary
# print "Total job listings: ", all_records
# print "Salaried listings / Total listings", round((float(has_salary) / all_records) * 100, 3), '%'
# master_df.head(5)
# master_df['title'].sort_values().unique()

# stacked = pd.DataFrame(master_df['summary'].str.split().tolist()).stack()
# final = pd.DataFrame(stacked.value_counts())
# final.reset_index(inplace=True)
# final['unique'] = final['index'].sort_values().unique()
# final['unique']
# import nltk
# final['tagged'] = final[['index']].applymap(lambda x: nltk.pos_tag(x.strip()))
# final.info()
master_df.shape

In [12]:
URL = 'https://www.expatistan.com/cost-of-living/index/north-america' 
driver = webdriver.PhantomJS(executable_path='/Users/kristensu/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.set_window_size(1024, 768) 
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'lxml')


In [13]:
c = []
r = []
for city in soup.findAll('table',class_="city-index"):
    for cities in soup.findAll('td',class_='city-name'):
        c.append(cities.text)
    for rank in soup.findAll('td',class_='price-index'):
        r.append(rank.text)

In [None]:
len(c)

In [14]:
COL = pd.DataFrame([c, r]).T
COL = COL.rename(columns = {0:'Cities',1:'price_index'})
COL['Cities'] = COL['Cities'].apply(lambda x: str(x).split(',')).apply(lambda x: x[0]).apply(lambda x: str(x).replace(' (United States)',''))
COL['price_index'] = COL['price_index'].astype(int)
new_base  = COL.loc[0,'price_index']
COL['COL_new'] = COL['price_index'].apply(lambda x: float(new_base)/x)
COL.loc[16,'Cities'] = 'Minneapolis'
COL.loc[2,'Cities'] = 'Washington, DC'
COL.loc[0,'Cities'] = 'New York'
print master_df.shape
master_df = pd.merge(master_df,COL.iloc[:,[0,-1]], left_on='search_city',right_on='Cities',how='left')
del master_df['Cities']

master_df

(13138, 16)


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,location,company,summary,jkid,title,stars,reviews,post_date_daysAgo,pull_date,search_city,job_link,sal_low,sal_high,sal_avg,COL_new
0,0.0,0,"Atlanta, GA",KPMG,"Machine learning, data visualization, statisti...",53b7f855d4891e19,Data Scientist,4.0,1768.0,2.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=53b7f855d4891e19,0.0,0.0,0.0,1.502674
1,1.0,1,"Atlanta, GA",ASSURANT,"3+ years of relevant experience in analytics, ...",9ecd8095dd0355f8,Data Scientist,3.5,1107.0,10.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=9ecd8095dd0355f8,0.0,0.0,0.0,1.502674
2,2.0,2,"Atlanta, GA",360i,The Associate Data Scientist will be mentored ...,c2b6dcbcb0895072,Associate Data Scientist,4.0,9.0,11.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=c2b6dcbcb0895072,0.0,0.0,0.0,1.502674
3,3.0,3,"Atlanta, GA",Centers for Disease Control and Prevention,Whether we are protecting the American people ...,40d8215afa28f4bb,HEALTH SCIENTIST,4.5,64.0,1.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=40d8215afa28f4bb,88305.0,114802.0,101553.0,1.502674
4,4.0,4,"Atlanta, GA",Vesta Corporation,"Or PhD in Computer Science, Statistics, Applie...",24cec20de39398ca,Senior Data Scientist,3.0,31.0,9.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=24cec20de39398ca,0.0,0.0,0.0,1.502674
5,5.0,5,"Atlanta, GA",Cox Automotive,Interprets problems and develops solutions to ...,4dd0428a36b610d7,Data Scientist,3.0,42.0,12.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=4dd0428a36b610d7,0.0,0.0,0.0,1.502674
6,6.0,6,"Atlanta, GA",ASSURANT,The primary objective of this position is to e...,ed7467a761020f51,Sr Data Scientist - Fraud,3.5,1107.0,1.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=ed7467a761020f51,0.0,0.0,0.0,1.502674
7,7.0,7,"Atlanta, GA",Ga. Dept. of Admin. Services,SQL Server knowledge for developing queries an...,134c301ec3ba87c4,Statistical Data Analyst,,0.0,5.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=134c301ec3ba87c4,0.0,0.0,0.0,1.502674
8,8.0,8,"Atlanta, GA",State Farm Mutual Automobile Insurance Company,Academic background in quantitative discipline...,594e039f37d4ec5d,Research Statistician,4.0,3358.0,3.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=594e039f37d4ec5d,0.0,0.0,0.0,1.502674
9,9.0,9,"Atlanta, GA",Honeywell,"Data modeling, mining, pattern analysis, data ...",9b253c3eeae5f37a,Analytics Data Scientist,3.5,3474.0,10.000000,2016-10-17 00:20:29.507371,Atlanta,http://www.indeed.com/rc/clk?jk=9b253c3eeae5f37a,0.0,0.0,0.0,1.502674


In [35]:
#need to add more to filter out the titles from mid to senior or entry

senior = ['sr','senior','lead','instructor','principal', 'director','manager','consultant']
mid = []
entry = ['associate','Associate','intern','junior','-1']
senior_bin = []
mid_bin = []
entry_bin = []
for x in master_df['title']:
    if any(word in x.lower() for word in senior):
        senior_bin.append(1)
        mid_bin.append(0)
        entry_bin.append(0)
    elif any(word in x.lower() for word in entry):
        senior_bin.append(0)
        mid_bin.append(0)
        entry_bin.append(1)
    else:
        senior_bin.append(0)
        mid_bin.append(1)
        entry_bin.append(0)

In [36]:
master_df['senior_bin'] = pd.Series(senior_bin)
master_df['mid_bin'] = pd.Series(mid_bin)
master_df['entry_bin'] = pd.Series(entry_bin)
# for x in master_df[master_df['mid_bin'] == 1]['title'].unique():
#     print x

In [37]:
senior = ['sr','senior','lead','instructor','principal', 'director','manager','consultant','chief']
mid = ['data','scientist','analyst','analytics','statistician',"statistical",'machine learning']
entry = ['associate','Associate','intern','junior','-1']
senior_bin = []
mid_bin = []

In [38]:
senior = ['sr','senior','lead','instructor','principal', 'director','manager','consultant','chief']
mid = ['data','scientist','analyst','analytics','statistician',"statistical",'machine learning']
entry = ['associate','Associate','intern','junior','-1']
senior_bin = []
mid_bin = []
entry_bin = []
other_bin = []
for x in master_df['title']:
    if any(word in x.lower() for word in senior):
        senior_bin.append(1)
        mid_bin.append(0)
        entry_bin.append(0)
        other_bin.append(0)
    elif any(word in x.lower() for word in entry):
        senior_bin.append(0)
        mid_bin.append(0)
        entry_bin.append(1)
        other_bin.append(0)        
    elif any(word in x.lower() for word in mid):
        senior_bin.append(0)
        mid_bin.append(1)
        entry_bin.append(0)
        other_bin.append(0)        
    else:
        senior_bin.append(0)
        mid_bin.append(0)
        entry_bin.append(0)
        other_bin.append(1)        
master_df['senior_bin'] = pd.Series(senior_bin)
master_df['mid_bin'] = pd.Series(mid_bin)
master_df['entry_bin'] = pd.Series(entry_bin)
master_df['other_bin'] = pd.Series(other_bin)

In [39]:
#calculates the mean salary or we can use median
mean_sal = master_df[master_df['sal_avg'] > 0]['sal_avg'].mean()
mean_sal

KeyError: 'sal_avg'

In [19]:
#expand on features, this creates the X and y
master_df['sal_bin'] = 0
master_df.loc[master_df['sal_avg'] > mean_sal,'sal_bin'] = 1
features = master_df[(master_df['sal_avg'] > 0) & (master_df['search_city'].notnull())]

X = features.loc[:,['search_city','senior_bin','mid_bin','entry_bin','COL_new']]
X = pd.get_dummies(X,columns = ['search_city'])
y = features.loc[:,'sal_bin']
X = X.astype(int)


In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.33, random_state = 77) ## create train-test out of the data given

In [21]:
# this is the code from the lab, using gridsearchCV
#need to expand on C_vals

logreg = LogisticRegression(solver='liblinear')
C_vals = [.1,.5,1]
#C_vals = np.linspace(.33,.66,50)
penalties = ['l1','l2']

gs = GridSearchCV(logreg, {'penalty': penalties, 'C': C_vals}, verbose=False, cv=3)
gs.fit(X_train, Y_train)

print gs.best_params_
logreg = LogisticRegression(C=gs.best_params_['C'], penalty=gs.best_params_['penalty'])
#logreg = LogisticRegression(C=.55, penalty=gs.best_params_['penalty'])
cv_model = logreg.fit(X_train, Y_train)
cv_pred = cv_model.predict(X_train)

{'penalty': 'l2', 'C': 0.5}


In [22]:
cv_model.coef_
x_list = X.columns.tolist()

#need to get coeficients into a list to bring in as a dataframe to pair up with columns


#coef_list = cv_model.coef_.tolist(

In [25]:
y_score = cv_model.decision_function(X_train)
conmat = np.array(confusion_matrix(Y_train, cv_pred, labels=[1,0]))
confusion = pd.DataFrame(conmat,index=['over_mean', 'under_mean'],
                         columns=['predicted_over_mean','predicted_under_mean'])
print confusion
# Used to verify the confusion matrix
# #confusion
# pred_series = pd.Series(cv_pred).to_frame()
# y_check = Y_train.to_frame()
# y_check.reset_index(inplace = True,drop= True)

# conmat_check = pd.concat([y_check,pred_series],axis = 1)
# conmat_check[conmat_check['sal_bin']==1][0].sum()
# # sub_yscore = y_score_sub.reshape((len(y_score_sub),1))
# # sub_yscore.shape

print classification_report(Y_train,cv_pred)
roc_auc_score(Y_train, y_score)

            predicted_over_mean  predicted_under_mean
over_mean                    87                    59
under_mean                   30                   193
             precision    recall  f1-score   support

          0       0.77      0.87      0.81       223
          1       0.74      0.60      0.66       146

avg / total       0.76      0.76      0.75       369



0.82634068431721852

In [26]:
glassdoor_df = pd.read_csv('glassdoor_df.csv')

In [27]:
glassdoor_df.head(3)

Unnamed: 0.1,Unnamed: 0,company,title,meanPay,City,low_sal,high_sal,sal_avg,entry_bin,mid_bin,senior_bin
0,0,careerbuilder,Data Scientist,86172,atlanta,86172,86172,86172,0,1,0
1,1,ncr,Data Scientist,86000 - 94000,atlanta,86000,94000,90000,0,1,0
2,2,the home depot,Data Scientist,90000 - 104000,atlanta,90000,104000,97000,0,1,0


In [33]:
master_df.head(100)

Unnamed: 0.1,Unnamed: 0,location,company,summary,jkid,title,salary,stars,reviews,post_date,pull_date,search_city
0,0,"Atlanta, GA 30338",KPMG,"Machine learning, data visualization, statisti...",53b7f855d4891e19,Data Scientist,,51.000000,1768.0,3 days,2016-10-17 23:45:52.664015,Atlanta
1,1,"Atlanta, GA 30339",ASSURANT,"3+ years of relevant experience in analytics, ...",9ecd8095dd0355f8,Data Scientist,,43.200000,1108.0,11 days,2016-10-17 23:45:52.664015,Atlanta
2,2,"Atlanta, GA 30306 (Virginia-Highland area)",360i,The Associate Data Scientist will be mentored ...,c2b6dcbcb0895072,Associate Data Scientist,,51.000000,9.0,12 days,2016-10-17 23:45:52.664015,Atlanta
3,3,"Atlanta, GA",Ga. Dept. of Admin. Services,SQL Server knowledge for developing queries an...,70f23d3b618d4b8e,Statistical Data Analyst,"$36,000 - $46,000 a year",,,6 days,2016-10-17 23:45:52.664015,Atlanta
4,4,"Atlanta, GA",Honeywell,"Data modeling, mining, pattern analysis, data ...",9b253c3eeae5f37a,Analytics Data Scientist,,44.400000,3477.0,11 days,2016-10-17 23:45:52.664015,Atlanta
5,5,"Atlanta, GA 30339",ASSURANT,The primary objective of this position is to e...,ed7467a761020f51,Sr Data Scientist - Fraud,,43.200000,1108.0,2 days,2016-10-17 23:45:52.664015,Atlanta
6,6,"Atlanta, GA 30303 (Five Points area)",Vesta Corporation,"Or PhD in Computer Science, Statistics, Applie...",24cec20de39398ca,Senior Data Scientist,,40.200000,31.0,10 days,2016-10-17 23:45:52.664015,Atlanta
7,7,"Atlanta, GA",Centers for Disease Control and Prevention,Whether we are protecting the American people ...,40d8215afa28f4bb,HEALTH SCIENTIST,"$88,305 - $114,802 a year",54.000000,64.0,2 days,2016-10-17 23:45:52.664015,Atlanta
8,8,"Atlanta, GA",Travelport,Developing and implementing advanced analytics...,324aa9db249c02a7,Data Scientist,,51.600000,52.0,30+ days,2016-10-17 23:45:52.664015,Atlanta
9,9,"Atlanta, GA 30308 (Old Fourth Ward area)",Stackfolio,A competitive full-time salary as well as a gr...,be3a7444db34e9a3,Lead Data Scientist,"$80,000 a year",,,30+ days,2016-10-17 23:45:52.664015,Atlanta


Lastly, we need to clean up salary data. 
1. Some of the salaries are not yearly but hourly, these will be useful to us for now
2. The salaries are given as text and usually with ranges.

#### Filter out the salaries that are not yearly (filter those that refer to hour)

In [None]:
## YOUR CODE HERE

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [None]:
## YOUR CODE HERE

### Save your results as a CSV

In [None]:
## YOUR CODE HERE