# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

In [1]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=30"

In [2]:
import requests
import bs4 as soup
from bs4 import BeautifulSoup
from IPython.core.display import HTML, Image
import pandas as pd

In [3]:
# use requests to call hte url
r = requests.get(URL)

In [4]:
# test request
r.request.headers

for k, v in r.request.headers.items():
    print(k + ':', v)
    
HTML(r.content.decode('utf-8'))

('Connection:', 'keep-alive')
('Cookie:', 'INDEED_CSRF_TOKEN=IjeSHLdwBphJ9ujYos6ziFyNofR7DIaF; ctkgen=1; BIGipServerjob=!tdnKZgIjhTHa5oo/0rl98CFw+W0yBrCAPtPhwJAi3n4B4OQ1SKQQ1QxGE+SB1g8fvIe3ZQfewJiOhBE=; JSESSIONID=6F8FFFA049B6DDF65E559E24F45EE6FC.jasxA_ord-job5; CTK=1an0s3la6afbbdo2')
('Accept-Encoding:', 'gzip, deflate')
('Accept:', '*/*')
('User-Agent:', 'python-requests/2.9.1')


0,1
Find JobsFind ResumesEmployers / Post Job,"function isUserOptionsOpen() { return document.getElementById('userOptions').className == 'open'; } function hideUserOptions(label, options) { options.className = ''; label.className = 'navBi'; } function toggleUserOptions(e) { var options = document.getElementById('userOptions'); var label = document.getElementById('userOptionsLabel'); if (isUserOptionsOpen()) { hideUserOptions(label, options); if (!e.keyCode) { label.blur(); } } else { options.className = 'open'; label.className = 'navBi active'; document.onclick = function() { hideUserOptions(label, options); document.onclick = function() { }; }; if (e.keyCode && e.keyCode == 13) { var fL = gbid('userOptions').getElementsByTagName('a')[0]; if (fL) { fL.focus(); } } else { label.blur(); } } stopPropagation(e); } function stopPropagation(e) { var e = e || window.event; e.stopPropagation ? e.stopPropagation() : e.cancelBubble = true; } function regExpEscape(s) {  return String(s).replace(/([-()\[\]{}+?*.$\^|,:#<!\\])/g, '\\$1').  replace(/\x08/g, '\\x08'); } function appendParamsOnce(url, params) {  var useParams = params.replace(/^(\?|\&)/, '');  if (url.match(new RegExp('[\\?|\\&]' + regExpEscape(useParams))) == null) {  return url += (url.indexOf('?') > 0 ? '&' : '?' ) + useParams;  }  return url; } Upload your resume Sign in"

0
"What: Where: Advanced Job Search  job title, keywords or company city, state, or zip"

0,1
,"What: Where: Advanced Job Search  job title, keywords or company city, state, or zip"

0,1,2,3
What:,Where:,Where:,Where:
,,,Advanced Job Search
,,,
,,,
"job title, keywords or company","city, state, or zip","city, state, or zip","city, state, or zip"

0
"window['ree'] = ""pdsssps""; window['jas'] = ""hxTdWyJLG""; data scientist $20,000 jobs in New York State  call_when_jsall_loaded(function() {  var recJobLink = new RecJobLink(""Recommended Jobs"", ""recPromoDisplay"", ""1an0s3la6afbbdo2"", """",  ""US"", ""en"", """",  """", null, true);  recJobLink.onLoad();  });  Sort by: relevance -  date You refined by: $20,000+ (undo) Salaries estimated if unavailable Job Type Full-time (2076) Temporary (70) Contract (58) Part-time (44) Internship (11) Commission (3) Location New York, NY (1662) Queens, NY (79) Manhattan, NY (42) Buffalo, NY (34) Albany, NY (33) Brooklyn, NY (31) Rochester, NY (30) New York State (27) Upton, NY (26) Bronx, NY (26) Syracuse, NY (19) Rensselaer, NY (15) Brookhaven, NY (13) Tarrytown, NY (13) Grand Island, NY (12) Data Scientist $20,000 jobs nationwide more » Company NYU Langone Medical Center (323) Mount Sinai Health System (129) DEPT OF HEALTH/MENTAL HYGIENE (105) Selby Jennings (79) Weill Cornell Medical College (78) Brookhaven National Laboratory (26) Analytic Recruiting (26) Columbia University (25) Elevate Recruiting Group (24) Albert Einstein College of Medicine (22) AMRI (21) Amneal Pharmaceuticals (19) JPMorgan Chase (18) Averity (16) Xaxis (15) more » function setJaPromoCookie() { var expires = new Date(); expires.setTime(expires.getTime() + (5 * 365 * 24 * 60 * 60 * 1000)); setCookie(""showJaPromo"", ""1"", expires); } function setRefineByCookie(refineByTypes) { var expires = new Date(); expires.setTime(expires.getTime() + (10 * 1000)); refineByTypes.forEach(function(type) { setCookie(type, ""1"", expires); }); } Jobs 31 to 40 of 2,260  Upload your resume - Let employers find you window['sjl'] = ""mJwkvTJk3bq""; Junior Data Scientist  UncommonGoods  - 4 reviews  - New York, NY We’re looking for a passionate Data Analyst that can rigorously analyze data and translate it into actionable insight that informs product, marketing, and... 30+ days ago - emailwindow['sj_result_76b88c1526cc5a7d'] = {""showSource"": false, ""source"": ""UncommonGoods"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""76b88c1526cc5a7d"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 10, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Lead Data Scientist  ASCAP  - 25 reviews  - New York, NY The Lead Data Scientist will work collaboratively to design the next generation of unique data capabilities on behalf of ASCAP members, licensees and executives...  Easily apply 14 days ago - emailwindow['sj_result_47b551541fd08be1'] = {""showSource"": false, ""source"": ""ASCAP"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""14 days ago"",""jobKey"": ""47b551541fd08be1"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 11, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Research Data Associate  NYU Langone Medical Center  - 174 reviews  - New York, NY 10016 (Gramercy area) Assists with study site recruitment, data collection, participant tracking, intervention delivery, protocol compliance, and accurate data entry.... 12 hours ago - save job - email - more...window['result_a7f03fccc9708692'] = {""showSource"": false, ""source"": ""NYU Langone Medical Center"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""12 hours ago"",""jobKey"": ""a7f03fccc9708692"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 0, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Team Lead, Data Analysis  WebMD  - 77 reviews  - New York, NY Provide data analysis and facilitate comprehension of data to the Business Intelligence user community. This position will work closely with other departments... 15 days ago - save job - email - more...window['result_b7e956e001f81d97'] = {""showSource"": false, ""source"": ""WebMD"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""15 days ago"",""jobKey"": ""b7e956e001f81d97"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 1, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Senior Data Scientist ($150k)  Averity  - New York, NY $150,000 a year Are you a Senior Data Scientist excited about the thought of starting a Data Science group in the arts and entertainment world?...  Easily apply 13 days ago - save job - email - more...window['result_b58e966f014fea01'] = {""showSource"": false, ""source"": ""Averity"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""13 days ago"",""jobKey"": ""b58e966f014fea01"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 2, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Data Engineer  PlaceIQ  - 3 reviews  - New York, NY Experience with Hadoop/Big Data paradigms. From building data pipelines to regression models/classification algorithms, complex data visualizations to... 21 days ago - save job - email - more...window['result_853d7dc97eddcd44'] = {""showSource"": false, ""source"": ""PlaceIQ"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""21 days ago"",""jobKey"": ""853d7dc97eddcd44"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 3, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Data Analyst  NYU Langone Medical Center  - 174 reviews  - New York, NY 10016 (Gramercy area) The individual is responsibile for data collection Will be solely responsible for exporting images and raw image data for further processing.... 21 hours ago - save job - email - more...window['result_c7d0b3cd5f05b892'] = {""showSource"": false, ""source"": ""NYU Langone Medical Center"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""21 hours ago"",""jobKey"": ""c7d0b3cd5f05b892"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 4, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Researcher (Full Time)  Weill Cornell Medical School  - New York, NY Qualified researchers will work with an array of experts in multiple disciplines including physicians, computer scientists, medical imagers, statisticians,... Nature Jobs - 5 days ago - save job - email - more...window['result_1c1d5a362b44989a'] = {""showSource"": true, ""source"": ""Nature Jobs"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""5 days ago"",""jobKey"": ""1c1d5a362b44989a"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 5, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Python Developer / Data Scientist  Smith & Keller  - Manhattan, NY Our client is looking for a talented Data Scientist to join our rapidly growing team in our New York City headquarters.... 10 days ago - save job - email - more...window['result_39fe45ef5c1350bb'] = {""showSource"": false, ""source"": ""Smith \x26 Keller"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""10 days ago"",""jobKey"": ""39fe45ef5c1350bb"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 6, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sr. Data Scientist - User & Product Analytics  Tumblr  - 5 reviews  - New York, NY We are seeking a veteran Data Scientist, well-versed in data analysis and algorithm implementation, ready to be let loose on Tumblr’s many terabytes of data....  Easily apply 30+ days ago - save job - email - more...window['result_6bee383df83e2492'] = {""showSource"": false, ""source"": ""Tumblr"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""6bee383df83e2492"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 7, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Senior Data Scientist  Xaxis  - 6 reviews  - New York, NY Mentor less-experienced members of the team toward becoming better data scientists. The Xaxis Product and Engineering organization is seeking an experienced... 20 days ago - save job - email - more...window['result_552ea7e3f66a1a71'] = {""showSource"": false, ""source"": ""Xaxis"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""20 days ago"",""jobKey"": ""552ea7e3f66a1a71"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 8, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Lead Data Scientist  ASCAP  - 25 reviews  - New York, NY The Lead Data Scientist will work collaboratively to design the next generation of unique data capabilities on behalf of ASCAP members, licensees and executives...  Easily apply 14 days ago - save job - email - more...window['result_47b551541fd08be1'] = {""showSource"": false, ""source"": ""ASCAP"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""14 days ago"",""jobKey"": ""47b551541fd08be1"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 9, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Data Scientist  AbilTo, Inc  - New York, NY Data & Analytics Team. Data & Analytics team culture:. As a Data Scientist at AbilTo, you will work in a uniquely cross-functional capacity, helping teams... 30+ days ago - emailwindow['sj_result_134f86fee2270420'] = {""showSource"": false, ""source"": ""AbilTo, Inc"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""134f86fee2270420"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 12, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Senior Scientist  Cipla  - Hauppauge, NY Executes batches as per the protocols, analyze data and prepare summary reports. The ideal candidate will identify and evaluate the critical formulation factors...  Easily apply 29 days ago - emailwindow['sj_result_d0c64eb2c2175d60'] = {""showSource"": false, ""source"": ""Indeed"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""29 days ago"",""jobKey"": ""d0c64eb2c2175d60"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 13, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Senior Epidemiologist, Bureau Systems Strengthening and Acce...  DEPT OF HEALTH/MENTAL HYGIENE  - 4 reviews  - Queens, NY $85,211 - $97,993 a year Develop new surveillance and data analysis methods. Two years as a City Research Scientist Level I can be substituted for the experience required in ""1"" and ""2""... 30+ days ago - emailwindow['sj_result_2b825dd33945fec8'] = {""showSource"": false, ""source"": ""NYC Careers"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""2b825dd33945fec8"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 14, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""sponsorName"" : ""NYC Careers"",""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored by NYC Careers Get email updates for the latest data scientist $20,000 jobs in New York My email: Also get an email with jobs recommended just for me You can cancel email alerts at any time. function ptk(st,p) { document.cookie = 'PTK=""tk=&type=jobsearch&subtype=' + st + (p ? '&' + p : '')  + (st == 'pagination' ? '&fp=4' : '') +'""; path=/'; } function pclk(event) { var evt = event || window.event; var target = evt.target || evt.srcElement; var el = target.nodeType == 1 ? target : target.parentNode; var tag = el.tagName.toLowerCase(); if (tag == 'span' || tag == 'a') { ptk('pagination'); } return true; } Results Page: « Previous 1 2 3 4 5 6 7 8 Next » Get new jobs for this search by email My email: Also get an email with jobs recommended just for me You can cancel email alerts at any time. Company with data scientist $20,000 jobs  UncommonGoods  Founded in 1999 and headquartered in NYC, UncommonGoods is a catalog and online retailer of creatively designed, high-quality products.  var ind_nr = true;  var ind_pub = '8772657697788355';  var ind_el = 'indJobContent';  var ind_pf = '';  var ind_q = '';  var ind_fcckey = 'bdb3f0ff52963b2e';  var ind_l = 'New York';  var ind_chnl = 'UncommonGoods';  var ind_n = 3;  var ind_d = '';  var ind_t = 60;  var ind_c = 30;  var ind_rq = 'data scientist $20,000';  window.indeedJobroll.origJobsCallback = window.indeedJobroll.jobsCallback;  window.indeedJobroll.jobsCallback=function(contentId, content) {  var sjDiv = document.getElementById('featemp_sj');  if (content.length <= 33 && sjDiv) {  sjDiv.style.display = 'none';  } else {  if (sjDiv) { sjDiv.style.display = 'block'; }  window.indeedJobroll.origJobsCallback(contentId, content);  }  };  Jobs (7)  Reviews (4) var focusHandlers = []; var linkHighlighter = new LinkHighlighter(); focusHandlers.push(googBind(linkHighlighter.fadeToOriginalColor, linkHighlighter)); var lostFocusHandlers = []; lostFocusHandlers.push(googBind(linkHighlighter.clickedAway, linkHighlighter, ""#551a8b"")); var didYouApplyPrompt = new DidYouApplyPrompt('1an0s3lbkafbbaek', 60, 'serp', false); focusHandlers.push(googBind(didYouApplyPrompt.returnedToPage, didYouApplyPrompt)); lostFocusHandlers.push(googBind(didYouApplyPrompt.leftPage, didYouApplyPrompt)); didYouApplyPrompt.dyaChangeFromCookie(); var clickTime = new ClickTime(window.tk, 'serp', 'jobtitle', focusHandlers, lostFocusHandlers); enableAdometry(); Indeed helps people get jobs: Over 2.8 million stories sharedJobs - Browse Companies - Salaries - Trends - Forums - var jobsProductLink = document.getElementById('jobs_product_link'); document.getElementById('salaries_product_link').onclick = function() { if ( !document.js ) { return; } var q = document.js.q.value; var l = document.js.l.value; if ( q || l ) { window.location = '/salary?q1=' + urlencode( q ) + '&l1=' + urlencode( l ); return false; } };document.getElementById('trends_product_link').onclick = function() { if ( !document.js ) { return; } var q = document.js.q.value; if ( q ) { window.location = '/jobtrends?q=' + urlencode( q ) + '&l=' + urlencode( document.js.l.value ); return false; } };document.getElementById('forums_product_link').onclick = function() { if ( !document.js ) { return; } var q = document.js.q.value; if ( q ) { window.location = '/forum/?q=' + urlencode( q ) + '&l=' + urlencode( document.js.l.value ); return false; } };document.getElementById('companies_product_link').onclick = function() { window.location = '/Best-Places-to-Work?campaignid=jobs'; return false;}; Browse Jobs - Tools - Work at Indeed - API - About - Help Center ©2016 Indeed - Cookies, Privacy and Terms"

0,1,2
"data scientist $20,000 jobs in New York State  call_when_jsall_loaded(function() {  var recJobLink = new RecJobLink(""Recommended Jobs"", ""recPromoDisplay"", ""1an0s3la6afbbdo2"", """",  ""US"", ""en"", """",  """", null, true);  recJobLink.onLoad();  });  Sort by: relevance -  date You refined by: $20,000+ (undo) Salaries estimated if unavailable Job Type Full-time (2076) Temporary (70) Contract (58) Part-time (44) Internship (11) Commission (3) Location New York, NY (1662) Queens, NY (79) Manhattan, NY (42) Buffalo, NY (34) Albany, NY (33) Brooklyn, NY (31) Rochester, NY (30) New York State (27) Upton, NY (26) Bronx, NY (26) Syracuse, NY (19) Rensselaer, NY (15) Brookhaven, NY (13) Tarrytown, NY (13) Grand Island, NY (12) Data Scientist $20,000 jobs nationwide more » Company NYU Langone Medical Center (323) Mount Sinai Health System (129) DEPT OF HEALTH/MENTAL HYGIENE (105) Selby Jennings (79) Weill Cornell Medical College (78) Brookhaven National Laboratory (26) Analytic Recruiting (26) Columbia University (25) Elevate Recruiting Group (24) Albert Einstein College of Medicine (22) AMRI (21) Amneal Pharmaceuticals (19) JPMorgan Chase (18) Averity (16) Xaxis (15) more »","function setJaPromoCookie() { var expires = new Date(); expires.setTime(expires.getTime() + (5 * 365 * 24 * 60 * 60 * 1000)); setCookie(""showJaPromo"", ""1"", expires); } function setRefineByCookie(refineByTypes) { var expires = new Date(); expires.setTime(expires.getTime() + (10 * 1000)); refineByTypes.forEach(function(type) { setCookie(type, ""1"", expires); }); } Jobs 31 to 40 of 2,260  Upload your resume - Let employers find you window['sjl'] = ""mJwkvTJk3bq""; Junior Data Scientist  UncommonGoods  - 4 reviews  - New York, NY We’re looking for a passionate Data Analyst that can rigorously analyze data and translate it into actionable insight that informs product, marketing, and... 30+ days ago - emailwindow['sj_result_76b88c1526cc5a7d'] = {""showSource"": false, ""source"": ""UncommonGoods"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""76b88c1526cc5a7d"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 10, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Lead Data Scientist  ASCAP  - 25 reviews  - New York, NY The Lead Data Scientist will work collaboratively to design the next generation of unique data capabilities on behalf of ASCAP members, licensees and executives...  Easily apply 14 days ago - emailwindow['sj_result_47b551541fd08be1'] = {""showSource"": false, ""source"": ""ASCAP"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""14 days ago"",""jobKey"": ""47b551541fd08be1"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 11, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Research Data Associate  NYU Langone Medical Center  - 174 reviews  - New York, NY 10016 (Gramercy area) Assists with study site recruitment, data collection, participant tracking, intervention delivery, protocol compliance, and accurate data entry.... 12 hours ago - save job - email - more...window['result_a7f03fccc9708692'] = {""showSource"": false, ""source"": ""NYU Langone Medical Center"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""12 hours ago"",""jobKey"": ""a7f03fccc9708692"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 0, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Team Lead, Data Analysis  WebMD  - 77 reviews  - New York, NY Provide data analysis and facilitate comprehension of data to the Business Intelligence user community. This position will work closely with other departments... 15 days ago - save job - email - more...window['result_b7e956e001f81d97'] = {""showSource"": false, ""source"": ""WebMD"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""15 days ago"",""jobKey"": ""b7e956e001f81d97"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 1, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Senior Data Scientist ($150k)  Averity  - New York, NY $150,000 a year Are you a Senior Data Scientist excited about the thought of starting a Data Science group in the arts and entertainment world?...  Easily apply 13 days ago - save job - email - more...window['result_b58e966f014fea01'] = {""showSource"": false, ""source"": ""Averity"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""13 days ago"",""jobKey"": ""b58e966f014fea01"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 2, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Data Engineer  PlaceIQ  - 3 reviews  - New York, NY Experience with Hadoop/Big Data paradigms. From building data pipelines to regression models/classification algorithms, complex data visualizations to... 21 days ago - save job - email - more...window['result_853d7dc97eddcd44'] = {""showSource"": false, ""source"": ""PlaceIQ"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""21 days ago"",""jobKey"": ""853d7dc97eddcd44"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 3, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Data Analyst  NYU Langone Medical Center  - 174 reviews  - New York, NY 10016 (Gramercy area) The individual is responsibile for data collection Will be solely responsible for exporting images and raw image data for further processing.... 21 hours ago - save job - email - more...window['result_c7d0b3cd5f05b892'] = {""showSource"": false, ""source"": ""NYU Langone Medical Center"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""21 hours ago"",""jobKey"": ""c7d0b3cd5f05b892"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 4, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Researcher (Full Time)  Weill Cornell Medical School  - New York, NY Qualified researchers will work with an array of experts in multiple disciplines including physicians, computer scientists, medical imagers, statisticians,... Nature Jobs - 5 days ago - save job - email - more...window['result_1c1d5a362b44989a'] = {""showSource"": true, ""source"": ""Nature Jobs"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""5 days ago"",""jobKey"": ""1c1d5a362b44989a"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 5, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Python Developer / Data Scientist  Smith & Keller  - Manhattan, NY Our client is looking for a talented Data Scientist to join our rapidly growing team in our New York City headquarters.... 10 days ago - save job - email - more...window['result_39fe45ef5c1350bb'] = {""showSource"": false, ""source"": ""Smith \x26 Keller"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""10 days ago"",""jobKey"": ""39fe45ef5c1350bb"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 6, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sr. Data Scientist - User & Product Analytics  Tumblr  - 5 reviews  - New York, NY We are seeking a veteran Data Scientist, well-versed in data analysis and algorithm implementation, ready to be let loose on Tumblr’s many terabytes of data....  Easily apply 30+ days ago - save job - email - more...window['result_6bee383df83e2492'] = {""showSource"": false, ""source"": ""Tumblr"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""6bee383df83e2492"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 7, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Senior Data Scientist  Xaxis  - 6 reviews  - New York, NY Mentor less-experienced members of the team toward becoming better data scientists. The Xaxis Product and Engineering organization is seeking an experienced... 20 days ago - save job - email - more...window['result_552ea7e3f66a1a71'] = {""showSource"": false, ""source"": ""Xaxis"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""20 days ago"",""jobKey"": ""552ea7e3f66a1a71"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 8, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Lead Data Scientist  ASCAP  - 25 reviews  - New York, NY The Lead Data Scientist will work collaboratively to design the next generation of unique data capabilities on behalf of ASCAP members, licensees and executives...  Easily apply 14 days ago - save job - email - more...window['result_47b551541fd08be1'] = {""showSource"": false, ""source"": ""ASCAP"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""14 days ago"",""jobKey"": ""47b551541fd08be1"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 9, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Data Scientist  AbilTo, Inc  - New York, NY Data & Analytics Team. Data & Analytics team culture:. As a Data Scientist at AbilTo, you will work in a uniquely cross-functional capacity, helping teams... 30+ days ago - emailwindow['sj_result_134f86fee2270420'] = {""showSource"": false, ""source"": ""AbilTo, Inc"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""134f86fee2270420"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 12, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Senior Scientist  Cipla  - Hauppauge, NY Executes batches as per the protocols, analyze data and prepare summary reports. The ideal candidate will identify and evaluate the critical formulation factors...  Easily apply 29 days ago - emailwindow['sj_result_d0c64eb2c2175d60'] = {""showSource"": false, ""source"": ""Indeed"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""29 days ago"",""jobKey"": ""d0c64eb2c2175d60"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 13, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored Senior Epidemiologist, Bureau Systems Strengthening and Acce...  DEPT OF HEALTH/MENTAL HYGIENE  - 4 reviews  - Queens, NY $85,211 - $97,993 a year Develop new surveillance and data analysis methods. Two years as a City Research Scientist Level I can be substituted for the experience required in ""1"" and ""2""... 30+ days ago - emailwindow['sj_result_2b825dd33945fec8'] = {""showSource"": false, ""source"": ""NYC Careers"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""2b825dd33945fec8"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": false, ""resultNumber"": 14, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": false, ""currentPage"" : ""serp"", ""sponsored"" : true,""sponsorName"" : ""NYC Careers"",""reportJobButtonEnabled"": false, ""showMyJobsHired"": false}; Sponsored by NYC Careers Get email updates for the latest data scientist $20,000 jobs in New York My email: Also get an email with jobs recommended just for me You can cancel email alerts at any time. function ptk(st,p) { document.cookie = 'PTK=""tk=&type=jobsearch&subtype=' + st + (p ? '&' + p : '')  + (st == 'pagination' ? '&fp=4' : '') +'""; path=/'; } function pclk(event) { var evt = event || window.event; var target = evt.target || evt.srcElement; var el = target.nodeType == 1 ? target : target.parentNode; var tag = el.tagName.toLowerCase(); if (tag == 'span' || tag == 'a') { ptk('pagination'); } return true; } Results Page: « Previous 1 2 3 4 5 6 7 8 Next »","Get new jobs for this search by email My email: Also get an email with jobs recommended just for me You can cancel email alerts at any time. Company with data scientist $20,000 jobs  UncommonGoods  Founded in 1999 and headquartered in NYC, UncommonGoods is a catalog and online retailer of creatively designed, high-quality products.  var ind_nr = true;  var ind_pub = '8772657697788355';  var ind_el = 'indJobContent';  var ind_pf = '';  var ind_q = '';  var ind_fcckey = 'bdb3f0ff52963b2e';  var ind_l = 'New York';  var ind_chnl = 'UncommonGoods';  var ind_n = 3;  var ind_d = '';  var ind_t = 60;  var ind_c = 30;  var ind_rq = 'data scientist $20,000';  window.indeedJobroll.origJobsCallback = window.indeedJobroll.jobsCallback;  window.indeedJobroll.jobsCallback=function(contentId, content) {  var sjDiv = document.getElementById('featemp_sj');  if (content.length <= 33 && sjDiv) {  sjDiv.style.display = 'none';  } else {  if (sjDiv) { sjDiv.style.display = 'block'; }  window.indeedJobroll.origJobsCallback(contentId, content);  }  };  Jobs (7)  Reviews (4)"

0
"We’re looking for a passionate Data Analyst that can rigorously analyze data and translate it into actionable insight that informs product, marketing, and..."

0
"The Lead Data Scientist will work collaboratively to design the next generation of unique data capabilities on behalf of ASCAP members, licensees and executives..."

0
"Assists with study site recruitment, data collection, participant tracking, intervention delivery, protocol compliance, and accurate data entry.... 12 hours ago - save job - email - more...window['result_a7f03fccc9708692'] = {""showSource"": false, ""source"": ""NYU Langone Medical Center"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""12 hours ago"",""jobKey"": ""a7f03fccc9708692"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 0, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"Provide data analysis and facilitate comprehension of data to the Business Intelligence user community. This position will work closely with other departments... 15 days ago - save job - email - more...window['result_b7e956e001f81d97'] = {""showSource"": false, ""source"": ""WebMD"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""15 days ago"",""jobKey"": ""b7e956e001f81d97"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 1, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"$150,000 a year Are you a Senior Data Scientist excited about the thought of starting a Data Science group in the arts and entertainment world?...  Easily apply 13 days ago - save job - email - more...window['result_b58e966f014fea01'] = {""showSource"": false, ""source"": ""Averity"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""13 days ago"",""jobKey"": ""b58e966f014fea01"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 2, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"Experience with Hadoop/Big Data paradigms. From building data pipelines to regression models/classification algorithms, complex data visualizations to... 21 days ago - save job - email - more...window['result_853d7dc97eddcd44'] = {""showSource"": false, ""source"": ""PlaceIQ"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""21 days ago"",""jobKey"": ""853d7dc97eddcd44"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 3, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"The individual is responsibile for data collection Will be solely responsible for exporting images and raw image data for further processing.... 21 hours ago - save job - email - more...window['result_c7d0b3cd5f05b892'] = {""showSource"": false, ""source"": ""NYU Langone Medical Center"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""21 hours ago"",""jobKey"": ""c7d0b3cd5f05b892"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 4, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"Qualified researchers will work with an array of experts in multiple disciplines including physicians, computer scientists, medical imagers, statisticians,... Nature Jobs - 5 days ago - save job - email - more...window['result_1c1d5a362b44989a'] = {""showSource"": true, ""source"": ""Nature Jobs"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""5 days ago"",""jobKey"": ""1c1d5a362b44989a"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 5, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"Our client is looking for a talented Data Scientist to join our rapidly growing team in our New York City headquarters.... 10 days ago - save job - email - more...window['result_39fe45ef5c1350bb'] = {""showSource"": false, ""source"": ""Smith \x26 Keller"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""10 days ago"",""jobKey"": ""39fe45ef5c1350bb"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 6, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"We are seeking a veteran Data Scientist, well-versed in data analysis and algorithm implementation, ready to be let loose on Tumblr’s many terabytes of data....  Easily apply 30+ days ago - save job - email - more...window['result_6bee383df83e2492'] = {""showSource"": false, ""source"": ""Tumblr"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""30+ days ago"",""jobKey"": ""6bee383df83e2492"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 7, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"Mentor less-experienced members of the team toward becoming better data scientists. The Xaxis Product and Engineering organization is seeking an experienced... 20 days ago - save job - email - more...window['result_552ea7e3f66a1a71'] = {""showSource"": false, ""source"": ""Xaxis"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""20 days ago"",""jobKey"": ""552ea7e3f66a1a71"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 8, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"The Lead Data Scientist will work collaboratively to design the next generation of unique data capabilities on behalf of ASCAP members, licensees and executives...  Easily apply 14 days ago - save job - email - more...window['result_47b551541fd08be1'] = {""showSource"": false, ""source"": ""ASCAP"", ""loggedIn"": false, ""showMyJobsLinks"": true,""undoAction"": ""unsave"",""relativeJobAge"": ""14 days ago"",""jobKey"": ""47b551541fd08be1"", ""myIndeedAvailable"": true, ""tellAFriendEnabled"": true, ""showMoreActionsLink"": true, ""resultNumber"": 9, ""jobStateChangedToSaved"": false, ""searchState"": ""q=data scientist $20,000&amp;l=New+York&amp;start=30"", ""basicPermaLink"": ""http://www.indeed.com"", ""saveJobFailed"": false, ""removeJobFailed"": false, ""requestPending"": false, ""notesEnabled"": true, ""currentPage"" : ""serp"", ""sponsored"" : false,""reportJobButtonEnabled"": false, ""showMyJobsHired"": false};"

0
"Data & Analytics Team. Data & Analytics team culture:. As a Data Scientist at AbilTo, you will work in a uniquely cross-functional capacity, helping teams..."

0
"Executes batches as per the protocols, analyze data and prepare summary reports. The ideal candidate will identify and evaluate the critical formulation factors..."

0
"Develop new surveillance and data analysis methods. Two years as a City Research Scientist Level I can be substituted for the experience required in ""1"" and ""2""..."


In [None]:
# test printing out the title 
soup = BeautifulSoup(r.content)
soup.title.text

# test to call company names
for x in soup.findAll('span', class_ = 'company'):
    print x.text

In [532]:
# read needed info into dataframe
df = pd.DataFrame()
city = ['New+York', 'Chicago', 'San+Francisco', 'Austin']
for c in city:
    # there are only 100 pages of results available
    for p in range(1,100): 
        url = 'http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l='
        r = requests.get(url + c+'&start=' + str(p*10))
        soup = BeautifulSoup(r.content)
        for x in soup.findAll('div', {'class' : ' row result'}):
            try:
                company = x.find('span', {'itemprop':'name'}).getText().strip() # company name 
                title = x.find('a', {'itemprop':'title'}).getText().strip() # job title
                location = x.find('span', {'itemprop':'addressLocality'}).getText().strip() # location
                description = x.find('span', {'itemprop':'description'}).getText().strip() # abbreviated description
                date = x.find('span', {'class':'date'}).getText().strip()
                salary = x.find('nobr')
                df = df.append({'company': company, 'title': title, 'post_date': date, 'location': location, 'description': description, 'salary':salary}, ignore_index=True)
            except:
                pass

In [533]:
# check rows we received with salary info 
sum(df.salary.value_counts())

359

In [21]:
df

Unnamed: 0,company,description,location,post_date,salary,title
0,Spreemo,"As a Senior Data Scientist at Spreemo, you wil...","New York, NY 10012 (Little Italy area)",1 day ago,,Senior Data Scientist
1,Paperless Post,Decision science and data analytics. You’ll le...,"New York, NY",5 days ago,,Director of Data Science and Analytics
2,Dia&Co,You are a data science Ph.D. Can write code fo...,"New York, NY",29 days ago,,Data Scientist
3,Tapad,Tapad is looking for an experienced Data Scien...,"New York, NY",16 days ago,,Data Scientist
4,S&P Global Ratings,Research and resolve data maintenance requests...,"New York, NY 10002 (Lower East Side area)",11 days ago,,Statistician
5,Oliver James Associates,We are currently looking for an brilliant data...,"New York, NY",11 days ago,"<nobr>$170,000 - $200,000 a year</nobr>",Data Scientist
6,Barclays,"Use knowledge of stochastic calculus, probabil...","New York, NY",15 days ago,,VP - Quantitative Analyst
7,SecurityScorecard,You're a seasoned Data Scientist who loves dif...,"New York, NY",28 days ago,,Data Scientist
8,Crisis Text Line,Chief Data Scientist. The Data Scientist's rol...,"New York, NY",12 days ago,,Data Scientist
9,Tilting Point,The Data Scientist will also work to ensure ac...,"New York, NY",30+ days ago,,Data Scientist


In [398]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4791 entries, 0 to 4790
Data columns (total 6 columns):
city             4791 non-null object
company          4791 non-null object
salary           4791 non-null object
summary          4791 non-null object
title            4791 non-null object
parsed_salary    4789 non-null float64
dtypes: float64(1), object(5)
memory usage: 224.6+ KB


In [396]:
# new dataframe with any cities, fulltime job only
ft = pd.DataFrame()

for p in range(1,100): 
    url = 'http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&jt=fulltime&start='
    r = requests.get(url + str(p*10))
    soup = BeautifulSoup(r.content)
    for x in soup.findAll('div', {'class' : ' row result'}):
        try:
            company = x.find('span', {'itemprop':'name'}).getText().strip() # company name 
            title = x.find('a', {'itemprop':'title'}).getText().strip() # job title
            location = x.find('span', {'itemprop':'addressLocality'}).getText().strip() # location
            description = x.find('span', {'itemprop':'description'}).getText().strip() # abbreviated description
            date = x.find('span', {'class':'date'}).getText().strip()
            salary = x.find('nobr')
            ft = ft.append({'company': company, 'title': title, 'post_date': date, 'location': location, 'description': description, 'salary':salary}, ignore_index=True)
        except:
            pass

In [397]:
ft.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 6 columns):
company        891 non-null object
description    891 non-null object
location       891 non-null object
post_date      891 non-null object
salary         104 non-null object
title          891 non-null object
dtypes: object(6)
memory usage: 48.7+ KB


Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

### Write 4 functions to extract each item: location, company, job, and salary.

example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- Make sure these functions are robust and can handle cases where the data/field may not be available.
- Test the functions on the results above

In [25]:
## alternatively using function to extract items


Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

#### Complete the following code to collect results from multiple cities and starting points. 
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [26]:
YOUR_CITY = ''

In [27]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 100

results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', YOUR_CITY]):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        # Append to the full set of results
        pass

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [28]:
## YOUR CODE HERE

Lastly, we need to clean up salary data. 
1. Some of the salaries are not yearly but hourly, these will be useful to us for now
2. The salaries are given as text and usually with ranges.

#### Filter out the salaries that are not yearly (filter those that refer to hour)

In [29]:
## YOUR CODE HERE


#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [30]:
## YOUR CODE HERE

# Cleaning data!

In [399]:
import numpy as np

# creating temp columns to split up the addresses in to city and states
ft['temp'] = [x.split(',') for x in ft.location]

# assign cities into a new column
ft['city'] = [x[0] for x in ft.temp] 
ft['temp1'] = [x[1:] for x in ft.temp] 

# assign states into a new column
states = []
for x in ft.temp1:
    try: 
        states.append(str(x)[4] + str(x)[5]) 
    except: 
        states.append(np.nan)
ft['state'] = states

In [400]:
# check out suspecious looking cities
ft.city.unique()
ft.city.value_counts()
temp = ft.ix[(ft['city'] == 'United States') | (ft['city'] =='Job') | (ft['city'] == 'Remote') | (ft['city'] =='New Jersey')]

In [262]:
temp

Unnamed: 0,company,description,location,post_date,salary,title,temp,city,temp1,state
23,iCube CSI,"Databases, data mapping, data validation, and ...","Job, WV",2 days ago,,Data Scientist,"[Job, WV]",Job,[ WV],WV
28,Avalance,Background in mining large sets of data. Inter...,Remote,8 days ago,,Data Scientist,[Remote],Remote,[],
31,Talent Solvers,The applications apply advanced machine learni...,United States,6 days ago,,Data Scientist,[United States],United States,[],
81,Quintiles Transnational,Whether you are beginning or continuing your c...,United States,11 hours ago,,Statistical Programmer 1 - Officebased Overlan...,[United States],United States,[],
100,PAREXEL International,The Senior Statistical Programmer is an excell...,United States,4 days ago,,Senior Statistical Programmer (Home Based),[United States],United States,[],
198,Adelphic Mobile,This individual works closely with data scient...,United States,30+ days ago,,Data Analyst,[United States],United States,[],
338,J & J Consumer Inc.,The Senior Scientist will:. The Senior Scienti...,New Jersey,30+ days ago,,"Senior Scientist, Data Sciences",[New Jersey],New Jersey,[],
550,RAND Corporation,These scientists rarely conduct theoretical re...,United States,30+ days ago,,Information Scientist,[United States],United States,[],
554,Riverland Community College,Demonstrated commitment to data integrity. Abi...,United States,3 days ago,"<nobr>$3,055 a month</nobr>",Research Analyst Intermediate,[United States],United States,[],
616,RAND Corporation,"Clean and analyze survey data using SAS, Stata...",United States,5 days ago,,"Statistical Programmer, Level II",[United States],United States,[],


In [263]:
# get rid off wrong city
ft.city.replace('Remote',np.nan, inplace =True)    
ft.city.replace('Job',np.nan, inplace =True) 
ft.city.replace('United States',np.nan, inplace =True) 

In [264]:
# Reassign city 'New Jersey' to state
ft.state[ft['city'] == 'New Jersey'] = 'NJ'
ft.city[ft['city'] == 'New Jersey'] = np.nan
# check if it's been reassigned
ft[ft['company'] == 'J & J Consumer Inc.']

Unnamed: 0,company,description,location,post_date,salary,title,temp,city,temp1,state
338,J & J Consumer Inc.,The Senior Scientist will:. The Senior Scienti...,New Jersey,30+ days ago,,"Senior Scientist, Data Sciences",[New Jersey],,[],NJ


In [265]:
# check state column 
ft.state.value_counts()

CA    180
NY    176
MA     51
IL     39
TX     39
MD     39
PA     36
VA     34
NJ     30
FL     22
WA     21
GA     17
MO     16
DC     15
MI     15
CT     13
NC     13
OH     12
CO     10
WI      9
MN      8
AZ      8
UT      7
TN      6
KS      4
NE      4
KY      4
IA      4
DE      3
NM      3
RI      3
HI      3
ME      2
VT      2
AR      2
NV      2
IN      2
AL      2
NH      2
WV      1
OR      1
LA      1
ID      1
MT      1
AK      1
SC      1
OK      1
Name: state, dtype: int64

In [266]:
# drop the temp columns 
ft = ft.drop(['temp','temp1'], axis = 1)

In [267]:
# check salary info
ft.salary.value_counts()

<nobr>$150,000 a year</nobr>               5
<nobr>$90,000 a year</nobr>                3
<nobr>$160,000 a year</nobr>               3
<nobr>$81,878 - $121,525 a year</nobr>     2
<nobr>$76,000 - $98,000 a year</nobr>      2
<nobr>$130,000 - $175,000 a year</nobr>    2
<nobr>$115,000 - $150,000 a year</nobr>    2
<nobr>$120,000 - $160,000 a year</nobr>    2
<nobr>$200,000 a year</nobr>               2
<nobr>$120,000 a year</nobr>               2
<nobr>$75,000 a year</nobr>                2
<nobr>$85,211 - $110,522 a year</nobr>     2
<nobr>$130,000 a year</nobr>               2
<nobr>$40,000 a year</nobr>                2
<nobr>$105,000 a year</nobr>               2
<nobr>$77,490 - $100,736 a year</nobr>     2
<nobr>$41,057 - $61,669 a year</nobr>      1
<nobr>$160,000 - $225,000 a year</nobr>    1
<nobr>$59,966 a year</nobr>                1
<nobr>$92,145 - $141,555 a year</nobr>     1
<nobr>$5,325 - $6,347 a month</nobr>       1
<nobr>$150,000 - $300,000 a year</nobr>    1
<nobr>$120

In [None]:
# change salary to string 
def to_string(x):
    try:
        y = str(x)
        return y
    except:
        pass  

ft.salary = ft.salary.apply(to_string)

# check data type 
type(ft.salary[16])

In [None]:
# create new columns for different types of salary
ft.loc[ft.salary.str.contains('year'), 'yearly_salary'] = ft['salary']
ft.loc[ft.salary.str.contains('month'), 'monthly_salary'] = ft['salary']

In [280]:
# seperating yearly salary column 
for x in ft.yearly_salary: 
    try:
        print x
        print x.split()
        print x.split()[0]
        print x.split()[2]
        print ''
    except:
        pass

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
<nobr>$70,000 - $90,000 a year</nobr>
['<nobr>$70,000', '-', '$90,000', 'a', 'year</nobr>']
<nobr>$70,000
$90,000

<nobr>$120,000 a year</nobr>
['<nobr>$120,000', 'a', 'year</nobr>']
<nobr>$120,000
year</nobr>

nan
nan
nan
nan
nan
<nobr>$71,282 a year</nobr>
['<nobr>$71,282', 'a', 'year</nobr>']
<nobr>$71,282
year</nobr>

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
<nobr>$170,000 - $200,000 a year</nobr>
['<nobr>$170,000', '-', '$200,000', 'a', 'year</nobr>']
<nobr>$170,000
$200,000

nan
nan
nan
nan
nan
nan
nan
nan
<nobr>$59,966 a year</nobr>
['<nobr>$59,966', 'a', 'year</nobr>']
<nobr>$59,966
year</nobr>

nan
nan
nan
nan
nan
<nobr>$80,000 - $85,000 a year</nobr>
['<nobr>$80,000', '-', '$85,000', 'a', 'year</nobr>']
<nobr>$80,000
$85,000

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
<nobr>$90,000 a year</nobr>
['<nobr>$90,000', 'a', 'year</nobr>

In [321]:
# create new column for high and low range
range_l = []
range_h = []
for x in ft['yearly_salary']: 
    try:
        range_l.append(x.split()[0])
        range_h.append(x.split()[2])
    except:
        range_l.append(np.nan)
        range_h.append(np.nan)
ft['range_low'] = range_l
ft['range_high'] = range_h
ft

Unnamed: 0,company,description,location,post_date,salary,title,city,state,yearly_salary,monthly_salary,range_low,range_high
0,Memorial Sloan Kettering Cancer Center,"Working together with members in the group, th...","New York, NY 10065 (Upper East Side area)",9 days ago,,Bioinformatics Data Scientist - Cancer Genomic...,New York,NY,,,,
1,Bevi,We are seeking a Data Scientist to charter Bev...,"Boston, MA",1 day ago,,Data Scientist,Boston,MA,,,,
2,Google,From creating experiments and prototyping impl...,"New York, NY 10011 (Chelsea area)",1 day ago,,"Research Scientist, Machine Learning and Intel...",New York,NY,,,,
3,bnchmrk,Bnchmrk is seeking a talented Data Scientist t...,"Edgewater, NJ",13 days ago,,Data Scientist,Edgewater,NJ,,,,
4,Merck,He/She will collaborate across disciplines to ...,"Kenilworth, NJ",1 day ago,,Senior Scientist Job,Kenilworth,NJ,,,,
5,Twitter,Ability to navigate large sets of data to tell...,"San Francisco, CA 94103 (South Of Market area)",2 days ago,,Research Analyst,San Francisco,CA,,,,
6,Career Path Group,Insurance Company is looking for a talented an...,"Manhattan, NY",19 days ago,,Data Scientist and Analytics Developer - Insur...,Manhattan,NY,,,,
7,Xpandit,"But, more than just a “Buzzword Guy”, we seek ...","Lisbon, ME",30 days ago,,Big Data Engineer,Lisbon,ME,,,,
8,PulsePoint,"3+ Years as a Data Scientist, preferable in th...","New York, NY",14 days ago,,Sr. Data Scientist,New York,NY,,,,
9,"Data Management Services, Inc.",Perform standard descriptive and inferential d...,"Frederick, MD 21701",6 days ago,,R Statistician/Data Scientist,Frederick,MD,,,,


In [328]:
# clean up range low 
def clean_h(x):
    try: 
        y = x.replace('$','')
        z = y.replace(',','')
        return z
    except: 
        pass
ft['range_high']= ft['range_high'].apply(clean_h)

ft['range_high'].replace('year</nobr>',np.nan, inplace =True)    

In [329]:
ft['range_high'].value_counts()

200000    3
160000    3
80000     3
110000    2
90000     2
100736    2
165000    2
110522    2
98000     2
175000    2
120000    2
300000    2
121525    2
150000    2
120187    1
150202    1
62000     1
84044     1
100000    1
140000    1
145000    1
70000     1
155000    1
141555    1
61669     1
60715     1
61441     1
170000    1
99243     1
96538     1
225000    1
96004     1
48000     1
85000     1
250000    1
133444    1
180000    1
Name: range_high, dtype: int64

In [324]:
# clean up range low 
def clean_l(x):
    try: 
        y = x.replace('<nobr>$','')
        z = y.replace(',','')
        return z
    except: 
        pass
ft['range_low'] = ft['range_low'].apply(clean_l)

ft['range_low'].value_counts()

### Save your results as a CSV

In [405]:
## YOUR CODE HERE
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# write to file 
ft.to_csv('clean_ft.csv')

## Predicting salaries using Logistic Regression

#### Load in the the data of scraped salaries

In [406]:
ndf = pd.read_csv('../../../project_data/indeed-scraped-job-postings.csv')
ndf 

Unnamed: 0,city,company,salary,summary,title,parsed_salary
0,San+Francisco,MarkMonitor,"$180,000 a year","Data skills (SQL, Hive, Pig). Applying machine...",Data Scientist,180000.0
1,San+Francisco,Workbridge Associates,"$130,000 - $180,000 a year",3+ years of industry experience in a data scie...,Senior Data Scientist,155000.0
2,San+Francisco,Mines.io,"$80,000 - $120,000 a year",We are looking for a data scientist/developer ...,Full-Stack Data Scientist,100000.0
3,San+Francisco,Workbridge Associates,"$150,000 - $180,000 a year",In this position you will share programming an...,Data Scientist,165000.0
4,San+Francisco,Smith Hanley Associates,"$140,000 a year","This person will recruit, build and lead a tea...",Data Scientist,140000.0
5,San+Francisco,HSF Consulting,"$300,000 a year",Teams included Data Services(including data en...,VP of Data Services,300000.0
6,San+Francisco,All-In Analytics,"$100,000 - $150,000 a year",Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,125000.0
7,San+Francisco,Brilent,"$130,000 a year","Perform large-scale data analysis, find intere...",Senior Data Scientist,130000.0
8,San+Francisco,HSF Consulting,"$160,000 a year",More data- they simply have more data than the...,Senior Data Scientist,160000.0
9,San+Francisco,All-In Analytics,"$100,000 - $150,000 a year",Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,125000.0


In [407]:
def clean_c(x):
    y = x.replace('+',' ')
    return y
ndf['city'] = ndf['city'].apply(clean_c)

In [411]:
ndf.loc[ndf.salary.str.contains('month'), 'salary_type'] = 'month'
ndf.loc[ndf.salary.str.contains('year'), 'salary_type'] = 'year'

In [413]:
ndf['salary'] = ndf['parsed_salary']

In [414]:
ndf = ndf.drop(['parsed_salary'], axis=1)

In [415]:
ndf

Unnamed: 0,city,company,salary,summary,title,salary_type
0,San Francisco,MarkMonitor,180000.0,"Data skills (SQL, Hive, Pig). Applying machine...",Data Scientist,year
1,San Francisco,Workbridge Associates,155000.0,3+ years of industry experience in a data scie...,Senior Data Scientist,year
2,San Francisco,Mines.io,100000.0,We are looking for a data scientist/developer ...,Full-Stack Data Scientist,year
3,San Francisco,Workbridge Associates,165000.0,In this position you will share programming an...,Data Scientist,year
4,San Francisco,Smith Hanley Associates,140000.0,"This person will recruit, build and lead a tea...",Data Scientist,year
5,San Francisco,HSF Consulting,300000.0,Teams included Data Services(including data en...,VP of Data Services,year
6,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year
7,San Francisco,Brilent,130000.0,"Perform large-scale data analysis, find intere...",Senior Data Scientist,year
8,San Francisco,HSF Consulting,160000.0,More data- they simply have more data than the...,Senior Data Scientist,year
9,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year


#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

In [416]:
## YOUR CODE HERE
ndf['salary_v'] = ndf['salary'].map(lambda x: 'high' if x > ndf['salary'].mean() else 'low')

#### Thought experiment: What is the baseline accuracy for this model?

The average salary plays a big role in this model because if the average salary of this sample cannot accurately represent the average of this population, this model cannot accurately predict for the other sets of data. Baseline is going to be the variabl that the model is comparing everything else to, in this case, one of those cities that will be dropped after the dummy variables are created. It also avoids dummy variable trap.

#### Create a Logistic Regression model to predict High/Low salary using statsmodel. Start by ONLY using the location as a feature. Display the coefficients and write a short summary of what they mean.

In [417]:
# create target variable
ndf['salary_b'] = ndf['salary_v'].map(lambda x: 1 if x == 'high' else 0)

In [418]:
# create dummy variables
dummies = pd.get_dummies(ndf['city'])

In [419]:
ndf

Unnamed: 0,city,company,salary,summary,title,salary_type,salary_v,salary_b
0,San Francisco,MarkMonitor,180000.0,"Data skills (SQL, Hive, Pig). Applying machine...",Data Scientist,year,high,1
1,San Francisco,Workbridge Associates,155000.0,3+ years of industry experience in a data scie...,Senior Data Scientist,year,high,1
2,San Francisco,Mines.io,100000.0,We are looking for a data scientist/developer ...,Full-Stack Data Scientist,year,high,1
3,San Francisco,Workbridge Associates,165000.0,In this position you will share programming an...,Data Scientist,year,high,1
4,San Francisco,Smith Hanley Associates,140000.0,"This person will recruit, build and lead a tea...",Data Scientist,year,high,1
5,San Francisco,HSF Consulting,300000.0,Teams included Data Services(including data en...,VP of Data Services,year,high,1
6,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year,high,1
7,San Francisco,Brilent,130000.0,"Perform large-scale data analysis, find intere...",Senior Data Scientist,year,high,1
8,San Francisco,HSF Consulting,160000.0,More data- they simply have more data than the...,Senior Data Scientist,year,high,1
9,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year,high,1


In [474]:
# create a new dataframe that contains both target and data
city = pd.concat([ndf['salary_b'], ndf['salary_type'],dummies.iloc[:,1:]], axis=1)

In [480]:
# seperate year and month - see if it makes a difference
city_y = city[city['salary_type']=='year']
city_m = city[city['salary_type']=='month']

In [489]:
import statsmodels.api as sm

# set data and target, then fit model 
# (having 2 salary types doesn't change the outcome compare to when there is only yearly salary/monthly salary)
data = city[list(city.columns[2:])]
target = city["salary_b"]

x = data.as_matrix(columns=None)
y = target.as_matrix(columns=None)

logit = sm.Logit(y, x)
# fit the model
result = logit.fit()

result.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,4791.0
Model:,Logit,Df Residuals:,4787.0
Method:,MLE,Df Model:,3.0
Date:,"Thu, 07 Jul 2016",Pseudo R-squ.:,0.3837
Time:,06:37:55,Log-Likelihood:,-2046.7
converged:,True,LL-Null:,-3320.9
,,LLR p-value:,0.0

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
x1,7.0682,1.000,7.065,0.000,5.107 9.029
x2,2.7702,0.202,13.702,0.000,2.374 3.166
x3,4.0405,0.381,10.597,0.000,3.293 4.788
x4,-1.3041,0.259,-5.041,0.000,-1.811 -0.797


In [490]:
# get AIC
result.aic

4101.4270108659312

In [493]:
# get BIC
result.bic

4127.3249886134645

In [495]:
result.conf_int()

array([[ 5.10737346,  9.02897054],
       [ 2.37394349,  3.16642048],
       [ 3.29325895,  4.78782476],
       [-1.81106714, -0.79704539]])

In [502]:
# ods ratio 
np.exp(result.params)

array([  1.17400000e+03,   1.59615385e+01,   5.68571429e+01,
         2.71428571e-01])

In [492]:
# Marginal Effect 
result_margeff = result.get_margeff()
result_margeff.summary() 

0,1
Dep. Variable:,y
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[95.0% Conf. Int.]
x1,1.0586,0.149,7.125,0.0,0.767 1.350
x2,0.4149,0.028,14.908,0.0,0.360 0.469
x3,0.6051,0.055,10.977,0.0,0.497 0.713
x4,-0.1953,0.038,-5.117,0.0,-0.270 -0.121


In [503]:
city

Unnamed: 0,salary_b,salary_type,Chicago,New York,San Francisco,Seattle
0,1,year,0.0,0.0,1.0,0.0
1,1,year,0.0,0.0,1.0,0.0
2,1,year,0.0,0.0,1.0,0.0
3,1,year,0.0,0.0,1.0,0.0
4,1,year,0.0,0.0,1.0,0.0
5,1,year,0.0,0.0,1.0,0.0
6,1,year,0.0,0.0,1.0,0.0
7,1,year,0.0,0.0,1.0,0.0
8,1,year,0.0,0.0,1.0,0.0
9,1,year,0.0,0.0,1.0,0.0


In [506]:
city['salary_b_pred'] = result.predict(x)

In [517]:
city.salary_b_pred.value_counts()

0.500000    2681
0.999149    1175
0.941043     441
0.982716     405
0.213483      89
Name: salary_b_pred, dtype: int64

In [521]:
city.Chicago.value_counts()

0.0    3616
1.0    1175
Name: Chicago, dtype: int64

In [522]:
city.Seattle.value_counts()

0.0    4702
1.0      89
Name: Seattle, dtype: int64

The coefficient shows whether  the salary is going to be high in certain cities compare to Austin. For example, Chicago, New York, San Francisco all have higher salary than the baseline Austin, but Seattle has lower salary than Austin. We can also tell from the predicted probability that, Chicago's od of having a high salary is 99.9% , vs. Seattle is only 21.3% posible of being high compared to Austin. 

#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title 
- or whether 'Manager' is in the title. 
- Then build a new Logistic Regression model with these features. Do they add any value? 


In [523]:
ndf

Unnamed: 0,city,company,salary,summary,title,salary_type,salary_v,salary_b
0,San Francisco,MarkMonitor,180000.0,"Data skills (SQL, Hive, Pig). Applying machine...",Data Scientist,year,high,1
1,San Francisco,Workbridge Associates,155000.0,3+ years of industry experience in a data scie...,Senior Data Scientist,year,high,1
2,San Francisco,Mines.io,100000.0,We are looking for a data scientist/developer ...,Full-Stack Data Scientist,year,high,1
3,San Francisco,Workbridge Associates,165000.0,In this position you will share programming an...,Data Scientist,year,high,1
4,San Francisco,Smith Hanley Associates,140000.0,"This person will recruit, build and lead a tea...",Data Scientist,year,high,1
5,San Francisco,HSF Consulting,300000.0,Teams included Data Services(including data en...,VP of Data Services,year,high,1
6,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year,high,1
7,San Francisco,Brilent,130000.0,"Perform large-scale data analysis, find intere...",Senior Data Scientist,year,high,1
8,San Francisco,HSF Consulting,160000.0,More data- they simply have more data than the...,Senior Data Scientist,year,high,1
9,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year,high,1


In [531]:
ndf['title_s'] = ndf['title'].map(lambda x: 1 if ('Senior'|'VP'|'Chief') in x else 0)

TypeError: unsupported operand type(s) for |: 'str' and 'str'

In [530]:
ndf

Unnamed: 0,city,company,salary,summary,title,salary_type,salary_v,salary_b,title_s
0,San Francisco,MarkMonitor,180000.0,"Data skills (SQL, Hive, Pig). Applying machine...",Data Scientist,year,high,1,0
1,San Francisco,Workbridge Associates,155000.0,3+ years of industry experience in a data scie...,Senior Data Scientist,year,high,1,1
2,San Francisco,Mines.io,100000.0,We are looking for a data scientist/developer ...,Full-Stack Data Scientist,year,high,1,0
3,San Francisco,Workbridge Associates,165000.0,In this position you will share programming an...,Data Scientist,year,high,1,0
4,San Francisco,Smith Hanley Associates,140000.0,"This person will recruit, build and lead a tea...",Data Scientist,year,high,1,0
5,San Francisco,HSF Consulting,300000.0,Teams included Data Services(including data en...,VP of Data Services,year,high,1,0
6,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year,high,1,0
7,San Francisco,Brilent,130000.0,"Perform large-scale data analysis, find intere...",Senior Data Scientist,year,high,1,1
8,San Francisco,HSF Consulting,160000.0,More data- they simply have more data than the...,Senior Data Scientist,year,high,1,1
9,San Francisco,All-In Analytics,125000.0,Fraud Data Scientist. Seeking someone with ski...,Fraud Data Scientist,year,high,1,0


In [None]:
## YOUR CODE HERE

#### Rebuild this model with scikit-learn.
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy, AUC, precision and recall of the model. 
- Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.

In [None]:
## YOUR CODE HERE

#### Compare L1 and L2 regularization for this logistic regression model. What effect does this have on the coefficients learned?

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

#### Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients

#### Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary - which entries have the highest predicted salaries?

### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate the logistic regression model using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

#### Re-test L1 and L2 regularization. You can use LogisticRegressionCV to find the optimal reguarlization parameters. 
- Re-test what text features are most valuable.  
- How do L1 and L2 change the coefficients?

In [None]:
## YOUR CODE HERE