# Web scrapping for Data Scientist job in CO (9 points total)



In this exercise we will do web scrapping for **Data Scientist job in CO**


Here is the link to the search query

https://www.indeed.com/jobs?q=data+scientist&l=CO

As you can see at the bottom of the page there are links to several pages related to this search.
If you click on second page, search url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10

If you click on 3rd then url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20

Hence, to go to more pages we can format the search string(**change start=??** part) for **requests.get in a loop**


# Q1(5 =  4(non indicator columns) + 1(indicator columns) points) Please complete the following task

- Scrape 10 pages (**last page(10 th) url will be https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90**)and build a pandas DataFrame containing following information
    + **job title, name of the company, location, summary of job description**
    + **Indicator columns(with value True/False) about keywords Python, SQL, AWS, RESTFUL, Machine learning, Deep Learning, Text Mining, NLP, SAS, Tableau, Sagemaker, TensorFlow, Spark**

Note:
- Make sure that you do a case insensitive search for keywords when filing(Tue/False) in the indicator columns
- You need to go to the webpage of detail job posting for keywords search. The main job posting only contains summary of the job description.  Build detail job posting webpage url from web scrapping main search results.
- If you run into difficulties which you are not able to overcome, skip this question and import the datafram from the provided the pickle file instead.
- If you find this entire homework too difficult at your current level of expertise, please feel free to complete the AlternateHwk5 instead.

In [7]:
# import the necessary packages
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Indeed Page Convention
<Indeed.com> pages follow the convention: 

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=#

where # is an integer in a range that starts w/ 0 (current page), and increments by 10 each time for subsequent pages (0, 10, 20, etc.)

In [8]:
page = f"https://www.indeed.com/jobs?q=data+scientist&l=CO"
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.78'}
response = requests.get(page, headers=header)
    
# code 403 means forbidden/blocked: server understands the request but refuses to authorize it
print(response.status_code, page)

print('\n', response.headers)

403 https://www.indeed.com/jobs?q=data+scientist&l=CO

 {'Date': 'Thu, 09 Feb 2023 03:07:04 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'close', 'Permissions-Policy': 'accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()', 'Referrer-Policy': 'same-origin', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Set-Cookie': '__cf_bm=mdurmS3hSfti4paeZ6JJ_99vYNwEQWfwGfxx0bU7IcM-1675912024-0-Ab21SM/LM4v92Wximo2XP7hX6daOj/EkhWHldUfnNC/lU/8siItfvPK43HivrPogSsl+MCxieUOyeF+yqnkbfKE=; path=/; expires=Thu, 09-Feb-23 03:37:04 GMT; domain=.indeed.com; HttpOnly; Secure; SameSite=None', 'Server-Timing': 'cf-q-config;du

#### <indeed.com> is using cloudflare which appears to be blocking bot activity
"The site owner may have set restrictions that prevent you from accessing the site."

In [9]:
response.text



In [10]:
ten_indeed_pages = []
for i in range(0, 91, 10):
    page_archetype = f"https://www.indeed.com/jobs?q=data+scientist&l=CO&start={i}"
    response = requests.get(page_archetype, "html")
    ten_indeed_pages.append(page_archetype)
    
    # code 403 means forbidden/blocked: server understands the request but refuses to authorize it
    print(response.status_code, page_archetype)

403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=0
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=30
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=40
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=50
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=60
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=70
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=80
403 https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90


In [5]:
print(ten_indeed_pages)

['https://www.indeed.com/jobs?q=data+scientist&l=CO&start=0', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=30', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=40', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=50', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=60', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=70', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=80', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90']


In [11]:
ten_indeed_pages[0] = 'https://www.indeed.com/jobs?q=data+scientist&l=CO'   # first page doesn't need '&start=0'
print(ten_indeed_pages)

['https://www.indeed.com/jobs?q=data+scientist&l=CO', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=30', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=40', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=50', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=60', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=70', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=80', 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90']


## Proceed to using Selenium

### Steps to Read A Job Description:
#### For each page of 10 pages of jobs for search results:

*Loop:* All 10 pages

A) grab job title, company, and location

*Loop:*

1) Click on respective job box

2) From "*jobsearch-ViewJobPaneWrapper*" scrape job description & parse for keywords:
        
        A) Get unabbreviated job summary
        
        B) Fill indicator columns -- if keywords are present in summary then record 1/True, otherwise 0/False:
                
                * Python, SQL, AWS, RESTFUL, Machine learning, Deep Learning, Text Mining, NLP, SAS, Tableau, Sagemaker, TensorFlow, Spark
#### There are 15 jobs per page max

#### NOTES:  
1) there may not be full <Indeed.com> results pages for jobs as this depends on the job market, 
    
    * this would occur on the last page if encountered

In [13]:
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

import time
from bs4 import BeautifulSoup
import pandas as pd
import re
import json
import os

In [14]:
# set pandas display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## Instantiate Chrome Webdriver & Navigate to 1st Target Page

In [3]:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--start-maximized")
driver = Chrome(
    executable_path=r'C:\Users\Joel\Desktop\DU\CS4447-DS Tools 1\Assignments\4\chromedriver_win32\chromedriver.exe', 
    chrome_options=chromeOptions
)
time.sleep(4)

  driver = Chrome(
  driver = Chrome(


# Fill Data Dictionary

In [18]:
start = time.time()

# construct data dictionary to hold elements
# idd meaning indeed dictionary
idd = {'job_title': [], 'company': [], 'location': [], 'job_description': [],
              'Python': [], 'SQL': [], 'AWS': [], 'RESTFUL': [], 'Machine Learning': [], 
               'Deep Learning': [], 'Text Mining': [], 'NLP': [], 'SAS': [], 'Tableau': [], 
               'Sagemaker': [], 'TensorFlow': [], 'Spark': []
              }

# Create list of key words we will parse our job description for
# keywords to search for in job description: keywords are same as dictionary keys above
keywords = ['Python', 'SQL', 'AWS', 'RESTFUL', 'Machine Learning', 'Deep Learning', 'Text Mining', 
            'NLP', 'SAS', 'Tableau', 'Sagemaker', 'TensorFlow', 'Spark']


total_jobs_scraped = 0
# begin to go through job pages
for index, indeed_page in enumerate(ten_indeed_pages):
    driver.get(indeed_page)
    driver.implicitly_wait(10)   # give the driver enough time to load and get situated
    print(f"Page {index + 1}: {indeed_page}\n'{driver.title}'")

    # get .page_source of the driver for the current webpage
    ps = driver.page_source
    
    # feed page source into Beautiful Soup
    # rows = soup.find('table', summary='License search results').find_all('tr')
    
#     time.sleep(wait_time)
    soup = BeautifulSoup(ps, "html.parser")
    rows = soup.find_all('td', class_='resultContent')
    
    # Get job titles, companies and locations
    for row in rows:
        idd['job_title'].append(row.span.text)
        idd['company'].append(row.find('span', class_='companyName').text)
        idd['location'].append(row.find('div', class_='companyLocation').text)


    ## click on each job to get unabreviated job description

    # use find_elements() to grab all job tile links
    job_tile_count = len(driver.find_elements(By.CSS_SELECTOR,'[class="slider_container css-g7s71f eu4oa1w0"]'))
    total_jobs_scraped += job_tile_count
    print(f"number of jobs on this page = {job_tile_count}")
    # we need to create of a function for .find_elements in order to prevent throwing a DOM/driver error
    all_job_tiles = lambda: driver.find_elements(By.CSS_SELECTOR,'[class="slider_container css-g7s71f eu4oa1w0"]')

    # click job tile links 1-by-1 & get full job description
    i = 0
    while True:
        job_tiles = all_job_tiles()

       # when we burn through job tiles then terminate while loop
        if i > job_tile_count - 1:
            break
        
        # click on job tile to get unabreviated summary info
        ActionChains(driver).move_to_element(job_tiles[i]).click(job_tiles[i]).perform()
        driver.implicitly_wait(wait_time)   # give the driver enough time to load and get situated
        time.sleep(4)    # needed or else BeautifulSoup may time out -- 5 seconds optimum whole integer

        # get new page source for lengthy summary info
        ps = driver.page_source
        
#         co_name = soup.find('div', attrs={'class': 'jobsearch-CompanyReview--heading'}).text
#         print(co_name)
        
        # feed page source into Beautiful Soup
        soup = BeautifulSoup(ps, "html.parser")

        time.sleep(4)
        
        # get updated job description
        desc = soup.find('div', attrs={'class': 'jobsearch-jobDescriptionText'}).text
        idd['job_description'].append(desc)
    #     print(f"type(desc) = {type(desc)}\n")
    #     print(desc)

        # see if job description has keywords
        for keyword in keywords:
            result = re.search(keyword, desc, re.IGNORECASE)
            if result:
        #         print(f"{keyword} found in job description.")
                idd[keyword].append(True)
            if not result:
                idd[keyword].append(False)        


        i += 1
        
        
    print('-' * 120)

print('=' * 120, '\n', f"total jobs scraped for 'data scientist' + 'CO': {total_jobs_scraped}")
        
driver.close()

stop = time.time()

runtime = stop - start
print(f"runtime: {runtime / 60} mins")

Page 1: https://www.indeed.com/jobs?q=data+scientist&l=CO
'Data Scientist Jobs, Employment in Colorado | Indeed.com'
number of jobs on this page = 15
------------------------------------------------------------------------------------------------------------------------
Page 2: https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10
'Data Scientist Jobs, Employment in Colorado | Indeed.com'
number of jobs on this page = 15
------------------------------------------------------------------------------------------------------------------------
Page 3: https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20
'Data Scientist Jobs, Employment in Colorado | Indeed.com'
number of jobs on this page = 15
------------------------------------------------------------------------------------------------------------------------
Page 4: https://www.indeed.com/jobs?q=data+scientist&l=CO&start=30
'Data Scientist Jobs, Employment in Colorado | Indeed.com'
number of jobs on this page = 15
----------

### NOTE: the above code cell will require a decent speed and also consistent internet connection to run effectively.
Sleep time has been built into the code in multiple places purposefully which adds to the runtime.  This avoids a page taking slighlty too long to run and stopping the code as a result.


In [21]:
for k, v in idd.items():
    print(f"{k}: {v}\n")

job_title: ['Sr. Data Scientist - VIRTUAL', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist, Staff', 'Data Scientist', 'Analyst - Junior Data Scientist', 'Data Scientist', 'Senior Data Scientist', 'Data Scientist', 'Data Scientist', 'Data Scientist, TIFIN Wealth', 'Data Scientist', 'Manufacturing Data Scientist', 'Data Scientist', 'Data Scientist II', 'Data Scientist', 'AIML - Analytics Engineer, Data & ML Innovation', 'Data Scientist or Sr. Data Scientist', 'Head of Data Science', 'Statistical Analyst I', 'Data Scientist', 'Data Scientist', 'Postdoc in Educational Data Mining/Learning Analytics', 'Senior Data Scientist', 'Data Scientist 3', 'Content Expert- Data Science', 'Natural Language Processing Specialist', 'Data Scientist', 'Software Developer – Computer Vision and Machine Learning', 'Data Scientist (Mid)', 'Associate Scientist II/III - NCAR Data Stewardship Coordinator', 'Data Science Research Institute Systems Administrator', 'Data Scie

In [20]:
df = pd.DataFrame(idd)
df

Unnamed: 0,job_title,company,location,job_description,Python,SQL,AWS,RESTFUL,Machine Learning,Deep Learning,Text Mining,NLP,SAS,Tableau,Sagemaker,TensorFlow,Spark
0,Sr. Data Scientist - VIRTUAL,Comcast,"Centennial, CO 80112",Comcast brings together the best in media and ...,True,True,False,False,True,False,False,False,False,True,False,False,False
1,Data Scientist,VeriCour,"Remote in Denver, CO",Data Scientist\nThe Data Scientist will be res...,True,True,False,False,False,False,False,False,False,False,False,False,False
2,Data Scientist,CyberCoders,"Remote in Denver, CO 80238",\n Data Scientist \n \nJob Title: Data Scienti...,True,True,False,False,True,False,False,False,True,True,False,False,False
3,Data Scientist,LogRhythm,"Remote in Denver, CO",\nAs a Data Scientist on our BI & Analytics te...,True,False,False,False,True,False,True,False,False,False,False,False,True
4,Data Scientist,Permian Resources,"Denver, CO 80202 (Lodo area)",\n\n Permian Resources (NYSE: PR) is currentl...,True,True,False,False,False,False,False,False,False,False,False,False,False
5,"Data Scientist, Staff",LOCKHEED MARTIN CORPORATION,"Aurora, CO 80011 (Norfolk Glen area)",\nThe coolest jobs on this planet… or any othe...,True,True,False,False,False,False,False,False,False,False,False,False,False
6,Data Scientist,Infosys,"Denver, CO",\n Infosys is seeking \n Data Scientists with ...,True,True,True,False,True,True,False,False,False,True,False,False,True
7,Analyst - Junior Data Scientist,YES Communities,"Greenwood Village, CO","About YES \nYES Communities, founded in 2008, ...",True,True,False,False,False,False,False,False,False,False,False,False,False
8,Data Scientist,"City of Grand Junction, Colorado","Grand Junction, CO 81506",\n\n\nDescription\n\n\n\n\nDATA SCIENTIST\n\n\...,False,False,False,False,True,False,False,False,False,False,False,False,False
9,Senior Data Scientist,The Trade Desk,"Boulder, CO",\n\n The Trade Desk is a global technology co...,False,False,False,False,False,False,False,False,False,False,False,False,False


# Q2(1 point) Save you DataFrame to a pickle file name *indeed_job_co.pkl*. 
   Load this pkl file in dataFrame and use this dataFrame for answering following questions.

   <font color='red'>upload the pickle file(indeed_job_co.pkl) along with solution notebook to the canvas</font>

In [22]:
# pickle our dataframe object
df.to_pickle('indeed_search_dataScientist_Colorado_2.8.2023.pkl')

In [23]:
#write code here
import pickle

with open('indeed_search_dataScientist_Colorado_2.8.2023.pkl', 'rb') as f:
    job_df = pickle.loads(f.read())
job_df.head()

Unnamed: 0,job_title,company,location,job_description,Python,SQL,AWS,RESTFUL,Machine Learning,Deep Learning,Text Mining,NLP,SAS,Tableau,Sagemaker,TensorFlow,Spark
0,Sr. Data Scientist - VIRTUAL,Comcast,"Centennial, CO 80112",Comcast brings together the best in media and ...,True,True,False,False,True,False,False,False,False,True,False,False,False
1,Data Scientist,VeriCour,"Remote in Denver, CO",Data Scientist\nThe Data Scientist will be res...,True,True,False,False,False,False,False,False,False,False,False,False,False
2,Data Scientist,CyberCoders,"Remote in Denver, CO 80238",\n Data Scientist \n \nJob Title: Data Scienti...,True,True,False,False,True,False,False,False,True,True,False,False,False
3,Data Scientist,LogRhythm,"Remote in Denver, CO",\nAs a Data Scientist on our BI & Analytics te...,True,False,False,False,True,False,True,False,False,False,False,False,True
4,Data Scientist,Permian Resources,"Denver, CO 80202 (Lodo area)",\n\n Permian Resources (NYSE: PR) is currentl...,True,True,False,False,False,False,False,False,False,False,False,False,False


In [None]:
# assignment prompt provided unpickling code to get DF

# with open('indeed_job_co.pkl', 'rb') as f:
#     job_df = pickle.loads(f.read())
# job_df.head()

<font size = "6" color='red'> Use pandas functionality to answer question 3</font>
# Q 3 a(1 point) Which city has maximum job posting.
Interpretation: Which city [in Colorado] has the most job postings?



In [24]:
job_df['location'].value_counts()

Denver, CO                                                      19
Colorado Springs, CO                                            10
Remote in Denver, CO                                             8
Aurora, CO                                                       5
Denver, CO 80202 (Union Station area)                            5
Remote in Boulder, CO 80305                                      4
Aurora, CO 80011 (Norfolk Glen area)                             4
Boulder, CO                                                      4
Centennial, CO 80112                                             3
Colorado Springs, CO 80912                                       3
Colorado Springs, CO 80916 (Southeast Colorado Springs area)     3
Denver, CO 80202 (Central Business District area)                3
Falcon, CO                                                       2
Fort Collins, CO                                                 2
Littleton, CO 80120                                           

In [25]:
locations = job_df['location'].tolist()

# We see above that there some cities are differentiated by a more specific locale "(East Colorado Springs)" for example
# we need to account for this in our freqenucy count of cities
loc_count, denver_count, colorado_springs_count = 0, 0, 0
for loc in locations:
    if "Denver" in loc:
        print(loc)
        denver_count += 1
    if "Colorado Springs" in loc:
        print(loc)
        colorado_springs_count += 1
    
    loc_count += 1
print('=' * 120, f"total locations/line count:\t{loc_count}", sep='\n')
print(f"'Denver' count:\t{denver_count}")
print(f"'Colorado Springs' count:\t{colorado_springs_count}")

Remote in Denver, CO
Remote in Denver, CO 80238
Remote in Denver, CO
Denver, CO 80202 (Lodo area)
Denver, CO
Colorado Springs, CO 80916 (Southeast Colorado Springs area)
Remote in Denver, CO
Remote in Denver, CO
Remote in Denver, CO
Denver, CO 80202 (Union Station area)
Hybrid remote in Denver, CO 80202
Remote in Colorado Springs, CO 80916
Denver, CO
Colorado Springs, CO 80919 (Northwest Colorado Springs area)
Colorado Springs, CO
Denver, CO
Remote in Denver, CO
Colorado Springs, CO
Remote in Denver, CO 80239
Denver, CO
Remote in Denver, CO 80202
Denver, CO 80202 (Central Business District area)
Denver, CO
Denver, CO 80202 (Central Business District area)
Denver, CO 80202 (Central Business District area)
Remote in Denver, CO 80201
Denver, CO
Colorado Springs, CO 80909 (East Colorado Springs area)
Colorado Springs, CO 80903 (East Colorado Springs area)
Remote in Denver, CO
Denver, CO
Denver, CO 80202 (Union Station area)
Denver, CO
Colorado Springs, CO 80903 (Central Colorado Springs ar

In [None]:
# uncomment to get full job locations list
# print("=======Full list of job locations=======")
# for loc in locations:
#     print(loc)

In [26]:
# freqenucy count of cities using RegularExpressions
freq = {'Denver': 0, 
        'Colorado Springs': 0, 
        'Golden': 0,
        'Boulder': 0,
        'Aurora': 0,
        'remote': 0, 
        'hybrid': 0, 
        'Colorado': 0,
        'remote + Denver': 0, 
        'remote + Colorado Springs': 0, 
        'remote + Golden': 0,
        'remote + Colorado': 0,
        'remote + Boulder': 0,
        'remote + Aurora': 0
       }

for location in locations:
    for pattern in freq.keys():
        if re.findall(f".*({pattern}).+", location, re.IGNORECASE):
#             print(f'--{pattern}--:\n', re.findall(f".*({pattern}).+", location, re.IGNORECASE))
            freq[pattern] += 1
        if pattern == 'remote' or pattern == 'hybrid':
            pass
        else:
            if re.findall(f".*(remote).*({pattern}).*|.*({pattern}).*.*(remote)", location, re.IGNORECASE):
#                 print(f"--(?:remote).+(?:{pattern}).+--:\n", re.findall(f".*(remote).*({pattern}).*|.*({pattern}).*.*(remote)", location, re.IGNORECASE))
                freq[f"remote + {pattern}"] += 1

#         print("-" * 120)


# Denver/Colorado Springs have the most job postings & are in contention for max job postings per city.
# subtract out the jobs which are said to be based out of Denver/Colorado Springs, however, are also remote: 
# count these as strictly remote for city frequency count purposes. 
freq['Denver'] = freq['Denver'] - freq['remote + Denver']
freq['Colorado Springs'] = freq['Colorado Springs'] - freq['remote + Colorado Springs']
freq['Golden'] = freq['Golden'] - freq['remote + Golden']
freq['Boulder'] = freq['Boulder'] - freq['remote + Boulder']
freq['Aurora'] = freq['Aurora'] - freq['remote + Aurora']

# Note: we will also capture hybrid under 'Denver' or 'Colorado Springs', however, these jobs WILL BE in either city 
# at least part-time under a hybrid (remote/in-office) model

print("Job Frequencies by Location Search Phrase\n(Case Insensitive)\n", '=' * 42, sep='')
for k, v in sorted(freq.items(), key=lambda kv: kv[1], reverse=True):    # sort search by dictionary values, descending
    print(f"{k:.<40}{v:.>2}")
print('=' * 42, f"total number of jobs:\t{loc_count}", sep='\n')

Job Frequencies by Location Search Phrase
(Case Insensitive)
Denver..................................31
remote..................................30
Colorado................................29
Colorado Springs........................24
remote + Denver.........................15
Aurora..................................12
Boulder..................................8
hybrid...................................5
remote + Colorado........................5
remote + Boulder.........................5
remote + Colorado Springs................4
Golden...................................2
remote + Golden..........................0
remote + Aurora..........................0
total number of jobs:	137


We can see from our job location analysis above that **Denver** is the single city with the most jobs at 31 (meaning if the job listed 'Denver' under the job location field).  If we were to count the entire Denver Metro area, as defined by the Census Bureau or according to our own distance proclivity for neighboring peripheral cities, this is likely to be much higher.

# Q 3 b(1.5 point) - Top 3 most demanding skills(like Python, AWS, SQL ...)
Interpretation: What are the top 3 most in-demand skills of those captured above in the problem statement?



In [27]:
keywords = job_df.drop(['job_title', 'company', 'location', 'job_description'], axis=1)
keywords

print("keyword totals\n", '=' * 25, sep='')
print(keywords.sum().sort_values(ascending=False))

total_jobs_scraped = 142
print("\n\nKeyword found in all Job Listings\n(Percentage of Total Jobs)\n", '=' * 35, sep='')
print(keywords.sum().sort_values(ascending=False).div(total_jobs_scraped).mul(100).round(2))

keyword totals
Python              92
Machine Learning    76
SQL                 58
AWS                 46
Deep Learning       27
Spark               27
Tableau             26
TensorFlow          22
NLP                 14
SAS                  9
Text Mining          4
Sagemaker            2
RESTFUL              0
dtype: int64


Keyword found in all Job Listings
(Percentage of Total Jobs)
Python              64.79
Machine Learning    53.52
SQL                 40.85
AWS                 32.39
Deep Learning       19.01
Spark               19.01
Tableau             18.31
TensorFlow          15.49
NLP                  9.86
SAS                  6.34
Text Mining          2.82
Sagemaker            1.41
RESTFUL              0.00
dtype: float64


### The top 3 most in-demand skills per our <indeed.com> results are: 
1) Python

2) Machine Learning

3) SQL

# Q3 c(.5 point) What other questions you would like to ask  based on indeed data?

This is a free response question.

### I would also like to have data on and ask questions about:
* estimated job salary information (min/max & therefore range)
* corporate rating information (is company peer-reviewed as a good company to work for?)
* date of job posting (to know how old a job is, and therefore how relevant it is?)
* if we considered the number of job postings in a city from a given base city, what about neighboring cities which are within a certain number of miles from the base city--how many jobs would also be included?
    * Indeed job postings now appear to have a map showing the geographic location of each job -- can we scrape coordinates?
* what are the total number of jobs any employer has at a given time (hiring trends, etc.)?