# Exercise - Word Frequency Counter (Web Crawler Edition)

*Based on `PythonReview\EXERCISES\Exercise - How to Build A Web Crawler`*

## Objective: Crawling [Indeed.com](https://www.indeed.com) for most common words in software developer job descriptions in Dallas, TX
Starting URL: https://www.indeed.com/jobs?q=software%20developer&l=Dallas,%20TX&start=0  *(start=# goes up by increments of 10)*

*Note: Too many manual web scraping requests, especially requests sent one after another, may result in getting blocked by web server.*

Workarounds after getting blocked:
* clear cookies/cache on Indeed from web browser
* Command Prompt >> `ipconfig /release` >> ENTER `ipconfig /renew` >> ENTER (to get assigned a new IP address with DHCP)
* try to do it with a VPN and/or proxy
* make sure to include the below header info (see `get_job_posting_links` function)

__And with any of the above workarounds, make sure to NOT send too many requests or will get blocked again. Wait a few minutes before sending another request.__

### Expanded Project Idea (separate repository): Find out the most common programming languages out of all job descriptions using a separate list of all lowercase programming languages

In [5]:
import requests # one of the ways to connect to websites via Python
from bs4 import BeautifulSoup # allows you to go through page source and get data
import operator # will help with counting words

- - - -

### Dynamic Web Crawler

In [12]:
# set max pages
def get_job_posting_links(max_pages):
    page = 0  # increment by 10
    website_links = list()
    word_list = list()

    while page <= max_pages:
         

        url = "https://www.indeed.com/jobs?q=software%20developer&l=Dallas,%20TX&start="+str(page)
        
        # Avoid getting blocked in case multiple requests are sent
        header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0",
            'referer':'https://www.indeed.com/'
        }

        proxy = {'http': 'http://12.231.44.251:3128'}

        # GET request; stores page HTML source in variable
        source_code = requests.get(url, headers=header, proxies=proxy) 

        # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
        plain_text = source_code.text 

        # can sort through this variable
        soup = BeautifulSoup(plain_text) 

        
        # Searched through all of the job posting links & added them to website_links list ( inspect element for unique class/id names)
        for link in soup.findAll('a', {'class': 'jcs-JobTitle'}): # similar to CSS descendant selectors (a .jcs-JobTitle)
            job_href = "https://www.indeed.com" + link.get('href')
            website_links.append(job_href)
        

        
        

        # Execute below function for each link in website_links list
        for index in website_links:

            return_list = indiv_job_posting_info(index) # store urls of each job listing meant for function in below cell
            
            for index2 in return_list:
                word_list.append(index2)

        # Increment pages by 10
        page += 10


        word_list_cleaned = clean_up_list(word_list)
        wordfrequency(word_list_cleaned)

In [13]:
def indiv_job_posting_info(job_url):
        
        word_list_temp = list()

        # GET request; stores page HTML source in variable
        source_code = requests.get(job_url) 

        # gets the front end text of the HTMl source code; ignoring back end stuff; essentally parses through HTML source
        plain_text = source_code.text 

        # can sort through this variable
        soup = BeautifulSoup(plain_text)
        
        # find the job description
        jobdesc = soup.find('div', {'id': 'jobDescriptionText'})
        
        # remove all the HTML tags from the job description & split job description by each word; stored in a list
        job_desc_list = jobdesc.get_text().lower().split() # all lowercase
                
        
        return job_desc_list
        
    

In [14]:
# removed extra unnecessary symbols
def clean_up_list(word_ls):
    clean_word_list = []

    for word in word_ls:
        symbols_to_remove = '".&:,\'()$1234567890'

        for index in range(0, len(symbols_to_remove)):
            word = word.replace(symbols_to_remove[index], "")

        if len(word) > 0:
            clean_word_list.append(word)   

    return clean_word_list

In [15]:
def wordfrequency(word_ls):
    counts = dict()


    for word in word_ls:
        counts[word] = counts.get(word, 0) + 1

    ## SORTING DICTIONARY (using operator module)
    for key, val in sorted(counts.items(), key=operator.itemgetter(1), reverse=True): # 0 for key, 1 for value
        print(key, "---", val)

### MOST COMMON WORDS:

In [16]:
get_job_posting_links(0)

and --- 206
to --- 157
the --- 125
of --- 82
with --- 74
a --- 72
in --- 67
software --- 51
work --- 46
or --- 46
on --- 44
we --- 42
experience --- 42
for --- 41
you --- 39
are --- 36
is --- 35
your --- 32
will --- 31
our --- 30
ability --- 27
as --- 26
preferred --- 25
insurance --- 24
skills --- 24
an --- 23
job --- 22
one --- 21
- --- 21
at --- 21
team --- 21
development --- 20
location --- 20
have --- 17
us --- 17
developer --- 16
design --- 16
year --- 16
that --- 16
all --- 15
from --- 15
technology --- 15
training --- 15
requirements --- 14
not --- 14
benefits --- 14
opportunity --- 14
learn --- 13
systems --- 13
information --- 13
pay --- 13
time --- 13
engineering --- 13
engineer --- 13
develop --- 13
be --- 12
full-time --- 12
assistance --- 12
this --- 12
knowledge --- 11
responsibilities --- 11
required --- 11
per --- 11
computer --- 11
other --- 10
technical --- 10
program --- 10
schedule --- 10
health --- 10
paid --- 10
java --- 10
their --- 10
career --- 10
relocation -

### FULL LIST OF KEYWORDS:

In [8]:
# FULL LIST

get_job_posting_links(0)

['marketscout', 'is', 'a', 'progressive', 'company', 'focusing', 'on', 'innovation', 'and', 'creative', 'concepts', 'in', 'the', 'insurance', 'and', 'financial', 'industries.', 'marketscout,', 'named', 'one', 'of', 'the', 'best', 'places', 'to', 'work', 'in', 'insurance', 'for', 'the', 'past', 'nine', 'consecutive', 'years', '(2012', '-', '2021),', 'owns', 'and', 'operates', 'the', 'marketscout', 'exchange', 'at', 'www.marketscout.com,', 'as', 'well', 'as', 'over', '40', 'other', 'online', 'and', 'traditional', 'underwriting', 'and', 'distribution', 'venues.', 'marketscout', 'has', 'offices', 'in', 'alabama,', 'arkansas,', 'florida,', 'illinois,', 'nebraska,', 'pennsylvania,', 'south', 'carolina,', 'tennessee,', 'texas,', 'and', 'washington,', 'dc.', 'position', 'summary:', 'we', 'are', 'seeking', 'an', 'enthusiastic', 'junior', 'software', 'developer', 'to', 'join', 'our', 'software', 'design', 'team.', 'you', 'will', 'report', 'directly', 'to', 'the', 'development', 'manager', 'and',