# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Seek.com](https://www.seek.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like seek.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.
 
## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

 

## Suggestions for Getting Started

1. Collect data from [seek.com](www.seek.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
 
---

## Useful Resources

- Scraping is one of the most fun, useful and interesting skills out there. Don’t lose out by copying someone else's code!
- [Here is some advice on how to write for a non-technical audience](http://programmers.stackexchange.com/questions/11523/explaining-technical-things-to-non-technical-people)
- [Documentation for BeautifulSoup can be found here](http://www.crummy.com/software/BeautifulSoup/).

---

### Project Feedback + Evaluation

For all projects, students will be evaluated on a simple 4 point scale (0, 1, 2 or 3). Instructors will use this rubric when scoring student performance on each of the core project **requirements:** 

 Score | Expectations
 ----- | ------------
 **0** | _Did not complete. Try again._
 **1** | _Does not meet expectations. Try again_
 **2** | _Meets expectations._
 **3** | _Surpasses expectations. Brilliant!_
 
 # Project 4 feedback
| Requirement | Rubric   |
|------|------|
|   Scrape and prepare your own data  | |
|   Create and compare at least two models, one a decision tree or ensemble and the other a classifier or regression for Section 1: Job Salary Trends and Section 2: Job Category Factors (so at least 4 models in total)  | |
|   Polished Jupyter notebook with your analysis annotated for a peer audience of data scientists  | |
|  Executive summary at the beginning of your notebook for written for your superiors to use to make business decisions. Make sure that it includes the ‘So what…’ regarding your analysis, risks and limitations. ||
|   
__Qualitative feedback:__


In [1]:
import requests

import numpy as np
from scrapy.selector import Selector
from bs4 import BeautifulSoup # From Sri Class on Sat 15/01/2021
import pandas as pd
import matplotlib.pyplot as plt

Lesson when touched on in Unit 3

http://localhost:8889/notebooks/KrisdansRepository01/Unit-3/dsi-unit-3.29-python-practice_xpath_webscraping-lesson/Srikanta%20-%20XPath%20Scrapy%20Selector.ipynb

In [2]:
# url = "https://www.seek.com.au/data-scientist-jobs?page=" % page
# url = "https://www.seek.com.au/data-scientist-jobs"
# base_url = "https://www.seek.com.au"
# url = "https://www.seek.com.au/data-scientist-jobs"

In [3]:
# Response200 = requests.get(url)
# Response200

In [4]:
# TheSoup = BeautifulSoup(requests.get(url).content,'lxml')
# TheSoup

In [5]:
# # Found the DIV Class in Seek.com
# "rX6MwqN"

In [6]:
# TheSoup_rX6MwqN = TheSoup.find_all('div',{'class': 'rX6MwqN'})
# TheSoup_rX6MwqN

In [7]:
# ListJobs_TheSoup_rX6MwqN = []

# for x in TheSoup_rX6MwqN[0].find_all('div'):
#     ListJobs_TheSoup_rX6MwqN.append(x)

# ListJobs_TheSoup_rX6MwqN
# len(ListJobs_TheSoup_rX6MwqN)

In [8]:
#Going down the tree

# ListJobs_TheSoup_rX6MwqN[0].find('article').find('span').find('div').attrs['href']

In [9]:
# for n in ListJobs_TheSoup_rX6MwqN:
#     n.find('article').find('span').find('div').attrs['href']

# AttributeError: 'NoneType' object has no attribute 'find'

In [10]:
# "https://www.seek.com.au/data-scientist-jobs?page=" % page

In [11]:
# import scrapy

# '''
# Anita Liu 8/01/2021   17:14
# '''

# class AuthorSpider(scrapy.Spider):
#     name = 'author'

#     start_urls = ['http://quotes.toscrape.com/']

#     def parse(self, response):
#         author_page_links = response.css('.author + a')
#         yield from response.follow_all(author_page_links, self.parse_author)

#         pagination_links = response.css('li.next a')
#         yield from response.follow_all(pagination_links, self.parse)

#     def parse_author(self, response):
#         def extract_with_css(query):
#             return response.css(query).get(default='').strip()

#         yield {
#             'name': extract_with_css('h3.author-title::text'),
#             'birthdate': extract_with_css('.author-born-date::text'),
#             'bio': extract_with_css('.author-description::text'),
#         }

Anita Liu 8/01/2021   17:15

https://docs.scrapy.org/en/latest/intro/tutorial.html

Anita Liu  8/01/2021 17:31

https://www.youtube.com/watch?v=wYJAtx4HL6U&list=PLhTjy8cBISEqkN-5Ku_kXG4QW33sxQo0t&index=2 

this playlist helped me get through project 4

In [12]:
# Anita suggest cleaning with 'Stemmer'

# Text cleaning

#     Stemmer - ing, y
#     tokenisation - 
#     countvectorise - Done
#     snowballstemmer
#     Genism



In [13]:
# requests.get(url)

#### Project 4 steps:

https://deepnote.com/project/Week-16-ID14VL4iQdeu_wJWLCnY2A/%2Fproject-4%2FProject%204%20-%20Review.ipynb/#00cc0266-d2ba-410a-a946-60c513844499

1. gather data via web scraping

2. apply stemming and lemmatization for text cleaning

3. apply feature engineering techniques

4. train the built model 

5. evaluate the model's performance

6. make appropriate changes in the model

7. deploy the model 

https://deepnote.com/project/Week-16-ID14VL4iQdeu_wJWLCnY2A/%2Fproject-4%2FProject%204%20-%20Review.ipynb/#00cc0266-d2ba-410a-a946-60c513844499

In [14]:
# https://deepnote.com/project/Week-16-ID14VL4iQdeu_wJWLCnY2A/%2Fproject-4%2FProject%204%20-%20Review.ipynb/#00006-d2bfc205-5db4-4d7e-bdc3-122917abf6a6

## Python web requests:

# import requests

# url = ""

# r = requests.get(url)

# if r.status_code == 200:
#     print(r.text)

In [15]:
# https://deepnote.com/project/Week-16-ID14VL4iQdeu_wJWLCnY2A/%2Fproject-4%2FProject%204%20-%20Review.ipynb/#00006-d2bfc205-5db4-4d7e-bdc3-122917abf6a6

## The primary library in Python for parsing HTML is BeautifulSoup

# from bs4 import BeautifulSoup

# url = ''

# r = requests.get(url)
# if r.status_code == 200:
#    soup = BeautifulSoup(r.text)
#    print(soup)

In [16]:
# https://deepnote.com/project/Week-16-ID14VL4iQdeu_wJWLCnY2A/%2Fproject-4%2FProject%204%20-%20Review.ipynb/#00006-d2bfc205-5db4-4d7e-bdc3-122917abf6a6

##  Scrapy

# import scrapy


# class QuotesSpider(scrapy.Spider):
#     name = "quotes"

#     def start_requests(self):
#         url = 'http://quotes.toscrape.com/'
#         tag = getattr(self, 'tag', None)
#         if tag is not None:
#             url = url + 'tag/' + tag
#         yield scrapy.Request(url, self.parse)

#     def parse(self, response):
#         for quote in response.css('div.quote'):
#             yield {
#                 'text': quote.css('span.text::text').get(),
#                 'author': quote.css('small.author::text').get(),
#             }

#         next_page = response.css('li.next a::attr(href)').get()
#         if next_page is not None:
#             yield response.follow(next_page, self.parse)


# # https://docs.scrapy.org/en/latest/intro/tutorial.html

In [17]:
# class SeekSpider(scrapy.Spider):
#     name = "quotes"

#     def start_requests(self):
#         url = 'https://www.seek.com.au/data-scientist-jobs?page=2'
#         tag = getattr(self, 'tag', None)
#         if tag is not None:
#             url = url + 'tag/' + tag
#         yield scrapy.Request(url, self.parse)

#     def parse(self, response):
#         for quote in response.css('div.quote'):
#             yield {
#                 'text': quote.css('span.text::text').get(),
#                 'author': quote.css('small.author::text').get(),
#             }

#         next_page = response.css('li.next a::attr(href)').get()
#         if next_page is not None:
#             yield response.follow(next_page, self.parse)

In [18]:
# import scrapy
# from scrapy.crawler import CrawlerProcess

# '''
# Given by Sri in class on Sat 15/01/2022

# "This is the framework, you don't have to try to write a lot"

# '''

# class JobsSpider(scrapy.Spider):
#     name = "jobseek"
#     allowed_domains = ['seek.com.au']
#     start_urls = ['https://www.seek.com.au/data-scientist-jobs'] # Can make list long if needed
#     def parse(self, response):
#         #this line gives "50" URLS
#         urls = response.xpath('//a[@class="rX6MwqN _wCZyZJ"]/@data-search-sol-meta/article/div/div/@href').extract() # Can choose to change 'response.xpath' to 'response.css'
#         for url in urls: #keep calling parse_details
#             url = response.urljoin(url)
#             yield scrapy.Request(url = url, callback=self.parse_details) #callback calls the next function def parse_details
#         next_page = response.xpath('//a[@class="_24YOjgT"]/@href').extract_first()
#         if next_page is not None:
#             next_page = response.urljoin(next_page)   #joins full url with the next page number
#             yield scrapy.Request(next_page, callback = self.parse)
#     def parse_details(self, response): #response - what you get from the query
#         yield {
#             'Job Title': response.xpath('//div[@class="FYwKg _6Gmbl_4"]/h1/text()').extract()
            
#         }



In [19]:
# Review lesson
# Scrapy Tutorial
#     Create new project
#     It'll creat folders
#     You can then make a .py file
#     Then you can run the cmd once in thta folder
# Consule needs to be within specific folder


In [20]:
# # Help with Sri on Sat 22/01/2022

# import requests
# from bs4 import BeautifulSoup

# url = 'https://www.seek.com.au/data-science-jobs/in-sydney'

# # Help with Sri on Sat 22/01/2022

# response = requests.get(url)

# soup = BeautifulSoup(response.text, 'html.parser')
# print(soup.title)
# # soup.find_all('div', {'class':'h3f08hf _14uh9945e _14uh994j _14uh994k _14uh994l _14uh994m'}).attrs['href']
# ['https://www.seek.com.au' + x.attrs['href']  
#  for x in soup.find_all
#  ('div', {'class':'h3f08hf _14uh9945e _14uh994j _14uh994k _14uh994l _14uh994m'})]

# # Help with Sri on Sat 22/01/2022

In [21]:
# https://www.youtube.com/watch?v=INm8yR4aYjk

In [22]:
# from scrapy.crawler import CrawlerProcess  #scrappy in jupyter notebook
# import scrapy


# class SeekSpider(scrapy.Spider):
#     name = "job_search"                 #spider name
#     allowed_domains = ["seek.com.au"]   #web page
    
#     #seed with the seek landing site for all data jobs in AUS
#     start_urls = ['https://www.seek.com.au/data-jobs/in-All-Australia?salaryrange=0-70000&salarytype=annual']
    
#     custom_settings = {
#         'FEED_FORMAT' : 'csv',
#         'FEED_URI' : 'seek_test09.csv'
#         }
    
#     def parse(self, response):
        
#         print("-----------------")
#         print("I just visited :" + response.url)
#         print("-----------------")
        
#         #grab list of all job ad links from search page
#         urls = response.css('h1 > a::attr(href)').extract()
        
#         #loop through list of links and visit each to grab the details
#         for url in urls:
#             # join the partial url with the domain to ensure correct operation
#             url = response.urljoin(url)
#             yield scrapy.Request(url = url, callback =self.parse_details)
            
#         # grab the link for the next page
#         next_page = response.css('a[data-automation="page-next"]::attr(href)').extract_first()
        
#         # if 'next page' link present, go to and rinse and repeat
#         if next_page is not None:
#             # join next page link with Domain to ensure correct operation
#             next_page = response.urljoin(next_page)
#             yield scrapy.Request(next_page, callback = self.parse)
    
#     def parse_details(self, response):
#         yield {
#             # grab and assign the job title
# #             'title': response.xpath('//div[@class="FYwKg _6Gmbl_4"]/h1/text()')[0].extract()
#              'title': response.xpath('//div[@data-automation="searchResults"]/h1/text()')[0].extract() # This might need to get updated
            
#         }
        
# process = CrawlerProcess()
# process.crawl(SeekSpider)
# process.start()

In [23]:
data = pd.read_csv('try/search.csv')
data

Unnamed: 0,title,date,company,location,salary,description
0,Data Scientist - Consulting,,The Onset,Sydney,Salary: $130-150k base,Great opportunity to form and be part of the A...
1,Senior Data Engineer - Data Platform,,SEEK Limited,Melbourne,,The Senior Data Engineer is a vital part of th...
2,Data Scientist,6h ago,Scentre Group,Sydney,,Support our growth and strengthen our strategi...
3,Junior Data Scientist - Property and Retail,4d ago,Talent Insights Group Pty Ltd,Sydney,Salary: $100K-$120K Including Super + Bonus,We are hiring a Junior data scientist for a ma...
4,Data Scientist,1h ago,Suncorp,Sydney,Salary: Great salary and supportive culture,Utilise your Insurance Analytics experience an...
...,...,...,...,...,...,...
545,Biostatistician,6d ago,The University of Sydney,Sydney,"Salary: Base Salary $97,975 -$106,738 + 17% su...",Exciting opportunity for a Biostatistician to ...
546,Assistant Director – Data Access and Assets,10d ago,"Department of Education, Skills and Employment",Brisbane,"Salary: $110,517 - $122,145",Assist a data driven organisation to mature ou...
547,Data Analytics Engineer,24d ago,Network 10,Sydney,,"Permanent Role, WFH/Remote Flexibility Availab..."
548,"Data, Analytics & Ai Manager",15d ago,Coles,Melbourne,,"Join Coles as an experienced Data, Analytics &..."


In [24]:
data2 = pd.read_csv('try/search02.csv')
data2

Unnamed: 0,title,date,company,location,salary,description
0,Senior Data Engineer - Data Platform,,SEEK Limited,Melbourne,,The Senior Data Engineer is a vital part of th...
1,Instructional Designer,,Hays Technology,Melbourne,,Incredible opportunity to work within the Vict...
2,Data Analyst,5h ago,AIA Australia Limited,Melbourne,,An exciting opportunity to work alongside the ...
3,Data Analyst,6h ago,Reserve Bank of Australia,Sydney,Salary: Access to a wide-range of staff benefits,Work on a highly visible program delivering a ...
4,Data Analyst,3d ago,CoAct,Brisbane,,The Data Analyst position will play a key role...
...,...,...,...,...,...,...
545,Reporting Analyst,10d ago,Chandler Macleod Group,Melbourne,,Seeking an experienced Reporting Analyst for a...
546,eCommerce Analyst,10d ago,Aldi Stores,Sydney,"Salary: $124,900 - $142,500 (including super)",eCommerce Analyst. Data & performance analysis...
547,Technical Business Analyst - Settlements,2h ago,Macquarie Group Limited,Sydney,Salary: Salary commensurate with experience,Join a collaborative and high performing Techn...
548,Data Warehouse Solution Designer,5h ago,SEEK Limited,Melbourne,,The Data Warehouse Solution Designer is a vita...


In [103]:
data2.columns

Index(['title', 'date', 'company', 'location', 'salary', 'description'], dtype='object')

In [25]:
# data3 = data
# data3.append(data2)
# data3.join(data2, axis = 1)
# data3

In [26]:
display(
    data.shape,
    data2.shape
    )

(550, 6)

(550, 6)

In [27]:
data03 = pd.merge(data,data2,
        how = 'outer'
        )
data03.to_csv("data03.csv")
data03

Unnamed: 0,title,date,company,location,salary,description
0,Data Scientist - Consulting,,The Onset,Sydney,Salary: $130-150k base,Great opportunity to form and be part of the A...
1,Senior Data Engineer - Data Platform,,SEEK Limited,Melbourne,,The Senior Data Engineer is a vital part of th...
2,Senior Data Engineer - Data Platform,,SEEK Limited,Melbourne,,The Senior Data Engineer is a vital part of th...
3,Data Scientist,6h ago,Scentre Group,Sydney,,Support our growth and strengthen our strategi...
4,Junior Data Scientist - Property and Retail,4d ago,Talent Insights Group Pty Ltd,Sydney,Salary: $100K-$120K Including Super + Bonus,We are hiring a Junior data scientist for a ma...
...,...,...,...,...,...,...
1043,Reporting Analyst,10d ago,Chandler Macleod Group,Melbourne,,Seeking an experienced Reporting Analyst for a...
1044,eCommerce Analyst,10d ago,Aldi Stores,Sydney,"Salary: $124,900 - $142,500 (including super)",eCommerce Analyst. Data & performance analysis...
1045,Technical Business Analyst - Settlements,2h ago,Macquarie Group Limited,Sydney,Salary: Salary commensurate with experience,Join a collaborative and high performing Techn...
1046,Data Warehouse Solution Designer,5h ago,SEEK Limited,Melbourne,,The Data Warehouse Solution Designer is a vita...


In [28]:
# pd.merge(data,data2,
#         how = 'inner'
#         )
# DOn't need 'inner'

In [29]:
data03.description[0]

'Great opportunity to form and be part of the APAC Data Science vision for this Global Leader.'

In [30]:
data03.dtypes

title          object
date           object
company        object
location       object
salary         object
description    object
dtype: object

In [31]:
len(data03['salary'])

1048

In [32]:
data03['salary'].isnull().sum()

611

In [33]:
len(data03['salary']) - data03['salary'].isnull().sum()
# Amount of non-null values in Salary

437

In [34]:
data03['salary'].unique()
# Need to use Regex
# https://regex101.com/
# dsi-unit-3.33-regex

array(['Salary: $130-150k base', nan,
       'Salary: $100K-$120K Including Super + Bonus',
       'Salary: Great salary and supportive culture',
       'Salary: $neg + 13% Super and STI', 'Salary: $140k - $160k p.a.',
       'Salary: hourly pay- with competitive rates',
       'Salary: $150000.00 - $160000.00 p.a. incl super, 15% bonus',
       'Salary: $70-$80 per hour inc Super',
       'Salary: Generous Salary plus benefits',
       'Salary: up to $1100 p/d (incl super)',
       'Salary: Base + Super + Profit Share', 'Salary: Competitive',
       'Salary: salary packaging benefits available',
       'Salary: Open to Quote (Sydney or Canberra Location)',
       'Salary: Competitive Package',
       'Salary: A leader in sustainability and renewable resources',
       'Salary: $75,200 to $84,468 plus 15.4% super',
       'Salary: Negotiable depending on experience',
       'Salary: $125000 - $160000 per annum', 'Salary: 600-800 p.d.',
       'Salary: Competitive salary, flexible worki

In [35]:
data03['salary02'] = data03['salary']
data03['salary02'].replace("Salary:", "")
data03['salary02']

0                              Salary: $130-150k base
1                                                 NaN
2                                                 NaN
3                                                 NaN
4         Salary: $100K-$120K Including Super + Bonus
                            ...                      
1043                                              NaN
1044    Salary: $124,900 - $142,500 (including super)
1045      Salary: Salary commensurate with experience
1046                                              NaN
1047                                              NaN
Name: salary02, Length: 1048, dtype: object

In [36]:
data03.columns

Index(['title', 'date', 'company', 'location', 'salary', 'description',
       'salary02'],
      dtype='object')

In [37]:
data03[['title']]

Unnamed: 0,title
0,Data Scientist - Consulting
1,Senior Data Engineer - Data Platform
2,Senior Data Engineer - Data Platform
3,Data Scientist
4,Junior Data Scientist - Property and Retail
...,...
1043,Reporting Analyst
1044,eCommerce Analyst
1045,Technical Business Analyst - Settlements
1046,Data Warehouse Solution Designer


In [38]:
data03['date']

0           NaN
1           NaN
2           NaN
3        6h ago
4        4d ago
         ...   
1043    10d ago
1044    10d ago
1045     2h ago
1046     5h ago
1047    11d ago
Name: date, Length: 1048, dtype: object

In [39]:
data04A = pd.read_csv('data-scientist-jobs.csv')
data04A 


Unnamed: 0.1,Unnamed: 0,Job Title,Company,multiple_details,Description
0,0,Data Scientist,Fundo Loans,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",We are scaling up! An exciting opportunity exi...
1,1,Data Scientist,Paxus,"['ACT', 'Information & Communication Technolog...",Skills RequiredContribute to improving reporti...
2,2,Data Scientist - Large Retail Group,Bluefin Resources Pty Limited,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",Data Scientist - Large Retail GroupThe Compan...
3,3,Data Scientist - Consulting,The Onset,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",Tier one consulting has not slown down through...
4,4,Data Scientist,Effective People,"['ACT', 'Information & Communication Technolog...","Data Scientist Role Baseline clearance, or ab..."
...,...,...,...,...,...
520,520,Advanced Analytics Engineer,Woolworths Group,"['Sydney', 'North West & Hills District', 'Inf...",Advanced Analytics Engineer Permanent role Ne...
521,521,Data Engineer,Diversity Talent Pty Ltd,"['Melbourne', 'CBD & Inner Suburbs', 'Informat...",Data Engineer Newly created role within the or...
522,522,Data Engineer - Private Health Insurance,GMHBA,"['Melbourne', 'Insurance & Superannuation', 'O...",GMHBA have an opening for a talented and passi...
523,523,Senior Data Engineer - Cloud/Tableau/Python,Alloc8,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...","Alloc8 - The Data Talent Specialists, present ..."


In [40]:
data04A ['Description']

0      We are scaling up! An exciting opportunity exi...
1      Skills RequiredContribute to improving reporti...
2       Data Scientist - Large Retail GroupThe Compan...
3      Tier one consulting has not slown down through...
4       Data Scientist Role Baseline clearance, or ab...
                             ...                        
520    Advanced Analytics Engineer  Permanent role Ne...
521    Data Engineer Newly created role within the or...
522    GMHBA have an opening for a talented and passi...
523    Alloc8 - The Data Talent Specialists, present ...
524    The Country Fire Authority (CFA) is one of the...
Name: Description, Length: 525, dtype: object

In [41]:
data04A ['Description'][0]
import json 

In [42]:
# https://stackoverflow.com/questions/64916148/how-to-split-a-json-string-column-in-pandas-spark-dataframe


data_DA_json = pd.read_json('data-DA.json')
data_bi_json = pd.read_json('data-bi.json')
data_from_json = pd.concat([data_DA_json, data_bi_json])
# data_from_json.to_csv('kk_proj4_json.csv') # Shared with Geoff


In [43]:
data_from_json['multiple_details']

0      [Sydney, CBD, Inner West & Eastern Suburbs, In...
1      [Melbourne, CBD & Inner Suburbs, Banking & Fin...
2      [Sydney, CBD, Inner West & Eastern Suburbs, In...
3      [Sydney, CBD, Inner West & Eastern Suburbs, In...
4      [Sydney, CBD, Inner West & Eastern Suburbs, Co...
                             ...                        
534    [ACT, Information & Communication Technology, ...
535    [Sydney, CBD, Inner West & Eastern Suburbs, In...
536    [Melbourne, CBD & Inner Suburbs, Information &...
537    [Brisbane, CBD & Inner Suburbs, Information & ...
538    [Brisbane, CBD & Inner Suburbs, Sales, Analysi...
Name: multiple_details, Length: 1079, dtype: object

In [44]:
data_from_json.columns

Index(['Job Title', 'Company', 'multiple_details', 'Description'], dtype='object')

In [45]:
str(data_from_json['multiple_details'][4])

'4    [Sydney, CBD, Inner West & Eastern Suburbs, Co...\n4    [Melbourne, CBD & Inner Suburbs, Information &...\nName: multiple_details, dtype: object'

In [46]:
print(data_from_json['multiple_details'][4])

4    [Sydney, CBD, Inner West & Eastern Suburbs, Co...
4    [Melbourne, CBD & Inner Suburbs, Information &...
Name: multiple_details, dtype: object


In [47]:
pd.DataFrame.from_dict(data_from_json['multiple_details'])

Unnamed: 0,multiple_details
0,"[Sydney, CBD, Inner West & Eastern Suburbs, In..."
1,"[Melbourne, CBD & Inner Suburbs, Banking & Fin..."
2,"[Sydney, CBD, Inner West & Eastern Suburbs, In..."
3,"[Sydney, CBD, Inner West & Eastern Suburbs, In..."
4,"[Sydney, CBD, Inner West & Eastern Suburbs, Co..."
...,...
534,"[ACT, Information & Communication Technology, ..."
535,"[Sydney, CBD, Inner West & Eastern Suburbs, In..."
536,"[Melbourne, CBD & Inner Suburbs, Information &..."
537,"[Brisbane, CBD & Inner Suburbs, Information & ..."


In [48]:
# https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas

from pandas.io.json import json_normalize

pd.json_normalize(data_from_json['multiple_details'])

Unnamed: 0,0,1,2,3,4,5
0,{},{},{},{},{},{}
1,{},{},{},{},{},
2,{},{},{},{},{},
3,{},{},{},{},{},
4,{},{},{},{},{},
...,...,...,...,...,...,...
1074,{},{},{},{},,
1075,{},{},{},{},{},{}
1076,{},{},{},{},{},{}
1077,{},{},{},{},{},


In [49]:
# import spark # ModuleNotFoundError: No module named 'spark'

# df = spark.createDataFrame(data_from_json['multiple_details'])
# json_schema = spark.read.json(df.rdd.map(lambda rec: rec.json_result)).schema
# df = df.withColumn('json', F.from_json(F.col('json_result'), json_schema)) \
#     .select("id", "name", "json.0._source.*")
# df.show()

In [50]:
data_cleaned = data04A

In [51]:
# data_cleaned.columns

# Index(['Unnamed: 0', 'Job Title', 'Company', 'multiple_details',
#        'Description'],
#       dtype='object')

In [52]:
# NLP Using a count vectorizer.  
from sklearn.feature_extraction.text import CountVectorizer

cvec01 = CountVectorizer()

In [53]:
cvec01.fit(data_cleaned['Description']) # dsi-unit-4.19-nlp-intro_to_nlp-lab

CountVectorizer()

In [54]:
display(
    len(cvec01.get_feature_names()), # Get individual words with .get_feature_names() - learnt from dsi-unit-4.19-nlp-intro_to_nlp-lab
    cvec01.get_feature_names()
    )



10481

['00',
 '000',
 '000my',
 '0012122473',
 '0012130425',
 '0018',
 '0032',
 '004',
 '008',
 '00am',
 '00pm',
 '01',
 '0114',
 '0195',
 '02',
 '020',
 '02014',
 '02026',
 '0233',
 '026',
 '0262010100',
 '03',
 '0302',
 '0386139999',
 '04',
 '0402',
 '0405',
 '0406384403',
 '0407',
 '0409321995',
 '0410',
 '0411382000',
 '0412',
 '0414',
 '0419',
 '0420',
 '0422015623',
 '0426',
 '0429',
 '0433',
 '0434',
 '0452089967',
 '0456',
 '04620',
 '0469',
 '0478',
 '0478697744',
 '0481',
 '0487848404',
 '0488',
 '0491011875',
 '050',
 '0614',
 '062',
 '06810',
 '07',
 '07151',
 '07197',
 '07288',
 '08',
 '0861',
 '09',
 '094',
 '097',
 '0971',
 '0experience',
 '0previous',
 '10',
 '100',
 '1000',
 '1006',
 '100k',
 '1013',
 '102',
 '103',
 '103k',
 '105',
 '1060',
 '107',
 '107773',
 '108475',
 '11',
 '110',
 '110k',
 '111',
 '111k',
 '112k',
 '113',
 '114',
 '115',
 '1150',
 '116',
 '118',
 '11th',
 '12',
 '120',
 '1200',
 '120k',
 '121',
 '122',
 '125',
 '126',
 '129390you',
 '12month',
 '13',
 

In [55]:
cvec02_stop_words_english = CountVectorizer(stop_words='english')
cvec02_stop_words_english

CountVectorizer(stop_words='english')

In [56]:
cvec02_stop_words_english.fit(data_cleaned['Description'])

CountVectorizer(stop_words='english')

In [57]:
display(
    len(cvec02_stop_words_english.get_feature_names()), # Get individual words with .get_feature_names() - learnt from dsi-unit-4.19-nlp-intro_to_nlp-lab
    cvec02_stop_words_english.get_feature_names()
    )

10227

['00',
 '000',
 '000my',
 '0012122473',
 '0012130425',
 '0018',
 '0032',
 '004',
 '008',
 '00am',
 '00pm',
 '01',
 '0114',
 '0195',
 '02',
 '020',
 '02014',
 '02026',
 '0233',
 '026',
 '0262010100',
 '03',
 '0302',
 '0386139999',
 '04',
 '0402',
 '0405',
 '0406384403',
 '0407',
 '0409321995',
 '0410',
 '0411382000',
 '0412',
 '0414',
 '0419',
 '0420',
 '0422015623',
 '0426',
 '0429',
 '0433',
 '0434',
 '0452089967',
 '0456',
 '04620',
 '0469',
 '0478',
 '0478697744',
 '0481',
 '0487848404',
 '0488',
 '0491011875',
 '050',
 '0614',
 '062',
 '06810',
 '07',
 '07151',
 '07197',
 '07288',
 '08',
 '0861',
 '09',
 '094',
 '097',
 '0971',
 '0experience',
 '0previous',
 '10',
 '100',
 '1000',
 '1006',
 '100k',
 '1013',
 '102',
 '103',
 '103k',
 '105',
 '1060',
 '107',
 '107773',
 '108475',
 '11',
 '110',
 '110k',
 '111',
 '111k',
 '112k',
 '113',
 '114',
 '115',
 '1150',
 '116',
 '118',
 '11th',
 '12',
 '120',
 '1200',
 '120k',
 '121',
 '122',
 '125',
 '126',
 '129390you',
 '12month',
 '13',
 

In [58]:
cvec02_stop_words_english.transform(data_cleaned['Description']) # dsi-unit-4.19-nlp-intro_to_nlp-lab

<525x10227 sparse matrix of type '<class 'numpy.int64'>'
	with 109343 stored elements in Compressed Sparse Row format>

In [59]:
cvec02_stop_words_english.transform(data_cleaned['Description']).todense() # dsi-unit-4.19-nlp-intro_to_nlp-lab

matrix([[0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 1, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [60]:
pd.DataFrame(cvec02_stop_words_english.transform(data_cleaned['Description'])) # dsi-unit-4.19-nlp-intro_to_nlp-lab

Unnamed: 0,0
0,"(0, 1)\t1\n (0, 176)\t1\n (0, 183)\t1\n (..."
1,"(0, 14)\t1\n (0, 167)\t1\n (0, 364)\t1\n ..."
2,"(0, 14)\t1\n (0, 233)\t1\n (0, 464)\t1\n ..."
3,"(0, 48)\t1\n (0, 67)\t1\n (0, 501)\t1\n (..."
4,"(0, 14)\t1\n (0, 61)\t1\n (0, 106)\t1\n (..."
...,...
520,"(0, 1)\t2\n (0, 67)\t1\n (0, 141)\t1\n (0..."
521,"(0, 553)\t1\n (0, 625)\t1\n (0, 743)\t1\n ..."
522,"(0, 1)\t1\n (0, 143)\t1\n (0, 273)\t1\n (..."
523,"(0, 94)\t2\n (0, 743)\t1\n (0, 814)\t2\n ..."


In [61]:
DF_from_cvec02_stop_words_english = pd.DataFrame(data = cvec02_stop_words_english.transform(data_cleaned['Description']).todense(),
             columns = cvec02_stop_words_english.get_feature_names()
    )
DF_from_cvec02_stop_words_english # dsi-unit-4.19-nlp-intro_to_nlp-lab



Unnamed: 0,00,000,000my,0012122473,0012130425,0018,0032,004,008,00am,...,yuen,yvonne,zap,zealand,zero,zhai,zone,zones,zoo,āhuatanga
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
520,0,2,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
522,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
DF_from_cvec02_stop_words_english.sum(axis = 0).sort_values(ascending = False).head(60) # dsi-unit-4.19-nlp-intro_to_nlp-lab

data             5887
experience       1943
work             1479
team             1365
business         1162
working          1136
role             1059
skills            877
development       782
analytics         762
people            730
solutions         696
support           651
including         618
technical         614
science           613
new               599
services          563
cloud             532
learning          532
design            529
strong            518
engineering       513
apply             488
using             484
knowledge         469
azure             442
sql               441
ability           430
environment       429
technology        424
python            422
australia         422
engineer          409
best              399
information       398
machine           388
management        388
provide           384
opportunity       377
platform          368
develop           367
tools             363
analysis          363
opportunities     361
based     

In [63]:
type(DF_from_cvec02_stop_words_english) # dsi-unit-4.19-nlp-intro_to_nlp-lab

pandas.core.frame.DataFrame

In [64]:
cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1 =  CountVectorizer(strip_accents='unicode', stop_words="english", ngram_range=(1,1)) # dsi-unit-4.22-naive-bayes-lab

In [65]:
# .fit_transform
cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.fit_transform(data_cleaned['Description'])

<525x10223 sparse matrix of type '<class 'numpy.int64'>'
	with 109342 stored elements in Compressed Sparse Row format>

In [66]:
DF_from_cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1 = pd.DataFrame(data = cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.transform(data_cleaned['Description']).todense(),
             columns = cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.get_feature_names()
    )

DF_from_cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1



Unnamed: 0,00,000,000my,0012122473,0012130425,0018,0032,004,008,00am,...,yrs,yuen,yvonne,zap,zealand,zero,zhai,zone,zones,zoo
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
520,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
522,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [67]:
# Check the most used words from cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1
DF_from_cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.sum(axis = 0).sort_values(ascending = False).head(60)

data             5887
experience       1943
work             1479
team             1365
business         1162
working          1136
role             1059
skills            877
development       782
analytics         762
people            730
solutions         696
support           651
including         618
technical         614
science           613
new               599
services          563
learning          532
cloud             532
design            529
strong            518
engineering       513
apply             488
using             484
knowledge         469
azure             442
sql               441
ability           430
environment       429
technology        424
python            422
australia         422
engineer          409
best              399
information       398
machine           388
management        388
provide           384
opportunity       377
platform          368
develop           367
tools             363
analysis          363
opportunities     361
based     

In [68]:
def get_freq_words_4_22(sparse_counts, columns):
    '''
    dsi-unit-4.22-naive-bayes-lab 
    '''
    # X_all is a sparse matrix, so sum() returns a 'matrix' datatype ...
    #   which we then convert into a 1-D ndarray for sorting
    word_counts = np.asarray(X_all.sum(axis=0)).reshape(-1)

    # argsort() returns smallest first, so we reverse the result
    largest_count_indices = word_counts.argsort()[::-1]

    # pretty-print the results! Remember to always ask whether they make sense ...
    freq_words = pd.Series(word_counts[largest_count_indices], 
                           index=columns[largest_count_indices])

    return freq_words


def hist_counts_4_22(word_counts):
    '''
    dsi-unit-4.22-naive-bayes-lab    
    '''
    hist_counts = pd.Series(minmax_scale(word_counts), 
                            index=word_counts.index)
    
    # Overall graph is hard to understand, so let's break it into three graphs
    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,6))
    
    hist_counts.plot(kind="hist", bins=50, ax=axes[0], title="Histogram - All")
    
    # There are a lot of really common tokens within 10% -- filter them out
    hist_counts[hist_counts > .1].plot(kind="hist", bins=50, ax=axes[1], title="Histogram - Counts > .1")
    
    # look at the range of extreme commons that seem to exist below .01
    hist_counts[hist_counts < .01].plot(kind="hist", ax=axes[2], title="Histogram - Counts < .01")
    


In [69]:
cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.get_feature_names()



['00',
 '000',
 '000my',
 '0012122473',
 '0012130425',
 '0018',
 '0032',
 '004',
 '008',
 '00am',
 '00pm',
 '01',
 '0114',
 '0195',
 '02',
 '020',
 '02014',
 '02026',
 '0233',
 '026',
 '0262010100',
 '03',
 '0302',
 '0386139999',
 '04',
 '0402',
 '0405',
 '0406384403',
 '0407',
 '0409321995',
 '0410',
 '0411382000',
 '0412',
 '0414',
 '0419',
 '0420',
 '0422015623',
 '0426',
 '0429',
 '0433',
 '0434',
 '0452089967',
 '0456',
 '04620',
 '0469',
 '0478',
 '0478697744',
 '0481',
 '0487848404',
 '0488',
 '0491011875',
 '050',
 '0614',
 '062',
 '06810',
 '07',
 '07151',
 '07197',
 '07288',
 '08',
 '0861',
 '09',
 '094',
 '097',
 '0971',
 '0experience',
 '0previous',
 '10',
 '100',
 '1000',
 '1006',
 '100k',
 '1013',
 '102',
 '103',
 '103k',
 '105',
 '1060',
 '107',
 '107773',
 '108475',
 '11',
 '110',
 '110k',
 '111',
 '111k',
 '112k',
 '113',
 '114',
 '115',
 '1150',
 '116',
 '118',
 '11th',
 '12',
 '120',
 '1200',
 '120k',
 '121',
 '122',
 '125',
 '126',
 '129390you',
 '12month',
 '13',
 

In [70]:
# get_freq_words_4_22(cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.fit_transform(data_cleaned['Description']),
#                                                                                                  cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.get_feature_names())
# # NameError: name 'X_all' is not defined

In [71]:
# hist_counts_4_22(DF_from_cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1)

In [72]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """    
    dsi-unit-4.22-naive-bayes-lab
    
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
        
    dsi-unit-4.22-naive-bayes-lab    
    """
    plt.figure()
    plt.title(title)
    
    if ylim is not None:
        plt.ylim(*ylim)
    
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

```python
# dsi-unit-4.24-nlp-topic_modeling_lda-lab
```

The `.vocabulary_` attribute of the vectorizer contains a dictionary of terms. There is also the built-in function `.get_feature_names()` which will extract the column names.

```python
# dsi-unit-4.24-nlp-topic_modeling_lda-lab
```

In [73]:
# Check .vocabulary_  on cvt03
cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.vocabulary_

# pd.DataFrame(cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.vocabulary_
#              ValueError: If using all scalar values, you must pass an index

{'scaling': 8104,
 'exciting': 3683,
 'opportunity': 6424,
 'exists': 3709,
 'data': 2604,
 'scientist': 8137,
 'join': 5181,
 'innovative': 4930,
 'successful': 8981,
 'fintech': 3970,
 'team': 9179,
 'fundo': 4179,
 'loans': 5555,
 'based': 1345,
 'surry': 9050,
 'hills': 4545,
 'competitive': 2158,
 'remuneration': 7723,
 'recognised': 7554,
 'experience': 3725,
 'skillsawesome': 8449,
 'career': 1717,
 'progression': 7252,
 'opportunities': 6417,
 'make': 5663,
 'mark': 5733,
 'fulfillment': 4160,
 'rapidly': 7492,
 'growing': 4384,
 'business': 1639,
 'committed': 2099,
 'better': 1426,
 'dayflexible': 2681,
 'work': 10082,
 'environment': 3493,
 'supported': 9036,
 'flexible': 4005,
 'working': 10098,
 'conditions': 2229,
 'dedicated': 2728,
 'achieving': 572,
 'good': 4315,
 'life': 5481,
 'balancenew': 1312,
 'light': 5490,
 'filled': 3945,
 'centrally': 1787,
 'located': 5558,
 'office': 6352,
 'space': 8621,
 'mins': 5964,
 'walk': 9950,
 'central': 1785,
 'station': 8830,
 '

In [74]:

vocab_from_cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1 = {v:k for k,v, in cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.vocabulary_.items()}
vocab_from_cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1

# dsi-unit-4.24-nlp-topic_modeling_lda-lab

{8104: 'scaling',
 3683: 'exciting',
 6424: 'opportunity',
 3709: 'exists',
 2604: 'data',
 8137: 'scientist',
 5181: 'join',
 4930: 'innovative',
 8981: 'successful',
 3970: 'fintech',
 9179: 'team',
 4179: 'fundo',
 5555: 'loans',
 1345: 'based',
 9050: 'surry',
 4545: 'hills',
 2158: 'competitive',
 7723: 'remuneration',
 7554: 'recognised',
 3725: 'experience',
 8449: 'skillsawesome',
 1717: 'career',
 7252: 'progression',
 6417: 'opportunities',
 5663: 'make',
 5733: 'mark',
 4160: 'fulfillment',
 7492: 'rapidly',
 4384: 'growing',
 1639: 'business',
 2099: 'committed',
 1426: 'better',
 2681: 'dayflexible',
 10082: 'work',
 3493: 'environment',
 9036: 'supported',
 4005: 'flexible',
 10098: 'working',
 2229: 'conditions',
 2728: 'dedicated',
 572: 'achieving',
 4315: 'good',
 5481: 'life',
 1312: 'balancenew',
 5490: 'light',
 3945: 'filled',
 1787: 'centrally',
 5558: 'located',
 6352: 'office',
 8621: 'space',
 5964: 'mins',
 9950: 'walk',
 1785: 'central',
 8830: 'station',
 9

In [75]:
from collections import defaultdict # dsi-unit-4.24-nlp-topic_modeling_lda-lab

frequency4_24 = defaultdict(int)

for text in data_cleaned['Description']:
    for token in text.split():
        frequency4_24[token] += 1

frequency4_24


defaultdict(int,
            {'We': 914,
             'are': 1437,
             'scaling': 17,
             'up!': 2,
             'An': 54,
             'exciting': 123,
             'opportunity': 296,
             'exists': 27,
             'for': 2766,
             'a': 4522,
             'Data': 1665,
             'Scientist': 98,
             'to': 7957,
             'join': 237,
             'the': 6725,
             'innovative': 119,
             'and': 14424,
             'successful': 203,
             'FinTech': 4,
             'team': 993,
             'at': 692,
             'Fundo': 2,
             'Loans,': 2,
             'based': 216,
             'in': 4178,
             'Surry': 7,
             'Hills.': 2,
             'Why': 48,
             'Join': 59,
             'Us?': 5,
             'Competitive': 17,
             'remuneration': 18,
             '-': 443,
             'be': 1653,
             'recognised': 41,
             'your': 1053,
             'experi

In [76]:
type(frequency4_24)

collections.defaultdict

In [77]:
# pd.DataFrame(frequency4_24)
# ValueError: If using all scalar values, you must pass an index

# https://stackoverflow.com/questions/54122942/how-to-convert-a-defaultdictlist-to-pandas-dataframe

DF_from_frequency4_24 = pd.DataFrame(list(frequency4_24.items()),columns = ['get_feature_names','frequency']).sort_values(by='frequency', ascending = False)
DF_from_frequency4_24

Unnamed: 0,get_feature_names,frequency
16,and,14424
12,to,7957
14,the,6725
133,of,5891
9,a,4522
...,...,...
11820,"FastAPI,",1
11821,"spaCy,",1
11822,Beautiful,1
11823,"Soup,",1


### 8. Set up the LDA model

We can create the gensim LDA model object like so:

```python
lda = models.LdaModel(
    # supply our sparse predictor matrix wrapped in a matutils.Sparse2Corpus object
    matutils.Sparse2Corpus(X, documents_columns=False),
    # or alternatively use the corpus object created with the dictionary in the previous frame!
    # corpus,
    # The number of topics we want:
    num_topics  =  3,
    # how many passes over the vocabulary:
    passes      =  20,
    # The id2word vocabulary we made ourselves
    id2word     =  vocab
    # or use the gensim dictionary object!
    # id2word     =  dictionary
)
```

In [78]:
# lda = cvt03_strip_accents_unicode_stop_words_english_ngram_range_1_1.LdaModel(
#     matutils.Sparse2Corpus(X, documents_columns=False),
#     num_topics  =  3,
#     passes      =  20,
#     id2word     =  vocab
# )

# AttributeError: 'CountVectorizer' object has no attribute 'LdaModel'

In [79]:
import re

# https://regexone.com/

In [80]:
regex_test = pd.read_csv('regex_test.csv')
regex_test

Unnamed: 0,salary
0,Salary: $130-150k base
1,Salary: $100K-$120K Including Super + Bonus
2,Salary: Great salary and supportive culture
3,Salary: $neg + 13% Super and STI
4,Salary: $140k - $160k p.a.
...,...
304,Salary: Competitive remuneration package and b...
305,"Salary: TSSA General Stream Band 4 ($87,543 - ..."
306,"Salary: $45,000 - $59,999"
307,"Salary: $124,900 - $142,500 (including super)"


In [81]:
regex_test['salary'][0]

'Salary: $130-150k base'

In [82]:
re.findall('\d+',regex_test['salary'][0])

['130', '150']

In [83]:
re.findall('(\d[0-9]+(. [0-9]+)?)',regex_test['salary'][0])

[('130', ''), ('150', '')]

In [84]:
# re.findall('\d[0-9]',regex_test['salary'][0])
re.findall('\d+',regex_test['salary'][0])

['130', '150']

In [85]:
# len(regex_test['salary'])
# [x for x in range(len(regex_test['salary']))]

# [re.findall('(\d[0-9]+(. [0-9]+)?)',regex_test['salary'][x]) for x in range(len(regex_test['salary']))]

In [86]:
regex_test.shape

(309, 1)

In [87]:
regex_test['salary'].dtype

dtype('O')

In [88]:
# Working with Anita


def avg_salary(string_salary):
    if len(re.findall('[0-9,.]+',string_salary)) == 1:
        return float(re.findall('[0-9,.]+',string_salary)[0].replace(',',''))
    if len(re.findall('[0-9,.]+',string_salary)) == 2:
        values = []
        for string in re.findall('[0-9,.]+',string_salary):
            values.append(string.replace(',',''))
        return ((float(values[0]))+float(values[1]))/2
    else:
        return (float(900000))
    

In [89]:
# Working with Anita


def salary_converter(salary):
    '''
    salary_converter
    '''
    try:
        if type(salary) == str:
    #         job_salary = str('90000')
            if len(re.findall(r'[0-9]+',salary)) >= 1:
                if 'year' in salary:
                    job_salary = average_value(salary)
                elif 'month' in salary:
                    job_salary = average_value(salary)*12
                elif 'week' in salary:
                    job_salary = average_value(salary)*52
                elif 'day' in salary:
                    job_salary = average_value(salary)*52*5
                elif 'hour' in salary:
                    job_salary = average_value(salary)*52*38
                else:
                    job_salary = average_value(salary)
            else:
                job_salary = float('90000')
        else:
            job_salary = float(salary)
        if job_salary < 300:
            job_salary = job_salary * 1000
        if job_salary < 10000:
            job_salary = 90000
        return job_salary
    except:
        return float(90000)



In [90]:
# Working with Anita
# Seem to call be 90k

regex_test['salary_conv'] = regex_test.salary.map(salary_converter)

display(
    regex_test['salary_conv'].value_counts(),
    len(regex_test['salary_conv'])
    )

90000.0    309
Name: salary_conv, dtype: int64

309

In [91]:
# # regex_test['salary02'] = regex_test.salary#map(avg_salary)
# regex_test.salary = regex_test.salary.map(avg_salary)
# regex_test

# # ValueError: could not convert string to float: ''

In [92]:
regex_test['re.findall01'] = [re.findall('\d+',regex_test['salary'][x]) for x in range(len(regex_test['salary']))]
regex_test['re.findall02'] = [re.findall(r'[0-9],.]+',regex_test['salary'][x]) for x in range(len(regex_test['salary']))]
# regex_test['re.findall03'] = [avg_salary(x) for x in range(len(regex_test['salary']))]
# regex_test['re.findall03'] = [avg_salary(regex_test['salary'][x]) for x in range(len(regex_test['salary']))] # ValueError: could not convert string to float: ''

regex_test.to_csv('regex_test03.csv', index = False)

regex_test

# Might need a .map


Unnamed: 0,salary,salary_conv,re.findall01,re.findall02
0,Salary: $130-150k base,90000.0,"[130, 150]",[]
1,Salary: $100K-$120K Including Super + Bonus,90000.0,"[100, 120]",[]
2,Salary: Great salary and supportive culture,90000.0,[],[]
3,Salary: $neg + 13% Super and STI,90000.0,[13],[]
4,Salary: $140k - $160k p.a.,90000.0,"[140, 160]",[]
...,...,...,...,...
304,Salary: Competitive remuneration package and b...,90000.0,[],[]
305,"Salary: TSSA General Stream Band 4 ($87,543 - ...",90000.0,"[4, 87, 543, 91, 877]",[]
306,"Salary: $45,000 - $59,999",90000.0,"[45, 000, 59, 999]",[]
307,"Salary: $124,900 - $142,500 (including super)",90000.0,"[124, 900, 142, 500]",[]


In [93]:

# Clean out commas
# 
# Year / Month/ Week/ Day / Hour

In [94]:
# Created fake salary to build the models whilst still cleaning the data

fake_sal_data = pd.read_csv('kk_proj4_json02 - Added fake salaries.csv')

fake_sal_data = fake_sal_data.drop(columns = 'Unnamed: 0', axis  = 1)

display(
    fake_sal_data,
    fake_sal_data.columns,
    fake_sal_data.dtypes
)

Unnamed: 0,Job Title,Company,multiple_details,Description,fake_salary
0,Data Scientist,Fundo Loans,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",We are scaling up! An exciting opportunity exi...,95000
1,Data Analyst,AIA Australia Limited,"['Melbourne', 'CBD & Inner Suburbs', 'Banking ...",The focus of this role is to provide support t...,80000
2,Junior Business Intelligence Analyst,Hays Talent Solutions,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...","Your new companyAt Hays, we are on a journey t...",75000
3,Data Analyst,Atos Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Atos:Atos is a global leader in digital ...,75000
4,Data Analyst,Infrastructure Partnerships Australia,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",About Infrastructure Partnerships AustraliaInf...,140000
...,...,...,...,...,...
1074,Data Analyst,Aurec,"['ACT', 'Information & Communication Technolog...",We are looking to engage a skilled and enthusi...,100000
1075,Senior Data Analyst - NSW Government,Talenza,"['Sydney', 'CBD, Inner West & Eastern Suburbs'...",Senior Data Analyst - NSW GovernmentLocation: ...,95000
1076,Principal Engineer (Knowledge Graph),SEEK Limited,"['Melbourne', 'CBD & Inner Suburbs', 'Informat...",Company DescriptionAbout SEEKSEEK’s portfolio ...,130000
1077,Data Management Analyst,Humanised Group,"['Brisbane', 'CBD & Inner Suburbs', 'Informati...",About the role:In this role you will be requir...,95000


Index(['Job Title', 'Company', 'multiple_details', 'Description',
       'fake_salary'],
      dtype='object')

Job Title           object
Company             object
multiple_details    object
Description         object
fake_salary          int64
dtype: object

In [95]:
# Make cvec04 CountVectorizer()
# dsi-unit-4.19-nlp-intro_to_nlp-lab

model_cvec04_fake_sal_data = CountVectorizer()

jobtitle_cvec05_fake_sal_data = CountVectorizer()

In [96]:

# Create function for CountVectorizer()

def cvec_df(series):
    
    '''
    Applies CountVectorizer() onto an input series/column and returns a pandas dataframe    
    '''
    
    from sklearn.feature_extraction.text import CountVectorizer
    cvec = CountVectorizer()
    
    cvec.fit(series)
    
    import pandas as pd
    
    return pd.DataFrame(cvec.transform(series).todense(),
                       columns=cvec.get_feature_names_out())

    

In [97]:

# Test function and check output against exisiting DataFrame

cvec_df(fake_sal_data['Description'])

# It Works!


Unnamed: 0,00,000,0000,0000applications,000kg,000mw,000talent,001,0018,0032,...,zapier,zealand,zealandinfrastructure,zero,zinfra,zing,zondag,zone,zones,zoo
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1074,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1075,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1076,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1077,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [98]:
# Create table for just vectorised job descriptions

# cvec_df(fake_sal_data['Job Title'])

list_of_chosen_words_jobtitle_words = ['business', 'intelligence', 'analyst','analysts','analytics','science','scientist']
df_just_jobtitles_from_fake_sal_data = cvec_df(fake_sal_data['Job Title'])[list_of_chosen_words_jobtitle_words]
df_just_jobtitles_from_fake_sal_data.head() 

Unnamed: 0,business,intelligence,analyst,analysts,analytics,science,scientist
0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,0
2,1,1,1,0,0,0,0
3,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0


In [99]:
# df_just_jobtitles_from_fake_sal_data['bool_business_intell_01'] = df_just_jobtitles_from_fake_sal_data['business'] + df_just_jobtitles_from_fake_sal_data['intelligence']
# df_just_jobtitles_from_fake_sal_data['bool_analyst_01'] = df_just_jobtitles_from_fake_sal_data['analyst'] + df_just_jobtitles_from_fake_sal_data['analysts'] + df_just_jobtitles_from_fake_sal_data['analytics']
# df_just_jobtitles_from_fake_sal_data['bool_scientist_01'] = df_just_jobtitles_from_fake_sal_data['science'] + df_just_jobtitles_from_fake_sal_data['scientist']
# df_just_jobtitles_from_fake_sal_data['bool_total_01'] = df_just_jobtitles_from_fake_sal_data['bool_business_intell_01'] + df_just_jobtitles_from_fake_sal_data['bool_analyst_01'] + df_just_jobtitles_from_fake_sal_data['bool_scientist_01']
df_just_jobtitles_from_fake_sal_data.head()


Unnamed: 0,business,intelligence,analyst,analysts,analytics,science,scientist
0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,0
2,1,1,1,0,0,0,0
3,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0


In [100]:
def jobtitle_sort(df):
    if df['business'] + df['intelligence'] == 2:
        return 'business intelligence'
    elif df['analyst'] + df['analysts'] + df['analytics'] >= 1:
        return 'business analyst'
    elif df['science'] + df['scientist'] >= 1:
        return 'data scientist'

In [101]:
df_just_jobtitles_from_fake_sal_data.apply(jobtitle_sort)

KeyError: 'business'

In [None]:
# df_just_jobtitles_from_fake_sal_data['presumed job title'] = map(jobtitle_sort,df_just_jobtitles_from_fake_sal_data)
# df_just_jobtitles_from_fake_sal_data['presumed job title']

# # TypeError: object of type 'map' has no len()

In [None]:
# # .fit()
# # dsi-unit-4.19-nlp-intro_to_nlp-lab

# model_cvec04_fake_sal_data.fit(fake_sal_data['Description'])

# # superceded by def cvec_df(series):

In [None]:
# # dsi-unit-4.19-nlp-intro_to_nlp-lab

# df_from_cvec04_fake_sal_data = pd.DataFrame(model_cvec04_fake_sal_data.transform(fake_sal_data['Description']).todense(),
#                        columns=model_cvec04_fake_sal_data.get_feature_names_out())

# # df_from_cvec04_fake_sal_data['fake_salary'] = fake_sal_data['fake_salary']
# # y = fake_sal_data['fake_salary']

# df_from_cvec04_fake_sal_data

# # superceded by def cvec_df(series):

In [None]:
# # df_from_cvec04_fake_sal_data.columns
# # df_from_cvec04_fake_sal_data.dtypes
# df_from_cvec04_fake_sal_data.shape

# # superceded by def cvec_df(series):

In [None]:
# df_from_cvec04_fake_sal_data

#     NLP
#     Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
#     Ensemble methods and decision tree models
#     SVM models

In [None]:
from sklearn.decomposition import PCA # dsi-unit-4.08-pca-intro-lesson
from sklearn.preprocessing import StandardScaler # dsi-unit-4.08-pca-intro-lesson

In [None]:
model_pca_for_cvec04_fake_sal_data = PCA(n_components=3)

In [None]:
fit_model_pca_for_cvec04_fake_sal_data = model_pca_for_cvec04_fake_sal_data.fit(df_from_cvec04_fake_sal_data).transform(df_from_cvec04_fake_sal_data)
fit_model_pca_for_cvec04_fake_sal_data

In [None]:
fit_model_pca_for_cvec04_fake_sal_data.shape

In [None]:
model_pca_for_cvec04_fake_sal_data.components_

In [None]:
model_pca_for_cvec04_fake_sal_data.components_.shape

In [None]:
np.cumsum(model_pca_for_cvec04_fake_sal_data.explained_variance_ratio_)

In [None]:
plt.plot(np.cumsum(model_pca_for_cvec04_fake_sal_data.explained_variance_ratio_))

In [None]:
X_s = df_from_cvec04_fake_sal_data
X_r = df_from_cvec04_fake_sal_data
X = df_from_cvec04_fake_sal_data
y = fake_sal_data['fake_salary']

target_names = y.unique()

plt.figure(figsize=(12,8))
pca = PCA(n_components=2)



pca.fit(X_s)

xvector = pca.components_[0] 
yvector = pca.components_[1]

xs = pca.transform(X_s)[:,0] 
ys = pca.transform(X_s)[:,1]

# for i in range(1):

for i in range(len(xvector)):
    plt.arrow(0, 0, xvector[i]*max(xs), yvector[i]*max(ys),
              color='r', width=0.0005, head_width=0.0025)
    plt.text(xvector[i]*max(xs)*1.2, yvector[i]*max(ys)*1.2,
             list(X.columns)[i], color='r')

for target_name in target_names:
    plt.scatter(X_r[y == target_name, 0], X_r[y == target_name, 1], alpha=.8, lw=1,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of Pizza dataset')
plt.show()

```python
X_s = df_from_cvec04_fake_sal_data
X_r = df_from_cvec04_fake_sal_data
X = df_from_cvec04_fake_sal_data
y = fake_sal_data['fake_salary']

target_names = y.unique()

plt.figure(figsize=(12,8))
pca = PCA(n_components=2)



pca.fit(X_s)

xvector = pca.components_[0] 
yvector = pca.components_[1]

xs = pca.transform(X_s)[:,0] 
ys = pca.transform(X_s)[:,1]

# for i in range(1):

for i in range(len(xvector)):
    plt.arrow(0, 0, xvector[i]*max(xs), yvector[i]*max(ys),
              color='r', width=0.0005, head_width=0.0025)
    plt.text(xvector[i]*max(xs)*1.2, yvector[i]*max(ys)*1.2,
             list(X.columns)[i], color='r')

for target_name in target_names:
    plt.scatter(X_r[y == target_name, 0], X_r[y == target_name, 1], alpha=.8, lw=1,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of Pizza dataset')
plt.show()
```

![image.png](attachment:image.png)