# Canadian Data Analyst Job Listing Analysis

## Table of Contents <a id='back'></a>
- [Project Introduction](#project-introduction)
    - [Analysis Outline](#analysis-outline)
    - [Results](#results)
- [Importing Libraries and Opening Data Files](#importing-libraries-and-opening-data-files)
- [Pre-Processing Data](#pre-processing-data)
    - [Duplicates](#duplicates)
    - [Missing Values](#missing-values)
    - [Removing Irrelevant Data](#removing-irrelevant-data)
    - [Data Structure Overhaul](#data-structure-overhaul)
        - [Header Style](#header-style)
        - [Formatting and Data Usage](#formatting-and-data-usage)
- [Exploratory Data Analysis](#exploratory-data-analysis)
    - [Job Characteristics & Availability](#)
        - [Total Available Jobs](#)
        - [Data Science Job Categories](#)
        - [Job Salaries by Type](#)
        - [Jobs Offered by Platform](#)
    - [](#)
    - [](#)
    - [](#)
    - [](#)
    - [](#)
    - [](#)
    - [](#)
    - [](#)
    - [](#)
    - [](#)
- [Conclusions and Reccomendations](#conclusions-and-reccomendations)
- [Dataset Citation](#dataset-citation)

## Project Introduction

2024 has been a difficult year for entry-level data science jobs and for this project, I am interested in analyzing the data science field job market. For this project, I am utilizing a Kaggle-based dataset that web-scraped Indeed and Glassdoor Canadian job postings for data using Selenium and BeautifulSoup. This dataset provides multiple interesting insights into the data science job market such as in-demand technical skills, expected work experience, and salary ranges.  

### Analysis Outline

[Analysis Outline]

### Results

[Results]


[Back to Table of Contents](#back)

## Importing Libraries and Opening Data Files

In [1]:
# Importing the needed libraries for this assignment
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import folium

In [2]:
# Importing file for assignment
try:
    df = pd.read_csv('Raw_Dataset.csv', sep=',')
except:
    df = pd.read_csv('/datasets/Raw_Dataset.csv', sep=',')

[Back to Table of Contents](#back)

## Pre-Processing Data

### Duplicates

In [3]:
# Checking for duplicates
df.duplicated().sum()

0

[Back to Table of Contents](#back)

### Missing Values

In [4]:
# Checking for null values
df.isna().sum()

Job ID                  0
Job Title               0
Company Name            0
Language and Tools    167
Job Salary            557
City                    0
Province              118
Job Link                0
dtype: int64

In [5]:
# Filling in null values
df.fillna({'Language and Tools': 'unknown',
           'Job Salary' : 'unknown',
           'Province' : 'unknown'}, inplace = True)
df.isna().sum()

Job ID                0
Job Title             0
Company Name          0
Language and Tools    0
Job Salary            0
City                  0
Province              0
Job Link              0
dtype: int64

[Back to Table of Contents](#back)

### Removing Irrelevant Data

In [6]:
# Removing columns we do not need for this analysis
df = df.drop(columns=['Job ID'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Job Title           1796 non-null   object
 1   Company Name        1796 non-null   object
 2   Language and Tools  1796 non-null   object
 3   Job Salary          1796 non-null   object
 4   City                1796 non-null   object
 5   Province            1796 non-null   object
 6   Job Link            1796 non-null   object
dtypes: object(7)
memory usage: 98.3+ KB


[Back to Table of Contents](#back)

### Data Structure Overhaul

In [7]:
df.describe()

Unnamed: 0,Job Title,Company Name,Language and Tools,Job Salary,City,Province,Job Link
count,1796,1796,1796,1796,1796,1796,1796
unique,811,790,1057,855,172,14,1761
top,Business Analyst,Scotiabank,unknown,unknown,Toronto,ON,https://www.glassdoor.ca/job-listing/business-...
freq,90,24,167,557,426,949,3


#### Header Style

In [8]:
# Getting general information about the dataset
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Job Title           1796 non-null   object
 1   Company Name        1796 non-null   object
 2   Language and Tools  1796 non-null   object
 3   Job Salary          1796 non-null   object
 4   City                1796 non-null   object
 5   Province            1796 non-null   object
 6   Job Link            1796 non-null   object
dtypes: object(7)
memory usage: 98.3+ KB


Unnamed: 0,Job Title,Company Name,Language and Tools,Job Salary,City,Province,Job Link
0,Binance Accelerator Program - Data Analyst (Risk),Binance,"Python, Sql",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=9c7f38160c736c...
1,Business Analyst,Canadian Nuclear Laboratories,"Power Bi, Power BI, Excel",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=0da15fed6a515f...
2,Geophysicist/Data Analyst,Sander Geophysics Limited,unknown,unknown,Ottawa,ON,https://ca.indeed.com/rc/clk?jk=2dc0470241aa60...
3,Business Intelligence Data Engineer,"Maximus Services, LLC","Fabric, Power BI, Sql, Machine Learning, Genes...","87,875Ã¢â‚¬â€œ$105,000 a year",Toronto,ON,https://ca.indeed.com/rc/clk?jk=cbbe0e29b236d2...
4,"BUSINESS INTELLIGENCE SPECIALIST, FT",Niagara Health System,"Azure, Power BI, SQL, Aws",55.39Ã¢â‚¬â€œ$62.66 an hour,Niagara,ON,https://ca.indeed.com/rc/clk?jk=fe8ad423818b24...


In [9]:
#checking for snakecase format
df.columns

Index(['Job Title', 'Company Name', 'Language and Tools', 'Job Salary', 'City',
       'Province', 'Job Link'],
      dtype='object')

In [10]:
# Renaming column names to snake_case format
df = df.rename(columns={'Job Title': 'job_title',
                        'Company Name': 'employer_name',
                        'Language and Tools': 'tech_skills',
                        'Job Salary': 'salary',
                        'City': 'city',
                        'Province': 'province',
                        'Job Link': 'web_platform'})
df.columns

Index(['job_title', 'employer_name', 'tech_skills', 'salary', 'city',
       'province', 'web_platform'],
      dtype='object')

[Back to Table of Contents](#back)

#### Formatting and Data Usage

In [11]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   job_title      1796 non-null   object
 1   employer_name  1796 non-null   object
 2   tech_skills    1796 non-null   object
 3   salary         1796 non-null   object
 4   city           1796 non-null   object
 5   province       1796 non-null   object
 6   web_platform   1796 non-null   object
dtypes: object(7)
memory usage: 98.3+ KB


Unnamed: 0,job_title,employer_name,tech_skills,salary,city,province,web_platform
0,Binance Accelerator Program - Data Analyst (Risk),Binance,"Python, Sql",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=9c7f38160c736c...
1,Business Analyst,Canadian Nuclear Laboratories,"Power Bi, Power BI, Excel",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=0da15fed6a515f...
2,Geophysicist/Data Analyst,Sander Geophysics Limited,unknown,unknown,Ottawa,ON,https://ca.indeed.com/rc/clk?jk=2dc0470241aa60...
3,Business Intelligence Data Engineer,"Maximus Services, LLC","Fabric, Power BI, Sql, Machine Learning, Genes...","87,875Ã¢â‚¬â€œ$105,000 a year",Toronto,ON,https://ca.indeed.com/rc/clk?jk=cbbe0e29b236d2...
4,"BUSINESS INTELLIGENCE SPECIALIST, FT",Niagara Health System,"Azure, Power BI, SQL, Aws",55.39Ã¢â‚¬â€œ$62.66 an hour,Niagara,ON,https://ca.indeed.com/rc/clk?jk=fe8ad423818b24...


In [12]:
# Changing all elements into snakecase format
for column in df.columns:
  df[column] = df[column].str.lower()
  df[column] = df[column].str.replace(' ', '_')

df.head()

Unnamed: 0,job_title,employer_name,tech_skills,salary,city,province,web_platform
0,binance_accelerator_program_-_data_analyst_(risk),binance,"python,_sql",unknown,remote,unknown,https://ca.indeed.com/rc/clk?jk=9c7f38160c736c...
1,business_analyst,canadian_nuclear_laboratories,"power_bi,_power_bi,_excel",unknown,remote,unknown,https://ca.indeed.com/rc/clk?jk=0da15fed6a515f...
2,geophysicist/data_analyst,sander_geophysics_limited,unknown,unknown,ottawa,on,https://ca.indeed.com/rc/clk?jk=2dc0470241aa60...
3,business_intelligence_data_engineer,"maximus_services,_llc","fabric,_power_bi,_sql,_machine_learning,_genes...","87,875ã¢â‚¬â€œ$105,000_a_year",toronto,on,https://ca.indeed.com/rc/clk?jk=cbbe0e29b236d2...
4,"business_intelligence_specialist,_ft",niagara_health_system,"azure,_power_bi,_sql,_aws",55.39ã¢â‚¬â€œ$62.66_an_hour,niagara,on,https://ca.indeed.com/rc/clk?jk=fe8ad423818b24...


experience level

In [13]:
df['job_title'].unique()

array(['binance_accelerator_program_-_data_analyst_(risk)',
       'business_analyst', 'geophysicist/data_analyst',
       'business_intelligence_data_engineer',
       'business_intelligence_specialist,_ft',
       'continuous_improvement_analyst', 'it_business_process_analyst',
       'computer_programmer/analyst',
       'senior_developer,_business_intelligence',
       'opgt_mod_ã¢â‚¬â€œ_one_(1)_business_analyst_ã¢â‚¬â€œ_senior',
       'business_analyst/quality_assurance_analyst',
       'capital_&_maintenance_program_analyst', 'senior_policy_analyst',
       'business_operations_analyst_(1_year_contract)',
       'senior_business_systems_analyst',
       'research_analyst_-_translational_addiction_research_laboratory',
       'payroll_analyst', 'lead_business_analyst', 'data_analyst',
       'hr_technology_lead_and_data_analyst', 'business_data_analyst',
       '(data-driven)_marketing_analyst', 'quality_analyst',
       'technology_risk_analyst', 'technical_support_analyst_',
  

In [14]:
df['job_title'] = df['job_title'].str.replace('[^a-zA-Z0-9_]', '')

def exp_level(value):
    if 'senior' in value:
        return 'senior'
    
    if 'sr' in value:
        return 'senior'
    
    if 'lead' in value:
        return 'senior'
    
    elif 'jr' in value:
        return 'middle'
    
    elif 'junior' in value:
        return 'middle'
    
    elif 'intermediate' in value:
        return 'middle'
    
    elif 'entry' in value:
        return 'entry-level'
    
    elif 'intern' in value:
        return 'internship'
    
    elif 'student' in value:
        return 'internship'
    
    else:
        return 'any'

df['experience_level'] = df['job_title'].apply(exp_level)
df['experience_level'].unique()

  df['job_title'] = df['job_title'].str.replace('[^a-zA-Z0-9_]', '')


array(['any', 'senior', 'internship', 'middle', 'entry-level'],
      dtype=object)

job title

In [15]:
df['job_title'].unique()

array(['binance_accelerator_program__data_analyst_risk',
       'business_analyst', 'geophysicistdata_analyst',
       'business_intelligence_data_engineer',
       'business_intelligence_specialist_ft',
       'continuous_improvement_analyst', 'it_business_process_analyst',
       'computer_programmeranalyst',
       'senior_developer_business_intelligence',
       'opgt_mod__one_1_business_analyst__senior',
       'business_analystquality_assurance_analyst',
       'capital__maintenance_program_analyst', 'senior_policy_analyst',
       'business_operations_analyst_1_year_contract',
       'senior_business_systems_analyst',
       'research_analyst__translational_addiction_research_laboratory',
       'payroll_analyst', 'lead_business_analyst', 'data_analyst',
       'hr_technology_lead_and_data_analyst', 'business_data_analyst',
       'datadriven_marketing_analyst', 'quality_analyst',
       'technology_risk_analyst', 'technical_support_analyst_',
       'technical_support_analyst__

In [16]:
def clean_title(value):
    if 'data_analyst' in value:
        return 'data_analyst'
    
    elif 'scien' in value:
        return 'data_scientist'
    
    elif 'engineer' in value:
        return 'data_engineer'
    
    elif 'business_intelligence' in value:
        return 'business_intelligence_analyst'
    
    elif 'business_system' in value:
        return 'business_systems_analyst'
    
    elif 'business_analyst' in value:
        return 'business_analyst'
    
    elif 'research' in value:
        return 'research_analyst'
    
    elif 'quality' in value:
        return 'quality_analyst'
    
    elif 'marketing' in value:
        return 'marketing_analyst'
    
    elif 'risk' in value:
        return 'risk_analyst'
    
    elif 'investment' in value:
        return 'financial_analyst'
    
    elif 'asset' in value:
        return 'financial_analyst'
    
    elif 'bank' in value:
        return 'financial_analyst'
    
    elif 'sale' in value:
        return 'financial_analyst'
    
    else:
        return 'analyst'
        
df['job_title'] = df['job_title'].apply(clean_title)
df['job_title'].unique()

array(['data_analyst', 'business_analyst', 'data_engineer',
       'business_intelligence_analyst', 'analyst',
       'business_systems_analyst', 'research_analyst',
       'marketing_analyst', 'quality_analyst', 'risk_analyst',
       'financial_analyst', 'data_scientist'], dtype=object)

company name

In [17]:
df['employer_name'].unique()

array(['binance', 'canadian_nuclear_laboratories',
       'sander_geophysics_limited', 'maximus_services,_llc',
       'niagara_health_system', 'imp_group', 'ground_effects',
       'the_city_of_vancouver', 'ontario_health', 'softline_technology',
       'dll', 'toronto_hydro', 'insurance_council_of_bc',
       'tmx_group_limited', 'royal_bank_of_canada',
       'centre_for_addiction_and_mental_health', 'cenovus_energy',
       'toronto_transit_commission', 'city_of_barrie', 'accencis_group',
       'weyburn_credit_union', 'closing_the_gap_healthcare',
       'bmo_financial_group', 'banff_caribou_properties_ltd.', 'seequent',
       'university_of_alberta', 'mackenzie_financial_corporation',
       'cloudmd_software_&_services_inc_-_can', 'banque_laurentienne',
       'b.c._college_of_nurses_and_midwives',
       'canada_life_assurance_company', 'keewee', 'snaplii', 'cae',
       'electronic_arts', 'cnooc_international', 'appcast',
       'bridgenext,_inc', 'leonardo_drs', 'autodesk',


In [18]:
df['employer_name'] = df['employer_name'].str.replace('[^a-zA-Z0-9_]', '')

def clean_employer_name(value):
    if 'financ' in value:
        return 'finance'
    
    elif 'invest' in value:
        return 'finance'
    
    elif 'capital' in value:
        return 'finance'

    elif 'wealth' in value:
        return 'finance'
    
    elif 'manage' in value:
        return 'finance'
    
    elif 'credit' in value:
        return 'finance'
    
    elif 'business' in value:
        return 'finance'
    
    elif 'college' in value:
        return 'education'
    
    elif 'university' in value:
        return 'education'
    
    elif 'school' in value:
        return 'education'
    
    elif 'edu' in value:
        return 'education'
    
    elif 'trans' in value:
        return 'transportation'
    
    elif 'express' in value:
        return 'transportation'
    
    elif 'rail' in value:
        return 'transportation'

    elif 'media' in value:
        return 'media'
    
    elif 'bank' in value:
        return 'banking'
    
    elif 'city' in value:
        return 'government'
    
    elif 'public' in value:
        return 'government'
    
    elif 'police' in value:
        return 'government'
    
    elif 'govern' in value:
        return 'government'
    
    elif 'energy' in value:
        return 'energy'

    elif 'nuclear' in value:
        return 'energy'
    
    elif 'elect' in value:
        return 'energy'
    
    elif 'spark' in value:
        return 'energy'
    
    elif 'insurance' in value:
        return 'insurance'
    
    elif 'health' in value:
        return 'healthcare'
    
    elif 'hospital' in value:
        return 'healthcare'
    
    elif 'medic' in value:
        return 'healthcare'
    
    elif 'pharma' in value:
        return 'healthcare'
    
    elif 'care' in value:
        return 'healthcare'
    
    elif 'farm' in value:
        return 'agriculture'
    
    elif 'metal' in value:
        return 'manufacturing'
    
    elif 'engineer' in value:
        return 'manufacturing'
    
    elif 'manufactur' in value:
        return 'manufacturing'
    
    elif 'machine' in value:
        return 'manufacturing'
    
    elif 'construction' in value:
        return 'construction'
    
    elif 'contracting' in value:
        return 'construction'
    
    elif 'tech' in value:
        return 'technology'
    
    elif 'web' in value:
        return 'technology'
    
    elif 'soft' in value:
        return 'technology'
    
    elif 'systems' in value:
        return 'technology'
    
    elif 'amazon' in value:
        return 'technology'
    
    elif 'estate' in value:
        return 'real_estate'

    elif 'properties' in value:
        return 'real_estate'
    
    elif 'property' in value:
        return 'real_estate'
    
    elif 'macdonald' in value:
        return 'real_estate'
    
    elif 'consult' in value:
        return 'consulting'
    
    elif 'communication' in value:
        return 'telecommunication'
    
    elif 'radio' in value:
        return 'telecommunication'
    
    elif 'food' in value:
        return 'retail'
    
    elif 'walmart' in value:
        return 'retail'
    
    elif 'pepsico' in value:
        return 'retail'
    
    elif 'supermarket' in value:
        return 'retail'
    
    elif 'resort' in value:
        return 'travel'
    
    elif 'travel' in value:
        return 'travel'
    
    elif 'air' in value:
        return 'aerospace'
    
    elif 'flight' in value:
        return 'aerospace'
    
    elif 'aviation' in value:
        return 'aerospace'
    
    elif 'aero' in value:
        return 'aerospace'
    
    elif 'auto' in value:
        return 'automobile'
    
    elif 'driv' in value:
        return 'automobile'
    
    elif 'service' in value:
        return 'service'
    
    else:
        return 'other'
        
df['industry'] = df['employer_name'].apply(clean_employer_name)
df['industry'].unique()

  df['employer_name'] = df['employer_name'].str.replace('[^a-zA-Z0-9_]', '')


array(['other', 'energy', 'service', 'healthcare', 'government',
       'technology', 'insurance', 'banking', 'transportation', 'finance',
       'real_estate', 'education', 'automobile', 'aerospace', 'retail',
       'consulting', 'construction', 'manufacturing', 'media',
       'agriculture', 'telecommunication', 'travel'], dtype=object)

work location

In [19]:
df['city'].unique()

array(['remote', 'ottawa', 'toronto', 'niagara', 'abbotsford', 'windsor',
       'vancouver', 'burlington', 'calgary', 'barrie', 'richmond_hill',
       'weyburn', 'mississauga', 'banff', 'edmonton',
       'greater_toronto_area', 'montrãƒâ©al', 'remote_in_beauceville',
       'london', 'remote_in_charlottetown', 'saint-laurent',
       'fredericton', 'bedford', 'remote_in_toronto', 'brampton',
       'vaughan', 'surrey', 'red_lake', 'winnipeg', 'laval', 'halifax',
       'dieppe', 'vernon', 'dorval', 'bolton', 'sherbrooke', 'victoria',
       'north_york', 'oakville', 'richmond', 'burnaby',
       'metro_vancouver_regional_district', 'berwick',
       'remote_in_moncton', 'remote_in_mount_pearl', 'remote_in_milton',
       'remote_in_boucherville', 'remote_in_woodstock',
       'remote_in_ottawa', 'remote_in_kelowna', 'remote_in_lakeside',
       'remote_in_mississauga', 'remote_in_winnipeg', 'remote_in_regina',
       'remote_in_victoria', 'remote_in_port_coquitlam',
       'remote_i

In [20]:
df['city'] = df['city'].str.replace('[^_a-zA-Z]', '')

def work_location(value):
    if 'remote' in  value:
        return 'remote'
    else:
        return 'on-site'

df['work_location'] = df['city'].apply(work_location)
df['work_location'].unique()

  df['city'] = df['city'].str.replace('[^_a-zA-Z]', '')


array(['remote', 'on-site'], dtype=object)

In [21]:
def clean_city(value):
    if value.startswith('remote_in_'):
        return value[10:]
    
    elif 'vancouver' in  value:
        return 'vancouver'
    
    elif 'toronto' in  value:
        return 'toronto'
    
    elif value.startswith('_'):
        return value[1:]
    else:
        return value

df['city'] = df['city'].str.replace('montral', 'montreal')
df['city'] = df['city'].apply(clean_city)
df['city'].unique()

array(['remote', 'ottawa', 'toronto', 'niagara', 'abbotsford', 'windsor',
       'vancouver', 'burlington', 'calgary', 'barrie', 'richmond_hill',
       'weyburn', 'mississauga', 'banff', 'edmonton', 'montreal',
       'beauceville', 'london', 'charlottetown', 'saintlaurent',
       'fredericton', 'bedford', 'brampton', 'vaughan', 'surrey',
       'red_lake', 'winnipeg', 'laval', 'halifax', 'dieppe', 'vernon',
       'dorval', 'bolton', 'sherbrooke', 'victoria', 'north_york',
       'oakville', 'richmond', 'burnaby', 'berwick', 'moncton',
       'mount_pearl', 'milton', 'boucherville', 'woodstock', 'kelowna',
       'lakeside', 'regina', 'port_coquitlam', 'thunder_bay', 'squamish',
       'south_dundas', 'hamilton', 'waterloo', 'waterdown', 'saskatoon',
       'whitehorse', 'okotoks', 'thornhill', 'concord', 'sparwood',
       'yellowknife', 'markham', 'bradford', 'etobicoke', 'leduc',
       'rocky_view_county', 'st_catharines', 'lakeshore', 'st_thomas',
       'york', 'st_paul', 'lac

province

In [22]:
df['province'].unique()

array(['unknown', 'on', 'bc', 'ab', 'sk', 'qc', 'pe', 'nb', 'ns', 'mb',
       'nl', 'yt', 'nt', 'nfl'], dtype=object)

In [23]:
df['province'] = df['province'].replace('unknown', 'unspecified')
df['province'].unique()

array(['unspecified', 'on', 'bc', 'ab', 'sk', 'qc', 'pe', 'nb', 'ns',
       'mb', 'nl', 'yt', 'nt', 'nfl'], dtype=object)

salary

In [24]:
df['salary'].unique()

array(['unknown', '87,875ã¢â‚¬â€œ$105,000_a_year',
       '55.39ã¢â‚¬â€œ$62.66_an_hour',
       '43.82ã¢â‚¬â€œ$51.78_an_hour,_43.82_to_$,_51.78_per_hour',
       '75,898ã¢â‚¬â€œ$113,847_a_year,_75,898_-_$,_113,847_dll_is',
       '47.62ã¢â‚¬â€œ$56.27_an_hour,_47.62_to_$,_56.27_per_hour',
       '89,606ã¢â‚¬â€œ$128,809_a_year,_$89,606_-_$_128,809_with,_128,809_with_a,_112,008_per_annum',
       '27.48ã¢â‚¬â€œ$36.65_an_hour',
       '83,192ã¢â‚¬â€œ$104,013_a_year,_1_department_:_talent,_83,192.00_-_$,_104,013.00_pay_scale',
       '95,106.71_to_$,_127,429.57_hourly_pay,_$52.26_to_$70.02_benefits_:',
       '43.50_an_hour',
       '61,181ã¢â‚¬â€œ$76,458_a_year,_61,181_to_$,_76,458_based_on',
       '74,800ã¢â‚¬â€œ$138,600_a_year,_74,800.00_-_$,_138,600.00_pay_type',
       '54,764.33ã¢â‚¬â€œ$73,940.15_a_year,_54,764.33_to_$',
       'approximately_$252_billion_in_total',
       '45,000ã¢â‚¬â€œ$50,000_a_year',
       '104,423ã¢â‚¬â€œ$109,644_a_year,_104,423_ã¢â‚¬_â€œ,_109,644_annually_,',


In [25]:
df['salary'] = df['salary'].str.replace('[^kK0-9$.-]', '')
df['salary'] = df['salary'].str.replace('k', '000')
df['salary'] = df['salary'].str.replace('K', '000')
df['salary'] = df['salary'].str.replace('$', '-')
df['salary'] = df['salary'].str.replace('--', '-')
df['salary'] = df['salary'].str.replace('\.00', '')
#df['salary'] = df['salary'][df['salary'].str.len() < 20]
df['salary'].unique()

  df['salary'] = df['salary'].str.replace('[^kK0-9$.-]', '')
  df['salary'] = df['salary'].str.replace('$', '-')
  df['salary'] = df['salary'].str.replace('\.00', '')


array(['000', '87875-105000', '55.39-62.66', '43.82-51.7843.82-51.78',
       '75898-11384775898-113847', '47.62-56.2747.62-56.27',
       '89606-128809-89606-128809128809112008', '27.48-36.65',
       '83192-104013183192-104013', '95106.71-127429.57-52.26-70.02',
       '43.50', '61181-7645861181-76458', '74800-13860074800-138600',
       '54764.33-73940.1554764.33-', '-252', '45000-50000',
       '104423-109644104423109644', '54500-81800-54500-81800.', '-250',
       '107600-147100-107600-147100147100', '2020', '83500-149300',
       '-271', '46', '-1.3-1.0-0.4-0.3-0.3-0.3.-59570-110630',
       '61181-7645861181-7645861181-76458', '105000', '50-5850-58',
       '76071.18-86658.4876071.18-86658.48', '11-75223.75223.96196.',
       '-100000', '69000.10-84021.082653.85-69000.10-2653.85-',
       '79500-10600079500-106000', '90000-11000090000-110000',
       '82800-103500', '51.89-61.93-51.89-61.9361.93', '38',
       '68000-85000-1500-68000-850008500068000-85000', '28.40',
       '1058

In [26]:
def format_sal(value):
    if str(value).startswith('-'):
        return value[1:]
    
    elif str(value).endswith('-'):
        return value[:-1]

    elif str(value).endswith('\.'):
        return value[:-1]
    
    elif str(value).endswith('\.\.'):
        return value[:-2]
    
    else:
        return value
    
#df['salary'] = df['salary'].replace('000', '0')
#df['salary'] = df['salary'].replace('', '0')
#df['salary'].fillna('0', inplace=True)

df['salary'] = df['salary'].apply(format_sal)
df['salary'].unique()

array(['000', '87875-105000', '55.39-62.66', '43.82-51.7843.82-51.78',
       '75898-11384775898-113847', '47.62-56.2747.62-56.27',
       '89606-128809-89606-128809128809112008', '27.48-36.65',
       '83192-104013183192-104013', '95106.71-127429.57-52.26-70.02',
       '43.50', '61181-7645861181-76458', '74800-13860074800-138600',
       '54764.33-73940.1554764.33', '252', '45000-50000',
       '104423-109644104423109644', '54500-81800-54500-81800.', '250',
       '107600-147100-107600-147100147100', '2020', '83500-149300', '271',
       '46', '1.3-1.0-0.4-0.3-0.3-0.3.-59570-110630',
       '61181-7645861181-7645861181-76458', '105000', '50-5850-58',
       '76071.18-86658.4876071.18-86658.48', '11-75223.75223.96196.',
       '100000', '69000.10-84021.082653.85-69000.10-2653.85',
       '79500-10600079500-106000', '90000-11000090000-110000',
       '82800-103500', '51.89-61.93-51.89-61.9361.93', '38',
       '68000-85000-1500-68000-850008500068000-85000', '28.40',
       '105800-1388

In [27]:
def repeating_sal_value(value):
    if len(str(value)) <= 1:
        return str(value)
    
    elif str(value)[0:5] == str(value)[5:10]:
        return str(value)[5:]
    
    elif str(value)[0:6] == str(value)[6:12]:
        return str(value)[6:]
    
    elif str(value)[0:7] == str(value)[6:14]:
        return str(value)[7:]
    
    elif str(value)[0:11] == str(value)[11:22]:
        return str(value)[11:]

    elif str(value)[0:12] == str(value)[12:24]:
        return str(value)[12:]

    elif str(value)[0:12] == str(value)[13:25]:
        return str(value)[0:12]  
    
    elif str(value)[0:5] == str(value)[12:17]:
        return str(value)[12:]
    
    elif str(value)[0:6] == str(value)[14:20]:
        return str(value)[6:13]
    
    elif str(value)[0:13] == str(value)[14:26]:
        return str(value)[13:0]

    else:
        return str(value)
    
df['salary'] = df['salary'].apply(repeating_sal_value)
df['salary'].unique()

array(['000', '87875-105000', '55.39-62.66', '43.82-51.78',
       '75898-113847', '47.62-56.27', '89606-128809', '27.48-36.65',
       '83192-104013', '95106.71-127429.57-52.26-70.02', '43.50',
       '61181-76458', '74800-138600', '54764.33-73940.1554764.33', '252',
       '45000-50000', '104423-109644104423109644', '54500-81800.', '250',
       '-147100', '2020', '83500-149300', '271', '46',
       '1.3-1.0-0.4-0.3-0.3-0.3.-59570-110630', '61181-7645861181-76458',
       '105000', '50-58', '76071.18-86658.4876071.18-86658.48',
       '11-75223.75223.96196.', '100000',
       '69000.10-84021.082653.85-69000.10-2653.85', '79500-106000',
       '90000-110000', '82800-103500', '51.89-61.9361.93', '38',
       '68000-85000-1500-68000-850008500068000-85000', '28.40',
       '105800-13880', '47.20', '50000-70000', '70000-85000',
       '40.07-47.31', '65000-8500085000', '87000.', '42.04-49.65',
       '69760.70-79322.6969760.70-79322.6979322.69', '38.12-54.80',
       '7977-95937977-959395

In [None]:
def two_values(value):
    if '-' in value:
        return value
    
    elif value.startswith('0'):
        return value
    
    else:
        return (f'{value}-{value}')
    
df['salary'] = df['salary'].apply(two_values)
df['salary'].unique()

In [None]:
new = df['salary'].str.split('-', n=1, expand=True)

df['min_salary'] = new[0]
df['max_salary'] = new[1]
df.head()

In [None]:
df['min_salary'].unique()

In [None]:
df['min_salary'] = df['min_salary'].replace('000', '0')
df['min_salary'] = df['min_salary'].replace('', '0')

def repeating_sal_value(value):
    if len(str(value)) <= 1:
        return str(value)
    
    if str(value)[0:5] == str(value)[5:10]:
        return str(value)[5:]
    
    elif str(value)[0:6] == str(value)[6:12]:
        return str(value)[6:]
    
    else:
        return str(value)

df['min_salary'] = df['min_salary'].apply(repeating_sal_value)
df['min_salary'].unique()

In [None]:
df['min_salary'] = df['min_salary'].astype('float')

def hourly_pay_to_annual(value):
    if value < 999:
        return (value * 2080)
    else:
        return value

df['min_salary'] = df['min_salary'].apply(hourly_pay_to_annual)
df['min_salary'].tail()

In [None]:
df['max_salary'].unique()

In [None]:
df['max_salary'] = df['max_salary'].apply(repeating_sal_value)
df['max_salary'] = df['max_salary'].str.replace('-', '')
df['max_salary'] = df['max_salary'].str.replace('None', '0')
df['max_salary'] = df['max_salary'][df['max_salary'].str.len() < 7]
df['max_salary'].unique()

In [None]:
df['max_salary'] = df['max_salary'].astype('float')
df['max_salary'] = df['max_salary'].apply(hourly_pay_to_annual)
df['max_salary'].fillna(0, inplace=True)
df['max_salary'].unique()

In [None]:
med_fillin = df[(df['min_salary'] > 0) & (df['max_salary'] > 0)].groupby('job_title').agg({'min_salary': 'median',
                                                                                           'max_salary': 'median'})

med_fillin.columns = ['median_min_sal', 'median_max_sal']
med_fillin = med_fillin.round(2)
med_fillin

In [37]:
df.loc[((df['job_title'] == 'analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][0])
df.loc[((df['job_title'] == 'analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][0])

df.loc[((df['job_title'] == 'business_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][1])
df.loc[((df['job_title'] == 'business_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][1])

df.loc[((df['job_title'] == 'business_intelligence_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][2])
df.loc[((df['job_title'] == 'business_intelligence_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][2])

df.loc[((df['job_title'] == 'business_systems_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][3])
df.loc[((df['job_title'] == 'business_systems_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][3])

df.loc[((df['job_title'] == 'data_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][4])
df.loc[((df['job_title'] == 'data_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][4])

df.loc[((df['job_title'] == 'data_engineer') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][5])
df.loc[((df['job_title'] == 'data_engineer') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][5])

df.loc[((df['job_title'] == 'data_scientist') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][6])
df.loc[((df['job_title'] == 'data_scientist') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][6])

df.loc[((df['job_title'] == 'financial_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][7])
df.loc[((df['job_title'] == 'financial_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][7])

df.loc[((df['job_title'] == 'marketing_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][8])
df.loc[((df['job_title'] == 'marketing_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][8])

df.loc[((df['job_title'] == 'quality_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][9])
df.loc[((df['job_title'] == 'quality_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][9])

df.loc[((df['job_title'] == 'research_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][10])
df.loc[((df['job_title'] == 'research_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][10])

df.loc[((df['job_title'] == 'risk_analyst') & (df['min_salary'] == 0)), 'min_salary'] = (med_fillin['median_min_sal'][11])
df.loc[((df['job_title'] == 'risk_analyst') & (df['max_salary'] == 0)), 'max_salary'] = (med_fillin['median_max_sal'][11])

In [None]:
df['avg_salary'] = ((df['min_salary'] + df['max_salary']) / 2).round(2)
df = df.drop(columns=['salary'])
df['avg_salary'].isna().sum()

In [None]:
np.percentile(df['avg_salary'], np.arange(1, 101))

In [None]:
def removing_outliers_avg_sal(value):
    if value < 50000:
        return df['avg_salary'].median()
    elif value > 200000:
        return df['avg_salary'].median()
    else:
        return value
    
df['avg_salary'] = df['avg_salary'].apply(removing_outliers_avg_sal)
np.percentile(df['avg_salary'], np.arange(1, 101))

web_platform

In [None]:
df['web_platform'].unique()

In [None]:
def clean_web(value):
    if value.startswith('https://ca.indeed'):
        return 'indeed'

    elif value.startswith('https://www.glassdoor'):
        return 'glassdoor'
        
df['web_platform'] = df['web_platform'].apply(clean_web)
df['web_platform'].unique()

[Back to Table of Contents](#back)

## Exploratory Data Analysis

In [None]:
df.info()
df.head()

### Jobs Available

In [None]:
total_jobs = sns.countplot(data=df,
              y='job_title',
              order=df['job_title'].value_counts().index)

for rect in total_jobs.patches:
    y_value = rect.get_y() + (rect.get_height() + 0.2) / 2
    x_value = rect.get_width()
    label = '{:.0f}'.format(x_value)
    total_jobs.annotate(label,
                        (x_value, y_value),
                        xytext=(2, 1),
                        textcoords='offset points',
                        ha='left',
                        va='center')


plt.title('Total Available Job Positions')
plt.xlabel('Total Jobs Available')
plt.xlim([0,900])
plt.ylabel('Available Job Positions')

plt.grid(axis='x', linewidth=.35)
plt.show()

In [None]:
total_jobs = sns.countplot(data=df,
              y='industry',
              order=df['industry'].value_counts().index)

for rect in total_jobs.patches:
    y_value = rect.get_y() + (rect.get_height() + 0.2) / 2
    x_value = rect.get_width()
    label = '{:.0f}'.format(x_value)
    total_jobs.annotate(label,
                        (x_value, y_value),
                        xytext=(2, 1),
                        textcoords='offset points',
                        ha='left',
                        va='center')


plt.title('Total Available Jobs by Industry')
plt.xlabel('Total Jobs Available')
plt.xlim([0,1100])
plt.ylabel('Job Industry')

plt.grid(axis='x')
plt.show()

In [None]:
r = df[df['industry'] != 'other']['industry'].value_counts().head(10).plot(kind='bar',
                                                 color='slateblue',
                                                 edgecolor='black')

for rect in r.patches:
    y_value = rect.get_height()
    x_value = rect.get_x() + rect.get_width() / 2
    space = 1
    label = '{:.0f}'.format(y_value)
    r.annotate(label, (x_value, y_value), xytext=(0, space), textcoords='offset points', ha='center', va='bottom')

plt.title('Total Available Jobs by Position')
plt.xlabel('Total Jobs Available')
plt.ylabel('Job Position')

plt.grid(axis='y')
plt.show()

### Job Salary by Position

In [None]:
plt.figure(figsize=[27, 7])
box = sns.boxplot(data=df,
                  x='job_title',
                  y='avg_salary')

plt.xlabel('Job Position')
plt.ylabel('Average Annual Salary')
plt.ylim(45000, 180000)
plt.grid(axis='y')

In [None]:
df[df['job_title'] == 'business_analyst']

### Jobs Available by Platform

In [None]:
plt.subplot(1, 2, 1)
web_total = sns.countplot(data=df,
                          x='web_platform')

for total in web_total.patches:
    y_value = total.get_height()
    x_value = total.get_x() + total.get_width() / 2
    label = '{:.0f}'.format(y_value)
    web_total.annotate(label,
                       (x_value, y_value),
                       xytext=(0, 1),
                       textcoords='offset points',
                       ha='center',
                       va='bottom')

plt.title('Total Jobs Available by Web Platform')
plt.xlabel('Total Available Jobs')
plt.ylabel('Web Platform')
plt.grid(axis='y')
plt.show()
    
plt.subplot(1, 2, 2)
df['web_platform'].value_counts().plot(kind='pie',
                                       fontsize=11,
                                       autopct='%1.1f%%',
                                       subplots=True)
plt.title('Jobs Available Ratio by Web Platform')
plt.ylabel(None)
plt.show()

### Jobs by Geography

In [None]:
df[df['city'] != 'remote']['city'].value_counts().sort_values(ascending=False).head(20)

In [None]:
top_20_list = ['toronto',
               'mississauga',
               'vancouver',
               'montreal',
               'calgary',
               'edmonton',
               'markham',
               'ottawa',
               'brampton',
               'winnipeg',
               'richmond',
               'north_york',
               'victoria',
               'london',
               'surrey',
               'halifax',
               'burnaby',
               'waterloo',
               'etobicoke',
               'hamilton']

df[df['city'].isin(top_20_list)].groupby('city')['avg_salary'].median().round(2)

In [None]:
# Create a map centered on Canada
canada_map = folium.Map(location=[56.1304, -106.3468],
                        zoom_start=5,
                        tiles='OpenStreetMap')
# cartodb positron

# Locations of top 20 hubs with most (non-remote) job postings 
top_10_cities = [['43.6532', '-79.3832', 'Toronto', '446 Job Openings', 'Median Salary: 110,000'],
                 ['43.5953', '-79.6405', 'Mississauga', '117 Job Openings', 'Median Salary: 124,750'],
                 ['49.2462', '-123.1162', 'Vancouver', '114 Job Openings', 'Median Salary: 121,000'],
                 ['45.5088', '-73.5616', 'Montreal', '104 Job Openings', 'Median Salary: 114,098'],
                 ['51.0499', '-114.0666', 'Calgary', '101 Job Openings', 'Median Salary: 132,500'],
                 ['53.6316', '-113.3239', 'Edmonton', '60 Job Openings', 'Median Salary: 119,750'],
                 ['43.8560', '-79.3370', 'Markham', '49 Job Openings', 'Median Salary: 115,211'],
                 ['45.4247', '-75.6950', 'Ottawa', '48 Job Openings', 'Median Salary: 105,000'],
                 ['43.7315', '-79.7666', 'Brampton', '38 Job Openings', 'Median Salary: 120,030'],
                 ['49.8950', '-97.1384', 'Winnipeg', '30 Job Openings', 'Median Salary: 87,000'],
                 ['49.1665', '-123.1335', 'Richmond', '30 Job Openings', 'Median Salary: 115,440'],
                 ['43.7615', '-79.4110', 'North York', '26 Job Openings', 'Median Salary: 135,002'],
                 ['48.4284', '-123.3656', 'Victoria', '23 Job Openings', 'Median Salary: 136,812'],
                 ['42.9849', '-81.2497', 'London', '22 Job Openings', 'Median Salary: 105,000'],
                 ['49.1913', '-122.8490', 'Surrey', '21 Job Openings', 'Median Salary: 127,920'],
                 ['44.8857', '-63.1005', 'Halifax', '20 Job Openings', 'Median Salary: 91,500'],
                 ['49.2488', '-122.9805', 'Burnaby', '19 Job Openings', 'Median Salary: 118,934'],
                 ['43.4643', '-80.5166', 'Waterloo', '17 Job Openings', 'Median Salary: 94,250'],
                 ['43.6205', '-79.5131', 'Etobicoke', '15 Job Openings', 'Median Salary: 111,000'],
                 ['43.2557', '-79.8711', 'Hamilton', '14 Job Openings', 'Median Salary: 120,120']]

# Loop for map markers
for row in top_10_cities:
    folium.Marker(location=[row[0], row[1]],
                  tooltip=[row[3], row[4]],
                  popup=row[2],
                  icon=folium.Icon(color='red',
                                    icon='info-sign')).add_to(canada_map)
    
    folium.CircleMarker(location=[row[0], row[1]],
                        radius=10,
                        popup=row[2],
                        color='red',
                        fill=True,
                        fill_color='red').add_to(canada_map)
    
canada_map.save('canada_map.html')
canada_map

[Back to Table of Contents](#back)

## Conclusions and Reccomendations

[Back to Table of Contents](#back)

## Dataset Citation

syntax:
[Dataset creator's name]. ([Year &amp; Month of dataset creation]). [Name of the dataset], [Version of the dataset]. Retrieved [Date Retrieved] from [Kaggle](URL of the dataset).

example:
Tatman, R. (2017, November). R vs. Python: The Kitchen Gadget Test, Version 1. Retrieved December 20, 2017 from https://www.kaggle.com/rtatman/r-vs-python-the-kitchen-gadget-test.

[Back to Table of Contents](#back)