# Canadian Data Analyst Job Listing Analysis

## Table of Contents <a id='back'></a>
- [Project Introduction](#project-introduction)
    - [Analysis Outline](#analysis-outline)
    - [Results](#results)
- [Importing Libraries and Opening Data Files](#importing-libraries-and-opening-data-files)
- [Pre-Processing Data](#pre-processing-data)
    - [Duplicates](#duplicates)
    - [Missing Values](#missing-values)
    - [Removing Irrelevant Data](#removing-irrelevant-data)
    - [Data Structure Overhaul](#data-structure-overhaul)
        - [Header Style](#header-style)
        - [Formatting and Data Usage](#formatting-and-data-usage)
- [Exploratory Data Analysis](#exploratory-data-analysis)
- [Conclusions and Reccomendations](#conclusions-and-reccomendations)
- [Dataset Citation](#dataset-citation)

## Project Introduction

[project intro]

### Analysis Outline

[Analysis Outline]

### Results

[Results]


[Back to Table of Contents](#back)

## Importing Libraries and Opening Data Files

In [1]:
# Importing the needed libraries for this assignment
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [2]:
# Importing file for assignment
try:
    df = pd.read_csv('Raw_Dataset.csv', sep=',')
except:
    df = pd.read_csv('/datasets/Raw_Dataset.csv', sep=',')

[Back to Table of Contents](#back)

## Pre-Processing Data

### Duplicates

In [3]:
# Checking for duplicates
df.duplicated().sum()

0

[Back to Table of Contents](#back)

### Missing Values

In [4]:
# Checking for null values
df.isna().sum()

Job ID                  0
Job Title               0
Company Name            0
Language and Tools    167
Job Salary            557
City                    0
Province              118
Job Link                0
dtype: int64

In [5]:
# Filling in null values
df.fillna({'Language and Tools': 'unknown',
           'Job Salary' : 'unknown',
           'Province' : 'unknown'}, inplace = True)
df.isna().sum()

Job ID                0
Job Title             0
Company Name          0
Language and Tools    0
Job Salary            0
City                  0
Province              0
Job Link              0
dtype: int64

[Back to Table of Contents](#back)

### Removing Irrelevant Data

In [6]:
# Removing columns we do not need for this analysis
df = df.drop(columns=['Job ID'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Job Title           1796 non-null   object
 1   Company Name        1796 non-null   object
 2   Language and Tools  1796 non-null   object
 3   Job Salary          1796 non-null   object
 4   City                1796 non-null   object
 5   Province            1796 non-null   object
 6   Job Link            1796 non-null   object
dtypes: object(7)
memory usage: 98.3+ KB


[Back to Table of Contents](#back)

### Data Structure Overhaul

In [7]:
df.describe()

Unnamed: 0,Job Title,Company Name,Language and Tools,Job Salary,City,Province,Job Link
count,1796,1796,1796,1796,1796,1796,1796
unique,811,790,1057,855,172,14,1761
top,Business Analyst,Scotiabank,unknown,unknown,Toronto,ON,https://www.glassdoor.ca/job-listing/business-...
freq,90,24,167,557,426,949,3


#### Header Style

In [8]:
# Getting general information about the dataset
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Job Title           1796 non-null   object
 1   Company Name        1796 non-null   object
 2   Language and Tools  1796 non-null   object
 3   Job Salary          1796 non-null   object
 4   City                1796 non-null   object
 5   Province            1796 non-null   object
 6   Job Link            1796 non-null   object
dtypes: object(7)
memory usage: 98.3+ KB


Unnamed: 0,Job Title,Company Name,Language and Tools,Job Salary,City,Province,Job Link
0,Binance Accelerator Program - Data Analyst (Risk),Binance,"Python, Sql",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=9c7f38160c736c...
1,Business Analyst,Canadian Nuclear Laboratories,"Power Bi, Power BI, Excel",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=0da15fed6a515f...
2,Geophysicist/Data Analyst,Sander Geophysics Limited,unknown,unknown,Ottawa,ON,https://ca.indeed.com/rc/clk?jk=2dc0470241aa60...
3,Business Intelligence Data Engineer,"Maximus Services, LLC","Fabric, Power BI, Sql, Machine Learning, Genes...","87,875Ã¢â‚¬â€œ$105,000 a year",Toronto,ON,https://ca.indeed.com/rc/clk?jk=cbbe0e29b236d2...
4,"BUSINESS INTELLIGENCE SPECIALIST, FT",Niagara Health System,"Azure, Power BI, SQL, Aws",55.39Ã¢â‚¬â€œ$62.66 an hour,Niagara,ON,https://ca.indeed.com/rc/clk?jk=fe8ad423818b24...


In [9]:
#checking for snakecase format
df.columns

Index(['Job Title', 'Company Name', 'Language and Tools', 'Job Salary', 'City',
       'Province', 'Job Link'],
      dtype='object')

In [10]:
# Renaming column names to snake_case format
df = df.rename(columns={'Job Title': 'job_title',
                        'Company Name': 'company_name',
                        'Language and Tools': 'tech_skills',
                        'Job Salary': 'salary',
                        'City': 'city',
                        'Province': 'province',
                        'Job Link': 'web_platform'})
df.columns

Index(['job_title', 'company_name', 'tech_skills', 'salary', 'city',
       'province', 'web_platform'],
      dtype='object')

[Back to Table of Contents](#back)

#### Formatting and Data Usage

In [11]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   job_title     1796 non-null   object
 1   company_name  1796 non-null   object
 2   tech_skills   1796 non-null   object
 3   salary        1796 non-null   object
 4   city          1796 non-null   object
 5   province      1796 non-null   object
 6   web_platform  1796 non-null   object
dtypes: object(7)
memory usage: 98.3+ KB


Unnamed: 0,job_title,company_name,tech_skills,salary,city,province,web_platform
0,Binance Accelerator Program - Data Analyst (Risk),Binance,"Python, Sql",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=9c7f38160c736c...
1,Business Analyst,Canadian Nuclear Laboratories,"Power Bi, Power BI, Excel",unknown,Remote,unknown,https://ca.indeed.com/rc/clk?jk=0da15fed6a515f...
2,Geophysicist/Data Analyst,Sander Geophysics Limited,unknown,unknown,Ottawa,ON,https://ca.indeed.com/rc/clk?jk=2dc0470241aa60...
3,Business Intelligence Data Engineer,"Maximus Services, LLC","Fabric, Power BI, Sql, Machine Learning, Genes...","87,875Ã¢â‚¬â€œ$105,000 a year",Toronto,ON,https://ca.indeed.com/rc/clk?jk=cbbe0e29b236d2...
4,"BUSINESS INTELLIGENCE SPECIALIST, FT",Niagara Health System,"Azure, Power BI, SQL, Aws",55.39Ã¢â‚¬â€œ$62.66 an hour,Niagara,ON,https://ca.indeed.com/rc/clk?jk=fe8ad423818b24...


In [12]:
# Changing all elements into snakecase format
for column in df.columns:
  df[column] = df[column].str.lower()
  df[column] = df[column].str.replace(' ', '_')

df.head()

Unnamed: 0,job_title,company_name,tech_skills,salary,city,province,web_platform
0,binance_accelerator_program_-_data_analyst_(risk),binance,"python,_sql",unknown,remote,unknown,https://ca.indeed.com/rc/clk?jk=9c7f38160c736c...
1,business_analyst,canadian_nuclear_laboratories,"power_bi,_power_bi,_excel",unknown,remote,unknown,https://ca.indeed.com/rc/clk?jk=0da15fed6a515f...
2,geophysicist/data_analyst,sander_geophysics_limited,unknown,unknown,ottawa,on,https://ca.indeed.com/rc/clk?jk=2dc0470241aa60...
3,business_intelligence_data_engineer,"maximus_services,_llc","fabric,_power_bi,_sql,_machine_learning,_genes...","87,875ã¢â‚¬â€œ$105,000_a_year",toronto,on,https://ca.indeed.com/rc/clk?jk=cbbe0e29b236d2...
4,"business_intelligence_specialist,_ft",niagara_health_system,"azure,_power_bi,_sql,_aws",55.39ã¢â‚¬â€œ$62.66_an_hour,niagara,on,https://ca.indeed.com/rc/clk?jk=fe8ad423818b24...


In [13]:
df['job_title'].unique()

array(['binance_accelerator_program_-_data_analyst_(risk)',
       'business_analyst', 'geophysicist/data_analyst',
       'business_intelligence_data_engineer',
       'business_intelligence_specialist,_ft',
       'continuous_improvement_analyst', 'it_business_process_analyst',
       'computer_programmer/analyst',
       'senior_developer,_business_intelligence',
       'opgt_mod_ã¢â‚¬â€œ_one_(1)_business_analyst_ã¢â‚¬â€œ_senior',
       'business_analyst/quality_assurance_analyst',
       'capital_&_maintenance_program_analyst', 'senior_policy_analyst',
       'business_operations_analyst_(1_year_contract)',
       'senior_business_systems_analyst',
       'research_analyst_-_translational_addiction_research_laboratory',
       'payroll_analyst', 'lead_business_analyst', 'data_analyst',
       'hr_technology_lead_and_data_analyst', 'business_data_analyst',
       '(data-driven)_marketing_analyst', 'quality_analyst',
       'technology_risk_analyst', 'technical_support_analyst_',
  

In [14]:
df['job_title'] = df['job_title'].str.replace('[^a-zA-Z0-9_]', '')

def clean_title(value):
    if 'data_analyst' in value:
        return 'data_analyst'
    
    elif 'data' in value:
        return 'data_analyst'
    
    elif 'business_intelligence' in value:
        return 'business_intelligence_analyst'
    
    elif 'business_system' in value:
        return 'business_systems_analyst'
    
    elif 'business_analyst' in value:
        return 'business_analyst'
    
    elif 'research' in value:
        return 'research_analyst'
    
    elif 'quality' in value:
        return 'quality_analyst'
    
    elif 'marketing' in value:
        return 'marketing_analyst'
    
    elif 'risk' in value:
        return 'risk_analyst'
    
    elif 'investment' in value:
        return 'financial_analyst'
    
    elif 'asset' in value:
        return 'financial_analyst'
    
    elif 'bank' in value:
        return 'financial_analyst'
    
    elif 'sale' in value:
        return 'financial_analyst'
    
    else:
        return 'analyst'
        
df['job_title'] = df['job_title'].apply(clean_title)
df['job_title'].unique()

  df['job_title'] = df['job_title'].str.replace('[^a-zA-Z0-9_]', '')


array(['data_analyst', 'business_analyst',
       'business_intelligence_analyst', 'analyst',
       'business_systems_analyst', 'research_analyst', 'quality_analyst',
       'risk_analyst', 'financial_analyst', 'marketing_analyst'],
      dtype=object)

In [15]:
df['web_platform'].unique()

array(['https://ca.indeed.com/rc/clk?jk=9c7f38160c736c78&bb=jmruzegvy_2zalhxc3miarbrfhvpjxoqvuat4gd16kuzbzqrzaincbr5w5gdefrfr5lvttdosyg3qraanlxcud9iaz4zngshhlixaeof9ezgpfkdnjt04v6cwqv09nog&xkcb=sobd67m39kh62wxbjh0lbzkdcdpp&fccid=ac2ee5578fa99fc9&vjs=3',
       'https://ca.indeed.com/rc/clk?jk=0da15fed6a515fe5&bb=jmruzegvy_2zalhxc3miapjgjclvcfqucxinmd0zx2fshgsye-wbutyscwefufuqm7kud9rgnijbjxw5y4k44be95otvsctpsv-_tqrmi8c76pdkhl3qnw%3d%3d&xkcb=sodp67m39kh62wxbjh0kbzkdcdpp&fccid=a0da53533519eae5&vjs=3',
       'https://ca.indeed.com/rc/clk?jk=2dc0470241aa6066&bb=jmruzegvy_2zalhxc3miag9ccuaulow1dzrh0spiltfpceyi0f2m_v0njmm3elktji6ouhs2jvoouz1hcgxjwqz2odtgrbbdrvebiynjejz0unmkp5mtk3blskfesl9l&xkcb=sob067m39kh62wxbjh0jbzkdcdpp&fccid=cf2319525eb667d8&cmp=sander-geophysics&ti=data+analyst&vjs=3',
       ...,
       'https://www.glassdoor.ca/job-listing/business-intelligence-analyst-clio-jv_ic2278756_ko0,29_ke30,34.htm?jl=1009218395921',
       'https://www.glassdoor.ca/job-listing/analyst-supply-c

In [16]:
def clean_web(value):
    if value.startswith('https://ca.indeed') == True:
        return 'indeed'

    elif value.startswith('https://www.glassdoor') == True:
        return 'glassdoor'
        
df['web_platform'] = df['web_platform'].apply(clean_web)
df['web_platform'].unique()

array(['indeed', 'glassdoor'], dtype=object)

In [17]:
df['salary'] = df['salary'].str.replace('[^a-zA-Z0-9$.-]', '')
df['salary'].unique()

  df['salary'] = df['salary'].str.replace('[^a-zA-Z0-9$.-]', '')


array(['unknown', '87875$105000ayear', '55.39$62.66anhour',
       '43.82$51.78anhour43.82to$51.78perhour',
       '75898$113847ayear75898-$113847dllis',
       '47.62$56.27anhour47.62to$56.27perhour',
       '89606$128809ayear$89606-$128809with128809witha112008perannum',
       '27.48$36.65anhour',
       '83192$104013ayear1departmenttalent83192.00-$104013.00payscale',
       '95106.71to$127429.57hourlypay$52.26to$70.02benefits',
       '43.50anhour', '61181$76458ayear61181to$76458basedon',
       '74800$138600ayear74800.00-$138600.00paytype',
       '54764.33$73940.15ayear54764.33to$',
       'approximately$252billionintotal', '45000$50000ayear',
       '104423$109644ayear104423109644annually',
       '54500$81800ayearbetween$54500-$81800annually.',
       'upto$250perintern',
       '107600$147100ayear$107600-$147100can147100canannually',
       '20anhour20.00perhour',
       'teambellandweemployeereferralprogramadequateknowledgeof',
       '83500$149300ayear', 'approximately$271bil

In [18]:
df['salary'] = df['salary'].str.replace('glassdoorest', '')
df['salary'] = df['salary'].str.replace('employerest', '')
df['salary'] = df['salary'].str.replace('k', '000')
df['salary'] = df['salary'].str.replace('ayear', '')
df['salary'] = df['salary'].str.replace('perhour', '')
df['salary'] = df['salary'].str.replace('anhour', '')
df['salary'] = df['salary'].str.replace('peryear', '')

df['salary'] = df['salary'][df['salary'].str.len() < 20]
df['salary'] = df['salary'].str.replace('[^0-9$-.]', '')

df['salary'] = df['salary'].str.replace('$', '-')
df['salary'] = df['salary'].str.replace('--', '-')
df['salary'].unique()

  df['salary'] = df['salary'].str.replace('[^0-9$-.]', '')
  df['salary'] = df['salary'].str.replace('$', '-')


array(['000', '87875-105000', '55.39-62.66', nan, '27.48-36.65', '43.50',
       '45000-50000', '-250', '2020.00', '83500-149300', '46', '105000',
       '50-5850.00-58.00', '-100000', '82800-103500', '38', '28.40',
       '47.20', '8700087000.', '11028', '2500', '5000', '38-39', '-', '',
       '42.56', '82000-101000', '43.30', '44', '47.12',
       '47266.40-64725.92', '56000-84000', '47', '40.12-45.39', '128',
       '8000080000', '75715-100652', '70-80', '500', '41', '43',
       '37.18-40.00', '50000-60000', '45', '83416.00.', '43.27', '42-50',
       '43.35-57.96', '71563-100052', '77105-86637', '20', '48.50',
       '109000-159000', '43200-70800', '-27', '103042', '42', '4545.00',
       '64618.31-90053.44', '-5000', '79786.18-93866.14', '40-45', '1',
       '57200-78000', '67450.89-97437.66', '80-10080.00-100.00', '4',
       '65000-78000', '-1000', '-20.01-22.52', '1700215', '25-30', '2',
       '38000-480002000', '-3500', '95000-10500095-105', '71000-94000',
       '580005800

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   job_title     1796 non-null   object
 1   company_name  1796 non-null   object
 2   tech_skills   1796 non-null   object
 3   salary        1515 non-null   object
 4   city          1796 non-null   object
 5   province      1796 non-null   object
 6   web_platform  1796 non-null   object
dtypes: object(7)
memory usage: 98.3+ KB


In [20]:
def format_sal(value):
    if str(value).startswith('-') == True:
        return value[1:]
    else:
        return value
    
df['salary'] = df['salary'].apply(format_sal)
df['salary'].unique()

array(['000', '87875-105000', '55.39-62.66', nan, '27.48-36.65', '43.50',
       '45000-50000', '250', '2020.00', '83500-149300', '46', '105000',
       '50-5850.00-58.00', '100000', '82800-103500', '38', '28.40',
       '47.20', '8700087000.', '11028', '2500', '5000', '38-39', '',
       '42.56', '82000-101000', '43.30', '44', '47.12',
       '47266.40-64725.92', '56000-84000', '47', '40.12-45.39', '128',
       '8000080000', '75715-100652', '70-80', '500', '41', '43',
       '37.18-40.00', '50000-60000', '45', '83416.00.', '43.27', '42-50',
       '43.35-57.96', '71563-100052', '77105-86637', '20', '48.50',
       '109000-159000', '43200-70800', '27', '103042', '42', '4545.00',
       '64618.31-90053.44', '79786.18-93866.14', '40-45', '1',
       '57200-78000', '67450.89-97437.66', '80-10080.00-100.00', '4',
       '65000-78000', '1000', '20.01-22.52', '1700215', '25-30', '2',
       '38000-480002000', '3500', '95000-10500095-105', '71000-94000',
       '5800058000.00', '70000-80000'

In [21]:
def format_sal_2(value):
    if str(value).endswith('.') == True:
        return value[:-1]
    else:
        return value
    
df['salary'] = df['salary'].apply(format_sal_2)
df['salary'].unique()

array(['000', '87875-105000', '55.39-62.66', nan, '27.48-36.65', '43.50',
       '45000-50000', '250', '2020.00', '83500-149300', '46', '105000',
       '50-5850.00-58.00', '100000', '82800-103500', '38', '28.40',
       '47.20', '8700087000', '11028', '2500', '5000', '38-39', '',
       '42.56', '82000-101000', '43.30', '44', '47.12',
       '47266.40-64725.92', '56000-84000', '47', '40.12-45.39', '128',
       '8000080000', '75715-100652', '70-80', '500', '41', '43',
       '37.18-40.00', '50000-60000', '45', '83416.00', '43.27', '42-50',
       '43.35-57.96', '71563-100052', '77105-86637', '20', '48.50',
       '109000-159000', '43200-70800', '27', '103042', '42', '4545.00',
       '64618.31-90053.44', '79786.18-93866.14', '40-45', '1',
       '57200-78000', '67450.89-97437.66', '80-10080.00-100.00', '4',
       '65000-78000', '1000', '20.01-22.52', '1700215', '25-30', '2',
       '38000-480002000', '3500', '95000-10500095-105', '71000-94000',
       '5800058000.00', '70000-80000', 

In [22]:
df['salary'].fillna('0', inplace=True)
new = df['salary'].str.split('-', n=1, expand=True)

df['min_salary'] = new[0]
df['max_salary'] = new[1]
df.head()

Unnamed: 0,job_title,company_name,tech_skills,salary,city,province,web_platform,min_salary,max_salary
0,data_analyst,binance,"python,_sql",000,remote,unknown,indeed,0.0,
1,business_analyst,canadian_nuclear_laboratories,"power_bi,_power_bi,_excel",000,remote,unknown,indeed,0.0,
2,data_analyst,sander_geophysics_limited,unknown,000,ottawa,on,indeed,0.0,
3,data_analyst,"maximus_services,_llc","fabric,_power_bi,_sql,_machine_learning,_genes...",87875-105000,toronto,on,indeed,87875.0,105000.0
4,business_intelligence_analyst,niagara_health_system,"azure,_power_bi,_sql,_aws",55.39-62.66,niagara,on,indeed,55.39,62.66


In [23]:
df['min_salary'].unique()

array(['000', '87875', '55.39', '0', '27.48', '43.50', '45000', '250',
       '2020.00', '83500', '46', '105000', '50', '100000', '82800', '38',
       '28.40', '47.20', '8700087000', '11028', '2500', '5000', '',
       '42.56', '82000', '43.30', '44', '47.12', '47266.40', '56000',
       '47', '40.12', '128', '8000080000', '75715', '70', '500', '41',
       '43', '37.18', '50000', '45', '83416.00', '43.27', '42', '43.35',
       '71563', '77105', '20', '48.50', '109000', '43200', '27', '103042',
       '4545.00', '64618.31', '79786.18', '40', '1', '57200', '67450.89',
       '80', '4', '65000', '1000', '20.01', '1700215', '25', '2', '38000',
       '3500', '95000', '71000', '5800058000.00', '70000', '45.64', '30',
       '54300', '2000', '42.05', '4800048000.00', '23.76', '1818.00', '3',
       '22', '52.09', '28.50', '41.50', '32.21', '75000', '86008',
       '77000', '72639', '32', '37', '140000140000.00', '55000', '53.50',
       '3358', '68994', '47.25', '26.50', '29.50', '36.5736

In [24]:
df['min_salary'] = df['min_salary'].replace('000', '0')
df['min_salary'] = df['min_salary'].replace('', '0')

def repeating_sal_value(value):
    if len(str(value)) <= 1:
        return str(value)
    
    if str(value)[0:5] == str(value)[5:10]:
        return str(value)[5:]
    
    elif str(value)[0:6] == str(value)[6:12]:
        return str(value)[6:]
    
    else:
        return str(value)


df['min_salary'] = df['min_salary'].apply(repeating_sal_value)
df['min_salary'].unique()

array(['0', '87875', '55.39', '27.48', '43.50', '45000', '250', '2020.00',
       '83500', '46', '105000', '50', '100000', '82800', '38', '28.40',
       '47.20', '87000', '11028', '2500', '5000', '42.56', '82000',
       '43.30', '44', '47.12', '47266.40', '56000', '47', '40.12', '128',
       '80000', '75715', '70', '500', '41', '43', '37.18', '50000', '45',
       '83416.00', '43.27', '42', '43.35', '71563', '77105', '20',
       '48.50', '109000', '43200', '27', '103042', '4545.00', '64618.31',
       '79786.18', '40', '1', '57200', '67450.89', '80', '4', '65000',
       '1000', '20.01', '1700215', '25', '2', '38000', '3500', '95000',
       '71000', '58000.00', '70000', '45.64', '30', '54300', '2000',
       '42.05', '48000.00', '23.76', '1818.00', '3', '22', '52.09',
       '28.50', '41.50', '32.21', '75000', '86008', '77000', '72639',
       '32', '37', '140000.00', '55000', '53.50', '3358', '68994',
       '47.25', '26.50', '29.50', '36.57', '44200', '39', '68000',
       '1411

In [25]:
df['min_salary'] = df['min_salary'].astype('float')

def hourly_pay_to_annual(value):
    if value < 100:
        return (value * 2080)
    else:
        return value

df['min_salary'] = df['min_salary'].apply(hourly_pay_to_annual).round(2)

In [26]:
df['max_salary'].unique()

array([None, '105000', '62.66', '36.65', '50000', '149300',
       '5850.00-58.00', '103500', '39', '101000', '64725.92', '84000',
       '45.39', '100652', '80', '40.00', '60000', '50', '57.96', '100052',
       '86637', '159000', '70800', '90053.44', '93866.14', '45', '78000',
       '97437.66', '10080.00-100.00', '22.52', '30', '480002000',
       '10500095-105', '94000', '80000', '65.60', '88300', '159000-',
       '9070.00-90.00', '25.90', '45.00', '110000', '96000', '94431',
       '3532-35', '75000', '5030.00-50.00', '44', '3530-35', '55300',
       '120000', '38.00', '113500', '55.01', '84450', '90000', '104000',
       '85000', '65000', '93000', '107000', '130000', '108000', '87000',
       '70000', '75.00', '86000', '82000', '76000', '77000', '45000',
       '111000', '74000', '79000', '43.00', '91000', '95000', '66000',
       '54.80', '49.65', '97000', '10000', '109000', '115000', '71000',
       '114000', '100000', '23.67', '69000', '43.40', '55.18', '99000',
       '55.00

In [27]:
df['max_salary'] = df['max_salary'].str.replace('-', '')
df['max_salary'] = df['max_salary'][df['max_salary'].str.len() < 7]
df['max_salary'].fillna('0', inplace=True)
df['max_salary'].unique()

array(['0', '105000', '62.66', '36.65', '50000', '149300', '103500', '39',
       '101000', '84000', '45.39', '100652', '80', '40.00', '60000', '50',
       '57.96', '100052', '86637', '159000', '70800', '45', '78000',
       '22.52', '30', '94000', '80000', '65.60', '88300', '25.90',
       '45.00', '110000', '96000', '94431', '353235', '75000', '44',
       '353035', '55300', '120000', '38.00', '113500', '55.01', '84450',
       '90000', '104000', '85000', '65000', '93000', '107000', '130000',
       '108000', '87000', '70000', '75.00', '86000', '82000', '76000',
       '77000', '45000', '111000', '74000', '79000', '43.00', '91000',
       '95000', '66000', '54.80', '49.65', '97000', '10000', '109000',
       '115000', '71000', '114000', '100000', '23.67', '69000', '43.40',
       '55.18', '99000', '55.00', '58000', '23.66', '124000', '81000',
       '147000', '68000', '112000', '123000', '85.00', '230000', '92000',
       '62000', '156000', '187000', '98000', '48.00', '113000', '530

In [28]:
df['max_salary'] = df['max_salary'].astype('float')
df['max_salary'] = df['max_salary'].apply(hourly_pay_to_annual).round(2)
df['max_salary'].unique()

array([0.000000e+00, 1.050000e+05, 1.303328e+05, 7.623200e+04,
       5.000000e+04, 1.493000e+05, 1.035000e+05, 8.112000e+04,
       1.010000e+05, 8.400000e+04, 9.441120e+04, 1.006520e+05,
       1.664000e+05, 8.320000e+04, 6.000000e+04, 1.040000e+05,
       1.205568e+05, 1.000520e+05, 8.663700e+04, 1.590000e+05,
       7.080000e+04, 9.360000e+04, 7.800000e+04, 4.684160e+04,
       6.240000e+04, 9.400000e+04, 8.000000e+04, 1.364480e+05,
       8.830000e+04, 5.387200e+04, 1.100000e+05, 9.600000e+04,
       9.443100e+04, 3.532350e+05, 7.500000e+04, 9.152000e+04,
       3.530350e+05, 5.530000e+04, 1.200000e+05, 7.904000e+04,
       1.135000e+05, 1.144208e+05, 8.445000e+04, 9.000000e+04,
       8.500000e+04, 6.500000e+04, 9.300000e+04, 1.070000e+05,
       1.300000e+05, 1.080000e+05, 8.700000e+04, 7.000000e+04,
       1.560000e+05, 8.600000e+04, 8.200000e+04, 7.600000e+04,
       7.700000e+04, 4.500000e+04, 1.110000e+05, 7.400000e+04,
       7.900000e+04, 8.944000e+04, 9.100000e+04, 9.5000

In [32]:
df['avg_salary'] = ((df['max_salary'] + df['min_salary']) / 2).round(2)
df = df.drop(columns=['salary'])
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1796 entries, 0 to 1795
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   job_title     1796 non-null   object 
 1   company_name  1796 non-null   object 
 2   tech_skills   1796 non-null   object 
 3   city          1796 non-null   object 
 4   province      1796 non-null   object 
 5   web_platform  1796 non-null   object 
 6   min_salary    1796 non-null   float64
 7   max_salary    1796 non-null   float64
 8   avg_salary    1796 non-null   float64
dtypes: float64(3), object(6)
memory usage: 126.4+ KB


Unnamed: 0,job_title,company_name,tech_skills,city,province,web_platform,min_salary,max_salary,avg_salary
0,data_analyst,binance,"python,_sql",remote,unknown,indeed,0.0,0.0,0.0
1,business_analyst,canadian_nuclear_laboratories,"power_bi,_power_bi,_excel",remote,unknown,indeed,0.0,0.0,0.0
2,data_analyst,sander_geophysics_limited,unknown,ottawa,on,indeed,0.0,0.0,0.0
3,data_analyst,"maximus_services,_llc","fabric,_power_bi,_sql,_machine_learning,_genes...",toronto,on,indeed,87875.0,105000.0,96437.5
4,business_intelligence_analyst,niagara_health_system,"azure,_power_bi,_sql,_aws",niagara,on,indeed,115211.2,130332.8,122772.0


In [30]:
avg_fillin = df.groupby('job_title').agg({'min_salary': 'mean',
                                          'max_salary': 'mean',
                                          'avg_salary': 'mean'})


avg_fillin.columns = ['avg_min_sal', 'avg_max_sal', 'avg_avg_sal']
avg_fillin

Unnamed: 0_level_0,avg_min_sal,avg_max_sal,avg_avg_sal
job_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
analyst,30722.860899,31074.814052,30898.837467
business_analyst,35944.764986,32952.368661,34448.566838
business_intelligence_analyst,37955.054237,38641.233898,38298.144068
business_systems_analyst,45579.433663,26860.910891,36220.172277
data_analyst,49469.450354,39049.919553,44259.684953
financial_analyst,18370.956522,21304.347826,19837.652174
marketing_analyst,40233.76,20352.0,30292.88
quality_analyst,21222.222222,24222.222222,22722.222222
research_analyst,40561.15,39412.0,39986.575
risk_analyst,25480.0,36770.0,31125.0


In [35]:
avg_fillin['avg_min_sal'][0]

30722.860898692812

In [None]:
def sal_fillin(value):
    if (value['job_title'] == 'analyst') & (value['min_salary'] == 0):
        return value['min_salary'] == avg_fillin['avg_min_sal'][0]

[Back to Table of Contents](#back)

[Back to Table of Contents](#back)

## Exploratory Data Analysis

[Back to Table of Contents](#back)

## Conclusions and Reccomendations

[Back to Table of Contents](#back)

## Dataset Citation

syntax:
[Dataset creator's name]. ([Year &amp; Month of dataset creation]). [Name of the dataset], [Version of the dataset]. Retrieved [Date Retrieved] from [Kaggle](URL of the dataset).

example:
Tatman, R. (2017, November). R vs. Python: The Kitchen Gadget Test, Version 1. Retrieved December 20, 2017 from https://www.kaggle.com/rtatman/r-vs-python-the-kitchen-gadget-test.

[Back to Table of Contents](#back)