# Dataset Story Telling

## Overview of data science problem

The purpose of this data science project is to come up with a predictive salary model for data scientist and related roles.

Using historical salaries of data scientist for a business case, I can use a dataset to understand the salary and other features related to these type of roles and also help others make future predictions in a more reliable way than just using the historical salary. There is also a huge absence of knowing what the fair salary for a data scientist to the public is. In particular, which roles who are most likely to receive more.

This project aims to build a predictive model for salary based on a number of features, or properties of industry standard-quality characteristics. This model will be used to provide guidance for salary and future value hire plans.

### Audience: Executive

**Goal:**
* Ask questions
* Explore data
* Investigate trends
* Review the resulting visualizations and conclusions

### Objectives
There are fundamental questions that should align with objectives for the dataset before moving on.

* Do I have the data I need to tackle the data science problem?
* Have I identified the required target value?
* Do I have potentially useful features?
* Do I have any fundamental issues with the data?

### Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Load salary dataset

In [14]:
path = 'ds_salaries.csv'
df = pd.read_csv(path)
df = df.iloc[:,1:]
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [15]:
# First we want to check the rows and columns
df.shape

(607, 11)

In [16]:
# Now we can check the data tupes of our columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           607 non-null    int64 
 1   experience_level    607 non-null    object
 2   employment_type     607 non-null    object
 3   job_title           607 non-null    object
 4   salary              607 non-null    int64 
 5   salary_currency     607 non-null    object
 6   salary_in_usd       607 non-null    int64 
 7   employee_residence  607 non-null    object
 8   remote_ratio        607 non-null    int64 
 9   company_location    607 non-null    object
 10  company_size        607 non-null    object
dtypes: int64(4), object(7)
memory usage: 52.3+ KB


"salary" is the variable we are looking for as well as salary_in_usd. The salary per observation is what is being modeled. The other columns are potential features. The data also shows that some column types are objects while the rest are integer type.

In [17]:
df['work_year'] = pd.to_datetime(df['work_year'], format='%Y')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   work_year           607 non-null    datetime64[ns]
 1   experience_level    607 non-null    object        
 2   employment_type     607 non-null    object        
 3   job_title           607 non-null    object        
 4   salary              607 non-null    int64         
 5   salary_currency     607 non-null    object        
 6   salary_in_usd       607 non-null    int64         
 7   employee_residence  607 non-null    object        
 8   remote_ratio        607 non-null    int64         
 9   company_location    607 non-null    object        
 10  company_size        607 non-null    object        
dtypes: datetime64[ns](1), int64(3), object(7)
memory usage: 52.3+ KB


In [18]:
df.work_year

0     2020-01-01
1     2020-01-01
2     2020-01-01
3     2020-01-01
4     2020-01-01
         ...    
602   2022-01-01
603   2022-01-01
604   2022-01-01
605   2022-01-01
606   2022-01-01
Name: work_year, Length: 607, dtype: datetime64[ns]

In [19]:
df.describe()

Unnamed: 0,salary,salary_in_usd,remote_ratio
count,607.0,607.0,607.0
mean,324000.1,112297.869852,70.92257
std,1544357.0,70957.259411,40.70913
min,4000.0,2859.0,0.0
25%,70000.0,62726.0,50.0
50%,115000.0,101570.0,100.0
75%,165000.0,150000.0,100.0
max,30400000.0,600000.0,100.0


Here we can see that the four numerical variables are different scales which makes sense.

In [20]:
# Double check for na values
df.isna().any()

work_year             False
experience_level      False
employment_type       False
job_title             False
salary                False
salary_currency       False
salary_in_usd         False
employee_residence    False
remote_ratio          False
company_location      False
company_size          False
dtype: bool