#### Does it make sense to include the other datasets collected by foorila?

Apart from the remote salary dataset, Foorila also collects a devops, ai and infosec salary dataset with the same attributes. As the remote salary dataset only contains <2000 rows, it would be interesting to get these additional datapoints. 

It turns out that the additional datasets contain mostly duplicates of the remote dataset. Only 17% (ai), 22% (infosec) and 12% of the devops-dataset are new rows. As it is unclear how this is handeled internally by foorila, we will not include these additional datapoints in our further analysis

Tim Weisbarth January 2022

In [8]:
#import modules
import pandas as pd

# Load all datasets provided by foorila
sal_devops = pd.read_csv('../data/salaries_devops.csv')
sal_infosec = pd.read_csv('../data/salaries_infosec.csv')
sal_ai = pd.read_csv('../data/salaries_ai.csv')
sal_remote = pd.read_csv('../data/salaries.csv')

#Make work_year attribute compatible
for i in ["2021e", "2022e"]:
        sal_remote.loc[(sal_remote.work_year == i), "work_year"]= int(i[:-1])
sal_remote.astype({'work_year': 'int64' })

# Merge remote salaries with the respective other dataset
attr =  ['work_year','experience_level','employment_type','job_title','salary','salary_currency',
        'salary_in_usd','employee_residence','remote_ratio','company_location','company_size'] 

sim_ai = pd.merge(sal_ai, sal_remote.drop_duplicates(), on=attr)
sim_infosec = pd.merge(sal_infosec, sal_remote.drop_duplicates(), on=attr)
sim_devops = pd.merge(sal_devops, sal_remote.drop_duplicates(), left_on=attr, right_on=attr)

# Print percentage of rows in the other datasets 
# that are not present in the remote salary dataset already
dic = {"ai_salaries" :      1 - sim_ai.shape[0]/sal_ai.shape[0],
       "infosec_salaries" : 1 - sim_infosec.shape[0]/sal_infosec.shape[0],
       "devops_salaries" :  1 - sim_devops.shape[0] / sal_devops.shape[0]    
      }
print(dic)


{'ai_salaries': 0.3906810035842294, 'infosec_salaries': 0.49663299663299665, 'devops_salaries': 0.37883008356545966}
