Daily Challenge: Data Handling And Analysis In Python

Download and import the Data Science Job Salary dataset.
Normalize the ‘salary’ column using Min-Max normalization.
Implement dimensionality reduction on the dataset.
Aggregate data by ‘experience_level’, calculating average and median salaries.

In [3]:
import pandas as pd

datascience_salaries_df = pd.read_csv('datascience_salaries.csv')
datascience_salaries_df

Unnamed: 0.1,Unnamed: 0,job_title,job_type,experience_level,location,salary_currency,salary
0,0,Data scientist,Full Time,Senior,New York City,USD,149000
1,2,Data scientist,Full Time,Senior,Boston,USD,120000
2,3,Data scientist,Full Time,Senior,London,USD,68000
3,4,Data scientist,Full Time,Senior,Boston,USD,120000
4,5,Data scientist,Full Time,Senior,New York City,USD,149000
...,...,...,...,...,...,...,...
1166,2243,ML Ops,Full Time,Senior,Toronto,USD,228000
1167,2249,ML Ops,Full Time,Senior,Boston,USD,115000
1168,2250,ML Ops,Full Time,Senior,Delhi,USD,76000
1169,2255,ML Ops,Full Time,Senior,San Francisco,USD,68000


In [4]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
datascience_salaries_df['salary_normalized'] = scaler.fit_transform(datascience_salaries_df[['salary']])
datascience_salaries_df

Unnamed: 0.1,Unnamed: 0,job_title,job_type,experience_level,location,salary_currency,salary,salary_normalized
0,0,Data scientist,Full Time,Senior,New York City,USD,149000,0.601010
1,2,Data scientist,Full Time,Senior,Boston,USD,120000,0.454545
2,3,Data scientist,Full Time,Senior,London,USD,68000,0.191919
3,4,Data scientist,Full Time,Senior,Boston,USD,120000,0.454545
4,5,Data scientist,Full Time,Senior,New York City,USD,149000,0.601010
...,...,...,...,...,...,...,...,...
1166,2243,ML Ops,Full Time,Senior,Toronto,USD,228000,1.000000
1167,2249,ML Ops,Full Time,Senior,Boston,USD,115000,0.429293
1168,2250,ML Ops,Full Time,Senior,Delhi,USD,76000,0.232323
1169,2255,ML Ops,Full Time,Senior,San Francisco,USD,68000,0.191919


In [5]:
numeric_cols = datascience_salaries_df.select_dtypes(include=['number']).columns
categorical_cols = datascience_salaries_df.select_dtypes(include=['object']).columns

from sklearn.decomposition import PCA

# Use datascience_salaries_df instead of df for get_dummies
df_encoded = pd.get_dummies(datascience_salaries_df, columns=categorical_cols)
print(df_encoded.head())

# Assuming you want to reduce to 2 dimensions
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(df_encoded)

   Unnamed: 0  salary  salary_normalized  job_title_Big data  \
0           0  149000           0.601010               False   
1           2  120000           0.454545               False   
2           3   68000           0.191919               False   
3           4  120000           0.454545               False   
4           5  149000           0.601010               False   

   job_title_Data analyst  job_title_Data scientist  job_title_ML Ops  \
0                   False                      True             False   
1                   False                      True             False   
2                   False                      True             False   
3                   False                      True             False   
4                   False                      True             False   

   job_title_Machine learning  job_type_Full Time  job_type_Internship  ...  \
0                       False                True                False  ...   
1                 

In [6]:
agg_data = datascience_salaries_df.groupby('experience_level')['salary'].agg(['mean', 'median']).reset_index()
agg_data

Unnamed: 0,experience_level,mean,median
0,Entry,36111.111111,30000.0
1,Executive,76076.923077,46000.0
2,Mid,51786.885246,51000.0
3,Senior,75088.033012,68000.0
