<img src="https://www.boston.com/wp-content/uploads/2017/03/1118800c-2a31-11e3-9705-17b312947581.jpg" width="900" align="center">

# <p style="background-color:#6b5b95; font-family:newtimeroman;color:#FFF9ED;font-size:220%; text-align:center; border-radius: 15px 55px;">💸DS Salaries | Advanced EDA & Prediction 💰</p>

<div style="border-radius:10px;
            border : black solid;
            background-color: #5b9aa0;
            font-size:100%;
            text-align: left">
    
<h2 style='; border:0; border-radius: 15px; font-weight: bold; font-size:220%; color:white'><center> ✍✍ Purpose of the Project ✍✍</center></h2>  

<b> Visual elements tell more meaningful stories than the mere numbers or words. Hence, exploratory data analysis is one of the most significant methods to examine the data we have. In this project, we are going to explore the hidden patterns in the dataset and extract information from them. And instead of heavily using legacy libraries such as Matplotlib and Seaborn, we will focus on the interactive visual tools like Plotly and Altair. Now, let's dive into water and check the secrets of Data Science Salaries! </b>

# <p style="background-color:#6b5b95; font-family:newtimeroman;color:#FFF9ED; font-size:150%; text-align:center; border-radius: 15px 50px;"> ✨ Business Problem ✨</p>

<strong> Before diving into technical aspects of ML projects, the first step is always defining the business (or scientific) problem related to data. </strong>

Here, we have a dataset that displays data science salaries with 11 columns. 

Our purpose is to predict the salary of an employee, using the measurements in the dataset. To be able to predict it correctly, we need to grasp the <b>domain knowledge </b>, which will be provided in the next stages. 

### Additional Note: 

<blockquote> In data science, the term domain knowledge is used to refer to the general background knowledge of the field or environment to which the methods of data science are being applied. </blockquote>

Even though it is a vital part of data science alongside computer science and machine learning/statistics, it is usually overlooked by novices due to unawareness. 

Imagine a data scientist employed in the banking, defense, telecom or any other industry without knowing the bare bones of that field. How far do you think they can get ahead in that position under these circumstances. Not much, I guess. 

As for how we can obtain domain knowledge, we should choose a field we are interested in (for example sport analytics, neuroscience, autonome systems etc.), then learn the basics of it gradually and develop projects eventually.

<b> For further reading: </b> <br>
https://www.indeed.com/career-advice/career-development/what-is-domain-knowledge#:~:text=Domain%20knowledge%20is%20the%20understanding,or%20specializations%20in%20an%20industry.

# <p style="background-color:#6b5b95; font-family:newtimeroman;color:#FFF9ED; font-size:150%; text-align:center; border-radius: 15px 50px;"> ⚛ Dataset Explanation ⚛</p>

<div style="border-radius:10px;
            border : black solid;
            background-color: #5b9aa0;
            font-size:100%;
            text-align: left">
    
<h2 style='; border:0; border-radius: 15px; font-weight: bold; font-size:220%; color:white'><center> Explanation of the Variables </center></h2> 
    
* ****work_year:**** The year the salary was paid.
* ****experience_level:**** The experience level in the job during the year
* ****employment_type:**** The type of employment for the role
* ****job_title:**** The role worked in during the year.
* ****salary:**** The total gross salary amount paid.
* ****salary_currency:**** The currency of the salary paid as an ISO 4217 currency code.
* ****salaryinusd:**** The salary in USD
* ****employee_residence:**** Employee's primary country of residence in during the work year as an ISO 3166 country code.
* ****remote_ratio:**** The overall amount of work done remotely
* ****company_location:**** The country of the employer's main office or contracting branch
* ****company_size:**** The median number of people that worked for the company during the year

# Notebook Content

<a id = "1"></a>
# <p style="background-color:#6b5b95; font-family:newtimeroman;color:#FFF9ED; font-size:150%; text-align:center; border-radius: 15px 50px;"> ☀ Import Libraries ☀</p>

In [None]:
!pip install country_converter

In [None]:
# Classic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Advanced Visualization Libraries
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected = True) #enables plotly plots to be displayed in notebook
cmap1 = "gist_gray"
import altair as alt
import country_converter as coco
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

#Models
from lightgbm import LGBMClassifier
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

#Metrics, Preprocessing and Tuning Tools
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
import missingno as msno
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

#Customization
import warnings
warnings.filterwarnings("ignore")

<a id = "2"></a>
# <p style="background-color:#6b5b95; font-family:newtimeroman;color:#FFF9ED; font-size:150%; text-align:center; border-radius: 15px 50px;"> ⇣ Load and Check Data ⇣</p>

In [None]:
salaries = pd.read_csv('/kaggle/input/data-science-salaries-2023/ds_salaries.csv')
df = salaries.copy()

In [None]:
df.head()

In [None]:
def check_data(df):
    print(80 * "*")
    print('DIMENSION: ({}, {})'.format(df.shape[0], df.shape[1]))
    print(80 * "*")
    print("COLUMNS:\n")
    print(df.columns.values)
    print(80 * "*")
    print("DATA INFO:\n")
    print(df.dtypes)
    print(80 * "*")
    print("MISSING VALUES:\n")
    print(df.isnull().sum())
    print(80 * "*")
    print("NUMBER OF UNIQUE VALUES:\n")
    print(df.nunique())
    
check_data(df)

In [None]:
def grab_col_names(dataframe, cat_th=10, car_th=20):
    
    # cat_cols, cat_but_car
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    # num_cols
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    print(" RESULT ".center(50, "-"))
    print(f"Observations: {dataframe.shape[0]}")
    print(f"Variables: {dataframe.shape[1]}")
    print(f'cat_cols: {len(cat_cols)}')
    print(f'num_cols: {len(num_cols)}')
    print(f'cat_but_car: {len(cat_but_car)}')
    print(f'num_but_cat: {len(num_but_cat)}')
    print("".center(50, "-"))
    
    return cat_cols, num_cols, cat_but_car

cat_cols, num_cols, cat_but_car = grab_col_names(df)

In [None]:
def descriptive_stats(df):
    desc = df.describe().T
    desc_df = pd.DataFrame(index = df. columns,
                          columns = desc.columns,
                          data = desc)
    f, ax = plt.subplots(figsize = (18, 8))
    sns.heatmap(desc,
               annot = True,
               cmap = cmap1,
               fmt = ".2f",
               ax = ax,
               linecolor = "black",
               linewidths = 1.5,
               cbar = False,
               annot_kws = {"size" : 15})
    plt.xticks(size = 15)
    plt.yticks(size = 15, rotation = 0)
    plt.title("Descriptive Statistics", size = 15)
    plt.show()
    
   
descriptive_stats(df[num_cols])

<div style="border-radius:10px;
            border : black solid;
            background-color: #5b9aa0;
            font-size:100%;
            text-align: left">
    
<h2 style='; border:0; border-radius: 15px; font-weight: bold; font-size:220%; color:white'><center> Summary of the Dataset </center></h2> 
    
 * <b> The dataset consists of 3755 rows and 11 columns </b>
 * <b> The target variable is salary_in_usd </b>
 * <b> We have 6 categorical and 2 numerical variables </b>
 * <b> We have 3 variables with high cardinality, which means it's technically categorical but has so many labels and encoding its values can increase the time of computation drastically </b>
 * <b> There are technically no missing values </b>
 * <b> Descriptive statistics show that some features have outliers </b>

<a id = "3"></a>
# <p style="background-color:#6b5b95; font-family:newtimeroman;color:#FFF9ED; font-size:150%; text-align:center; border-radius: 15px 50px;"> ⚡ Exploratory Data Analysis (EDA) ⚡</p>

In [None]:
def tar_var_summary(df, target):
    fig = go.Figure()
    fig.add_trace(go.Violin(x=df[target], line_color='#6C9BCF', name='Happy', y0=0))
    fig.update_traces(orientation='h', side='positive', meanline_visible=False)
    fig.update_layout(title={'text': "Distribution of the Target Variable",
                             'y':0.9,
                             'x':0.5,
                             'xanchor':'center',
                             'yanchor':'top'},
                             barmode='overlay',
                             yaxis=dict(title='Count'),
                             template = 'plotly_dark')
    fig.show()
    
tar_var_summary(df, "salary_in_usd")

In [None]:
def num_var_summary(df, num_var):
    fig = make_subplots(rows = 1, cols = 2,
                       subplot_titles = ("Quantiles", "Distribution"))
    
    fig.add_trace(go.Box(y = df[num_var],
                         name = str(num_var),
                         showlegend = False,
                         marker_color = "#A6D0DD"), 
                         row = 1, col = 1)
    
    fig.add_trace(go.Histogram(x = df[num_var],
                               xbins = dict(start = df[num_var].min(),
                                            end = df[num_var].max()),
                               showlegend = False,
                               name = str(num_var),
                               marker=dict(color="#0A4D68",
                                           line = dict(color = '#DBE6EC',
                                                       width = 1))
                              ),
                  row = 1, col = 2)
    
    fig.update_layout(title={'text': num_var.capitalize(),
                         'y':0.9,
                         'x':0.5,
                         'xanchor': 'center',
                         'yanchor': 'top'},
                  template='plotly_dark')
    
    iplot(fig)

In [None]:
num_var_summary(df, "salary_in_usd")

In [None]:
def cat_var_summary(df, cat_var):
    colors = ['#a2b9bc', '#6b5b95', '#b2ad7f', '#feb236', '#b5e7a0', '#878f99',
              '#d64161', '#86af49', '#ff7b25']
    
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=('Countplot', 'Percentages'),
                        specs=[[{"type": "xy"}, {'type': 'domain'}]])
    
    x = [str(i) for i in df[cat_var].value_counts().index]
    y = df[cat_var].value_counts().values.tolist()
    
    fig.add_trace(go.Bar(x = x, y = y, text = y, 
                         textposition = "auto",
                       showlegend = False,
                        marker=dict(color=colors,
                              line = dict(color = 'black',
                                          width = 2))), row=1, col=1)
    
    fig.add_trace(go.Pie(labels = df[cat_var].value_counts().keys(),
                         values = df[cat_var].value_counts().values, 
                         hoverinfo ='label',
                  textinfo ='percent',
                  textfont_size = 20,
                  textposition ='auto',
                  marker=dict(colors=colors,
                              line = dict(color = 'black',
                                          width = 2))), row=1, col=2)

    
    fig.update_layout(title={'text': cat_var,
                         'y':0.9,
                         'x':0.5,
                         'xanchor': 'center',
                         'yanchor': 'top'},
                  template='plotly_dark')
    
    iplot(fig)

In [None]:
cat_cols

In [None]:
for i in ["experience_level", "company_size", "work_year", "remote_ratio"]:
    cat_var_summary(df, i)

In [None]:
alt.Chart(df).mark_bar().encode(
    color=alt.Color('salary_in_usd', bin=True),
    x='count()',
    y='employment_type'
)

In [None]:
alt.Chart(df).mark_bar().encode(
    x='count()',
    y='salary_currency'
)

In [None]:
def df_corr(df):
    plt.figure(figsize = (12,10))
    corr = df.corr()
    matrix = np.triu(corr)
    sns.heatmap(corr, annot = True, mask = matrix, cmap = "gist_gray")

In [None]:
df_corr(df)

In [None]:
def detect_outliers(df, num_var):
    
    trace0 = go.Box(
        y = df[num_var],
        name = "All Points",
        jitter = 0.3,
        pointpos = -1.8,
        boxpoints = 'all',
        marker = dict(
            color = '#a2b9bc'),
        line = dict(
            color = '#6b5b95')
    )

    trace1 = go.Box(
        y = df[num_var],
        name = "Only Whiskers",
        boxpoints = False,
        marker = dict(
            color = '#b2ad7f'),
        line = dict(
            color = '#feb236')
    )

    trace2 = go.Box(
        y = df[num_var],
        name = "Suspected Outliers",
        boxpoints = 'suspectedoutliers',
        marker = dict(
            color = '#b5e7a0',
            outliercolor = '#878f99',
            line = dict(
                outliercolor = '#d64161',
                outlierwidth = 2)),
        line = dict(
            color = '#86af49')
    )

    trace3 = go.Box(
        y = df[num_var],
        name = "Whiskers and Outliers",
        boxpoints = 'outliers',
        marker = dict(
            color = '#6b5b95'),
        line = dict(
            color = '#ff7b25')
    )

    data = [trace0,trace1,trace2,trace3]

    layout = go.Layout(
        title = "{} Outliers".format(num_var)
    )
    
    layout = go.Layout(title={'text': num_var,
                         'y':0.9,
                         'x':0.5,
                         'xanchor':'center',
                         'yanchor':'top'},
                         barmode='overlay',
                         yaxis=dict(title='Count'),
                         template = 'plotly_dark')

    fig = go.Figure(data=data,layout=layout)
    
    iplot(fig)

In [None]:
detect_outliers(df, "salary_in_usd")

In [None]:
def display_topn_cat_val(df, feature, n=5):
    topn = df[feature].value_counts()[:n]
    fig = px.bar(y = topn.values, x = topn.index,  
            text = topn.values, title = 'Top {} Job Designations'.format(n))
    fig.update_layout(title={'text': feature,
                             'y':0.9,
                             'x':0.5,
                             'xanchor':'center',
                             'yanchor':'top'},
                             barmode='overlay',
                             yaxis=dict(title='Count'),
                             template = 'plotly_dark')
    fig.show()

In [None]:
display_topn_cat_val(df, 'job_title', n=10)

In [None]:
display_topn_cat_val(df, 'salary_currency')

In [None]:
df['experience_level'] = df['experience_level'].replace('EN','Entry-level/Junior')
df['experience_level'] = df['experience_level'].replace('MI','Mid-level/Intermediate')
df['experience_level'] = df['experience_level'].replace('SE','Senior-level/Expert')
df['experience_level'] = df['experience_level'].replace('EX','Executive-level/Director')

ex_level = df['experience_level'].value_counts()
fig = px.treemap(ex_level, path = [ex_level.index], values = ex_level.values, 
                title = 'Experience Level')
fig.update_layout(title={'text': "Experience Level Frequency",
                             'y':0.9,
                             'x':0.5,
                             'xanchor':'center',
                             'yanchor':'top'},
                             barmode='overlay',
                             yaxis=dict(title='Count'),
                             template = 'plotly_dark')
fig.show()

In [None]:
remote_year = df.groupby(['work_year','remote_ratio']).size()
ratio_2020 = np.round(remote_year[2020].values/remote_year[2020].values.sum(),2)
ratio_2021 = np.round(remote_year[2021].values/remote_year[2021].values.sum(),2)
ratio_2022 = np.round(remote_year[2022].values/remote_year[2022].values.sum(),2)
ratio_2023 = np.round(remote_year[2023].values/remote_year[2023].values.sum(),2)

fig = go.Figure()
categories = ['No Remote Work', 'Partially Remote', 'Fully Remote']
fig.add_trace(go.Scatterpolar(
            r = ratio_2020, theta = categories, 
            fill = 'toself', name = '2020 remote ratio'))

fig.add_trace(go.Scatterpolar(
            r = ratio_2021, theta = categories,
            fill = 'toself', name = '2021 remote ratio'))

fig.add_trace(go.Scatterpolar(
            r = ratio_2022, theta = categories,
            fill = 'toself', name = '2022 remote ratio'))

fig.add_trace(go.Scatterpolar(
            r = ratio_2023, theta = categories,
            fill = 'toself', name = '2023 remote ratio'))
fig.update_layout(title={'text': "Remote Ratio by Work Year",
                             'y':0.9,
                             'x':0.5,
                             'xanchor':'center',
                             'yanchor':'top'},
                             barmode='overlay',
                             yaxis=dict(title='Count'),
                             template = 'plotly_dark')
fig.show()

In [None]:
exp_level = df.groupby(['work_year','experience_level']).size()
ratio_2020 = np.round(exp_level[2020].values/exp_level[2020].values.sum(),2)
ratio_2021 = np.round(exp_level[2021].values/exp_level[2021].values.sum(),2)
ratio_2022 = np.round(exp_level[2022].values/exp_level[2022].values.sum(),2)
ratio_2023 = np.round(exp_level[2023].values/exp_level[2023].values.sum(),2)

fig = go.Figure()
categories = ['Entry-level/Junior', 'Mid-level/Intermediate', 'Senior-level/Expert', 
              'Executive-level/Director']
fig.add_trace(go.Scatterpolar(
            r = ratio_2020, theta = categories, 
            fill = 'toself', name = '2020 exp-level ratio'))

fig.add_trace(go.Scatterpolar(
            r = ratio_2021, theta = categories,
            fill = 'toself', name = '2021 exp-level ratio'))

fig.add_trace(go.Scatterpolar(
            r = ratio_2022, theta = categories,
            fill = 'toself', name = '2022 exp-level ratio'))

fig.add_trace(go.Scatterpolar(
            r = ratio_2023, theta = categories,
            fill = 'toself', name = '2023 exp-level ratio'))
fig.update_layout(title={'text': "Experience Level Ratio by Work Year",
                             'y':0.9,
                             'x':0.5,
                             'xanchor':'center',
                             'yanchor':'top'},
                             barmode='overlay',
                             yaxis=dict(title='Count'),
                             template = 'plotly_dark')
fig.show()

In [None]:
def disp_avg_score(df, feature, target):
    exp_avg = pd.DataFrame(df.groupby([target]).mean())
    exp_avg = exp_avg.reset_index()

    trace1 = go.Bar(x = exp_avg[target],
                    y = exp_avg[feature],
                    name = 'Salary in USD',
                    marker = dict(color ='#A6D0DD',
                                  opacity = 0.7))

    layout = go.Layout(title = 'Average Scores by Target Value',
                       barmode = 'stack',
                       xaxis = dict(title='Level of Experience'),
                       yaxis =dict(title='Salary ($)'),
                       template = 'plotly_dark')

    fig = go.Figure(data = [trace1], layout=layout)
    iplot(fig)
    
disp_avg_score(df, "salary_in_usd", "experience_level")

In [None]:
df['employment_type'] = df['employment_type'].replace('CT','Freelancer')
df['employment_type'] = df['employment_type'].replace('FL','Contractor')
df['employment_type'] = df['employment_type'].replace('FT','Full-time')
df['employment_type'] = df['employment_type'].replace('PT','Part-time')
disp_avg_score(df, "salary_in_usd", "employment_type")

In [None]:
df['company_size'] = df['company_size'].replace({
    'S': 'Small',
    'M': 'Medium',
    'L' : 'Large',
})
disp_avg_score(df, "salary_in_usd", "company_size")

In [None]:
country = coco.convert(names = df['employee_residence'], to = "ISO3")
df['employee_residence'] = country

residence = df['employee_residence'].value_counts()
fig = px.choropleth(locations = residence.index,
                    color = residence.values,
                    color_continuous_scale=px.colors.sequential.YlGn,
                    title = 'Employee Loaction On Map')
fig.show()

In [None]:
text = df['job_title'].values
text = ' '.join(text)

wc = WordCloud(background_color = "black", width = 1200, height = 600,
               contour_width = 0, contour_color = "#410F01", max_words = 1000,
               scale = 1, collocations = False, repeat = True, min_font_size = 1)

wc.generate(text)

plt.figure(figsize = [15, 7])
plt.title("Top Words in the Text")
plt.imshow(wc)
plt.axis("off")
plt.show