# Data Science Salary Analysis Project

This data analysis project focuses on analyzing a dataset related to salaries in the data science and machine learning field. The dataset, *ds_salaries.csv*, contains various attributes that describe the employment details of professionals, including their job titles, salary, employment type, experience level, and more. 

The primary goal of this project is to extract valuable insights from the data, such as trends in salaries across different job titles, locations, experience levels, and company sizes. By examining key variables like salary in USD, remote working ratio, and employment type, this analysis aims to identify patterns and provide a comprehensive understanding of the factors influencing salaries in the data science and ML industry.


## About the Dataset

The *Data Science Job Salaries* dataset consists of 11 columns, each representing different aspects of employment and compensation. The columns are as follows:

- **work_year**: The year when the salary was paid.
- **experience_level**: The experience level of the employee in the role during the specified year.
- **employment_type**: The type of employment associated with the role.
- **job_title**: The job title or role the employee held during the year.
- **salary**: The total gross salary received by the employee.
- **salary_currency**: The currency in which the salary is paid, represented by an ISO 4217 currency code.
- **salary_in_usd**: The equivalent salary in USD.
- **employee_residence**: The country where the employee resides, represented by an ISO 3166 country code.
- **remote_ratio**: The proportion of work completed remotely during the year.
- **company_location**: The location of the employer's main office or contracting branch, represented by a country code.
- **company_size**: The median number of employees at the company during the year.


In [29]:
# Installing the library quietly (without verbose output)
# Provides the ISO databases for the standards
!pip install pycountry -q 


In [2]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing libraries
import numpy as np
import pycountry
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [3]:
# Reading the dataset 'ds_salaries.csv' into a pandas DataFrame
data = pd.read_csv("ds_salaries.csv")


In [4]:
# Printing the number of columns and rows in the dataset
print(f"The data set has {data.shape[1]} columns and a total of {data.shape[0]} rows")


The data set has 11 columns and a total of 3755 rows


In [5]:
# Displaying the first five rows of the DataFrame to get an overview of the dataset
data.head()


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [6]:
# Dropping the 'salary' and 'salary_currency' columns from the DataFrame
"""
'axis=1' parameter specifies that columns are being dropped 
'inplace=True' ensures that the changes are made directly to the original DataFrame
"""

data.drop(["salary", "salary_currency"], axis=1, inplace=True)


In [7]:
# Checking for missing values in each column of the DataFrame
data.isnull().sum()


work_year             0
experience_level      0
employment_type       0
job_title             0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [8]:
# Identifying duplicate rows in the DataFrame
duplicate_rows = data[data.duplicated()]
num_duplicates = duplicate_rows.shape[0]

# Printing the number of duplicate rows
print(f'Number of duplicate rows: {num_duplicates}')


Number of duplicate rows: 1171


In [9]:
# Generate descriptive statistics for the 'salary_in_usd' column, including all data types
salary_statistics = data['salary_in_usd'].describe(include='all')
round(salary_statistics ,2)


count      3755.00
mean     137570.39
std       63055.63
min        5132.00
25%       95000.00
50%      135000.00
75%      175000.00
max      450000.00
Name: salary_in_usd, dtype: float64

In [10]:
# Replacing abbreviated experience levels with full descriptions
data['experience_level'] = data['experience_level'].replace({
    'EN': 'Entry level',
    'MI': 'Mid level',
    'SE': 'Senior level',
    'EX': 'Executive'
})

# Replacing abbreviated employment types with full descriptions
data['employment_type'] = data['employment_type'].replace({
    'CT': 'Contract',
    'FL': 'Freelance',
    'FT': 'Full-time',
    'PT': 'Part-time'
})

# Converting 'remote_ratio' to string and replacing numeric values with descriptions
data['remote_ratio'] = data['remote_ratio'].astype(str).replace({
    '0': 'On-site',
    '50': 'Hybrid',
    '100': 'Remote'
})

# Replacing abbreviated company sizes with full descriptions
data['company_size'] = data['company_size'].replace({
    'S': 'Small',
    'M': 'Medium',
    'L': 'Large'
})


In [11]:
print(f"The data set has {data.shape[1]} columns and a total of {data.shape[0]} rows")
data.head()

The data set has 9 columns and a total of 3755 rows


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,Senior level,Full-time,Principal Data Scientist,85847,ES,Remote,ES,Large
1,2023,Mid level,Contract,ML Engineer,30000,US,Remote,US,Small
2,2023,Mid level,Contract,ML Engineer,25500,US,Remote,US,Small
3,2023,Senior level,Full-time,Data Scientist,175000,CA,Remote,CA,Medium
4,2023,Senior level,Full-time,Data Scientist,120000,CA,Remote,CA,Medium


# Data Transformation Summary

- The dataset has been transformed to replace abbreviations with full descriptions for several columns:
  - **`experience_level`**: Replaced abbreviated levels (`EN`, `MI`, `SE`, `EX`) with full descriptions (e.g., "Mid level").
  - **`employment_type`**: Employment types (`CT`, `FL`, `FT`, `PT`) are now fully spelled out (e.g., "Full-time").
  - **`remote_ratio`**: Converted numeric values (`0`, `50`, `100`) into strings representing work environments (e.g., "Remote").
  - **`company_size`**: Abbreviations (`S`, `M`, `L`) replaced with full descriptions (e.g., "Large").

### Dataset Overview:
- **Rows and Columns**: The dataset contains `3755` rows and `9` columns.
- **Numerical Values**: Includes columns like `work_year`, `salary` and `adjusted_salary_usd`
- **Categorical Values**: Columns such as `experience_level`, `employment_type`, `job_title`, `employee_residence`, `remote_ratio`, `company_location`, `company_size` and `job_category`


In [12]:
# Function takes a job title as input and checks if it belongs to one of the predefined lists
def assign_broader_category(job_title):
    data_science = [
        "Principal Data Scientist", "Data Scientist", "Applied Data Scientist", 
        "Research Scientist", "Lead Data Scientist", "Staff Data Scientist", 
        "Data Science Lead", "Data Science Consultant", "Data Science Manager", 
        "Head of Data Science", "Data Scientist Lead", "Director of Data Science", 
        "Data Science Engineer", "Data Science Tech Lead", "Product Data Scientist"
    ]
    
    machine_learning_ai = [
        "ML Engineer", "Applied Scientist", "Machine Learning Engineer", 
        "Applied Machine Learning Engineer", "AI Developer", "Machine Learning Researcher", 
        "Machine Learning Scientist", "MLOps Engineer", "AI Scientist", "AI Programmer", 
        "Machine Learning Software Engineer", "Deep Learning Researcher", "Deep Learning Engineer", 
        "NLP Engineer", "Machine Learning Research Engineer", "Machine Learning Developer", 
        "Principal Machine Learning Engineer", "Lead Machine Learning Engineer", 
        "Head of Machine Learning", "Machine Learning Manager", "Applied Machine Learning Scientist"
    ]
    
    data_engineering_infrastructure = [
        "Data Engineer", "Data Modeler", "Analytics Engineer", "ETL Engineer", 
        "Data DevOps Engineer", "Big Data Engineer", "Lead Data Engineer", 
        "Cloud Data Engineer", "Cloud Database Engineer", "Software Data Engineer", 
        "Data Operations Engineer", "Data Infrastructure Engineer", "ETL Developer", 
        "Big Data Architect", "Cloud Data Architect", "Principal Data Engineer", 
        "Azure Data Engineer", "Marketing Data Engineer"
    ]
    
    analytics = [
        "Data Analyst", "Business Data Analyst", "Staff Data Analyst", "Data Quality Analyst", 
        "Compliance Data Analyst", "Data Analytics Manager", "Data Analytics Specialist", 
        "Data Analytics Engineer", "Data Analytics Consultant", "Insight Analyst", 
        "Data Analytics Lead", "Product Data Analyst", "Marketing Data Analyst", 
        "Financial Data Analyst", "Lead Data Analyst", "Principal Data Analyst", 
        "Data Operations Analyst"
    ]
    
    business_intelligence = [
        "BI Developer", "BI Data Engineer", "Business Intelligence Engineer", 
        "BI Analyst", "BI Data Analyst", "Power BI Developer"
    ]
    
    data_leadership_strategy = [
        "Data Strategist", "Head of Data", "Data Manager", "Manager Data Management", 
        "Data Lead", "Data Science Tech Lead", "Principal Data Architect", "Data Management Specialist"
    ]
    
    research_advanced_techniques = [
        "Research Engineer", "Autonomous Vehicle Technician", "Computer Vision Engineer", 
        "Computer Vision Software Engineer", "3D Computer Vision Researcher"
    ]
    
    others_specialized = [
        "Data Specialist", "Data Operations Analyst", "Data Management Specialist", 
        "Finance Data Analyst"
    ]

    if job_title in data_science:
        return "Data Science"
    elif job_title in machine_learning_ai:
        return "Machine Learning / AI"
    elif job_title in data_engineering_infrastructure:
        return "Data Engineering / Infrastructure"
    elif job_title in analytics:
        return "Analytics"
    elif job_title in business_intelligence:
        return "Business Intelligence (BI)"
    elif job_title in data_leadership_strategy:
        return "Data Leadership / Strategy"
    elif job_title in research_advanced_techniques:
        return "Research / Advanced Techniques"
    elif job_title in others_specialized:
        return "Others / Specialized"
    else:
        return "Other"

# Apply the function to the 'job_title' column and create a new column 'job_category'
data['job_category'] = data['job_title'].apply(assign_broader_category)



# Job Title Categorization

The `assign_broader_category` function is designed to take a specific job title as input and assign it to a broader, predefined job category. These categories include:

- **Data Science**
- **Machine Learning / AI**
- **Data Engineering / Infrastructure**
- **Analytics**
- **Business Intelligence (BI)**
- **Data Leadership / Strategy**
- **Research / Advanced Techniques**
- **Others / Specialized**

Each category contains a list of job titles that fall under its umbrella. For example, roles such as "Principal Data Scientist" and "Data Science Manager" are categorized under **Data Science**, while titles like "ML Engineer" and "AI Developer" are assigned to the **Machine Learning / AI** category. 

By applying the function to the `job_title` column, the dataset is enriched with a new `job_category` column that standardizes diverse job titles into these broader categories. This approach simplifies the analysis and comparison of job roles across different fields within data-related professions.


In [13]:
# Function takes a country code (alpha-2) as input and returns the corresponding country name
def get_country_name(code):
    try:
        country = pycountry.countries.get(alpha_2=code.upper())
        return country.name if country else "Unknown code"
    except KeyError:
        return "Unknown code"

data['employee_residence'] = data['employee_residence'].apply(get_country_name)
data['company_location'] = data['company_location'].apply(get_country_name)


# Country Code to Country Name

The `get_country_name` function converts two-letter country codes (ISO alpha-2) into their corresponding country names. Using the `pycountry` library, the function takes an input country code, converts it to uppercase, and returns the full country name. If the code is invalid or unrecognized, it returns "Unknown code."

This function is applied to both the `employee_residence` and `company_location` columns in the dataset, replacing the country codes with readable country names. As a result, the dataset now contains full country names, making geographical data more intuitive and accessible for analysis.


In [14]:
# inflation rates: https://www.in2013dollars.com/us/inflation/2017?amount=1
inflation_rates = {
    2020: 1.0123,
    2021: 1.047,
    2022: 1.08,
    2023: 1.0412 
}


# Function to adjust salary based on inflation
def adjust_for_inflation(data):
    year = data['work_year']
    inflation_rate = inflation_rates.get(year)
    if inflation_rate is None:
        raise ValueError(f"Inflation rate for year {year} is not defined.")
    return round(data['salary_in_usd'] * inflation_rate,2)

data['adjusted_salary_usd'] = data.apply(lambda data: adjust_for_inflation(data), axis=1)


# Salary Adjustment for Inflation

The dataset includes a function, `adjust_for_inflation`, designed to adjust salaries based on inflation rates for specific years. Inflation rates for the years 2020, 2021, 2022, and 2023 are predefined and stored in the `inflation_rates` dictionary:

- **2020**: 1.0123
- **2021**: 1.047
- **2022**: 1.08
- **2023**: 1.0412

The function takes the year from each record (`work_year`) and multiplies the salary in USD (`salary_in_usd`) by the corresponding inflation rate. If no inflation rate is defined for a given year, it raises an error. The result is a new column, `adjusted_salary_usd`, which reflects the inflation-adjusted salary, making it easier to compare salaries across different years in constant terms.


In [15]:
# Displaying the first five rows of the DataFrame to get an overview of the dataset
data.head()


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,job_category,adjusted_salary_usd
0,2023,Senior level,Full-time,Principal Data Scientist,85847,Spain,Remote,Spain,Large,Data Science,89383.9
1,2023,Mid level,Contract,ML Engineer,30000,United States,Remote,United States,Small,Machine Learning / AI,31236.0
2,2023,Mid level,Contract,ML Engineer,25500,United States,Remote,United States,Small,Machine Learning / AI,26550.6
3,2023,Senior level,Full-time,Data Scientist,175000,Canada,Remote,Canada,Medium,Data Science,182210.0
4,2023,Senior level,Full-time,Data Scientist,120000,Canada,Remote,Canada,Medium,Data Science,124944.0


In [16]:
 # Calculate value counts in percentage and sort
value_counts = data['job_category'].value_counts(normalize=True) * 100
value_counts = value_counts.sort_values(ascending=True)


# Create a horizontal bar chart using Plotly Express
fig = px.bar(
    value_counts, 
    x=value_counts.values,
    y=value_counts.index, 
    orientation='h', 
    labels={'x': 'Percentage', 'index': 'Job Category'},
    title='Job Titles Percentage'
)

# Update layout for better readability
fig.update_layout(
    xaxis_title='Percentage',
    yaxis_title='Job Category',
    title_x=0.5,  # Center the title
    height=500,   # Adjust height based on the number of categories
)

# Show the plot
fig.show()


## Job Category Distribution

The horizontal bar chart visualizes the distribution of job categories as percentages, highlighting key trends in the data-driven job market:

- **Data Engineering / Infrastructure** holds the largest share at **32.17%**, reflecting the crucial role of building and maintaining data systems.
- **Data Science** is the second most prominent category, making up **28.47%** of the total. This indicates a strong demand for professionals with advanced analytical capabilities.
- **Analytics**, with **18.46%**, demonstrates the importance of data analysis skills in supporting business decision-making processes.
- **Machine Learning / AI** contributes **13.40%**, showing the increasing reliance on artificial intelligence technologies across industries.
- Less common but still essential roles include **Research / Advanced Techniques** (**1.76%**), **Data Leadership / Strategy** (**1.23%**), and **Business Intelligence (BI)** (**1.15%**), which are focused on innovation, strategy, and operational insights.
- **Others / Specialized** roles account for a small **0.40%**, while the broader **Other** category contributes **2.98%**.

This distribution suggests that the bulk of the demand in data-related job categories lies in engineering, data science, and analytics, while AI and machine learning also play a growing role in the industry.


In [17]:
# Create a boxplot using Plotly Express
fig = px.box(
    data, 
    x='employment_type', 
    y='adjusted_salary_usd', 
    labels={'employment_type': 'Employment Type', 'adjusted_salary_usd': 'Adjusted Salary'},
    title='Salary Distribution Across Different Employment Types'
)

# Update layout for better readability
fig.update_layout(
    xaxis_title='Employment Type',
    yaxis_title='Adjusted Salary',
    title_x=0.5,  # Center the title
    height=500,   # Adjust height if necessary
)

# Show the plot
fig.show()

for trace in fig.data: 
    print(f"Trace name: {trace.name}") 
    print(f"X values: {trace.x}") 
    print(f"Y values: {trace.y}\n")


Trace name: 
X values: ['Full-time' 'Contract' 'Contract' ... 'Full-time' 'Contract' 'Full-time']
Y values: [ 89383.9   31236.    26550.6  ... 106291.5  101230.    99114.25]



In [18]:
# Create a histogram for Remote Ratio distribution
fig = px.histogram(
    data,
    x='remote_ratio',
    nbins=2,  # Number of bins based on expected unique values (0, 50, 100)
    title="Remote Ratio Distribution",
    labels={'remote_ratio': 'Remote Ratio'},
    color='remote_ratio',
    color_discrete_sequence=px.colors.qualitative.Plotly
)

# Update layout for better readability
fig.update_layout(
    xaxis_title = "Remote Type",
    title_x=0.5,
    height=500
)

# Show the plot
fig.show()



## Remote Work Distribution

The histogram shows the distribution of job roles based on their remote work ratio, highlighting three categories: **On-site (0)**, **Hybrid (50)**, and **Fully Remote (100)**:

- **On-site roles** make up the largest portion, with **1,923 jobs** (around 51%), showing that many companies still prefer or require employees to be physically present in the workplace.
- **Fully Remote roles** are the second most common, accounting for **1,643 jobs** (approximately 44%). This reflects the growing trend of remote work, as more organizations continue to adopt fully remote models.
- **Hybrid roles** are the least common, with only **189 jobs** (about 5%), indicating that while some companies offer a mix of remote and on-site work, it is less common than the fully remote or fully on-site models.

This distribution suggests that while on-site jobs remain dominant, there is a significant shift toward fully remote work, with hybrid models still being a relatively rare approach.


In [19]:
# create pivot table
pivot_table = data.pivot_table(values='adjusted_salary_usd', index='job_category', columns='work_year', aggfunc='median')

# Create Heatmap
fig = go.Figure(data=go.Heatmap(
    z=pivot_table.values,
    x=pivot_table.columns,
    y=pivot_table.index,
    colorscale='YlGnBu',
    text=pivot_table.values,
    texttemplate="%{text:.2f}",
    textfont={"size":12}
))

fig.update_layout(
    title='Median Salary by Year',
    xaxis_title='Year',
    yaxis_title='Job Category',
    title_x = 0.5,
    height=700,
   
)

fig.show()


### Median Salary by Year and Job Category
The analysis aims to visualize the median salary trends across various job categories over the years, providing insights into how compensation levels have changed from 2020 to 2023 for each job category.

Here are the key salary trends observed in the heatmap:
- **Analytics**: Salaries have shown a steady increase from 47,334 in 2020 to 111,200 in 2023.
- **Business Intelligence (BI)**: The salary fluctuated, with a dip in 2021 (47,774) and significant growth in 2023 (137,594).
- **Data Engineering / Infrastructure**: A noticeable increase from 73,021 in 2020 to 150,974 in 2023, indicating high demand.
- **Data Leadership / Strategy**: A sharp jump in 2021 (240,810) followed by a slight dip but remaining high in subsequent years.
- **Data Science**: Consistent growth from 79,359 in 2020 to 162,635 in 2023.
- **Machine Learning / AI**: A fluctuation, but overall increasing trend from 129,120 in 2020 to 159,720 in 2023.
- **Other**: Showed substantial increases over the years, reaching 170,496 in 2023.
- **Others / Specialized**: The salaries remained somewhat stable, with a slight decrease in 2023 (109,326).
- **Research / Advanced Techniques**: Highly volatile salaries, dropping to 25,128 in 2021 and recovering to 166,592 by 2023.



In [20]:
# Create the scatter plot with Plotly
fig = px.scatter(
    data,
    x='employee_residence',
    y='company_location',
    size='adjusted_salary_usd',  # Bubble size based on adjusted salary
    color='adjusted_salary_usd',  # Color intensity based on adjusted salary
    title='Salary Comparison between Employee Residence and Company Location',
    labels={'employee_residence': 'Employee Residence', 'company_location': 'Company Location'},
    size_max=60,  # Max bubble size
    color_continuous_scale=px.colors.sequential.Viridis,  # Color scale for adjusted salary
)

# Update the layout
fig.update_layout(
    xaxis_tickangle=-90,
    legend_title="Adjusted Salary",
    title_x=0.5,
    height=700,
)

# Show the plot
fig.show()


 ## Salary Comparison between Employee Residence and Company Location
The goal is to compare salaries based on the **employee's residence** and the **company's location**, using a scatter plot where each point represents the interaction between the two. Salaries are represented by the size and color of the bubbles, giving a visual comparison of salary differences across different locations.

### Scatter Plot Findings
   - The bubbles vary significantly in size and color, indicating wide disparities in salaries based on different combinations of employee residence and company location.
   - **United States** is a prominent location on both axes, with bubbles indicating relatively high salaries (larger bubbles and brighter colors).
   - **India**, **Spain**, and other countries show smaller bubbles, indicating lower salaries in comparison.
   - Some bubbles have the same location for both employee residence and company location, representing employees who live and work in the same country.

### Insights
   - Salaries can vary even for the same company location depending on where the employee resides, and vice versa.
   - Countries like the United States consistently show higher salaries, as indicated by the larger, brighter bubbles.
   - The plot reveals the global nature of employment, where employees may work remotely, living in one country while being employed by a company in another country.


In [21]:
# Average salary by company_location (Choropleth Map)
avg_salary_by_location = data.groupby('company_location', as_index=False)['adjusted_salary_usd'].mean()

fig1 = px.choropleth(
    avg_salary_by_location,
    locations='company_location',
    locationmode='country names',
    color='adjusted_salary_usd',
    hover_name='company_location',
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Average Salary by Company Location',
    labels={'adjusted_salary': 'Average Adjusted Salary'},
    projection='natural earth'
)

fig1.update_layout(
    height=500,
    title_x=0.5
)


fig1.show()

# Create the bar plot using Plotly
fig2 = px.bar(
    avg_salary_by_location,
    x='company_location',
    y='adjusted_salary_usd',
    title='Average Salary by Company Location (Yearly)',
    labels={'adjusted_salary_usd': 'Average Adjusted Salary (Yearly)', 'company_location': 'Company Location'},
    color='adjusted_salary_usd',  # Color by salary for visual effect
    color_continuous_scale=px.colors.sequential.Plasma
)

# Customize layout
fig2.update_layout(
    xaxis_tickangle=-90,
    height=500,
    title_x=0.5,
)


fig2.show()




In [22]:
# Average salary by employee_residence (Choropleth Map)
avg_salary_by_residence = data.groupby('employee_residence', as_index=False)['adjusted_salary_usd'].mean()

fig1 = px.choropleth(
    avg_salary_by_residence,
    locations='employee_residence',
    locationmode='country names',
    color='adjusted_salary_usd',
    hover_name='employee_residence',
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Average Salary by Employee Residence',
    labels={'adjusted_salary_usd': 'Average Adjusted Salary'},
    projection='natural earth'
)

fig1.update_layout(
    height=500,
    title_x=0.5
)

fig1.show()


# Create the bar plot using Plotly
fig2 = px.bar(
    avg_salary_by_residence,
    x='employee_residence',
    y='adjusted_salary_usd',
    title='Average Salary by Employee Residence (Yearly)',
    labels={'adjusted_salary_usd': 'Average Adjusted Salary (Yearly)', 'employee_residence': 'Employee Residence'},
    color='adjusted_salary_usd',  # Color by salary for visual effect
    color_continuous_scale=px.colors.sequential.Plasma
)

# Customize layout (rotate x-axis labels for readability)
fig2.update_layout(
    xaxis_tickangle=-90,
    title_x=0.5,
    height=500,
)

# Show the plot
fig2.show()



In [23]:
# Filter for remote_ratio of 100 (Full-Remote)
remote = data[data['remote_ratio'] == 'Remote']

# Aggregate by company_location (country code)
country_counts = remote['company_location'].value_counts().reset_index()
country_counts.columns = ['country_name', 'count']

# Create the choropleth map with a logarithmic color scale
fig = px.choropleth(
    country_counts, 
    locations='country_name', 
    locationmode='country names',
    color='count',  
    hover_name='country_name',
    hover_data=['count'],
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Choropleth Map of Full-Remote Company Locations',
    projection='natural earth'
)

# Customize the colorbar to show the original count values
fig.update_coloraxes(colorbar=dict(title='Count (Log Scale)', tickvals=[0, 1, 2, 3], ticktext=['1', '10', '100', '1000']))

fig.update_layout(
    height=500,
    title_x=0.5
)

# Show the map
fig.show()

In [24]:
on_site = data[data['remote_ratio'] == 'On-site']

# Aggregate by company_location
country_counts = on_site['company_location'].value_counts().reset_index()
country_counts.columns = ['country_name', 'count']

# Create the choropleth map with a logarithmic color scale
fig = px.choropleth(
    country_counts, 
    locations='country_name', 
    locationmode='country names',
    color='count',
    hover_name='country_name',
    hover_data={'count': True},
    color_continuous_scale=px.colors.sequential.Plasma,
    title='Choropleth Map of On-Site Company Locations',
    projection='natural earth'
)

# Customize the colorbar to show the original count values
fig.update_coloraxes(colorbar=dict(title='Count (Log Scale)', tickvals=[0, 1, 2, 3], ticktext=['1', '10', '100', '1000']))

fig.update_layout(
    height=500,
    title_x=0.5
)

# Show the map
fig.show()

In [25]:
# Dropping 'salary_in_usd' without inplace=True, so no error if re-run
altered_data = data.drop(["salary_in_usd"], axis=1)

# Convert categorical variables to numerical
altered_data['experience_level'] = altered_data['experience_level'].astype('category').cat.codes
altered_data['employment_type'] = altered_data['employment_type'].astype('category').cat.codes
altered_data['job_title'] = altered_data['job_title'].astype('category').cat.codes
altered_data['employee_residence'] = altered_data['employee_residence'].astype('category').cat.codes
altered_data['remote_ratio'] = altered_data['remote_ratio'].astype('category').cat.codes
altered_data['company_location'] = altered_data['company_location'].astype('category').cat.codes
altered_data['company_size'] = altered_data['company_size'].astype('category').cat.codes
altered_data['job_category'] = altered_data['job_category'].astype('category').cat.codes

# Calculating the correlation matrix
correlation_matrix = altered_data.corr()

# Creating the heatmap
fig = px.imshow(
    correlation_matrix,
    text_auto=".3f",
    aspect="auto", 
    color_continuous_scale='Viridis',
    title='Correlation Matrix'
)

fig.update_layout(
    height=600,
    width=1000,
    title_x=0.5,
    xaxis_tickangle=-90
)

# Displaying the heatmap
fig.show()


### Insights from the Correlation Matrix

1. **Salary and Job Categories**: There is a weak positive correlation between salary and job category, indicating that job category has a small influence on salary.

2. **Employee and Company Location**: Employee residence and company location have a very strong positive correlation, suggesting that employees tend to work in the same country as their employer's main office.

3. **Experience Level and Job Titles**: The correlation between experience level and job title is weak, implying that while experience may influence job titles, the relationship is not very strong.

4. **Remote Work**: The correlation between salary and remote work is very weak, suggesting that the amount of remote work does not significantly affect salary.

5. **Company Size**: Company size shows a moderate positive correlation with both company location and employee residence, indicating that larger companies may have a broader geographical presence.

In summary, salary is weakly related to job category and experience, while employee and company locations are strongly correlated. Remote work has minimal impact on salary, and company size shows moderate relationships with geographical factors.
