In [1]:
# Imports
import numpy as np
import pandas as pd
from IPython.core.display import HTML
from typing import Dict, List
import plotly.express as px

_COLOR_SCHEME = ["#adaee3", "#acbfef", "#a6d2f8", "#94e1ff",
                 "#8df1ff", "#b3deb5", "#cee6b3", "#e3eb90",
                 "#fff7ad", "#ffe597", "#ffd494", "#ffb8a2",
                 "#ffbcbc", "#ffc8dc", "#e8bbf0", "#dfb2ff",
                 "#cccccc"]


_CALLOUT_PREFIX = "<div class='callout green-callout'><h4 style='font-family: verdana; color: #444444; font-size:110%;'>"
# CSS Style
def css_styling():
    styles = """
    * {
        margin: 0;
        padding: 0;
        box-sizing: border-box;
    }

    .callout {
        width: 80%;
        margin: 20px auto;
        padding: 30px;
        position: relative;
        border-radius: 5px;
        box-shadow: 0 0 15px 5px rgb(255, 255, 255, 0.65);
    }
    .green-callout {
        background-color: #d3efdc;
        border-left: 5px solid #9dc2a9;
        border-right: 5px solid #9dc2a9;
    }
    """
    # HTML('<style>{}</style>'.format(styles))
    return HTML("<style>"+styles+"</style>")

css_styling()


In [2]:
# Global variables
_WIDTH = 900
_YEARS = [2018, 2019, 2020, 2021, 2022]
_FILEPATHS = ["/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv",
              "/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv",
              "/kaggle/input/kaggle-survey-2020/kaggle_survey_2020_responses.csv",
              "/kaggle/input/kaggle-survey-2021/kaggle_survey_2021_responses.csv",
              "/kaggle/input/kaggle-survey-2022/kaggle_survey_2022_responses.csv"]
_COUNTRIES = ["Japan", "United States of America", "China", "India"]

In [3]:
df = pd.read_csv("/kaggle/input/kaggle-survey-2018/multipleChoiceResponses.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [4]:
df.head(2)

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q1_OTHER_TEXT,Q2,Q3,Q4,Q5,Q6,Q6_OTHER_TEXT,Q7,...,Q49_OTHER_TEXT,Q50_Part_1,Q50_Part_2,Q50_Part_3,Q50_Part_4,Q50_Part_5,Q50_Part_6,Q50_Part_7,Q50_Part_8,Q50_OTHER_TEXT
0,Duration (in seconds),What is your gender? - Selected Choice,What is your gender? - Prefer to self-describe...,What is your age (# years)?,In which country do you currently reside?,What is the highest level of formal education ...,Which best describes your undergraduate major?...,Select the title most similar to your current ...,Select the title most similar to your current ...,In what industry is your current employer/cont...,...,What tools and methods do you use to make your...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...,What barriers prevent you from making your wor...
1,710,Female,-1,45-49,United States of America,Doctoral degree,Other,Consultant,-1,Other,...,-1,,,,,,,,,-1


In [5]:
def keep_employees_in_2022df(df: pd.DataFrame) -> pd.DataFrame:
    df = df[df["Q5"] == "No"]
    df = df[~df["Q23"].str.contains(
            "not employed", case=False, na=False)]
    return df

In [6]:
_COUNTRIES

['Japan', 'United States of America', 'China', 'India']

In [7]:
title_prefix = "<h1 style = 'font-size: 3em'>"

title_str = "How is India doing? "

HTML(f"{title_prefix}{title_str}</h1>")

India has made significant progress in the field of data science in recent years. The country has a large pool of talented data scientists and data analysts who are working on solving complex problems using data-driven approaches.

India is also home to a number of leading educational institutions that offer data science courses and programs, including the Indian Institutes of Technology (IITs) and the Indian Institutes of Management (IIMs). Many of these institutions have also established research centers and collaborations with industry to foster innovation and drive advancements in data science.

The Indian government has also recognized the importance of data science and has launched initiatives to promote its growth, including the National Data and Analytics Platform (NDAP) and the National Programme on AI (NPAI).

Additionally, India has a thriving tech industry with a number of startups and established companies that are leveraging data science to develop innovative products and services. Many global tech giants, including Google, Microsoft, and Amazon, have also established research and development centers in India to tap into the country's talent pool and research capabilities.

Overall, India is making significant strides in the field of data science and is poised to become a major player in this rapidly growing industry.

In this notebook, we will explore data from yearly surveys, collected between 2018 and 2022, answered by tech workers (e.g. software engineers, machine learning engineers, data scientists, statisticians, managers, etc) working in selected countries to understand the differences and similarities of these professionals across countries. Although this analysis covers only tech workers, we believe that an understanding of the tech landscape could shed a little bit of light in how Japan is doing in terms of the innovation race.

We selected United States of America, China and India to compare Japan with. The reason for this choice is based on the fact that these countries are either leaders in software development [18] and in the AI research and market [12], and/or they are the top economies in the world.

Our notebook is divided in three sections, each with the purpose of analyzing one large area of Japan's tech landscape: 
- **The tech working environment.** In this section, we will focus on analyzing the diversity and the fairness in the tech working environment.
- **The technical profile of people working in tech.** In this section, we will focus on the skills of the individuals that make up the workforce in tech, the tools that they frequently use, and the activities that are important in their jobs.
- **Market and investment.** In this section, we will focus on analyzing the companies profiles to understand the tech market better.

Before jumping into these sections, we should start with few considerations.

In [8]:
"""
csv reading and data proecessing functions to select only
respondents that are employees and that work at the selected countries
"""
# Data reading and cleaning functions
def keep_employees_in_2022df(df: pd.DataFrame) -> pd.DataFrame:
    df = df[df["Q5"] == "No"]
    df = df[~df["Q23"].str.contains(
            "not employed", case=False, na=False)]
    return df


def keep_employees_in_2019_to_2021df(df: pd.DataFrame) -> pd.DataFrame:
    df = df[~df["Q5"].str.contains(
        "Student", case=False, na=False)]
    df = df[~df["Q5"].str.contains(
            "employed", case=False, na=False)]
    return df


def keep_employees_in_2018df(df: pd.DataFrame) -> pd.DataFrame:
    df = df[~df["Q6"].str.contains(
        "Student", case=False, na=False)]
    df = df[~df["Q6"].str.contains(
            "employed", case=False, na=False)]
    return df


def keep_employees_in_df(df: pd.DataFrame, year: int) -> pd.DataFrame:
    """ Keeps only the data of respondents that are employees/retirees."""
    if year == 2022:
        return keep_employees_in_2022df(df)
    elif year in [2021, 2020, 2019]:
        return keep_employees_in_2019_to_2021df(df)
    elif year == 2018:
        return keep_employees_in_2018df(df)
    elif year == 2017:
        df = df[~(df["Student Status"] == "Yes")]
        df = df[~df["EmploymentStatus"].str.contains(
                "Not employed", case=False, na=False)]
        return df


def filter_country_in_df(
    df: pd.DataFrame, year: int, country: str) -> pd.DataFrame:
    """ Keeps only the data of respondents in a specific country."""
    if year == 2022:
        return df[df["Q4"] == country]
    elif year >= 2018 and year <= 2021:
        return df[df["Q3"] == country]
    elif year == 2017:
        return df[df["Country"] == country]


def read_year_country_data(
    filepath: str, year: int, country: str) -> pd.DataFrame:
    """ Reads the data in a csv file referring to a year of data and keeps
    only the data filled by employees/retirees in a specific country"""
    df = pd.read_csv(filepath, dtype=str)
    df = filter_country_in_df(df, year, country)
    df = keep_employees_in_df(df, year)
    return df


def read_data_from_all_years(
    years: List[int],
    filepaths: List[str],
    countries: List[str]) -> Dict[str, Dict[int, pd.DataFrame]]:
    """ Returns the survey yearly data of respondents of
    specific countries and that are employees/retirees."""
    all_years_data = {}

    for country in countries:
        all_years_data[country] = {}
        for year, filepath in zip(years, filepaths):
            country_year_data = read_year_country_data(filepath, year, country)
            assert country_year_data is not None
            all_years_data[country][year] = country_year_data
    return all_years_data


In [9]:
"""
Data processing functions to get one analysis_df with all the data
used in our analysis with labels fixed
"""

# https://www.economist.com/big-mac-index
# https://github.com/TheEconomist/big-mac-data
bigmac_price = {
    "Japan": {
        2022: 4.33424940302147,
        2021: 4.52475306799529,
        2020: 4.4222753559572,
        2019: 4.26944490282339,
        2018: 4.34051401396391,
    },
    "United States of America": {
        2022: 4.54640118103503,
        2021: 4.75935662467434,
        2020: 4.71797911563537,
        2019: 4.51734673325535,
        2018: 4.57758988396032,
    },
    "China": {
        2022: 3.64104793756501,
        2021: 3.53487733533526,
        2020: 3.30759148628284,
        2019: 3.28136332239687,
        2018: 3.23090135161334,
    },
    "India": {
        2022: 3.42355999236184,
        2021: 3.21614517431256,
        2020: 2.96866603376469,
        2019: 2.99855627219012,
        2018: 2.94345521926975,
    },
}


def get_age_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column from a dataframe with age information."""
    if year in [2022, 2018]:
        age_data = df["Q2"]
    elif year in [2021, 2020, 2019]:
        age_data = df["Q1"]

    age_data = age_data.replace("80+", "70+")
    return age_data.replace("70-79", "70+")

def get_gender_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column from a dataframe with gender information."""
    if year==2022:
        gender_data = df["Q3"]
        gender_data = gender_data.replace("Male", "Man")
        gender_data = gender_data.replace("Female", "Woman")
        return gender_data
    if year in [2021, 2020, 2019]:
        gender_data = df["Q2"]
        gender_data = gender_data.replace("Male", "Man")
        gender_data = gender_data.replace("Female", "Woman")
        return gender_data
    elif year==2018:
        gender_data = df["Q1"]
        gender_data = gender_data.replace("Male", "Man")
        gender_data = gender_data.replace("Female", "Woman")
        return gender_data

def convert_salary_to_num_bigmacs(salary_data: List[float], year: int, country: str) -> List[float]:
    return salary_data//bigmac_price[country][year]

def convert_str_range_to_median(str_range: str) -> float:
    """ Converts the range of salary from a string to the median
    of that range."""
    if not pd.isnull(str_range):
        str_range = str_range.replace(",", "")
        str_range = str_range.replace("$", "")
        str_range = str_range.replace(">", "")
        str_range = str_range.replace("+", "")
        str_range = str_range.replace(" ", "")
        if "-" in str_range:
            index = str_range.find("-")
            smallest = float(str_range[:index])
            largest = float(str_range[(index+1):])
            result = (largest + smallest)/2
            return result
        else:
            result = float(str_range)
            return result
    else:
        return str_range

def process_2018_salary_ranges(str_range: str) -> str:
    """ Adds ,000 to the first number in the salary range."""
    if not pd.isnull(str_range):
        if "-" in str_range:
            index = str_range.find("-")
            smallest = str_range[:index]
            largest = str_range[(index+1):]
            return f"{smallest},000-{largest}"
        else:
            return str_range
    else:
        return str_range

def process_salary_data(salary_data: pd.Series, year: int, country: str) -> List[float]:
    if year == 2018:
        salary_data = salary_data.apply(process_2018_salary_ranges)
    salary_data = salary_data.apply(convert_str_range_to_median)

    return convert_salary_to_num_bigmacs(salary_data, year, country)

def get_salary_data(df: pd.DataFrame, year: int, country: str) -> List[float]:
    """ Returns the column from a dataframe with salary information."""
    salary_data_select = {
        2022: "Q29",
        2021: "Q25",
        2020: "Q24",
        2019: "Q10",
        2018: "Q9"
    }
    assert year in salary_data_select
    salary_data = df[salary_data_select[year]]
    salary_data = salary_data.replace("I do not wish to disclose my approximate yearly compensation", np.nan)
    salary_data = process_salary_data(salary_data, year, country)
    
    return salary_data

def get_title_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column from a dataframe with job title information."""
    if year==2022:
        title_data = df["Q23"]
    elif year in [2021, 2020, 2019]:
        title_data = df["Q5"]
    elif year == 2018:
        title_data = df["Q6"]
    
    title_data = title_data.replace("Data Analyst (Business, Marketing, Financial, Quantitative, etc)", "Data Analyst")
    title_data = title_data.replace("Manager (Program, Project, Operations, Executive-level, etc)", "Manager")
    return title_data

def get_formal_education_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column from a dataframe with the level of formal
        education. """

    if year == 2022:
        formal_education_data = df["Q8"]
    else:
        formal_education_data = df["Q4"]

    formal_education_data = formal_education_data.replace(
        "Professional doctorate", "Professional degree")
    formal_education_data = formal_education_data.replace(
        "Some college/university study without earning a bachelor’s degree",
        "Some college/university study without<br>earning a bachelor's degree"
    )
    formal_education_data = formal_education_data.replace(
        "Bachelor’s degree", "Bachelor's degree")
    formal_education_data = formal_education_data.replace(
        "Master’s degree", "Master's degree")

    return formal_education_data

def get_multiple_answers_data(df: pd.DataFrame, data_select: dict, year: int) -> pd.Series:

    columns = df.columns.values

    columns_to_select = []
    for column in columns:
        if data_select[year] in column:
            columns_to_select.append(column)
    
    return df[columns_to_select].apply(lambda row: ' -- '.join(row.dropna().values.astype(str)), axis=1)


def get_learning_platforms_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column from a dataframe with the learning platforms. """
    data_select = {
        2022: "Q6",
        2021: "Q40",
        2020: "Q37",
        2019: "Q13",
        2018: "Q36"
    }
    return get_multiple_answers_data(df, data_select, year)

def get_coding_experience_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column from a dataframe with the coding experience in
    years."""
    if year == 2022:
        experience_data = df["Q11"]
    elif year in [2021, 2020]:
        experience_data = df["Q6"]
    elif year == 2019:
        experience_data = df["Q15"]
    else:
        experience_data = df["Q24"]
        experience_data = experience_data.replace(
            "I have never written code and I do not want to learn",
            "I have never written code")
        experience_data = experience_data.replace(
            "I have never written code but I want to learn",
            "I have never written code")
        experience_data = experience_data.replace(
            "20-30 years",
            "20+ years")
        experience_data = experience_data.replace(
            "30-40 years",
            "20+ years")
        experience_data = experience_data.replace(
            "40+ years",
            "20+ years")
    
    experience_data = experience_data.replace("1-3 years", "1-5 years")
    experience_data = experience_data.replace("3-5 years", "1-5 years")
    experience_data = experience_data.replace("1-2 years", "1-5 years")
    experience_data = experience_data.replace("2-5 years", "1-5 years")
    experience_data = experience_data.replace("< 1 years", "< 1 year")
    return experience_data

def get_ML_experience_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column from a dataframe with the ML experience in
    years."""
    if year == 2022:
        experience_data = df["Q16"]
    elif year in [2021, 2020]:
        experience_data = df["Q15"]
    elif year == 2019:
        experience_data = df["Q23"]
        experience_data = experience_data.replace("10-15 years", "10-20 years")
    else:
        experience_data = df["Q25"]
        experience_data = experience_data.replace("10-15 years", "10-20 years")
        experience_data = experience_data.replace("I have never studied machine learning but plan to learn in the future", "I do not use machine<br>learning methods")
        experience_data = experience_data.replace("I have never studied machine learning and I do not plan to", "I do not use machine<br>learning methods")
    experience_data = experience_data.replace("< 1 years", "< 1 year")
    experience_data = experience_data.replace("Under 1 year", "< 1 year")
    experience_data = experience_data.replace("20 or more years", "20+ years")
    experience_data = experience_data.replace("I do not use machine learning methods", "I do not use machine<br>learning methods")
    return experience_data

def get_important_activities_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns the column of a dataframe with the data on important
        activities in the workplace."""
    data_select = {
        2022: "Q28",
        2021: "Q24",
        2020: "Q23",
        2019: "Q9",
        2018: "Q11"
    }
    return get_multiple_answers_data(df, data_select, year)


def get_language_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used programming language."""
    data_select = {
        2022: "Q12",
        2021: "Q7",
        2020: "Q7",
        2019: "Q18",
        2018: "Q16"
    }
    return get_multiple_answers_data(df, data_select, year)

def get_IDE_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used IDE."""
    data_select = {
        2022: "Q13",
        2021: "Q9",
        2020: "Q9",
        2019: "Q16",
        2018: "Q13"
    }
    return get_multiple_answers_data(df, data_select, year)

def get_notebook_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used notebook."""
    data_select = {
        2022: "Q14",
        2021: "Q10",
        2020: "Q10",
        2019: "Q17",
        2018: "Q14"
    }
    return get_multiple_answers_data(df, data_select, year)

def get_visualization_tools_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used visualization tools."""
    data_select = {
        2022: "Q15",
        2021: "Q14",
        2020: "Q14",
        2019: "Q20",
        2018: "Q21"
    }
    return get_multiple_answers_data(df, data_select, year)

def get_ML_frameworks_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used ML frameworks."""
    data_select = {
        2022: "Q17",
        2021: "Q16",
        2020: "Q16",
        2019: "Q28",
        2018: "Q19"
    }
    return get_multiple_answers_data(df, data_select, year)

def get_ML_algorithms_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used ML algorithms."""
    data_select = {
        2022: "Q18",
        2021: "Q17",
        2020: "Q17",
        2019: "Q24",
    }
    return get_multiple_answers_data(df, data_select, year)

def get_computer_vision_methods_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used computer vision methods."""
    data_select = {
        2022: "Q19",
        2021: "Q18",
        2020: "Q18",
        2019: "Q26",
    }
    return get_multiple_answers_data(df, data_select, year)

def get_NLP_methods_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the type of used NLP methods."""
    data_select = {
        2022: "Q20",
        2021: "Q19",
        2020: "Q19",
        2019: "Q27",
    }
    return get_multiple_answers_data(df, data_select, year)

def get_ML_hardware_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on ML hardware usage."""
    data_select = {
        2022: "Q42",
        2021: "Q12",
        2020: "Q12",
        2019: "Q21",
    }
    return get_multiple_answers_data(df, data_select, year)

def get_TPU_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on TPU usage frequency."""
    data_select = {
        2022: "Q43",
        2021: "Q13",
        2020: "Q13",
        2019: "Q22",
    }
    tpu_data = df[data_select[year]]
    tpu_data = tpu_data.replace("More than 25 times", "> 25 times")
    tpu_data = tpu_data.replace("6-24 times", "6-25 times")
    return tpu_data

def get_managed_ML_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on managed ML tools usage."""
    data_select = {
        2022: "Q37",
        2021: "Q31_A", #
        2020: "Q28_A", #
        2018: "Q28" #
    }
    return get_multiple_answers_data(df, data_select, year)

def get_auto_ML_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on auto ML tools usage."""
    data_select = {
        2022: "Q38",
        2021: "Q37_A",
        2020: "Q34_A", #
        2019: "Q33", #
    }
    return get_multiple_answers_data(df, data_select, year)

def get_ML_research_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on ML research."""
    data_select = {
        2022: "Q10"
    }
    ML_research_data = get_multiple_answers_data(df, data_select, year)
    ML_research_data = ML_research_data.replace("Yes, the research made advances related to some novel machine learning method (theoretical research)", "Yes, theoretical research")
    ML_research_data = ML_research_data.replace("Yes, the research made use of machine learning as a tool (applied research)", "Yes, applied research")
    ML_research_data = ML_research_data.replace("Yes, the research made advances related to some novel machine learning method (theoretical research) -- Yes, the research made use of machine learning as a tool (applied research)", "Yes, both theoretical<br>and applied research")

    return ML_research_data

def get_ML_serve_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on ML serve products."""
    data_select = {
        2022: "Q39"
    }
    return get_multiple_answers_data(df, data_select, year)

def get_ML_monitor_tools_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on ML monitoring tools."""
    data_select = {
        2022: "Q40",
        2021: "Q38_A", #
        2020: "Q35_A", #
    }
    return get_multiple_answers_data(df, data_select, year)

def get_ethical_AI_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on ethical AI tools."""
    data_select = {
        2022: "Q41",
    }
    return get_multiple_answers_data(df, data_select, year)

def get_cloud_computing_tools_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on Cloud Computing tools."""
    data_select = {
        2022: "Q31",
        2021: "Q27_A", #
        2020: "Q26_A", #
        2019: "Q29", #
        2018: "Q15" #
    }
    return get_multiple_answers_data(df, data_select, year)

def get_cloud_computing_products_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on Cloud Computing products."""
    data_select = {
        2022: "Q33",
        2021: "Q29_A",
        2020: "Q27_A", #
        2019: "Q30", #
        2018: "Q27" #
    }
    return get_multiple_answers_data(df, data_select, year)

def get_data_storage_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on used data storage products."""
    data_select = {
        2022: "Q34",
        2021: "Q30_A", #
    }
    return get_multiple_answers_data(df, data_select, year)

def get_data_products_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on data products usage."""
    data_select = {
        2022: "Q35",
        2021: "Q32_A", #
        2020: "Q29_A", #
    }
    return get_multiple_answers_data(df, data_select, year)

def get_spent_money_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on money spent on cloud computing or ML."""
    data_select = {
        2022: "Q30",
        2021: "Q26",
        2020: "Q25",
        2019: "Q11",
    }
    return df[data_select[year]]

def get_industry_type_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on Industry type."""
    data_select = {
        2022: "Q24",
        2021: "Q20",
        2018: "Q7"
    }
    return df[data_select[year]]

def get_company_size_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on Company size."""
    data_select = {
        2022: "Q25",
        2021: "Q21",
        2020: "Q20",
        2019: "Q6",
    }
    company_size_data = df[data_select[year]]
    company_size_data = company_size_data.replace(
        "10,000 or more employees", "> 10,000 employees")
    return company_size_data

def get_num_data_scientists_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on the number of data scientists in the job."""
    data_select = {
        2022: "Q26",
        2021: "Q22",
        2020: "Q21",
        2019: "Q7",
    }
    return df[data_select[year]]

def get_ML_usage_in_business_data(df: pd.DataFrame, year: int) -> pd.Series:
    """ Returns data on how ML is being used in business."""
    data_select = {
        2022: "Q27",
        2021: "Q23",
        2020: "Q22",
        2019: "Q8",
        2018: "Q10"
    }
    return df[data_select[year]]

def get_df_for_year_and_country(df: pd.DataFrame, year: int, country: str) -> pd.DataFrame:
    """ Returns a dataframe with age, gender, salary and job title information
    extracted from a dataframe."""
    new_df = pd.DataFrame()
    new_df["Age"] = get_age_data(df, year)
    new_df["Gender"] = get_gender_data(df, year)
    new_df["Salary"] = get_salary_data(df, year, country)
    new_df["Title"] = get_title_data(df, year)


    new_df["Formal Education"] = get_formal_education_data(df, year)
    new_df["Learning Platforms"] = get_learning_platforms_data(df, year)

    new_df["Coding Experience"] = get_coding_experience_data(df, year)
    new_df["ML Experience"] = get_ML_experience_data(df, year)
    new_df["Important Activities"] = get_important_activities_data(df, year)

    new_df["Languages"] = get_language_data(df, year)
    new_df["IDEs"] = get_IDE_data(df, year)
    new_df["Notebooks"] = get_notebook_data(df, year)
    new_df["Visualization Tools"] = get_visualization_tools_data(df, year)
    new_df["ML Frameworks"] = get_ML_frameworks_data(df, year)

    new_df["Cloud Computing Tools"] = get_cloud_computing_tools_data(df, year)
    new_df["Cloud Computing Products"] = get_cloud_computing_products_data(df, year)
    
    
    new_df["ML Usage in Business"] = get_ML_usage_in_business_data(df, year)

    if year == 2018:
        new_df["ML Algorithms"] = [None] * len(new_df["Age"])
        new_df["Computer Vision Methods"] = [None] * len(new_df["Age"])
        new_df["NLP Methods"] = [None] * len(new_df["Age"])
        new_df["ML Hardware"] = [None] * len(new_df["Age"])
        new_df["TPU"] = [None] * len(new_df["Age"])
        new_df["AutoML"] = [None] * len(new_df["Age"])
        new_df["Company Size"] = [None] * len(new_df["Age"])
        new_df["Number of Data Scientists"] = [None] * len(new_df["Age"])
        new_df["Money Spent"] = [None] * len(new_df["Age"])
    else:
        new_df["ML Algorithms"] = get_ML_algorithms_data(df, year)
        new_df["Computer Vision Methods"] = get_computer_vision_methods_data(df, year)
        new_df["NLP Methods"] = get_NLP_methods_data(df, year)
        new_df["ML Hardware"] = get_ML_hardware_data(df, year)
        new_df["TPU"] = get_TPU_data(df, year)
        new_df["AutoML"] = get_auto_ML_data(df, year)
        new_df["Company Size"] = get_company_size_data(df, year)
        new_df["Number of Data Scientists"]= get_num_data_scientists_data(df, year)
        new_df["Money Spent"] = get_spent_money_data(df, year)

    if year == 2019:
        new_df["Managed ML"] = [None] * len(new_df["Age"])
    else:
        new_df["Managed ML"] = get_managed_ML_data(df, year)

    if year == 2022:
        new_df["Published Papers"] = df["Q9"]
        new_df["ML Research"] = get_ML_research_data(df, year)
        new_df["ML Serve"] = get_ML_serve_data(df, year)
        new_df["Ethical AI tools"] = get_ethical_AI_data(df, year)
    else:
        new_df["Published Papers"] = [None] * len(new_df["Age"])
        new_df["ML Research"] = [None] * len(new_df["Age"])
        new_df["ML Serve"] = [None] * len(new_df["Age"])
        new_df["Ethical AI tools"] = [None] * len(new_df["Age"])

    if year in [2020, 2019]:
        new_df["Industry Type"] = [None] * len(new_df["Age"])
    else:
        new_df["Industry Type"] = get_industry_type_data(df, year)

    if year in [2022, 2021]:
        new_df["Data Storage Products"] = get_data_storage_data(df, year)
    else:
        new_df["Data Storage Products"] = [None] * len(new_df["Age"])

    if year in [2019, 2018]:
        new_df["Data Products"] = [None] * len(new_df["Age"])
        new_df["ML Monitor Tools"] = [None] * len(new_df["Age"])
    else:
        new_df["Data Products"] = get_data_products_data(df, year)
        new_df["ML Monitor Tools"] = get_ML_monitor_tools_data(df, year)

    return new_df

def get_data_for_analysis(
    all_years_data: Dict[str, Dict[int, pd.DataFrame]]) -> Dict[str, Dict[int, pd.DataFrame]]:

    diversity_data_df = pd.DataFrame()

    for country in all_years_data:
        for year in all_years_data[country]:
            df = (
                get_df_for_year_and_country(
                    df = all_years_data[country][year],
                    year = year,
                    country = country
                ))
            if country == "United States of America":
                df["Country"] = ["U.S.🇺🇸"]*len(df)
            elif country == "Japan":
                df["Country"] = ["Japan🇯🇵"]*len(df)
            elif country == "China":
                df["Country"] = ["China🇨🇳"]*len(df)
            elif country == "India":
                df["Country"] = ["India🇮🇳"]*len(df)
            df["Year"] = [year]*len(df)
            if len(diversity_data_df) == 0:
                diversity_data_df = df
            else:
                diversity_data_df = pd.concat([diversity_data_df, df])
            
    return diversity_data_df


In [10]:
"""
Functions to make calculations on the data from analysis_df
(salary averaged over certain groups, get percentage of respondents
that gave a certain answer to a question)
"""

def get_salary_avg_grouped_by_year_country_and_extra_variable(
    analysis_df: pd.DataFrame, extra_var: str) -> pd.DataFrame:

    avg_data = analysis_df[["Year", "Country", "Salary", extra_var]]
    avg_data = avg_data.dropna(subset=["Year", "Country", "Salary", extra_var])

    tmp_df = avg_data.groupby(
        ["Year", "Country", extra_var]).size().reset_index(name="Count")
    tmp_df = tmp_df[tmp_df["Count"] >= 15] # consider only groups with more than 15 samples

    avg_data["Keep"] = False
    for c,year,extra_i in zip(tmp_df["Country"].values, tmp_df["Year"].values, tmp_df[extra_var].values):
        data_filter = (avg_data["Country"]==c) & (avg_data["Year"]==year) & (avg_data[extra_var]==extra_i)
        avg_data.loc[data_filter,"Keep"] = True

    avg_data = avg_data[avg_data["Keep"] == True]

    avg_data = avg_data.groupby(
        ["Year", "Country", extra_var], as_index=False).mean()
    return avg_data

def get_percentage_of_multiple_choices_item(
    analysis_df: pd.DataFrame, multiple_choice_column: str, choices:List[str],
    re_expressions:List[str]) -> pd.DataFrame:
    """ Returns a dataframe with the percentage of respondents for each
    option in the choices list by country and year"""
    countries = list(set(analysis_df["Country"].values))
    years = list(set(analysis_df["Year"].values))

    percentage_df_rows = []

    for year in years:
        for country in countries:
            df = analysis_df[analysis_df["Year"]==year]
            df = df[df["Country"] == country]
            for re_expression, choice in zip(re_expressions, choices):
                # df[multiple_choice_column] = df[multiple_choice_column].replace(np.nan, "None") 
                
                # assert (df[multiple_choice_column].str.count(choice) < 2).all()
                # num_respondents = df[multiple_choice_column].str.count(choice).sum()
                
                num_respondents = df[multiple_choice_column].str.contains(
                    re_expression, regex=True, na=False, case=True).sum()

                percentage = round((num_respondents*100)/len(df), 1)

                percentage_df_rows.append({
                    "Year": year,
                    "Country": country,
                    multiple_choice_column: choice,
                    "Percentage of Respondents": percentage
                })
    

    return pd.DataFrame(percentage_df_rows)

In [11]:
all_years_data = read_data_from_all_years(years=_YEARS,
                                          filepaths=_FILEPATHS,
                                          countries=_COUNTRIES)
analysis_df = get_data_for_analysis(all_years_data)

In [12]:
title_prefix = "<h1 style = 'font-size: 2.5em'>"

title_str = "Initial considerations"

HTML(f"{title_prefix}{title_str}</h1>")

# Percentage of respondents
In many plots presented in this notebook, we show the percentage of respondents in each country and year. This percentage refers to the number of people, per year and per country, that gave a particular answer to a question in the survey (and its equivalent across the yearly surveys) divided by the total number of people that answered that question. When showing such percentages, we do not consider empty answers to the related question (i.e. `None` answers).

We do this so that we can compare the profiles of each country more easily. However, it is also important to know the absolute number of participants that are tech workers per country and year, as shown below.

In [13]:
fig = px.histogram(analysis_df,
                    x="Year", color="Country", barmode="group", histfunc="count",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]},
                    title="Number of tech workers that participated in the data science surveys")
fig.update_xaxes(type='category')
fig.update_yaxes(title="Number of tech workers")
fig.show()

We can see that, from 2019 to 2022, China was the country with less respondents, followed by Japan, and that, in 2018, Japan was the country with less respondents. We also note that the number of participants in each country is not proportional to the population of that country. In 2021, for example, the population in China was about 1.412 billion, the population of India was 1.393 billion, the population in Japan was 125.7 million, and the population in U.S. was about 331.9 million. This means that the tech workers that participated in the 2021 survey composed (2.88e-8)% of the Chinese population, (2.84e-6)% of the Indian population, (5.65e-6)% of the Japanese population, and (6.03e-6)% of the American population.

# Compensation comparison

In order to compare compensation across countries and years, we use the GDP-adjusted Big Mac index [14]. With such index and with the salary of tech workers given in US dollars, it is possible to express their salary as the number of Big Macs that they could have bought in the country where they work in a given year. Our reasoning in converting people's salaries from US dollars to number of Big Macs was to use a metric that considers the purchasing-power of workers per country and year. 

# Datasets

In this notebook, we use the survey answers from the "Kaggle Machine Learning & Data Science Survey" for the years between 2018 and 2022. These surveys have different sets of questions and answers. We manually checked all of them and mapped, as best as we could, the questions from the 2022 survey to the questions from the surveys of past years. We have added a table with the mapping between the questions in each survey to our Appendix.

We have also used the GDP-adjusted Big Mac prices in US dollars per country and per year [14] to convert the compensation of workers from US dollars to number of Big Macs.

In [14]:
title_prefix = "<h1 style = 'font-size: 2.5em'>"

title_str = "The tech working environment"

HTML(f"{title_prefix}{title_str}</h1>")

The tech working environment in India is dynamic and rapidly evolving. India is known for its large pool of talented tech professionals, innovative startups, and established global tech companies. The country has a strong education system, with many leading engineering and technology institutions that produce highly skilled graduates.

In recent years, there has been a surge in the number of tech startups in India, particularly in cities like Bangalore, Hyderabad, and Delhi. These startups are leveraging cutting-edge technologies such as artificial intelligence, machine learning, and blockchain to develop innovative products and services. Many of these startups are funded by venture capital firms and angel investors, and are growing rapidly.

India is also home to several global tech giants, including Google, Microsoft, Amazon, and IBM, which have established research and development centers in the country. These companies leverage the highly skilled workforce and the large market opportunities in India to drive innovation and growth.

The tech working environment in India is highly competitive, with a focus on innovation and continuous learning. Many companies offer attractive compensation packages and benefits, as well as opportunities for career advancement and professional growth. However, there are also challenges such as intense competition for talent, high levels of work stress, and a lack of work-life balance in some cases.

Overall, the tech working environment in India is characterized by innovation, dynamism, and a focus on growth and learning.

# Diversity in tech

From the answers given to the Data Science surveys, we can analyze gender diversity in the tech workplace. The following histogram shows the gender distribution of tech workers based on the survey's answers.

In [15]:
fig = px.histogram(analysis_df,
                    x="Country", color="Gender", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Year",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                                      "Gender":["Man", "Woman", "Prefer not to say", "Nonbinary"],
                                      "Year": range(2018,2022)},
                    title="Gender distribution of people working in tech")
fig.update_xaxes(type='category')
fig.update_yaxes(title="Percentage of respondents (%)")
fig.show()

We can see that the percentage of women working in tech is much smaller than the percentage of men working in tech across countries and over the years. We can also note that, except for India, in all countries, the percentage of female tech workers did not change that much over the years, with a change of at most absolute 4.43% between consecutive years (see the change in China between 2021 and 2022). In India, we can see a modest trend towards an increase in the female workforce over the years.


In [16]:
callout_str = "Compared to India, China and U.S,Japan consistently showed the largest gender imbalance in tech between 2018 and 2022."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

In [17]:
avg_data = analysis_df[analysis_df["Gender"].isin(["Man", "Woman"])]
avg_data = avg_data[["Country", "Gender", "Salary", "Year"]]
avg_data = avg_data.dropna(subset=["Gender", "Country", "Salary", "Year"])
avg_data = avg_data.groupby(["Country", "Gender", "Year"], as_index=False).mean()


fig = px.bar(avg_data,
             x="Country", y="Salary",
             color="Gender",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                              "Gender":["Man", "Woman"],
                              "Year": range(2018,2022)},
             title="Average yearly compensation of men and women")

fig.update_layout(yaxis_range=[0,38000])
fig.update_yaxes(title="Average yearly compensation (number of Big Macs)")
fig.update_xaxes(type='category')
fig.show()

We can see from the plot above that, in 2022, the average compensation of women in the U.S. was around 66.75% of the compensation of men. In Japan, this percentage was 67.66%.

India was the country that showed the smallest pay gap, with the average compensation of women equal to 93.44% of the compensation of men, and China was the country that showed the largest pay gap, where the average compensation of women was 58.88% of the average men's compensation in 2022.

However, we note that the gender pay gap in Japan seemed to be non-existent in 2020, with the average compensation of women being slightly larger than the average compensation of men. In 2021, the gender gap seemed to be back, with the average compensation of the female workforce equal to 65.68% of the compensation of the male workforce.

If we compare the data of 2021 with the data of 2022, there seems to be a slight trend towards a reduction in the gender pay gap in Japan, but this reduction is very small compared to the gender pay gap reduction in China and India between 2021 and 2022.

In [18]:
callout_str = "The gender pay gap in India is very similar to the one in U.S.<br>Japan is the country with the smallest pay gap, and China is the one with the largest."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

We can also use the data from the surveys to analyze age diversity in the tech workplace.

In [19]:
fig = px.histogram(analysis_df,
                    x="Country", color="Age", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Year",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                              "Age": ["18-21", "22-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-69", "70+"],
                              "Year": range(2018,2022)},
                    title="Age distribution of people working in tech")
fig.update_xaxes(type='category')
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_layout(legend_title="Age group (years)")
fig.show()

Three interesting things can be observed from the histogram above:
1. As the years go by, the percentage of younger participants reduce and the percentage of older participants increase in all countries.
2. The age distribution of tech workers in Japan and U.S. is quite similar, and, if we consider the age groups of people between 25 and 54 years old, we also see a somewhat balanced distribution of people by age group in these two countries.
3. In 2022, tech workers that were up to 39 years old seemed to make up more than 80% of the tech workforce in China and India, while in Japan and U.S., they made up less than 50%.

In [20]:
callout_str = "There is a good age diversity in India compared to other countries"
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# How attractive working in tech is?

Working in tech in India can be highly attractive for a number of reasons.

Firstly, India has a large and rapidly growing tech industry, with many global tech companies and startups establishing a presence in the country. This has created a wealth of job opportunities for tech professionals in a variety of fields, including software engineering, data science, cybersecurity, and more.

Secondly, the tech industry in India offers attractive compensation packages, with many companies offering competitive salaries, benefits, and other incentives to attract and retain top talent. This is particularly true for larger and more established companies, which have the financial resources to invest in their employees.

Thirdly, working in tech in India can offer a high degree of job satisfaction, as tech professionals are often at the forefront of innovation and are working on cutting-edge technologies and projects. This can be especially rewarding for those who are passionate about using technology to solve complex problems and drive progress.

Finally, India has a strong education system, with many leading engineering and technology institutions that produce highly skilled graduates. This has created a deep pool of talented tech professionals, making India an attractive destination for companies looking to build high-performing tech teams.

Despite these advantages, there are also challenges associated with working in tech in India, including intense competition for talent, high levels of work stress, and a lack of work-life balance in some cases. However, for those who are willing to take on these challenges, working in tech in India can be highly rewarding and offer many opportunities for career growth and advancement.

In [21]:
avg_data = analysis_df[["Year", "Country", "Salary"]]
avg_data = avg_data.groupby(["Year", "Country"], as_index=False).mean()

fig = px.scatter(avg_data, x="Year", y="Country",
	             size="Salary", color="Country",
                 hover_name="Country", size_max=60,
                 width=_WIDTH, height=600,
                 color_discrete_sequence=_COLOR_SCHEME,
                 title="Average yearly compensation (in number of Big Macs) per country and year")

fig.update_xaxes(type='category')
fig.show()

In [22]:
callout_str = "The average compensation in Japan was consistently higher than in India and China between 2018 and 2022. However, the percentual increase in compensation in Japan between 2021 and 2022 was the smallest among all countries."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

We should also analyze the yearly compensation by job title to determine possible discrepancies between different job titles in different countries. The following plot shows the average yearly compensation for the year of 2022 by job title for tech workers working in India, Japan and U.S. We have disregarded China in this analysis since there were not many tech workers for each job title (In fact, we have disregarded any group that had less than 15 participants. So each bubble in the bubble plot below shows the average of the compensation of a certain group of people with the same job title working in the same country, where the group had at least 15 people).

In [23]:
avg_data = get_salary_avg_grouped_by_year_country_and_extra_variable(
    analysis_df, "Title")
avg_data = avg_data[avg_data["Year"]==2022]
avg_data = avg_data[avg_data["Country"].isin(["India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"])]

fig = px.scatter(avg_data, x="Title", y="Country",
	             size="Salary", color="Country",
                 hover_name="Title", size_max=60,
                 width=_WIDTH, height=700,
                 color_discrete_sequence=_COLOR_SCHEME,
                 title="Average yearly compensation (in number of Big Macs) by job title in 2022")

fig.update_xaxes(type='category', title="Job Title")
fig.show()

We can see that Machine Learning Engineers and Managers are the ones with the highest compensation in the U.S. In Japan, the highest compensation is given to Data Engineers, and, in India, to Data Architects. One thing that we note is that, for every job title, the compensation in Japan is larger than in India, but smaller than in the U.S.

Data Architects have one of the highest salaries in the U.S. and the highest in India. This indicates that these professionals are probably esteemed and in demand. In Japan, we see that there were not enough (less than 15) participants of that profession in the 2022 survey (and that is why no data is shown for them), which could be a sign that this type of worker is not in demand and/or is difficult to find in Japan.

We also note that, in Japan, the salary of a non-software Engineer is lower than the salary of a Software Engineer. This is surprising given the concern of specialists in the field regarding Japan investing more in manufacturing than software, and given the perception that working with software is considered "second class".

Another interesting fact about the plot above is that Japan is the only country where the salary of teachers and professors is higher than of Software Engineers, Machine Learning Engineers, Research Scientists, and Data Scientists. This could indicate a tendency of giving a higher value to people working in academia compared to people working in the industry in Japan.

In [24]:
callout_str = "The fact that the salary of a non-software Engineer is lower than the salary of a Software Engineer in India goes against the idea that software in Japan is viewed as being 'second class' and that Japan prioritizes manufacturing over software."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

In [25]:
callout_str = "There seems to be a tendency in India to give more value to people working in the academia than to people working in the industry. This is the opposite behaviour found in the U.S."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

In [26]:
avg_data = get_salary_avg_grouped_by_year_country_and_extra_variable(analysis_df, "Age")

fig = px.bar(avg_data,
             x="Country", y="Salary",
             color="Age",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Average compensation of tech workers according to age, country and year",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                              "Age": ["18-21", "22-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-69", "70+"],
                              "Year": range(2018,2022)})

fig.update_layout(yaxis_range=[0,39000], legend_title="Age group (years)")
fig.update_yaxes(title="Average Compensation (number of Big Macs)")
fig.update_xaxes(type='category')
fig.show()

In [27]:
avg_data = analysis_df[["Year", "Country", "Salary", "Age"]]
avg_data = avg_data[avg_data["Age"].isin(["18-21", "22-24", "25-29", "30-34", "35-39"])]
avg_data = avg_data.dropna(subset=["Year", "Country", "Salary"])
avg_data = avg_data.drop(columns=["Age"])

avg_data = avg_data.groupby(["Year", "Country"], as_index=False).mean()

fig = px.bar(avg_data,
             x="Year", y="Salary",
             color="Country",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Average compensation of workers between 18 and 39 years of age")

fig.update_layout(yaxis_range=[0,38000])
fig.update_yaxes(title="Average Compensation (number of Big Macs)")
fig.update_xaxes(type='category')
fig.show()

Considering the average compensation of tech workers in the last four years, India seems to be as financially attractive as China,less attractive than US ,China, Japane and two to three times More attractive than U.S. to people between 18 and 39 years old.

We do not see strong trends of salary increase or decrease in Japan, China or India over the last four years (2019-2022), but we do note that the average salary of young workers was higher in Japan and India in 2018.

In [28]:
title_prefix = "<h1 style = 'font-size: 2.5em'>"

title_str = "The technical profile of people working in tech"

HTML(f"{title_prefix}{title_str}</h1>")

Another factor pointed out by experts as a potential cause for the Japanese innovative and economical decline was the shortage of professionals with knowledge of cutting-edge technologies such as AI, and of professionals holding a university degree in Science, Technology, Engineering and Mathematics (STEM).

In this section, we will take a look into the techincal background, skills and knowledge of tech workers in Japan.

# What is the academic background of tech workers?

We start by taking a look at the distribution of the formal education of tech workers per country and year.

In [29]:
fig = px.histogram(analysis_df,
                    x="Country", color="Formal Education", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Year",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="Distribution of formal education of tech workers",
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                    "Formal Education": ["I prefer not to answer",
                    "No formal education past high school",
                    "Some college/university study without<br>earning a bachelor's degree",
                    "Bachelor's degree", "Master's degree", "Doctoral degree",
                    "Professional degree"],
                    "Year": range(2018,2022)})
fig.update_xaxes(type='category')
fig.update_yaxes(title="Percentage of respondents (%)")
fig.show()

The technical profile of people working in tech in India is quite diverse and constantly evolving. Here are some common technical skills and competencies that are in high demand among tech professionals in India:
Overall, the technical profile of people working in tech in India is constantly evolving as new technologies and tools emerge. The ability to adapt to new tools and techniques is becoming increasingly important for tech professionals in India as the industry continues to evolve at a rapid pace.

In [30]:
percentage_df = analysis_df[["Year", "Country", "Formal Education"]]

choices = ["Master's degree", "Doctoral degree",
"Bachelor's degree", "Professional degree"]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = percentage_df,
        multiple_choice_column = "Formal Education",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df.groupby(["Country", "Year"], as_index=False).sum()

fig = px.bar(percentage_df,
             x="Year", y="Percentage of Respondents",
             color="Country",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of respondents with university degree",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,100])
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_xaxes(type='category')
fig.show()

We can see that the percentage of participants that hold an university degree decreased from 2021 to 2022 in all the four countries. However, we also note that Japan is the country where less participants have declared having an university degree in the last two years, followed by China.

In [31]:
choices = [
    "resulting in a university degree"]

re_expressions = [fr"\b{choices[0]}\b"]

percentage_df = analysis_df[["Year", "Country", "Learning Platforms"]]
percentage_df = percentage_df[percentage_df["Year"].isin([2019,2020,2021,2022])]
percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = percentage_df,
        multiple_choice_column = "Learning Platforms",
        choices=choices,
        re_expressions=re_expressions)

fig = px.bar(percentage_df,
             x="Year", y="Percentage of Respondents",
             color="Country",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of respondents with university degree in Data Science or related field",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,35])
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_xaxes(type='category')
fig.show()

From the plot above we can observe that Indai is the country where there is a smaller percentage than US and China respondents with a university degree in Data Science or related fields over the last 4 years. U.S. is the country with the largest percentage in the same period, followed by China and India.

In [32]:
callout_str = "India is the country where there are less tech workers with a university degree in Data Science or related fields."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

However, it seems that the proportion of tech workers with a university degree and that have published academic papers is larger in Japan:

In [33]:
papers_df = analysis_df[analysis_df["Year"] == 2022]
papers_df = papers_df[["Country", "Published Papers"]]

fig = px.histogram(papers_df,
                    x="Country", color="Published Papers", barmode="group", histfunc="count",
                    barnorm="percent",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="Percentage of respondents with university degree that have/have not published papers",
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})
fig.update_xaxes(type='category')
fig.update_layout(legend_title="Have they published<br>academic papers?")
fig.update_yaxes(title="Percentage of respondents (%)")
fig.show()

However, the majority of people that have published academic papers in India did not make use of machine learning in their research, while in other countries, the majority of people that published papers did:

In [34]:
papers_df = analysis_df[analysis_df["Year"] == 2022]
papers_df = papers_df[["Country", "ML Research"]]

papers_df = papers_df.dropna(subset=["ML Research"])
papers_df = papers_df[papers_df["ML Research"].isin(["No", "Yes, theoretical research", "Yes, applied research", "Yes, both theoretical<br>and applied research"])]

fig = px.histogram(papers_df,
                    x="Country", color="ML Research", barmode="stack", histfunc="count",
                    barnorm="percent",
                    width=_WIDTH, height=500,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="Did their research make use of machine learning?",
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})
fig.update_xaxes(type='category')
fig.update_layout(yaxis_range=[0,100], legend_title="Respondents' answers")
fig.update_yaxes(title="Percentage of respondents (%)")
fig.show()

In [35]:
callout_str = "The proportion of respondents that published academic papers in India  is the largest, but the majority of their research did not make use of machine learning."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# How is the coding experience of people working in tech?

The coding experience of people working in tech in India varies depending on a number of factors such as their job role, industry, and level of experience. However, in general, India has a large and growing pool of skilled software developers who are proficient in a wide range of programming languages and frameworks.

One of the strengths of the Indian tech industry is the strong emphasis on education and training. Many tech professionals in India hold advanced degrees in computer science or related fields, and there is a large network of coding bootcamps, online courses, and training programs that help individuals develop their coding skills.

In terms of the coding experience itself, many tech companies in India provide a collaborative and supportive environment that encourages learning and development. Agile methodologies, such as Scrum, are commonly used in Indian tech companies to promote collaboration and teamwork among developers.

That said, like in any industry, there are also challenges that come with working as a software developer in India. Long hours, tight deadlines, and high stress levels are common in many tech companies, and there can be a high level of competition for jobs and promotions. However, with the right skills, training, and experience, many software developers in India are able to build rewarding and successful careers in the tech industry.

In [36]:
fig = px.histogram(analysis_df,
                    x="Country", color="Coding Experience", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Year",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="Coding experience groups distribution per country and year",
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                    "Coding Experience": ["I have never written code",
                    "< 1 year", "1-5 years",
                    "5-10 years", "10-20 years", "20+ years"],
                    "Year": range(2018,2022)})
fig.update_xaxes(type='category')
fig.update_yaxes(title="Percentage of respondents (%)")
fig.show()

We can see that, in 2019, less than 25% of the respondents had 5 years of coding experience in India . However, since 2020, this percentage has been around 50%, and the profile of tech workers in India in terms of coding experience seems to be somewhat similar to the one found in the U.S. In both countries, we see that more than 80% of the respondents have at least 1 year of coding experience, and that more than 20% of the respondents have more than 20 years of coding experience.

In [37]:
callout_str = "The proportion of respondents that are more experienced in coding is large in India compared to other countries."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

But is coding experience a valued asset in India? Accoding to the following plot, this may not be the case.

In [38]:
avg_data = get_salary_avg_grouped_by_year_country_and_extra_variable(analysis_df, "Coding Experience")

fig = px.bar(avg_data,
             x="Country", y="Salary",
             color="Coding Experience",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Average compensation of tech workers by coding experience, country and year",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                              "Coding Experience": ["I have never written code", "< 1 year", "1-5 years", "5-10 years", "10-20 years", "20+ years"],
                              "Year": range(2018,2022)})

fig.update_layout(yaxis_range=[0,43000])
fig.update_yaxes(title="Average yearly compensation (number of Big Macs)")
fig.update_xaxes(type='category')
fig.show()

We see that, in the U.S and China in 2021 and 2022, coding experience seems to be directly proportional to compensation. However, in Japan, this relation between compensation and experience does not seem to be true. We see that, for example, in 2022, the salary of people with 5-10 years of experience was larger than the salary of people with 20+ years of experience, but that, in 2020, the compensation and the years of experience seemed to be directly proportional. Thus, it is not possible to confirm or deny any relation that coding experience may have with compensation.




In [39]:
# get only data for 2022
role_df = analysis_df[analysis_df["Year"] == 2022]
role_df = role_df[["Title", "Country", "Coding Experience"]]

# remove China because there are not many respondents per role
role_df = role_df[role_df["Country"].isin(["India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"])]
role_df = role_df.dropna(subset=["Title"])

# remove the roles that have less than 15 respondents in each country
tmp_df = role_df.groupby(
    ["Country", "Title"]).size().reset_index(name="Count")
tmp_df = tmp_df[tmp_df["Count"] >= 15] # consider only groups with more than 15 samples

role_df["Keep"] = False
for c,title in zip(tmp_df["Country"].values, tmp_df["Title"].values):
    data_filter = (role_df["Country"]==c) & (role_df["Title"]==title)
    role_df.loc[data_filter,"Keep"] = True

role_df = role_df[role_df["Keep"] == True]

# remove the roles that have no respondents in some countries
role_df = role_df[role_df["Title"] != "Statistician"]
role_df = role_df[role_df["Title"] != "Data Architect"]
role_df = role_df[role_df["Title"] != "Data Administrator"]

title_values = list(set(role_df["Title"].values))
title_values = sorted(title_values)
fig = px.histogram(role_df,
                    x="Title", color="Coding Experience", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Country",
                    width=_WIDTH, height=800,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="Coding experience in the context of job titles in 2022",
                    category_orders={"Country": ["India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                    "Coding Experience": ["I have never written code",
                    "< 1 year", "1-5 years",
                    "5-10 years", "10-20 years", "20+ years"],
                    "Title":title_values})
fig.update_xaxes(type='category')
fig.update_yaxes(title="Percentage of respondents (%)")
sliders = [dict(y=-0.2)]
updatemenus = [dict(y=-0.2)]
fig.update_layout(sliders=sliders, updatemenus=updatemenus)
fig.show()

In Japan, the group of workers that have the largest percentage of people with more than 20 years of coding experience is the group of teachers and professors, making 50% of that group. In the U.S., that would be the group of Sofware Engineers, where 41.3% have more than 20 years of coding experience.

In the U.S., about 60% of Data Engineers, about 70% of Data Scientists, almost 80% of Machine Learning Engineers, about 70% of Research Scientists, and more than 80% of Software Engineers have at least 5 years of coding experience. In Japan, we see that the equivalent percentages are lower for all job titles and, in India, they are even lower.

In [40]:
callout_str = "In India, teachers and professors constitute the group of workers with the largest percentage of people with more than 20 years of coding experience. In the U.S, that would be the group of Software Engineers."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# Are people with Machine Learning experience working in tech?

Another skill of tech workers that we should investigate is their experience with Machine Learning (ML), specially because the "Shortage of professionals with knowledge of cutting-edge technologies", such as AI, has been pointed out as one of the potential causes of Japan being behind the innovation race.

We can see below that, over the last 3 years, the percentage of workers in Japan that have less than one year of experience with ML or no experience with it is not much different than in the U.S. and China, and it is smaller than in India. This suggests that the profile of professionals in Japan may be that of people with the same amount of ML experience as of professionals working in AI-leading countries.

In [41]:
fig = px.histogram(analysis_df[analysis_df["Year"].isin([2022,2021,2020])],
                    x="Country", color="ML Experience", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Year",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="Years of experience with ML",
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                                    "ML Experience": ["I do not use machine<br>learning methods",
                                    "< 1 year", "1-2 years", "2-3 years", "3-4 years", "4-5 years",
                                    "5-10 years", "10-20 years", "20+ years"],
                                    "Year": range(2018,2022)})
fig.update_xaxes(type="category")
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_layout(legend_title="Experience with ML")
fig.show()

However, since we are comparing the profile or workers in terms of percentages of respondents in each country, this does not mean that Japan is not suffering of a shortage of professionals with knowledge in ML. It simply means that the tech workers that are already working in Japan will likely have knowledge and experience in ML.

Assuming it is true that there is a lack of workers with knowledge of ML, in order to attract such workers to Japan, one thing that could be done is to increase the compensation of these workers as a way of valuing workers with knowledge in ML. However, the following plot shows that the average compensation of people with no knowledge of ML is actually higher than of people with experience in ML, regardless of the amount of experience, in the Japan of 2022. In U.S. and India, the compensation of people in 2022 with no experience in ML is only higher than of people with less than one year of experience in ML. The compensation of people with more than one year of experience in ML is higher than for people with no experience, and there is a tendency of an increase in compensation as the experience increases.

In [42]:
avg_data = get_salary_avg_grouped_by_year_country_and_extra_variable(
    analysis_df, "ML Experience")

fig = px.bar(avg_data[avg_data["Year"].isin([2022,2021,2020])],
             x="Country", y="Salary",
             color="ML Experience",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="How experience in ML is related to compensation?",
            category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                            "ML Experience": ["I do not use machine<br>learning methods",
                            "< 1 year", "1-2 years", "2-3 years", "3-4 years", "4-5 years",
                            "5-10 years", "10-20 years", "20+ years"],
                            "Year": range(2018,2022)})

fig.update_layout(yaxis_range=[0,55000], legend_title="Experience in ML")
fig.update_yaxes(title="Average yearly compensation (number of Big Macs)")
fig.update_xaxes(type='category')
fig.show()

In [43]:
callout_str = "In the India of 2022, the salary of workers with no experience in ML is, on average, higher than the salary of workers with experience, regardless of the amount of experience."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# What kind of activities are important in the job?

Since experience in ML does not seem to be valued in Japan, it would be expected that activities related to machine learning are not considered important in the workplace. Let's take a look at the respondent's answers regarding the important activities in their workplace and find out.

In [44]:
def make_percentage_bar_plot(
    percentage_df: pd.DataFrame,
    color_column: str,
    yaxis_range: List[float],
    title: str,
    labels: Dict[str, str] = None,
    legend_title: str = None)->None:

    if labels is not None:
        category_orders = {
            "Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
            color_column: labels.values(),
            "Year": range(2018,2022)}
    else:
        category_orders = {
            "Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
            "Year": range(2018,2022)}

    fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color=color_column,
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title=title,
             category_orders=category_orders)

    fig.update_layout(yaxis_range=yaxis_range)
    if legend_title is not None:
        fig.update_layout(legend_title=legend_title)
    fig.update_yaxes(title="Percentage of respondents (%)")
    fig.update_xaxes(type='category')
    fig.show()

In [45]:
choices = [
    "Analyze and understand data to influence product or business decisions",
    "Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data",
    "Build prototypes to explore applying machine learning to new areas",
    "Build and/or run a machine learning service that operationally improves my product or workflows",
    "Experimentation and iteration to improve existing ML models",
    "Do research that advances the state of the art of machine learning",
    "None of these activities are an important part of my role at work",
    "Other"]

labels ={
    "Analyze and understand data to influence product or business decisions": "Analyze data to influence<br>product or business decisions",
    "Build and/or run the data infrastructure that my business uses for storing, analyzing, and operationalizing data": "Build and/or run the data<br>infrastructure that my business<br>uses",
    "Build prototypes to explore applying machine learning to new areas": "Build prototypes to explore<br>applying ML to new areas",
    "Build and/or run a machine learning service that operationally improves my product or workflows":"Build and/or run a ML service<br>that operationally improves<br>my product or workflows",
    "Experimentation and iteration to improve existing ML models": "Experimentation and iteration to<br>improve existing ML models",
    "Do research that advances the state of the art of machine learning": "Research to advance the<br>state of the art of ML",
    "None of these activities are an important part of my role at work": "None of the activities above",
    "Other": "Other"
}

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Important Activities"]],
        multiple_choice_column = "Important Activities",
        choices=choices,
        re_expressions=re_expressions)

for key, value in labels.items():
    percentage_df = percentage_df.replace(key, value)

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "Important Activities",
    yaxis_range=[0,65],
    title="Activities that make up an important part of respondents' role at work",
    labels=labels,
    legend_title="Activities")

We can see that, over the years, Japan is the country where a larger percentage of participants have answered that "None of the activities above" are important in their role at work. This indicates that there is a larger percentage of tech workers in Japan that are not involved with activities related to data or machine learning compared to other countries, which may suggest a shortage of such professionals and/or a shortage of jobs for such professionals.

If we take a look at the percentages of each activity, we will see that, from 2019 to 2022, the percentage of respondents in Japan that answered "Analyze data to influence product or business decisions" or "Build and/or run the data infrastructure that my business uses" was consistently lower than in other countries. This suggests a lack of people working with data (e.g. Data Scientists, Data Analysts, etc) in Japan. In the same period, we see that the percentage of people in Japan that answered "Build and/or run a ML service that operationally improves my product or workflows" or "Experimentation and iteration to improve existing ML models" or "Research to advance the state of the art of ML" was also consistently lower compared to other countries. This suggests that, in Japan, there may be a lack of jobs to use ML to improve products and workflows in the companies, and that there may be a lack of people involved in discovering and improving ML methods.

In [46]:
callout_str = "Compared to other countries, there seems to be a shortage of professionals that work with data and machine learning and/or a shortage of jobs for such professionals in India."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# Are people writing code?

One of the most important skills of tech workers is arguably coding. Although we have seen that the majority of tech workers have coding experience, it could be worth asking how many percent of these experienced tech workers are actually coding in a regular basis and what are the programming languages that they use. The following plot shows the percentage of tech workers with coding experience that **are not** coding on a regular basis followed by a plot showing which languages are often used by the people that do code on a regular basis.

In [47]:
choices = ["None"]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Languages"]],
        multiple_choice_column = "Languages",
        choices=choices,
        re_expressions=re_expressions)

fig = px.bar(percentage_df,
             x="Year", y="Percentage of Respondents",
             color="Country",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of respondents with coding experience that do not code on a regular basis",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,3.5])
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_xaxes(type='category')
fig.show()

In [48]:
choices = [
        "Python",
        "R",
        "SQL",
        "C",
        "C++",
        "Java",
        "Javascript",
        "Julia",
        "Bash",
        "MATLAB",
        "None",
        "Other",
        "C#"]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[4] = fr"C\+\+"
re_expressions[-1] = fr"C#"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Languages"]],
        multiple_choice_column = "Languages",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Languages"] != "None"]

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "Languages",
    yaxis_range=[0,90],
    title="Programming languages that are used on a regular basis",
    legend_title="Programming languages")

There are few things to note from the plots above:
- People with coding experience often code on a regular basis (We see that the percentage of people with experience that do not code is fairly low across countries, being at most 3%). In Japan, from 2018 to 2022, we see that less than 0.8% of people with coding experience did not code on a regular basis. This is much less than what is seen in the U.S., for example.
- In Japan, over the last 5 years, the percentage of people that use python was larger than the percentage of people using the same programming language in India and the U.S. However, we see that, in these two countries, the percentage of participants that use SQL (and R, except for India in 2022) was larger than in Japan for the same period. This could be related to our previous findings that there is a lack of professionals working with data, and to our observation that Japan is the country where there are less tech workers with an university degree in data science or related field.
- The percentage of participants that use C and C++ in Japan is consistently larger than in the U.S., and, from 2019 to 2022, the popularity of these languages kept increasing. The popularity of these languages, which are often used for hardware-related applications, could be an indication that there is still a focus in hardware as opposed to software in Japan.

In [49]:
callout_str = "The low popularity of SQL and the increased popularity of C and C++ in India compared to other countries could be an indication that there is a preference to focus on hardware-related problems as opposed to data-related ones."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

In [50]:
choices = [
"Kaggle Notebooks",
"Colab Notebooks",
"None",
"Other"]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bKaggle Notebooks\b|\bKaggle Kernels\b"
re_expressions[1] = fr"\bColab Notebooks\b|\bGoogle Colab\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Notebooks"]],
        multiple_choice_column = "Notebooks",
        choices=choices,
        re_expressions=re_expressions)

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "Notebooks",
    yaxis_range=[0,50],
    title="Percentage of tech workers that use hosted notebook products")

In [51]:
callout_str = "The large percentage of Kaggle notebook users in India  may be an indication of a strong Kaggle community in that country."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# Are people visualizing data?

Besides coding, another skill that is valued in the tech workplace is data visualization. Although, so far, we have seen some factors that could indicate a less data-oriented tech community in Japan, the following plot shows that the majority of tech workers with coding experience have been using at least matplotlib from 2018 to 2022 in India.

In [52]:
choices = [
"Matplotlib",
"Seaborn",
"Plotly / Plotly Express",
"Ggplot / ggplot2",
"Shiny",
"D3 js",
"Altair",
"Bokeh",
"Geoplotlib",
"Leaflet / Folium",
# "Pygal",
# "Dygraphs",
# "Highcharter",
"None",
"Other"]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[2] = fr"\bPlotly\b"
re_expressions[3] = fr"\bggplot\b|ggplot2"
re_expressions[5] = fr"D3"
re_expressions[9] = fr"\bLeaflet\b|\bFolium\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Visualization Tools"]],
        multiple_choice_column = "Visualization Tools",
        choices=choices,
        re_expressions=re_expressions)

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "Visualization Tools",
    yaxis_range=[0,75],
    title="Visualization libraries used on a regular basis by tech workers with coding experience")

From the plot above, we also note that the percentage of people using matplotlib or seaborn is comparable to the corresponding percentages in India and U.S. between 2018 and 2022. However, the percentage of people using tools such as plotly, ggplot and shiny is smaller in Japan than in these two countries for the same period. The lack of popularity of ggplot and shiny in Japan is understandable given the lack of popularity of the R language itself, as we have seen in previous analysis.

The smaller popularity of plotly in Japan compared to other countries could be due to several reasons, such as a possible resistence to adopt newer tools (note that plotly was released in 2013, while matplotlib has been around since 2003), a lack of interest in plot interactivity and aestetics and/or a lack of interest in diversifying and exploring new visualization tools. However, it is not possible to conclude the reason to this lack of popularity based solely on the data that we have.

# Are people using Machine Learning methods?

Another important question to be asked is whether people are using machine learning methods, which type of methods they are using and which type of frameworks they use on a regular basis to define, train and evaluate their models and methods. We start by taking a look at the frameworks used by tech workers.

In [53]:
choices = [
"Scikit-learn",
"TensorFlow",
"Keras",
"PyTorch",
"Fast.ai",
"Xgboost",
"LightGBM",
"CatBoost",
"Caret",
"Tidymodels",
"JAX",
"PyTorch Lightning",
"Huggingface",
"None",
"Other"
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bScikit-learn\b|\bScikit-Learn\b"
re_expressions[4] = fr"\bFastai\b|\bFast.ai\b"
re_expressions[6] = fr"\bLightGBM\b|\blightgbm\b"
re_expressions[7] = fr"\bCatBoost\b|\bcatboost\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "ML Frameworks"]],
        multiple_choice_column = "ML Frameworks",
        choices=choices,
        re_expressions=re_expressions)

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "ML Frameworks",
    yaxis_range=[0,65],
    title="Machine Learning frameworks that are used on a regular basis by tech workers")

We see that in India, the percentage of people that do not use any machine learning framework is one of the lowest compared to other countries over the years. We also see that there is a large percentage of people using scikit-learn, TensorFlow, keras, pytorch, xgboost and lightBGM compared to other countries from 2018 to 2022. This, combined with previous analysis, could be an indication that, although workers in India do not analyze data or use ML methods in their jobs as much as in other countries, they do use ML methods in their free time. The fact that they tend to use Kaggle notebooks more than in other countries may be a sign that people working in India 
often spend their free time being active in the Kaggle community, and using ML models in the notebooks that they create in Kaggle.

In [54]:
callout_str = "There may be a tendency for tech workers in India to train ML models in their free time."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

In [55]:
choices = [
    "Linear or Logistic Regression",
    "Decision Trees or Random Forests",
    "Gradient Boosting Machines (xgboost, lightgbm, etc)",
    "Bayesian Approaches",
    "Evolutionary Approaches",
    "Dense Neural Networks (MLPs, etc)",
    "Convolutional Neural Networks",
    "Generative Adversarial Networks",
    "Recurrent Neural Networks",
    "Transformer Networks (BERT, gpt-3, etc)",
    "Autoencoder Networks (DAE, VAE, etc)",
    "Graph Neural Networks",
    "None",
    "Other",
]

labels ={
    "Linear or Logistic Regression":"Linear or Logistic Regression",
    "Decision Trees or Random Forests":"Decision Trees or Random<br>Forests",
    "Gradient Boosting Machines (xgboost, lightgbm, etc)":"Gradient Boosting Machines",
    "Bayesian Approaches":"Bayesian Approaches",
    "Evolutionary Approaches":"Evolutionary Approaches",
    "Dense Neural Networks (MLPs, etc)":"Dense Neural Networks",
    "Convolutional Neural Networks":"Convolutional Neural<br>Networks",
    "Generative Adversarial Networks":"Generative Adversarial<br>Networks",
    "Recurrent Neural Networks":"Recurrent Neural Networks",
    "Transformer Networks (BERT, gpt-3, etc)":"Transformer Networks",
    "Autoencoder Networks (DAE, VAE, etc)":"Autoencoder Networks",
    "Graph Neural Networks":"Graph Neural Networks",
    "None":"None",
    "Other":"Other",
}

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[2] = fr"\bGradient Boosting Machines\b"
re_expressions[5] = fr"\bDense Neural Networks\b"
re_expressions[9] = fr"\bTransformer Networks\b"
re_expressions[10] = fr"\bAutoencoder Networks\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "ML Algorithms"]],
        multiple_choice_column = "ML Algorithms",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] != 2018]

for key, value in labels.items():
    percentage_df = percentage_df.replace(key, value)

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "ML Algorithms",
    yaxis_range=[0,65],
    title="Machine Learning algorithms that are used on a regular basis by tech workers",
    labels=labels)

"Linear or Logistic Regression", "Decision Trees or Random Forests" and "Bayesian Approaches" seem to be consistently less popular in India than in the U.S. and china between 2019 and 2022. However, methods such as "Dense Neural Networks", "Convolutional Neural Networks" and "Transformer Networks" seem to be consistently more popular in India than in these two countries over the same period.



In [56]:
callout_str = "The popularity of data-hungry algorithms and the lack of popularity of methods with less parameters in Japan, combined with our previous observations, could be an indication that workers in Japan are not using ML in the industry as much as workers in other countries, but they are researching and experimenting in their free time."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# Are people using other Deep Learning methods?

We have seen that tech workers in Japan are using ML methods. In particular, methods that often have a large number of parameters (e.g. neural networks), and that could be considered "deep learning", but are they using other types of deep learning methods, such as Computer Vision and Natural Language Processing methods, as well? We start by taking a look at Computer Vision methods. The choices given to the respondents are all (except "General purpose image/video tools", "None" and "Other") often implemented with deep learning methods, and, in fact, the state-of-the-art in all these method categories are deep learning models.

In [57]:
choices = [
    "General purpose image/video tools (PIL, cv2, skimage, etc)",
    "Image segmentation methods (U-Net, Mask R-CNN, etc)",
    "Object detection methods (YOLOv6, RetinaNet, etc)",
    "Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)",
    "Vision transformer networks (ViT, DeiT, BiT, BEiT, Swin, etc)",
    "Generative Networks (GAN, VAE, etc)",
    "None",
    "Other"
]

labels = {
    "General purpose image/video tools (PIL, cv2, skimage, etc)":"General purpose image/video<br>tools (PIL, cv2, skimage, etc)",
    "Image segmentation methods (U-Net, Mask R-CNN, etc)":"Image segmentation methods",
    "Object detection methods (YOLOv6, RetinaNet, etc)":"Object detection methods",
    "Image classification and other general purpose networks (VGG, Inception, ResNet, ResNeXt, NASNet, EfficientNet, etc)":"Image classification and other<br>general purpose networks",
    "Vision transformer networks (ViT, DeiT, BiT, BEiT, Swin, etc)":"Vision transformer networks",
    "Generative Networks (GAN, VAE, etc)":"Generative Networks",
    "None":"None",
    "Other":"Other"
}

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bGeneral purpose image/video tools\b"
re_expressions[1] = fr"\bImage segmentation methods\b"
re_expressions[2] = fr"\bObject detection methods\b"
re_expressions[3] = fr"\bImage classification and other general purpose networks\b"
re_expressions[4] = fr"\bVision transformer networks\b"
re_expressions[5] = fr"\bGenerative Networks\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Computer Vision Methods"]],
        multiple_choice_column = "Computer Vision Methods",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] != 2018]

for key, value in labels.items():
    percentage_df = percentage_df.replace(key, value)

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "Computer Vision Methods",
    yaxis_range=[0,28],
    title="Computer Vision methods that are used on a regular basis by tech workers",
    labels=labels)

In [58]:
callout_str = "Workers in India seem to be interested in Computer Vision methods and they tend to keep updated on the state-of-the-art in that field."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

The following plot shows the participants' answers regarding Natural Language Processing (NLP). In this case, all the options given to the participants, except "None" and "Other", are frequently implemented with deep learning models.

In [59]:
choices = [
    "Word embeddings/vectors (GLoVe, fastText, word2vec)",
    "Encoder-decoder models (seq2seq, vanilla transformers)",
    "Contextualized embeddings (ELMo, CoVe)",
    "Transformer language models (GPT-3, BERT, XLnet, etc)",
    "None",
    "Other",
]

labels = {
    "Word embeddings/vectors (GLoVe, fastText, word2vec)":"Word embeddings/vectors",
    "Encoder-decoder models (seq2seq, vanilla transformers)":"Encoder-decoder models",
    "Contextualized embeddings (ELMo, CoVe)":"Contextualized embeddings",
    "Transformer language models (GPT-3, BERT, XLnet, etc)":"Transformer language models",
    "None":"None",
    "Other":"Other",
}

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bWord embeddings/vectors\b"
re_expressions[1] = fr"\bEncoder-decoder models\b|\bEncoder-decorder models\b"
re_expressions[2] = fr"\bContextualized embeddings\b"
re_expressions[3] = fr"\bTransformer language models\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "NLP Methods"]],
        multiple_choice_column = "NLP Methods",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] != 2018]

for key, value in labels.items():
    percentage_df = percentage_df.replace(key, value)

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "NLP Methods",
    yaxis_range=[0,20],
    title="Natural Language Processing methods that are used on a regular basis by tech workers",
    labels=labels)

In [60]:
callout_str = "The interest in transformer language models has been increasing since 2019 in India."
HTML(f"{_CALLOUT_PREFIX}<center>{callout_str}</center></h4></div>")

# What kind of hardware are people using to train Machine Learning models?

We have seen that, in India, workers are training ML and DL models, but what kind of hardware are they using for that? In the next two plots, we respectively show the percentage of participants in each country that use GPUs and TPUs, and the number of times that these participants have used TPUs.

In [61]:
choices = [
"GPUs",
"TPUs",
# "IPUs",
# "WSEs",
# "RDUs",
# "Trainium Chips",
# "Inferentia Chips",
"None",
"Other",
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "ML Hardware"]],
        multiple_choice_column = "ML Hardware",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] != 2018]

make_percentage_bar_plot(
    percentage_df=percentage_df,
    color_column= "ML Hardware",
    yaxis_range=[0,60],
    title="Type of hardware used to train ML models",
    legend_title="Hardware type")

In [62]:
fig = px.histogram(analysis_df,
                    x="Country", color="TPU", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Year",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="TPU usage",
                    category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                    "TPU":["Never", "Once", "2-5 times", "6-25 times", "> 25 times"]})
fig.update_xaxes(type="category")
fig.update_yaxes(title="Number of time that the participants used a TPU")
fig.update_layout(legend_title="Number of times")
fig.show()

# What kind of products and tools related to ML are people using?

To help with training, monitoring and evaluating ML models, there are several tools that one can use. We now analyze whether workers in the selected countries use these tools and how different their usage is across countries and over the years. We start by showing the percentage of respondents that **do not** use any managed ML product (e.g. Amazon SageMaker, Azure Machine Learning Studio, Google Cloud Vertex AI, DataRobot, Databricks, Dataiku, etc).

In [63]:
choices = ["No / None"]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Managed ML"]],
        multiple_choice_column = "Managed ML",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"].isin([2021,2022])]

fig = px.bar(percentage_df,
             x="Year", y="Percentage of Respondents",
             color="Country",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of respondents that do not use managed ML products",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,60])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()


In [64]:
choices = ["No / None"]

re_expressions = [fr"\b{choices[0]}\b|\bNone\b"]

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "AutoML"]],
        multiple_choice_column = "AutoML",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] != 2018]

fig = px.bar(percentage_df,
             x="Year", y="Percentage of Respondents",
             color="Country",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of respondents that do not use AutoML products",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,50])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

In [65]:
choices = ["None"]

re_expressions = [fr"\b{choices[0]}\b"]

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "ML Serve"]],
        multiple_choice_column = "ML Serve",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] == 2022]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of tech workers that do not use products to serve their ML models in 2022",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,50])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

In [66]:
choices = [
"TensorFlow Extended (TFX)",
"TorchServe",
"ONNX Runtime",
"Triton Inference Server",
"OpenVINO Model Server",
"KServe",
"BentoML",
"Multi Model Server (MMS)",
"Seldon Core",
"MLflow",
"Other",
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bTensorFlow Extended\b"
re_expressions[7] = fr"\bMulti Model Server\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "ML Serve"]],
        multiple_choice_column = "ML Serve",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] == 2022]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="ML Serve",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Products used to serve ML models in 2022 by tech workers",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]},
             labels=labels)

fig.update_layout(yaxis_range=[0,9], legend_title="Products")
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_xaxes(type='category')
fig.show()

In [67]:
choices = ["No / None"]

re_expressions = [fr"\b{choices[0]}\b"]

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "ML Monitor Tools"]],
        multiple_choice_column = "ML Monitor Tools",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"].isin([2020,2021,2022])]

fig = px.bar(percentage_df,
             x="Year", y="Percentage of Respondents",
             color="Country",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of respondents that do not use tools to monitor ML experiments and models",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,50])
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_xaxes(type='category')
fig.show()

In [68]:
choices = ["None"]

re_expressions = [fr"\b{choices[0]}\b"]

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Ethical AI tools"]],
        multiple_choice_column = "Ethical AI tools",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] == 2022]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Percentage of respondents that do not use Ethical AI products in 2022",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,55])
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_xaxes(type='category')
fig.show()

# What kind of cloud computing tools are people using?

Another type of tools that tech workers may use are cloud computing tools. The next plot shows what kind of cloud computing tools tech workers used in the selected countries from 2018 to 2022.

In [69]:
choices = [
"Amazon Web Services (AWS)",
"Microsoft Azure",
"Google Cloud Platform (GCP)",
"IBM Cloud / Red Hat",
"Oracle Cloud",
"SAP Cloud",
"VMware Cloud",
"Alibaba Cloud",
"Tencent Cloud",
"Huawei Cloud",
"None",
"Other",
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bAmazon Web Services\b"
re_expressions[2] = fr"\bGoogle Cloud Platform\b"
re_expressions[3] = fr"\bIBM Cloud\b|\bRed Hat\b"
re_expressions[-2] = fr"\bNone\b|\bI have not used any cloud providers\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Cloud Computing Tools"]],
        multiple_choice_column = "Cloud Computing Tools",
        choices=choices,
        re_expressions=re_expressions)

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="Cloud Computing Tools",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Cloud Computing Tools usage",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,55])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

In [70]:
choices = [
"Amazon Elastic Compute Cloud (EC2)",
"Microsoft Azure Virtual Machines",
"Google Cloud Compute Engine",
"No / None",
"Other",
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bAmazon Elastic Compute Cloud\b|EC2"
re_expressions[1] = fr"\bAzure Virtual Machines\b"
re_expressions[2] = fr"\bGoogle Compute Engine\b|\bGoogle Cloud Compute Engine\b"
re_expressions[-2] = fr"\bNo / None\b|\bNone\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Cloud Computing Products"]],
        multiple_choice_column = "Cloud Computing Products",
        choices=choices,
        re_expressions=re_expressions)

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="Cloud Computing Products",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Cloud Computing Products usage",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,35])

fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

# What kind of data-related products are people using?

Finally, we take a look at the tech workers' profiles when it comes to the usage of data-related products. In particular, we are interested in data storage products and other data products, such as data lakes and databases.



In [71]:
# new_df["Data Storage Products"] # Q34

choices = [
"Amazon Simple Storage Service (S3)",
"Amazon Elastic File System (EFS)",
"Google Cloud Storage (GCS)",
"Google Cloud Filestore",
"Microsoft Azure Blob Storage",
"Microsoft Azure Files",
"No / None",
"Other",
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[0] = fr"\bAmazon Simple Storage Service\b|S3"
re_expressions[1] = fr"\bAmazon Elastic File System\b|EFS"
re_expressions[2] = fr"\bGoogle Cloud Storage\b|GCS"
re_expressions[-2] = fr"\bNo / None\b|\bNone\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Data Storage Products"]],
        multiple_choice_column = "Data Storage Products",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"].isin([2021,2022])]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="Data Storage Products",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Data Storage Products usage",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,25])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

In [72]:
choices = [
"MySQL",
"PostgreSQL",
"SQLite",
"Oracle Database",
"MongoDB",
"Snowflake",
"IBM Db2",
"Microsoft SQL Server",
"Microsoft Azure SQL Database",
"Amazon Redshift",
"Amazon RDS",
"Amazon DynamoDB",
"Google Cloud BigQuery",
"Google Cloud SQL",
"None",
"Other",
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Data Products"]],
        multiple_choice_column = "Data Products",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"].isin([2020,2021,2022])]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="Data Products",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Data Products usage",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,40])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

In [73]:
title_prefix = "<h1 style = 'font-size: 2.5em'>"

title_str = "Market and investment"

HTML(f"{title_prefix}{title_str}</h1>")

We have observed Japan's situation so far in the individual scale, by considering the tech worker's profile and some aspects in their working environment. Although the individual often reflects the situation of a community, we should also try to draw a profile of the companies and businesses. In the remainder of this notebook, we take a look at some answers in the surverys that could help us do that.

# Tech companies profile



First, we take a look at the types of industry of the companies where the tech workers are employed. In the plot below, we only show the data from the surveys conducted in 2018, 2021 and 2022 due to a lack of the same question in the surveys from 2019 and 2020.

In [74]:
# new_df["Industry Type"] # Q24

choices = [
"Academics/Education",
"Accounting/Finance",
"Broadcasting/Communications",
"Computers/Technology",
"Energy/Mining",
"Government/Public",
"Service",
"Insurance/Risk Assessment",
"Online Service/Internet-based",
"Services",
"Marketing/CRM",
"Manufacturing/Fabrication",
"Medical/Pharmaceutical",
"Non-profit/Service",
"Retail/Sales",
"Shipping/Transportation",
"Other"
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Industry Type"]],
        multiple_choice_column = "Industry Type",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"].isin([2018, 2021, 2022])]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="Industry Type",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Industry Type",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,45])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

Another important aspect to consider is the size of the companies where the participants are employed, specially the percentage of people employed in start-ups, since having a strong start-up community is one of the factors required to promote innovation in a country, as pointed out by experts.

Although not always the case, a start-up is typically a company in its early stages of development [16], and it is often considered a small business, which is defined by the US Small Business Administration as a business with at most 500 employees [17]. We will use this definition to analyze the start-up environment in Japan.

The following plot shows the percentage of people in each country that is employed in companies of various sizes. We see that, in Japan, the percentage of workers employed in businesses with 0-49 employees, which would configure a start-up in its early stages, was the smallest among all countries and over all years. The percentage of people working in companies with 50-249 employees, which would also configure a start-up environment, was the smallest in Japan in the last 3 years.

We also see that there are more workers in Japan employed in companies with 1000 to 9,999 employees than in companies of different sizes. However, other countries also show a similar situation of having a large percentage of employees working in large companies.

In [75]:
choices = [
"0-49 employees",
"50-249 employees",
"250-999 employees",
"1000-9,999 employees",
"> 10,000 employees"
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

re_expressions[-1] = fr"\b10,000 employees\b"

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Company Size"]],
        multiple_choice_column = "Company Size",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] != 2018]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="Company Size",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Company Size",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,40])
fig.update_yaxes(title="Percentage of respondents")
fig.update_xaxes(type='category')
fig.show()

In [76]:
choices = [
"0",
"1-2",
"3-4",
"5-9",
"10-14",
"15-19",
"20+",
]

re_expressions = []
for choice in choices:
    re_expressions.append(fr"\b{choice}\b")

percentage_df = get_percentage_of_multiple_choices_item(
        analysis_df = analysis_df[["Year", "Country", "Number of Data Scientists"]],
        multiple_choice_column = "Number of Data Scientists",
        choices=choices,
        re_expressions=re_expressions)

percentage_df = percentage_df[percentage_df["Year"] != 2018]

fig = px.bar(percentage_df,
             x="Country", y="Percentage of Respondents",
             color="Number of Data Scientists",
             animation_frame="Year",
             barmode="group",
             width=_WIDTH, height=600,
             color_discrete_sequence=_COLOR_SCHEME,
             title="Number of people responsible for data science workloads",
             category_orders={"Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"]})

fig.update_layout(yaxis_range=[0,40], legend_title="Number of people")
fig.update_yaxes(title="Percentage of respondents (%)")
fig.update_xaxes(type='category')
fig.show()

In [77]:
ml_usage_df = analysis_df[["Country", "Year", "ML Usage in Business"]]

labels = {
    "We are exploring ML methods (and may one day put a model into production)":"We are exploring ML methods",
    "We use ML methods for generating insights (but do not put working models into production)":"We use ML methods for generating insights",
    "We recently started using ML methods (i.e., models in production for less than 2 years)":"We recently started using ML methods",
    "We have well established ML methods (i.e., models in production for more than 2 years)":"We have well established ML methods",
    "No (we do not use ML methods)":"No",
    "I do not know":"I do not know",
}
for key, value in labels.items():
    ml_usage_df = ml_usage_df.replace(key, value)

fig = px.histogram(ml_usage_df,
                    x="Country", color="ML Usage in Business", barmode="stack", histfunc="count",
                    barnorm="percent", animation_frame="Year",
                    width=_WIDTH, height=600,
                    color_discrete_sequence=_COLOR_SCHEME,
                    title="Do employers incorporate machine learning methods into their business?",
                    category_orders={
                        "Country": ["China🇨🇳", "India🇮🇳", "Japan🇯🇵", "U.S.🇺🇸"],
                        "ML Usage in Business":[
                            "No",
                            "We are exploring ML methods",
                            "We use ML methods for generating insights",
                            "We recently started using ML methods",
                            "We have well established ML methods",
                            "I do not know",
                        ]})
fig.update_xaxes(type="category")
fig.update_yaxes(title="Percentage of respondents (%)")
fig.show()

In [78]:
title_prefix = "<h1 style = 'font-size: 2.5em'>"

title_str = "Conclusion"

HTML(f"{title_prefix}{title_str}</h1>")

Comparing the data science industry between the US, China, Japan, and India reveals some interesting insights into the state of the industry and the relative strengths and weaknesses of each country.

The US has long been a leader in the data science industry, with a large number of companies and startups specializing in data analysis and machine learning. The US also has a well-established education system that produces a large number of skilled data scientists and researchers. The US also has a thriving research community that is at the forefront of developing new techniques and technologies for data analysis.

China has emerged as a major player in the data science industry in recent years, with a large number of companies specializing in artificial intelligence and machine learning. China has also made significant investments in education and research in the field of data science, with a particular focus on developing advanced AI technologies.

Japan has a long history of innovation and technological excellence, and this extends to the field of data science as well. Japan is home to a number of companies and research institutions that specialize in data analysis and machine learning, and the Japanese government has made significant investments in the development of new technologies in this area.

India is also emerging as a major player in the data science industry, with a large number of companies specializing in data analysis, machine learning, and artificial intelligence. India also has a large pool of skilled software developers and data scientists, and a thriving startup ecosystem that is attracting significant investment from around the world. However, India still faces some challenges, particularly in terms of infrastructure and education, that can hinder its ability to compete with other major players in the industry.

In conclusion, India is rapidly emerging as a major player in the data science industry, with a large and growing pool of skilled professionals and a thriving startup ecosystem. While there are certainly challenges that need to be addressed, particularly in terms of infrastructure and education, India has the potential to become a leading center for data analysis and machine learning in the years to come.

# Appendix: Equivalence of Kaggle Machine Learning & Data Science Survey questions over the years

Below we present the question numbers of the surveys conducted in the years of 2018 to 2021 that are equivalente to the questions `Q1`-`Q44` asked in the 2022 survey.

If there is no equivalent question in the past surveys, we that indicate with `-`. If there is a question in the past surveys that is similar (but not exactly the same) as the one in the 2022 survey, we add the symbol `&` as a suffix to the question number. Moreover, if there is a question that have similar (but not exactly the same) options as answer, we add the symbol `*` as a suffix to the question number.

We use the equivalence table below to guide us in our data processing stage of our code.

| 2022 | 2021 | 2020 | 2019 | 2018 |
| --- | --- | --- | --- | --- |
| Q1 | - | - | - | - |
| Q2 | Q1 | Q1 | Q1 | Q2 |
| Q3 | Q2 | Q2 | Q2* | Q1* |
| Q4 | Q3 | Q3 | Q3 | Q3 |
| Q5 | - | - | - | - |
| Q6 | Q40 | Q37 | Q13*  | Q36* |
| Q7 | - | - | - | - |
| Q8 | Q4 | Q4 | Q4* | Q4* |
| Q9 | - | - | - | - |
| Q10 | - | - | - | - |
| Q11 | Q6 | Q6 | Q15& | Q24&* |
| Q12 | Q7* | Q7* | Q18* | Q16* |
| Q13 | Q9* | Q9* | Q16* | Q13*& |
| Q14 | Q10*& | Q10*& | Q17*& | Q14*& |
| Q15 | Q14* | Q14* | Q20* | Q21*& |
| Q16 | Q15 | Q15 | Q23 | Q25* |
| Q17 | Q16* | Q16* | Q28* | Q19*& |
| Q18 | Q17* | Q17* | Q24* | - |
| Q19 | Q18* | Q18* | Q26* | - |
| Q20 | Q19 | Q19 | Q27 | - |
| Q21 | - | - | - | - |
| Q22 | - | - | - | - |
| Q23 | Q5 | Q5* | Q5* | Q6* |
| Q24 | Q20* | - | - | Q7* |
| Q25 | Q21 | Q20 | Q6 | - |
| Q26 | Q22 | Q21 | Q7 | - |
| Q27 | Q23 | Q22 | Q8 | Q10 |
| Q28 | Q24 | Q23 | Q9 | Q11* |
| Q29 | Q25 | Q24 | Q10 | Q9 |
| Q30 | Q26 | Q25 | Q11 | - |
| Q31 | Q27-A* | Q26-A* | Q29*& | Q15*& |
| Q32 | Q28* | - | - | - |
| Q33 | Q29-A& | Q27-A*& | Q30*& | Q27*& |
| Q34 | Q30-A*& | - | - | - |
| Q35 | Q32-A*& | Q29-A*& | - | - |
| Q36 | Q34-A*& | Q31-A*& | - | - |
| Q37 | Q31-A*& | Q28-A*& | - | Q28*& |
| Q38 | Q37-A& | Q34-A*& | Q33*& | - |
| Q39 | - | - | - | - |
| Q40 | Q38-A*& | Q35-A*& | - | - |
| Q41 | - | - | - | - |
| Q42 | Q12*& | Q12*& | Q21*& | - |
| Q43 | Q13 | Q13 | Q22 | - |
| Q44 | Q42 | Q39 | Q12* | Q38* |