<font size="6"><b>15 factors for data science in your country!</b></font>
<h4>Comparative analysis of countries in terms of demography, education, technology and career factors of respondents participating in the "2022 Kaggle Machine Learning & Data Science Survey"</h4>
<h4><i> Author: Michał Bogacz (@michau96)</i></h4>

<h3><b>Table of contents</b></h3>
<br>

<b>1. </b>[**Introduction**](#section-1) <br>
<b>2. </b>[**Data cleansing and quaility**](#section-2) <br>
<b>3. </b>[**Factors creation**](#section-3) <br>
<b>4. </b>[**Factors analysis**](#section-4) <br>
&nbsp;&nbsp;&nbsp;&nbsp;  <b>4.1. </b>[**Factors analysis in one dimension**](#section-41) <br>
&nbsp;&nbsp;&nbsp;&nbsp;  <b>4.2. </b>[**Factors analysis in two dimension**](#section-42) <br>
&nbsp;&nbsp;&nbsp;&nbsp;  <b>4.3. </b>[**Clusterization of countries by factors**](#section-43) <br>
<b>5. </b>[**Summary and conclusions**](#section-5) <br>
<b>6. </b>[**Sources**](#section-6)

<a id="section-1"></a>
<div class="alert alert-light" role="alert">
<h3><b>1. Introduction</b></h3>
</div>

<div style="text-align: justify;">This is the 6th edition of the "Kaggle ML & DS Survey" study in which the users of this platform but also other people related to the broadly understood Data Science and Machine Learning, answer many different questions revolving around the topics of technologu, socio-demographic factors, work, software, hardware and many more. <b>This year's edition was attended by 23,997 people from many countries</b> and the result of the study is a dataset containing answers to the questions from the questionnaire [1] [2].</div>

<div style="text-align: justify;">The data is very interesting and allows for advanced analysis. <b>The narrative of this notebook will focus on the countries of the respondents</b> who, in one of the first questions, answered which country they currently reside in. The cross-country analysis will be a kind of comparative analysis based on more or less obvious factors describing respondents from different countries in different ways. <b>The goals of the following analysis are: creating factors that distinguish countries from each other and comparative analysis (with an emphasis on the most atypical countries by characteristics, analysis of interdependencies between countries and creating separate groups of countries based on created factors)</b>. The results of the analysis can be used to better understand what place in the world in terms of broadly understood data science is the country in which we are staying, which countries are similar to our country and which are completely different in specific aspects, and what specific conclusions and advice can be drawn from this study for recipients and what actions and decisions can be made based on the received knowledge.</div>

In [None]:
remotes::install_github("hrbrmstr/ggchicklet")

library(tidyverse)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(devtools)
library(ggchicklet)
library(ggrepel)
library(ggforce)
library(ggcorrplot)
library(ggdendro)
library(ggmap)

annotate <- ggplot2::annotate

theme_michau <- theme(axis.text = element_text(size = 21, colour = "gray25"), plot.caption = element_text(color = "gray65", face = "bold", size = 10), legend.position = "none", 
axis.title = element_text(size = 22.5, colour = "gray25"), axis.line = element_line(size = 0.4, colour = "gray35"), panel.background = element_rect(fill = "white"),
plot.background = element_rect(fill = "white"), plot.title = element_text(size = 25.8, colour = "gray35"), plot.subtitle = element_text(size = 23, colour = "gray62"),
strip.background = element_rect(fill = "white"), strip.text = element_text(size = 19, colour = "gray25", face = "bold"), panel.grid.major = element_line(colour = "white"))

options(repr.plot.width = 20.8, repr.plot.height = 15.2)
options(warn = -1)

<div style="text-align: justify;">Grouping, aggregation and analysis will cover all important questions from the survey, discussing aspects related to social and demografical aspects, software, hardware, broadly understood technology, way of working and earnings and much more. Further details on the exact metrics we will base on will be specified in the next section, as the data will require preparation and control. To meet our goals, we will use the <b>R programming language</b> to perform the analysis, which will allow us to transform the data and generate a graph with appropriate descriptions. In addition to the basic R functionalities, we will use packages from <b>tidyverse</b>, which allows for appropriate data transformations and visualization of results. There will be also some <b>additional packages to enhance ggplot2</b> with additional functionalities (ggthemes, ggrepel, ggdendro, RColorBrewer, ggamp, ggforce, ggcorrplot and ggchicklet - currently not in CRAN, but only on Github). We can see more details and documentation for these libraries in technical references [f]:[o]. We start the analysis by loading the appropriate packages, setting the options to the appropriate size of the graphs and not displaying any warning when calling the code. We also create a specific background that will refer to each chart to keep all our visualizations uniform.</div>

In [None]:
dataset <- read.csv('../input/kaggle-survey-2022/kaggle_survey_2022_responses.csv')
dataset = dataset[-1,]

<div style="text-align: justify;">We load only one dataset, which is in .csv format and contains all the necessary data. <b>One row is equal to one respondent completing the survey, while each column is the answers to the questions included in the survey</b>. If the question is in the form of a multiple choice, then we have the same number of columns in the database for this question as there are answers to the question. The questionnaire also had various types of filters, so that all missing data are justified (e.g. people who do not work are not asked about the size of the team at work, etc.).</div>

<a id="section-2"></a>
<div class="alert alert-light" role="alert">
<h3><b>2. Data cleansing and quaility</b></h3>
</div>

<div style="text-align: justify;">We start with a bit of cleaning up the data, especially in terms of country names. We <b>remove from the database rows in which respondents did not indicate which country they reside in or their country is so unpopular that less than 50 people indicated it among all respondents (after this filter, the number of respondents dropped to 22,525)</b>. Additionally, we write the official names of some countries in a simplified form to make it easier to visualize the results on charts.</div>

In [None]:
dataset <- dataset %>%
    filter(!Q4 %in% c("", "Other", "I do not wish to disclose my location")) %>%
    mutate(Q4 = case_when(Q4 == "United States of America" ~ "United States",
                Q4 == "United Kingdom of Great Britain and Northern Ireland" ~ "United Kingdom",
                Q4 == "People 's Republic of China" ~ "China",
                Q4 ==  "Iran, Islamic Republic of..." ~ "Iran",
                Q4 == "Hong Kong (S.A.R.)" ~ "Hong Kong",
                TRUE ~ Q4))

<div style="text-align: justify;">The next stage, preceding the main analysis, is the data quality process. Before publication, the data has already been cleaned (which can be read in detail in the documentation for the competition [2]), e.g. for questionnaires with extremely short time. To increase the quality of data even more, <b>we add 3 cleaning filters to remove the rows that raise doubts about reliable filling</b>.</div>

In [None]:
# Q8 doctoral/professor and Q1 18-21 22-24
dataset <- dataset %>%
    filter(!(Q8 %in% c('Doctoral degree', 'Professional doctorate') & Q2 %in% c('18-21', '22-24')))

<div style="text-align: justify;">The first filter concerns the extremely unlikely combinations of age and education. We remove rows in which the respondents declared their age to be 24 years or less (categories 18-21 and 22-24 in question Q2) and at the same time declare that they have a doctoral degree or a professional degree (in question Q8). It is extremely rare for someone to earn such a high degree at such a young age, so we are removing these surveys from our scope as the credibility of further answers by these people raises quality concerns.</div>

In [None]:
# Q11 20+ and Q2 18-21 22-24 25-29
dataset <- dataset %>%
    filter(!(Q11 == '20+ years' & Q2 %in% c('18-21', '22-24', '25-29')))

<div style="text-align: justify;">The second filter includes the aspects of declared age and the number of years in programming. We remove surveys in which the person filling out question Q2 indicates that has not more than 29 years old (the three lowest categories) and at the same time declares in question Q11 about the number of years of programming experience as over 20. It is very unlikely that someone started programming when was less than 9 years old and was born in the '90s or '00s when programming wasn't such popular.</div>

In [None]:
# Q16 20+ and Q2 18-21 22-24 25-29
dataset <- dataset %>%
    filter(!(Q16 == '20+ years' & Q2 %in% c('18-21', '22-24', '25-29')))

<div style="text-align: justify;">The third and last filter is very similar to the previous one. This time, we exclude people who are also between 18 and 29 years old and declare that they have been using machine learning methods for at least 20 years, which also means that they would have to be practitioners of such models at the maximum age of 9, which is extremely unlikely, especially among people born at the turn of the century.</div>

<div style="text-align: justify;">After creating these filters, <b>the number of repsonders in the database dropped from 22,525 to 22,299 (by 1%)</b>, of which the first filter eliminated most of this 226 rows in the database. Basic data cleansing is finished, so we can proceed to transformation into our title factors!</div>

<a id="section-3"></a>
<div class="alert alert-light" role="alert">
<h3><b>3. Factors creation</b></h3>
</div>

<div style="text-align: justify;">We come to the key stage that is included in the title of this notebook, which is the creation of data community coefficients for each country. In the beginning, for the results of these ratios to be reliable, <b>we use only those countries in which at least 80 respondents took part in the survey</b>. Such a filter will allow us to avoid a situation in which very large or small values of the factors would be strongly dependent on the sample size.</div>

In [None]:
CountryTable <- dataset %>%
    group_by(Q4) %>%
    summarise(NumberOfUsers = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    filter(NumberOfUsers >= 80)

<div style="text-align: justify;">After writing such a filter, <b>we have 39 countries that meet this requirement. The final number of surveys that will contribute to the factors is 21,039</b>. Below is a list of all the factors with a description of their operation on which we will create all charts and draw conclusions:</div>

1. Ratio of the men
2. Ratio of people up to 29 years old
3. Ratio of people with education higher than bachelor's degree (or wannabe in 2 years)
4. Ratio of students
5. Ratio of people with published any academic research
6. Ratio of coders for at least 5 years
7. Ratio of people using ML for at least 5 years
8. Ratio of people who have ever used TPU
9. Ratio of people who regularly use at least one tool related to cloud computing
10. Ratio of people who spent at least 1 USD on ML or cloud computing in recent 5 years
11. Ratio of people who regularly use Python programming language
12. Ratio of people who regularly use R programming language
13. Ratio of people working in companies with more than 1,000 employees
14. Ratio of people working in companies with at least 5 data scientists
15. Ratio of people earning over 10,000 USD per year

In [None]:
CountryTable <- dataset %>%
    filter(Q3 == "Man") %>%
    group_by(Q4) %>%
    summarise(NumberOfMan = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberOfMan) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(ManRatio = NumberOfMan/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q2 %in% c("18-21", "22-24", "25-29")) %>%
    group_by(Q4) %>%
    summarise(NumberOfLowerThan30 = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberOfLowerThan30) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(Lower30yoRatio = NumberOfLowerThan30/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q8 %in% c("Master’s degree", "Doctoral degree", "Professional doctorate")) %>%
    group_by(Q4) %>%
    summarise(NumberHigherThanBachelor = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberHigherThanBachelor) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(HigherThanBachelorRatio = NumberHigherThanBachelor/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q5 == "Yes") %>%
    group_by(Q4) %>%
    summarise(NumberStudent = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberStudent) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(StudentRatio = NumberStudent/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q9 == "Yes") %>%
    group_by(Q4) %>%
    summarise(NumberPublished = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberPublished) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(PublishedRatio = NumberPublished/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q11 %in% c("5-10 years", "10-20 years", "20+ years")) %>%
    group_by(Q4) %>%
    summarise(NumberOver5Coding = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberOver5Coding) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(Over5YearsCodingRatio = NumberOver5Coding/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q16 %in% c("5-10 years", "10-20 years", "20 or more years")) %>%
    group_by(Q4) %>%
    summarise(NumberOver5ML = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberOver5ML) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(Over5YearsMLRatio = NumberOver5ML/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q12_1 == "Python") %>%
    group_by(Q4) %>%
    summarise(NumberPython = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberPython) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(PythonRatio = NumberPython/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q12_2 == "R") %>%
    group_by(Q4) %>%
    summarise(NumberR = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberR) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(RRatio = NumberR/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q43 %in% c("Once", "2-5 times", "6-25 times", "More than 25 times")) %>%
    group_by(Q4) %>%
    summarise(NumberTPU = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberTPU) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(TPURatio = NumberTPU/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q25 %in% c("1000-9,999 employees", "10,000 or more employees")) %>%
    group_by(Q4) %>%
    summarise(Number1kWorkers = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, Number1kWorkers) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(Over1kWorkersRatio = Number1kWorkers/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q26 %in% c("5-9", "10-14", "15-19", "20+")) %>%
    group_by(Q4) %>%
    summarise(NumberDataScientst = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberDataScientst) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(DataScientistTeamsRatio = NumberDataScientst/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q29 %in% c("10,000-99,999", "15,000-19,999", "20,000-24,999", "25,000-29,999", "30,000-39,999", "40,000-49,999",
                      "50,000-59,999", "60,000-69,999", "70,000-79,999", "80,000-89,999", "90,000-99,999", "100,000-124,999",
                      "125,000-149,999", "150,000-199,999", "200,000-249,999", "250,000-299,999", "300,000-499,999",
                      "$500,000-999,999", " >$1,000,000")) %>%
    group_by(Q4) %>%
    summarise(NumberOver10k = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberOver10k) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(NumberOver10kRatio = NumberOver10k/NumberOfUsers)

CountryTable <- dataset %>%
    filter(Q30 %in% c("$1-$99", "$100-$999", "$1000-$9,999", "$10,000-$99,999", "$100,000 or more ($USD)")) %>%
    group_by(Q4) %>%
    summarise(NumberSpendOnCCML = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberSpendOnCCML) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(NumberSpendOnCCMLRatio = NumberSpendOnCCML/NumberOfUsers)

CountryTable <- dataset %>%
    mutate(sumQ31 = nchar(paste(Q31_1, Q31_2, Q31_3, Q31_4, Q31_5, Q31_6, Q31_7, Q31_8, Q31_9, Q31_10))) %>%
    filter(sumQ31 > 9) %>%
    group_by(Q4) %>%
    summarise(NumberUsingCloud = n(), .groups = 'drop') %>%
    rename(CountryName = Q4) %>%
    select(CountryName, NumberUsingCloud) %>%
inner_join(CountryTable, by = "CountryName") %>%
    mutate(NumberUsingCloudRatio = NumberUsingCloud/NumberOfUsers)


CountryTable <- CountryTable[,c('CountryName', 'NumberOfUsers', 'ManRatio', 'Lower30yoRatio', 'HigherThanBachelorRatio', 'StudentRatio', 
                                'PublishedRatio', 'Over5YearsCodingRatio', 'Over5YearsMLRatio', 'TPURatio', 'NumberUsingCloudRatio',
                                'NumberSpendOnCCMLRatio', 'PythonRatio', 'RRatio', 'Over1kWorkersRatio',  'DataScientistTeamsRatio', 
                                'NumberOver10kRatio')]

head(CountryTable)

<div style="text-align: justify;"><b>The data prepared in this way gave us data frame with 39 rows (number of countries) and 15 columns (number of factors)</b>, with a total of 585 values. All values of the factors range from 0 to 1, where 1 means that all responders in the country meet the requirement described in the factor and 0 that no one meets this requirement. </div>

<a id="section-4"></a>
<div class="alert alert-light" role="alert">
<h3><b>4. Factors analysis</b></h3>
</div>

<div style="text-align: justify;">We have all the data ready in the final form, so it's time to see what we can learn from them. <b>The analysis will be divided into three stages: the analysis of each of the coefficients separately (one-dimension, bar charts), the analysis of two factors simultaneously (correlations, scatter plot) and the analysis of all factors at the same time (hierarchical clustering)</b>. Let's dive in!</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(CountryName == "India" ~ "Fill", 
                             CountryName != "India" ~ "No fill"))

Color <- Chart %>%
    arrange(NumberOfUsers) %>%
    mutate(Color = case_when(Color == "Fill" ~ "#c35493",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)


ggplot(Chart, aes(reorder(CountryName, +NumberOfUsers), NumberOfUsers, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(data=subset(Chart, NumberOfUsers>400), aes(label = NumberOfUsers), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(breaks = seq(0, 8000, 1000))+
    scale_fill_manual(values = c("#c35493", "gray65"))+
    labs(x = "", y = "Number of responders", title = "Countries by number of responders", 
         subtitle = "In countries with at least 80 responders, after quality control", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">Before we move on to the factors, let's start with an important thing that will have a significant impact on the interpretation of the results. The above chart shows how many completed questionnaires (after the data quality process) were obtained from each country. As mentioned at the beginning, we have 39 countries, of which <span style="color:#c35493"><b>India has the highest number of respondents (8,690), which is 41% of surveys qualified to analysis</span></b>. The top three also include the United States of America and Japan, but they are far from the leader. The last country to meet the requirement of at least 80 reliable answers is Saudi Arabia with 84 responders. Due to the small number of questionnaires, countries such as Algeria, Belgium, Cameroon, Malaysia, Nepal, Singapore or Ukraine did not qualify for the analysis. The size of the sample for each country is worth keeping in the back of mind when interpreting the results from the charts, because the factors for countries with fewer data are relatively more burdened with random error.</div>

<a id="section-41"></a>
<div class="alert alert-light" role="alert">
<h3><b>4.1. Factors analysis in one dimension</b></h3>
</div>

<div style="text-align: justify;">We start the proper and awaited analysis by looking at each of the 15 factors separately. <b>For each iteration, we will take the factor values for all countries and present them in the form of a bar chart</b> (with the country name on the y-axis for ease of readability). This way we can easily see which countries are ranked high, medium or low in terms of factor values (which does not always mean better or worse). For each period, we will especially look at the countries <span style="color:#005d89"><b>at the top</span></b> and <span style="color:#df9100"><b>at the bottom</span></b> and <span style="color:#A6A6A6"><b>certain interesting situations</span></b> in the middle of the ranking if existing. In the end, we will condense the information from 15 charts into one to see which countries most often deviated from the middle of the ranking in all factors. Let's start one dimension journey!</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(between(ManRatio, quantile(ManRatio, probs = 0.88), quantile(ManRatio, probs = 0.99)) ~ "1st fill", 
                             ManRatio < quantile(ManRatio, probs = 0.04) ~ "2nd fill",
                             (ManRatio > quantile(ManRatio, probs = 0.99)) | (between(ManRatio, quantile(ManRatio, probs = 0.04), quantile(ManRatio, probs = 0.88))) ~ "No fill"))

Color <- Chart %>%
    arrange(ManRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +ManRatio), ManRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(ManRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,1.13), breaks = seq(0,1,0.2))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 36.4, y = 1.035, label = "4 of 5 countries with highest men\nratio are in South America", fontface = "bold", size = 5.9, colour = "#005d89")+
    annotate("text", x = 1.5, y = 0.815, label = "Only 2 countries have values under 70%", fontface = "bold", size = 5.9, colour = "#df9100")+
    annotate("text", x = 9.9, y = 1.00, label = "Russia has the highest feminization\n rate in the world and has the lowest\n men ratio of responders in Europe", fontface = "bold", size = 5.9, colour = "gray65")+
    annotate(geom = "curve", xend = 'Russia', y = 0.84, x = 'Russia', yend = 0.75, curvature = 0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of men among responders", title = "Ratio of men by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The first factor contains information on how many people indicated in the Q3 question "Man" instead of the other possible options ("Woman", "Nonbinary", "Prefer not to say", "Prefer to self-describe"). In each country, the value of this ratio exceeds 50%. The country with the highest value is Japan, which is the only one where more than 90% of respondents identify themselves as men. The remaining <span style="color:#005d89"><b>4 out of 5 top places are filled by South American countries, with Chile in the lead</span></b> (which had 96% of men in last year's edition of the survey). An interesting fact is that <span style="color:#A6A6A6"><b>Russia, which has the highest feminization rate in the world in the current history (around 1.16)</span></b> [3], <span style="color:#A6A6A6"><b>despite the fact that a large percentage of men (76%), is actually low compared to other European countries</span></b>. Only <span style="color:#df9100"><b>two countries have a men ratio below 70%. These are Iran (63%) and Tunisia (53%)</span></b>, both countries where Islam is the dominant religion [4].</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(Lower30yoRatio > quantile(Lower30yoRatio, probs = 0.75) ~ "1st fill", 
                             Lower30yoRatio < quantile(Lower30yoRatio, probs = 0.07) ~ "2nd fill",
                             between(Lower30yoRatio, quantile(Lower30yoRatio, probs = 0.07), quantile(Lower30yoRatio, probs = 0.75)) ~ "No fill"))

Color <- Chart %>%
    arrange(Lower30yoRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +Lower30yoRatio), Lower30yoRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(Lower30yoRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,1.02), breaks = seq(0,0.8,0.2))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    geom_vline(xintercept = 26.5, size = 0.5, color = "gray30")+
    annotate("text", x = 27.3, y = 0.91, label = "'Young' countries", size = 6.3, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 25.8, y = 0.903, label = "'Old' countries", size = 6.3, colour = "#df9100", fontface = "bold")+
    annotate(geom = "curve", x = 26.8, y = 0.82, xend = 27.8, yend = 0.82, curvature = 0, size = 1.3, colour = "#005d89", arrow = arrow(length = unit(4.5, "mm")))+
    annotate(geom = "curve", x = 26.2, y = 0.82, xend = 25.2, yend = 0.82, curvature = 0, size = 1.3, colour = "#df9100", arrow = arrow(length = unit(4.5, "mm")))+
    annotate("text", x = 34.1, y = 0.895, label = "The top 10 countries with the\n highest rate are in Asia or North Africa", fontface = "bold", size = 5.9, colour = "#005d89")+
    annotate("text", x = 1.9, y = 0.36, label = "Last 3 countries are in Europe", fontface = "bold", size = 5.9, colour = "#df9100")+
    annotate("text", x = 3.9, y = 0.438, label = "Highest median age is in\n this country: 48.6 years old", fontface = "bold", size = 5.9, colour = "gray65")+
    annotate(geom = "curve", xend = 'Japan', y = 0.34, x = 'Japan', yend = 0.26, curvature = 0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of people under 30 years old among responders", title = "Ratio of people under 30 years old by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The next factor is based on the Q2 question about the age in ranges. On this basis, the percentage of respondents who marked the range between 18 and 29 years of age was calculated for each country. The densely populated Asian countries have the highest values of this factor, where more than 70% of respondents have not yet entered their 4th decade of life. <span style="color:#005d89"><b>Among the countries with values above 60% are only Asian and African countries</span></b> with high fertility rates. Thus, the future and youth of the data world are beyond the most economically developed countries. The only European country with a value exceeding 50% is Russia (59%). The countries of the both Americas are also falling out, where Peru is the leader in "youth" (45% of respondents are under 30). An interesting case is <span style="color:#A6A6A6"><b>Japan, which (after only Monaco), has the highest median population of almost 50 years</span></b> [5]. <span style="color:#A6A6A6"><b>Also here, when it comes to the percentage of young respondents, it is very far in the ranking (in 36th place)</span></b>. European countries close the end of the rate - <span style="color:#df9100"><b>in the Netherlands, Portugal and Spain, less than every fourth respondent is under 30 years of age</span></b>. The differences between countries are huge, even fourfold comparing extreme countries in terms of age.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(HigherThanBachelorRatio > quantile(HigherThanBachelorRatio, probs = 0.84) ~ "1st fill", 
                             HigherThanBachelorRatio < quantile(HigherThanBachelorRatio, probs = 0.36) ~ "2nd fill",
                             between(HigherThanBachelorRatio, quantile(HigherThanBachelorRatio, probs = 0.36), quantile(HigherThanBachelorRatio, probs = 0.84)) ~ "No fill"))

Color <- Chart %>%
    arrange(HigherThanBachelorRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +HigherThanBachelorRatio), HigherThanBachelorRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(HigherThanBachelorRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,1.04), breaks = seq(0,1,0.2))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 35.6, y = 0.94, label = "Top 7 countries are located\n in West-Central Europe", fontface = "bold", size = 5.9, colour = "#005d89")+
    annotate("text", x = 7.2, y = 0.622, label = "14 countries have a factor value below 50%,\n including only one European country - Russia", fontface = "bold", size = 5.9, colour = "#df9100")+
    annotate("text", x = 27.2, y = 0.83, label = "Canada: Most educated country\n in the world (in 2018)", fontface = "bold", size = 5.9, colour = "gray65")+
    annotate(geom = "curve", xend = 'Canada', y = 0.72, x = 'Canada', yend = 0.64, curvature = -0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of people with education level higher than bachelor among responders", title = "Ratio of people with education level higher than bachelor by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The next factor relates to the question Q8 about education. For each country, we count the percentage of people who (as the current highest level of education or whose plans are to be obtained within 2 years) are: masters, doctors or professors. <span style="color:#005d89"><b>The highest values of this ratio are achieved for European countries - all cases where the value exceeds 70% are in the old continent</span></b>. Other highly developed countries also have high values (Australia 69%, USA 64%, <span style="color:#A6A6A6"><b>Canada</span></b> 63%). The last of these three countries <span style="color:#A6A6A6"><b>leads the global statistics on the percentage of people with higher education</span></b>, which was over 56% in the global research 4 years ago [6]. <span style="color:#df9100"><b>One-third of the countries have a ratio of less than 50%</span></b>, which means that the majority of people who do not have or plan in the near future a level of education above the bachelor's degree. These are mainly Asian countries, not only densely populated but also those with high GDP per capita such as Japan and South Korea. The lower values also supplement some African and South American countries. Also in this factor, we can draw similar conclusions as in the case of young people: there are very large differences between countries and a brighter Kaggle future awaits Asian countries because people starting their educational adventure in the area of working with data creates the advantage.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(StudentRatio > quantile(StudentRatio, probs = 0.77) ~ "1st fill", 
                             StudentRatio < quantile(StudentRatio, probs = 0.35) ~ "2nd fill",
                             between(StudentRatio, quantile(StudentRatio, probs = 0.35), quantile(StudentRatio, probs = 0.77)) ~ "No fill"))

Color <- Chart %>%
    arrange(StudentRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +StudentRatio), StudentRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(StudentRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.92), breaks = seq(0,1,0.2))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 34.6, y = 0.795, label = "All 9 of the most popular countries\n are located in Asia or Africa", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 7.2, y = 0.44, label = "Each country with less than\n 35% of students is one of the\n economically developed countries", size = 5.9, colour = "#df9100", fontface = "bold")+
    labs(x = "", y = "Ratio of students among responders", title = "Ratio of students by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">We go to another factor that is closely related to age and the level of education - the student's rate in each country indicated in the Q5 questions at any level of education. <span style="color:#005d89"><b>The highest values are among Asian and African countries - the top 20% of countries all come from these two continents</span></b> and have values between 61% and 76%. <span style="color:#df9100"><b>The lowest values below 35% are taken without exception by countries with high GDP per capita, of which the Netherlands (12%) and Japan (15%) have definitely the lowest values</span></b>. Among the countries generally considered to be developed, only Russia has a medium ratio of students, equal to exactly half of the respondents. The differences are very big when it comes to the students in the countries: Tunisia has almost five times less student ratio than the Netherlands. It is worth noticing, that there is a strong negative correlation between the two previous factors (age and level of education completed).</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(PublishedRatio > quantile(PublishedRatio, probs = 0.88) ~ "1st fill", 
                             PublishedRatio < quantile(PublishedRatio, probs = 0.08) ~ "2nd fill",
                             between(PublishedRatio, quantile(PublishedRatio, probs = 0.08), quantile(PublishedRatio, probs = 0.88)) ~ "No fill"))

Color <- Chart %>%
    arrange(PublishedRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +PublishedRatio), PublishedRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(PublishedRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.68), breaks = seq(0,0.6,0.15))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 36.6, y = 0.59, label = "Top 5 countries are\n located in Western Europe", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 2.4, y = 0.255, label = "Countries with values <15% are only the\n populous countries of Asia and Africa", size = 5.9, colour = "#df9100", fontface = "bold")+
    annotate("text", x = 32.1, y = 0.542, label = "Australia is the only country with scientific\n publications >4,000 per million people", size = 5.9, colour = "gray65", fontface = "bold")+
    annotate(geom = "curve", xend = 'Australia', y = 0.43, x = 'Australia', yend = 0.37, curvature = 0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of people with at least one published research among responders", title = "Ratio of people with at least one published research by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The fifth factor is based on question Q9 in which all responders were asked if they had ever published academic research (for example papers or conference proceedings). The rate is based on the percentage of people for each country who answered "yes" to this question. <span style="color:#005d89"><b>Only three countries in the world have this factor above 40%: Germany, Portugal and the Netherlands, while the top five include only Western European countries</span></b>. It is interesting that in the United States, only 31% of people participating in the survey published their research. <span style="color:#A6A6A6"><b>In the global statistics of the countries in which the most publications are published per population, Australia is the leader in our surveyed countries</span></b> [7], but it ranks only 8th among the respondents. <span style="color:#df9100"><b>Egypt closes the ranking with 11%, and the countries in the end of ranking have lower than average GDP per capita and are located in Asia or Africa</span></b>. An interesting fact is that this question was asked for the first time, in the previous 5 editions we do not have data on this subject.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(Over5YearsCodingRatio > quantile(Over5YearsCodingRatio, probs = 0.85) ~ "1st fill", 
                             Over5YearsCodingRatio < quantile(Over5YearsCodingRatio, probs = 0.28) ~ "2nd fill",
                             between(Over5YearsCodingRatio, quantile(Over5YearsCodingRatio, probs = 0.28), quantile(Over5YearsCodingRatio, probs = 0.85)) ~ "No fill"))

Color <- Chart %>%
    arrange(Over5YearsCodingRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +Over5YearsCodingRatio), Over5YearsCodingRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(Over5YearsCodingRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.71), breaks = seq(0,0.6,0.15))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 36.1, y = 0.65, label = "Top countries are located\n in West Europe", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 6.3, y = 0.235, label = "End of the ranking only\n with eastern countries", size = 5.9, colour = "#df9100", fontface = "bold")+
    labs(x = "", y = "Ratio of people coding over 5 years among responders", title = "Ratio of people coding over 5 years by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">With the next factor, we move on to the subject of the experience of respondents from all countries. The first of these factors is based on question Q11 where respondents have been asked how many years they have been coding or programming. For each country, we counted the percentage of people who declare that they have been doing it for at least 5 years (5-10 years, 10-20 years and 20+ years). Again, <span style="color:#005d89"><b>only representatives of Western Europe, led by the Netherlands (59%), are in the top six countries</span></b>. The only country outside Europe that exceeded 50% is Israel. <span style="color:#df9100"><b>The countries commonly regarded as "west" are not at the end of the ranking, which closes Nigeria with a 7% experienced people in terms of programming</span></b>. This once again confirms the thesis that the masses of new data scientists are currently being trained in developing countries, while in developed countries experienced people prevail from the perspective of what is good from the current perspective, but from the development perspective it is negative information in the future.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(Over5YearsMLRatio > quantile(Over5YearsMLRatio, probs = 0.9) ~ "1st fill", 
                             Over5YearsMLRatio < quantile(Over5YearsMLRatio, probs = 0.1) ~ "2nd fill",
                             between(Over5YearsMLRatio, quantile(Over5YearsMLRatio, probs = 0.1), quantile(Over5YearsMLRatio, probs = 0.9)) ~ "No fill"))

Color <- Chart %>%
    arrange(Over5YearsMLRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +Over5YearsMLRatio), Over5YearsMLRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(data=subset(Chart,Over5YearsMLRatio>0.01), aes(label = paste(100*round(Over5YearsMLRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.35), breaks = seq(0,0.3,0.1))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 37.1, y = 0.29, label = "Top 4 is the same as for factor about\n programming experience (different order)", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 2.4, y = 0.08, label = "There are 4 countries where less than 2% of\n respondents are highly experienced in ML", size = 5.9, colour = "#df9100", fontface = "bold")+
    annotate("text", x = 31.1, y = 0.233, label = "Country where most ML/DL\n model where published so far", size = 5.9, colour = "gray65", fontface = "bold")+
    annotate(geom = "curve", xend = 'United States', y = 0.195, x = 'United States', yend = 0.163, curvature = -0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of people using ML methods over 5 years among responders", title = "Ratio of people using ML methods over 5 years by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">Information on the number of years of experience with the use of machine learning methods works in a very similar way to the previous factor, where in Q14 (5+ years) all answers are indicated in the same way as in programming. We can immediately see a very large correlation between the experience in programming and the experience in using ML methods. <span style="color:#005d89"><b>The top 4 countries are the same in both factors: Germany, Portugal, France and the Netherlands (only the order is changed)</span></b>. This time, however, the values are much lower, Germany has only 29% of people who have been using machine learning methods for more than 5 years (this is twice less than people who have more than 5 years in programming in this country). <span style="color:#A6A6A6"><b>The history of machine learning largely began in the USA</span></b>, where they came from and published their works, e.g. McCulloch and Pitts, one of the fathers of artificial neural networks, or Breiman, who invented the random forest in 2001 [8] [9]. <span style="color:#A6A6A6"><b>Currently, only every sixth respondent in this country has been using ML methods for at least 5 years</span></b>, which gives 9th place in the ranking despite such maturity in this field in the USA. Only 13 countries have a value of this ratio above 10%, <span style="color:#df9100"><b>while there are four countries (Vietnam, Russia, Tunisia and Nigeria) where the value of this ratio does not exceed 2%</span></b>, which means that people using machine learning methods for more than 5 years are extremely rare.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(TPURatio > quantile(TPURatio, probs = 0.8) ~ "1st fill", 
                             TPURatio < quantile(TPURatio, probs = 0.1) ~ "2nd fill",
                             between(TPURatio, quantile(TPURatio, probs = 0.1), quantile(TPURatio, probs = 0.8)) ~ "No fill"))

Color <- Chart %>%
    arrange(TPURatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +TPURatio), TPURatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(TPURatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.26), breaks = seq(0,0.2,0.05))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 35.2, y = 0.22, label = "Economically highly developed countries\n have values > 15%", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 2.1, y = 0.06, label = "10% of countries\n have values below 5%", size = 5.9, colour = "#df9100", fontface = "bold")+
    labs(x = "", y = "Ratio of people who have used TPU at least once among responders", title = "Ratio of people who have used TPU at least once by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">We move on to slightly different factors based on the software and hardware used by the respondents. We start with Tensor Processing Unit (AI accelerator application-specific integrated circuit developed by Google for neural network machine learning, using Google's own TensorFlow software) [10]. Respondents were asked about it in Q43 where most of the respondents replied how many times they used this technology. On this basis, the percentage of people who indicated that they used TPU at least once in their life (answers: Once, 2-5 times, 6-25 times, more than 25 times). The values of this ratio for all countries are small - the only countries that have exceeded 20% are the Netherlands and Japan (both 21%). <span style="color:#005d89"><b>In the top (values above 15%), we can see European countries and other developed economies of the world</span></b> where there is greater availability of the use of such modern technologies. Tunisia and Nigeria are at the bottom of the ranking with values of 3%. In total <span style="color:#df9100"><b>we have 4 from 39 countries with values lower than 5%</span></b>. The values differ significantly between the top and the end of the ranking, but the most important information is the fact that in every place in the world TPU is not yet the norm and a popular tool.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(NumberUsingCloudRatio > quantile(NumberUsingCloudRatio, probs = 0.95) ~ "1st fill", 
                             NumberUsingCloudRatio < quantile(NumberUsingCloudRatio, probs = 0.1) ~ "2nd fill",
                             between(NumberUsingCloudRatio, quantile(NumberUsingCloudRatio, probs = 0.1), quantile(NumberUsingCloudRatio, probs = 0.95)) ~ "No fill"))

Color <- Chart %>%
    arrange(NumberUsingCloudRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +NumberUsingCloudRatio), NumberUsingCloudRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(NumberUsingCloudRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.47), breaks = seq(0,1,0.1))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 38.4, y = 0.43, label = "The same top 2 countries\n as in the TPU factor", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 2.3, y = 0.13, label = "The same last 4 countries\n as in TPU factor", size = 5.9, colour = "#df9100", fontface = "bold")+
    labs(x = "", y = "Ratio of people using at least one cloud platform among responders", title = "Ratio of people using at least one cloud platform by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">Moving on to other technologies that are part of data science development, apart from local computers, we are moving to the increasingly popular cloud. Based on question Q31 for each country, we calculate the percentage of people who indicated that they use at least one cloud computing platform (for example AWS, Microsoft Azure, GCP). <span style="color:#005d89"><b>The two countries with the highest percentages are the same as for the TPU factor: the Netherlands (38%) followed by Japan (35%)</span></b>. In the forefront, we also see Spain and Poland which were also very high in the ranking compared to the previous factor. In total, 9 countries have a value of this ratio higher than 25%. At the bottom of the ranking, we also find an analogy to TPU - <span style="color:#df9100"><b>the final four countries are the same: Pakistan, Bangladesh, Russia and Tunisia with values below 7.2%</span></b>. Although cloud technologies are becoming more and more popular, it is worth noting that most of the respondents do not declare using any of the 10 most popular platforms listed in the possible answers to question Q31. In each of the 39 countries, less than half of the respondents declare regular use of at least one platform.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(NumberSpendOnCCMLRatio > quantile(NumberSpendOnCCMLRatio, probs = 0.99) ~ "1st fill", 
                             NumberSpendOnCCMLRatio < quantile(NumberSpendOnCCMLRatio, probs = 0.1) ~ "2nd fill",
                             between(NumberSpendOnCCMLRatio, quantile(NumberSpendOnCCMLRatio, probs = 0.1), quantile(NumberSpendOnCCMLRatio, probs = 0.99)) ~ "No fill"))

Color <- Chart %>%
    arrange(NumberSpendOnCCMLRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +NumberSpendOnCCMLRatio), NumberSpendOnCCMLRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(NumberSpendOnCCMLRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.63), breaks = seq(0,1,0.15))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 39, y = 0.58, label = "Only Japan has value > 50%", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 2.4, y = 0.223, label = "Same last 4 countries as in TPU \nand cloud usage factors (different order)", size = 5.9, colour = "#df9100", fontface = "bold")+
    labs(x = "", y = "Ratio of people who spent money on cloud computing or ML among responders", title = "Ratio of people who spent money on cloud or ML by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">Staying on the topic of the cloud, respondents were also asked in Q30 how much more or less money they spent on cloud computing and / or machine learning in the last 5 years. For each country, we calculate the percentage of people who declare that they spend at least one USD on these topics. <span style="color:#005d89"><b>The only country where the majority of respondents declare such expenses is Japan with 51%</span></b>. The leaders are complemented by European countries: Poland, Netherlands and Germany. In each country, at least 10% of people declare spending at least 1 USD, the lowest of which is in Tunisia (11%). <span style="color:#df9100"><b>The last four are identical to the factor about the use of cloud computing (and TPU too)</span></b>, which is probably the reason for the lack of spending on these or ML topics, since the popularity of the cloud in these countries is currently low.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(PythonRatio > quantile(PythonRatio, probs = 0.92) ~ "1st fill", 
                             PythonRatio < quantile(PythonRatio, probs = 0.17) ~ "2nd fill",
                             between(PythonRatio, quantile(PythonRatio, probs = 0.17), quantile(PythonRatio, probs = 0.92)) ~ "No fill"))

Color <- Chart %>%
    arrange(PythonRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +PythonRatio), PythonRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(PythonRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,1.12), breaks = seq(0,1,0.2))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 37.6, y = 1.02, label = "European and Middle East\n countries in the lead", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 30.3, y = 1.03, label = "The country where the creator\n of Python comes from", size = 5.9, colour = "gray65", fontface = "bold")+
    annotate("text", x = 3.7, y = 0.83, label = "Only African and Asian countries\n at the end of the ranking", size = 5.9, colour = "#df9100", fontface = "bold")+
    annotate(geom = "curve", xend = 'Netherlands', y = 0.93, x = 'Netherlands', yend = 0.85, curvature = -0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of people using Python among responders", title = "Ratio of people using Python by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">It is time for factors directly related to the software used by the respondents. In question Q12, the respondents were asked what programming languages they use regularly. The list has 14 languages and options to enter another answer that is not listed, but we will only focus on the two most popular for data science and machine learning applications. Python (currently the most popular data science technology), goes first. In each country, at least 50% of respondents declare that they use the language regularly, but the values are very high for the leaders. <span style="color:#005d89"><b>The ranking is opened by France, where 89% of people declare that they use this technology</span></b> on a daily basis, but the differences are small as the value of the ratio above 80% is found in 20 countries. <span style="color:#005d89"><b>The leading group includes countries located in Europe and the west of Asia</span></b> (Iran, Turkey and the United Kingdom). <span style="color:#A6A6A6"><b>An interesting fact is that 84% of responders from the country of origin of its creator, Guido van Rossum, declare the ability to use the technology he initiated</span></b> [11]. No European country has a value below 80%, but <span style="color:#df9100"><b>in the lower part of the ranking (where the factor values are the lowest), African and Asian countries have the majority</span></b>.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(RRatio > quantile(RRatio, probs = 0.88) ~ "1st fill", 
                             RRatio < quantile(RRatio, probs = 0.1) ~ "2nd fill",
                             between(RRatio, quantile(RRatio, probs = 0.1), quantile(RRatio, probs = 0.88)) ~ "No fill"))

Color <- Chart %>%
    arrange(RRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +RRatio), RRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(RRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.455), breaks = seq(0,0.4,0.1))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 36.2, y = 0.41, label = "Only one European, African or\n Asian country in top 5", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 2.1, y = 0.145, label = "The last 4 countries have\n values < 10 %", size = 5.9, colour = "#df9100", fontface = "bold")+
    annotate("text", x = 27.1, y = 0.346, label = "One of R's two 'fathers'\n is from this country", size = 5.9, colour = "gray65", fontface = "bold")+
    annotate(geom = "curve", xend = 'Canada', y = 0.31, x = 'Canada', yend = 0.263, curvature = -0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of people using R among responders", title = "Ratio of people using R by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">In the same way, referring to the same question, we check the percentage of people who regularly use the R language in each country. This time, no country has a value even higher than 40% due to the worldwide much lower popularity of R about Python. As we can easily see, <span style="color:#005d89"><b>the top five countries with the highest values in this factor has only one country that is in Eurasia or Africa: Spain. There are three countries from the Americas (which also, according to a study from previous years, is keen on using R) and Australia</span></b>. In Asian countries, there are a lot of young students who do not yet have a master's degree or higher education have a much lower percentage of people who use R daily. This is because the Python boom has been going on for several years and probably in developed countries where data scientists are on average much older they gained their first professional experience back in the days when Python's dominance over R was not so obvious. <span style="color:#A6A6A6"><b>The fathers of this language are considered to be Robert Gentleman and Ross Ihaka, who come from Canada and New Zealand. We do not have enough data from New Zealand, but today, more than 20 years after the creation of the R language, Canada is in the middle of the ranking where every fourth respondent uses this software [12]</span></b>. Russia unexpectedly closes the ranking, where less than one in 6% of responders regularly uses the language in which the notebook is written. <span style="color:#df9100"><b>In total there is a 4 from 39 countries where less than 10% of responders declared using R language</span></b>.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(Over1kWorkersRatio > quantile(Over1kWorkersRatio, probs = 0.98) ~ "1st fill", 
                             Over1kWorkersRatio < quantile(Over1kWorkersRatio, probs = 0.07) ~ "2nd fill",
                             between(Over1kWorkersRatio, quantile(Over1kWorkersRatio, probs = 0.07), quantile(Over1kWorkersRatio, probs = 0.98)) ~ "No fill"))

Color <- Chart %>%
    arrange(Over1kWorkersRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +Over1kWorkersRatio), Over1kWorkersRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(Over1kWorkersRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.51), breaks = seq(0,0.4,0.1))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 38.8, y = 0.47, label = "Only Netherlands \nexceed 40%", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 1.9, y = 0.122, label = "In the last 3 countries less than 5%\n of people work in big companies", size = 5.9, colour = "#df9100", fontface = "bold")+
    labs(x = "", y = "Ratio of people working in companies with >1,000 employees among responders", title = "Ratio of people working in companies with >1,000 employees by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The last type of factors are those related to professional work. We start with the size of the company in which the respondents from Q25 work. Calculate the percentage of people who indicated that work in companies with over 1000 employees (the denominator also includes people who do not work at all). <span style="color:#005d89"><b>There is only one country with a value of more than 40% - the Netherlands</span></b>. The leaders are complemented by Japan, the United Arab Emirates and Saudi Arabia. High values of the index can also be found in European countries, and moderate in South American countries. At the end of the ranking, the low-GDP countries of Asia and Africa prevail, which is probably due to the fact that in these countries there are relatively few large companies that employ people for positions related to data processing. <span style="color:#df9100"><b>Bangladesh, Iran and Tunisia have only around 4% of people working in big companies</span></b>, which is additionally influenced by the fact that in these countries there are a lot of students and very young people, which makes people focus on acquiring education instead of gaining professional experience, especially in large companies.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(DataScientistTeamsRatio > quantile(DataScientistTeamsRatio, probs = 0.98) ~ "1st fill", 
                             DataScientistTeamsRatio < quantile(DataScientistTeamsRatio, probs = 0.2) ~ "2nd fill",
                             between(DataScientistTeamsRatio, quantile(DataScientistTeamsRatio, probs = 0.2), quantile(DataScientistTeamsRatio, probs = 0.98)) ~ "No fill"))

Color <- Chart %>%
    arrange(DataScientistTeamsRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +DataScientistTeamsRatio), DataScientistTeamsRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(aes(label = paste(100*round(DataScientistTeamsRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.52), breaks = seq(0,1,0.1))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 37.2, y = 0.445, label = "Netherlands has 13 percentage point\n advantage over 2nd place", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 4.4, y = 0.14, label = "No western countries\n with values < 10%", size = 5.9, colour = "#df9100", fontface = "bold")+
    labs(x = "", y = "Ratio of people whose companies have at least 5 data scientist among responders", title = "Ratio of people whose companies have at least 5 data scientist by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">Another factor is based on Q26 where the working respondents were asked how many data scientists work in their company. On this basis, what counts is the percentage of people with at least five employees (options 5-9, 10-14, 15-19 and 20+ in the list of answers to this question). <span style="color:#005d89"><b>The Netherlands is again a clear leader, standing out from the top, with a 48% share of people working in companies with at least 5 data scientists</span></b>. The advantage over Germany and Poland is 13 percentage points and the top five includes only European countries. <span style="color:#df9100"><b>There are 7 countries with a factor value below 10% (only from Asia, Africa and one from South America)</span></b>, and Tunisia closes the ranking, where only 4% of respondents work in the company with at least 5 people on data science position. The difference between the first and the last country in the ranking is huge - 12 times.</div>

In [None]:
Chart <- CountryTable %>%
    mutate(Color = case_when(NumberOver10kRatio > quantile(NumberOver10kRatio, probs = 0.93) ~ "1st fill", 
                             NumberOver10kRatio < quantile(NumberOver10kRatio, probs = 0.24) ~ "2nd fill",
                             between(NumberOver10kRatio, quantile(NumberOver10kRatio, probs = 0.24), quantile(NumberOver10kRatio, probs = 0.93)) ~ "No fill"))

Color <- Chart %>%
    arrange(NumberOver10kRatio) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "#005d89",
                             Color == "2nd fill" ~ "#df9100",
                             Color == "No fill" ~ "gray65")) %>%
    select(Color)

ggplot(Chart, aes(reorder(CountryName, +NumberOver10kRatio), NumberOver10kRatio, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    geom_text(data=subset(Chart,NumberOver10kRatio>0.02), aes(label = paste(100*round(NumberOver10kRatio, 2), "%")), colour = "white", position = position_stack(vjust = 0.5), size = 4.8, fontface = "bold")+
    scale_y_continuous(labels = scales::percent, limits = c(0,0.735), breaks = seq(0,0.6,0.15))+
    scale_fill_manual(values = c("#005d89", "#df9100", "gray65"))+
    annotate("text", x = 37.9, y = 0.67, label = "All countries with value > 50% are\n in top 30 GDP per capita", size = 5.9, colour = "#005d89", fontface = "bold")+
    annotate("text", x = 5.9, y = 0.12, label = "No western countries\n with values < 10%", size = 5.9, colour = "#df9100", fontface = "bold")+
    annotate("text", x = 31.9, y = 0.585, label = "Highest GDP per capita\n from these 39 countries", size = 5.9, colour = "gray65", fontface = "bold")+
    annotate(geom = "curve", xend = 'United States', y = 0.52, x = 'United States', yend = 0.463, curvature = -0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    annotate("text", x = 1.9, y = 0.145, label = "Lowest GDP per capita\n from these 39 countries", size = 5.9, colour = "gray65", fontface = "bold")+
    annotate(geom = "curve", xend = 'Nigeria', y = 0.08, x = 'Nigeria', yend = 0.029, curvature = -0.2, arrow = arrow(length = unit(4.5, "mm")), colour = "gray65", size = 0.8)+
    labs(x = "", y = "Ratio of people who earn over 10,000 USD per year among responders", title = "Ratio of people who earn over 10,000 USD per year by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The last factor is the earnings of the respondents. Based on the Q29 question, we marked out people who indicated that their annual earnings exceed 10,000 USD and divided the number of people that meet this requirement by the number of respondents in each country. For the third time in a row, the Netherlands leads the way in terms of job topics - 57% of respondents declare earnings above this level. <span style="color:#005d89"><b>There are three countries where the majority of respondents declare earnings above the title amount: apart from the Netherlands, these are the United Kingdom and Germany. All of these countries have high GDP per capita and are in the top 30 according to the estimates of the International Monetary Fund (12th, 18th and 26th respectively)</span></b> [13]. Nigeria and Iran are at the bottom, where only 2% of people declare earnings above 10k USD in the last 12 months. <span style="color:#df9100"><b>Only Asian and African countries are among the cases where less than every tenth person declares high earnings</span></b>. In addition to low GDP per capita, the value of the ratio will increase large numbers of people not working yet due to education or inexperience. <span style="color:#A6A6A6"><b>Looking at the data on the gross domestic product per capita for 2022, of the analyzed countries, the first place is taken by the United States (with a value of over 75,000 USD), which is high in our ranking and takes 8th place. The lowest per capita income measure is in Nigeria (approximately 5,800 USD), which ranks in the penultimate, 41st place in the ranking of earnings</span></b>.</div>

<div style="text-align: justify;">All 15 factors have been visualized in one-dimension charts, so it's time for a small summary. It is easy to see when we go through all 15 charts that there are countries that definitely stand out from the rest in many factors. Therefore, information on how far from the middle of the ranking are all countries will be presented. We will <b>create a special measure that for each of the 15 factors will be giving points for "atypical" in each case</b> (at the top or at the bottom of ranking). The measure in the mathematical notation looks like this:</div>
<br>

\begin{align}
Measure = \sum_{k=1}^\text {n} \frac{\text {position in ranking}}{\frac{\text {number of countries}}{2}} 
\end{align}

<br>
<div style="text-align: justify;">In formula, <i>n</i> is the number of factors. <b>The further away from the center of ranking, the more points country can get</b>, however, the point value is equal for the first and last place, due to the fact that most factors are not stimulants or destimulants, so we are interested in the strength of the deviation from the ranking average, not the direction.</div>

In [None]:
middle <- nrow(Chart)/2
Chart_Ranking <- data.frame(Chart$CountryName)

for (i in colnames(Chart %>% select(-c(NumberOfUsers, CountryName, Color)))){
    
    col_new_name <- paste0(i, "_rank")
    j = noquote(i)
    
    Chart_Ranking <- Chart_Ranking %>%
        mutate(Rank = rank(Chart[[i]])) %>%
        mutate(Rank = abs(Rank - middle)) %>%
        rename(!!col_new_name:= Rank)
}

Sums_Country <- Chart_Ranking %>%
    select(-Chart.CountryName) 

Ranking_DF <- data.frame('CountryName' = Chart$CountryName, 'Scoring' = rowSums(Sums_Country))

<div style="text-align: justify;">After going through 15 factors, we have the sum of points for each country, which proves the general "atypicality" of the country, the higher the value, the more the country deviates from the raking averages. We visualize the results on a simple bar chart in the same way as for each factor before.</div>

In [None]:
Ranking_DF <- Ranking_DF %>%
    mutate(Color = case_when(rank(Scoring) == nrow(Chart) ~ "1st fill", 
                             rank(Scoring) == nrow(Chart)-1 ~ "2nd fill", 
                             rank(Scoring) == nrow(Chart)-2 ~ "3rd fill", 
                             rank(Scoring) < nrow(Chart)-2 ~ "No fill"))

Color <- Ranking_DF %>%
    arrange(Scoring) %>%
    mutate(Color = case_when(Color == "1st fill" ~ "gold3",
                             Color == "2nd fill" ~ "azure4",
                             Color == "3rd fill" ~ "brown",
                             Color == "No fill" ~ "gray70")) %>%
    select(Color)

ggplot(Ranking_DF, aes(reorder(CountryName, +Scoring), Scoring, fill = Color))+
    geom_chicklet(col = "gray10", stat = "identity", width = 0.64, size = 0.35)+
    coord_flip()+
    scale_y_continuous(limits = c(0,260), breaks = seq(0,250,50))+
    scale_fill_manual(values = c("gold3", "azure2", "brown", "gray70"))+
    labs(x = "", y = "Scoring", title = "Ranking of the most atypical countries according to 15 factors", 
         subtitle = "Deviations from the average in the rankings, in countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(legend.position = "none", axis.text.y = element_text(colour = as.list(Color)$Color), axis.text.x = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The <span style="color:#EEC900"><b>Netherlands</span></b> is the undisputed leader, with almost 250 points (which is an average of over 16 points for each factor). Out of 15 factors, the Netherlands took first or last place 7 times, which undoubtedly contributed to such a high value of the measure. The second place is taken by <span style="color:#838B8B"><b>Nigeria</span></b>, which is the fourth most popular country in this survey, so we are sure that its unusualness is not due to a small sample. This country took extreme positions three times. <span style="color:#A52A2A"><b>Tunisia</span></b> completes the top three with very little difference from 2nd place. Further in the lead, we see various countries from different continents, so it's hard to find a relationship between economic or geographic factors and the atypicality of data scientists in relation to the world average. In the lower part of the ranking, a large group consists of South American countries which, as we have often seen in the analysis of factors, occupy positions near the middle of the ranking. The lowest score, i.e. the highest similarity to the average taking into account all factors at the same time, is occupied by Turkey. Last year's data also indicated that Turkey is the most moderate country in terms of our 15-dimensional Data Science view of the country.</div>

<a id="section-42"></a>
<div class="alert alert-light" role="alert">
<h3><b>4.2. Factors analysis in two dimensions</b></h3>
</div>

<div style="text-align: justify;">We are after a long part of analyzing each factor independently, so it's time to start the phase where we will look at more than one factor at the same time. This part will consist of two smaller parts: <b>pairwise correlation analysis</b> (where we will create a correlation matrix to see what factors are strongly correlated) and <b>analysis of the most interesting relationships in the scatter plots</b>, where on the X and Y axes we will select some of the most interesting pairs of factors by creating a two-dimensional cartesian coordinate system where each country will have its position shown by a point. To work!</div>


<div style="text-align: justify;">All factors are quantitative continuous variables, which makes it easier to choose a correlation method. However, we will not use the most popular Pearson's linear correlation coefficient, as it works well in detecting linear relationships, but we have not studied the types of dependencies so far. Therefore, we will use the <b>Spearman's rank correlation coefficient</b>, which is also able to deal with non-linear dependencies. The mathematical notation of the rank and Sperman correlation coefficient is shown in the formula below [14] [15]. 
    
\begin{align}
r_s = 1 - \frac{6\sum_{t=1}^{n} d_t^2}{n(n^2-1)}
\end{align}
    
We have a total of 105 (15 × 14 × 0.5) pairs of factors which we represent using matrices. The correlogram will show us all the correlation pairs in a two-dimensional form. </div>

In [None]:
Chart %>% 
    select(-c(NumberOfUsers, CountryName, Color)) %>%
    rename(`Man ratio` = ManRatio, 
           `Lower 30 yo ratio` = Lower30yoRatio,
           `Higher than BSc ratio` = HigherThanBachelorRatio,
           `Student ratio` = StudentRatio,
           `Publication ratio` = PublishedRatio,
           `> 5 years coding ratio` = Over5YearsCodingRatio,
           `> 5 years ML ratio` = Over5YearsMLRatio,
           `Used TPU ratio` = TPURatio,
           `Using cloud ratio` = NumberUsingCloudRatio,
           `Expense on cloud/ML ratio` = NumberSpendOnCCMLRatio,
           `Python ratio` = PythonRatio,
           `R ratio` = RRatio,
           `> 1k employees ratio` = Over1kWorkersRatio,
           `> 4 data scienist comapny ratio` = DataScientistTeamsRatio,
           `> 10k USD yearly ratio` = NumberOver10kRatio,
           ) %>% 
    as.matrix() %>%
    cor(., method = "spearman") %>%
ggcorrplot(., outline.col = "white", type = "full", lab = T, lab_size = 5.5,
           legend.title = "Strength of \ncorrelation", colors = c("#6D9EC1", "white", "#E46726"))+
  labs(y = "", x = "", title = "Values of spearman's rank correlation coefficient between factors",
       subtitle = "Grouped by countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
  guides(fill = 'none')+
  theme_fivethirtyeight()+
  theme_michau+
  theme(legend.position = "right", legend.direction = "vertical", axis.text.x = element_text(angle = 30, hjust = 1), 
        axis.text = element_text(size = 19, colour = "gray25"))

<div style="text-align: justify;">Going through the factors from scratch, we start with <b>the ratio of men that are the least correlated factor with others</b>: all values between -0.4 and 0.4. As we suspected from a priori knowledge and the analysis of factors in one dimension, the factor of young people is strongly correlated with the factor of students. <b>The percentage of people with at least a master's degree (or planned within 2 years) is only moderately correlated with the student ratio</b>, but we can see high correlations (spearman's rank correlation coefficient > 0.7) with the following factors: people with the scientific publication, coding over 5 years and using ML methods over 5 years. This factor correlates with any factor above 0.5 or below -0.5 except the ratio of men. Due to the strong correlation between the student factor and the age factor, they form a specific group that is negatively correlated with other factors (with some moderate and high strength). <b>The percentage of people who code for more than 5 years and have been using ML methods for 5 years is the second largest positive correlation - the value of correlation coefficient is over 0.92</b>. Both factors are also very strongly correlated with the percentage of people earning more than 10,000 USD a year (experience therefore goes hand in hand with finances). It is also not surprising that <b>the least correlated factors are those related to programming languages, with only 0.05 between the percentage of Pythonists and coders regularly in R</b>. We can see also a <b>very strong correlation between using TPU and using and spending funds on cloud technologies - a coefficient value of almost 0.9</b>. All three work-related factors, i.e. the size of the company, the number of people involved in data science in the company and earnings, have strong positive correlations with each other. <b>The leader in terms of the strength of correlation is the percentage of people using the cloud and the percentage of people who have spent the last 5 years on cloud computing and machine learning - 0.94</b>. Overall, we see the advantage of positive correlations and a strong connection between factors in data aggregated based on countries.</div>

<div style="text-align: justify;">We come to the scatterplots. It would be too overwhelming to create 105 of them for each pair, so we choose 6 the most interesting situations. </div>

In [None]:
#ManRatio vs Lower30yoRatio
ggplot(Chart, aes(ManRatio, Lower30yoRatio))+
    geom_point(shape = 21, size = 5.5, fill = "gray10", alpha = 0.6)+
    geom_text_repel(aes(label = CountryName), size = 6.5, colour = "gray30", max.overlaps = 50, fontface = "bold")+    
    annotate("rect", xmin = 0.8, xmax = 0.99, ymin = 0.5, ymax = 0.85, alpha = .2, colour = "#005d89", fill = "#005d89")+
    annotate("text", x = 0.925, y = 0.81, label = "Young men dominance", size = 9.3, colour = "#005d89", fontface = "bold")+
    scale_fill_manual(values = c("#df9100", "#005d89"))+
    scale_y_continuous(labels = scales::percent)+
    scale_x_continuous(labels = scales::percent)+
    labs(x = "Ratio of men", y = "Ratio of people under 30 years old", title = "Ratio of men vs. people under 30 years old by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(axis.text = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">The first relationship of the factors: gender (on the X axis) and age (on the Y axis). These factors had a slight negative correlation (with the increase in the ratio of men, on average, the share of people under 30 does not decrease, and vice versa). The most distant point from the rest is Tunisia, in which, as we saw in the first one-dimensional factor chart, there are countries with by far the smallest percentage of men and at the same time this country has one of the leading percentages of people who have reached the age of 29. <span style="color:#005d89"><b>The group that goes against dependence are countries where we have a very high percentage of men and young people. This group includes 3 countries that are relatively close to each other: Bangladesh, Viet Nam and Pakistan</span></b>. Among these 2 factors, it is visible that the difference in age between countries is much greater than in gender (range of about 60% to less than 40%).</div>

In [None]:
# HigherThanBachelorRatio vs StudentRatio
ggplot(Chart, aes(StudentRatio, HigherThanBachelorRatio))+
    geom_point(shape = 21, size = 5.5, fill = "gray10", alpha = 0.6)+
    geom_text_repel(aes(label = CountryName), size = 6.5, colour = "gray30", max.overlaps = 50, fontface = "bold")+    
    annotate("rect", xmin = 0.05, xmax = 0.3, ymin = 0.65, ymax = 0.95, alpha = 0.2, colour = "#df9100", fill = "#df9100")+
    annotate("rect", xmin = 0.55, xmax = 0.8, ymin = 0.2, ymax = 0.5, alpha = 0.2, colour = "#005d89", fill = "#005d89")+
    annotate("text", x = 0.145, y = 0.91, label = "Master's degree+\n dominance", size = 9.3, colour = "#df9100", fontface = "bold")+
    annotate("text", x = 0.71, y = 0.245, label = "Students dominance", size = 9.3, colour = "#005d89", fontface = "bold")+
    scale_fill_manual(values = c("#df9100", "#005d89"))+
    scale_y_continuous(labels = scales::percent)+
    scale_x_continuous(labels = scales::percent)+
    labs(x = "Ratio of students", y = "Ratio of people with education level higher than bachelor", title = "Ratio of student vs. people with education level higher than bachelor by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(axis.text = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">On the second relationship, we focus on educational factors: percentage of students (X-axis) and percentage of people with at least a master's degree in 2 years (Y-axis). This is related to a moderate negative correlation (Spearman's rank correlation coefficient was -0.5), so logically, the student rate does not go hand in hand with the high education level already achieved (people after completing the first stages of studies are less frequently students than people without a master's degree or higher). This creates two groups of countries: in the first of them <span style="color:#005d89"><b>there are countries where the percentage of students is high (> 55%) and at the same time the percentage of people with high education is low (< 50 %). This group includes 6 countries: India, Indonesia, Nigeria, Bangladesh, Viet Nam and Egypt. We can presume that these are the countries where the data science boom is just beginning, since such a large percentage are people starting their adventure</span></b>. On the other hand, <span style="color:#df9100"><b>we have the opposite pole - countries with few students (< 30%) and a high proportion of highly educated people (> 65%). This group includes 4 countries: the Netherlands, Poland, Italy and France (so all of them are located in Central-Western Europe)</span></b>. The scattering of points representing countries confirms the existence of a negative correlation - it is not easy to find countries with both a low or high proportion of students and educated people.</div>

In [None]:
# R Ratio vs Python Ratio
Chart2 <- CountryTable %>%
    mutate(Color = case_when(
            PythonRatio >= mean(CountryTable$PythonRatio) & RRatio < mean(CountryTable$RRatio) ~ "Python bigger, R less",
            PythonRatio < mean(CountryTable$PythonRatio) & RRatio >= mean(CountryTable$RRatio) ~ "R bigger, Python less"
    ))

ggplot(Chart2, aes(PythonRatio, RRatio))+
    geom_point(shape = 21, size = 5.5, fill = "gray10", alpha = 0.6)+
    geom_text_repel(aes(label = CountryName), size = 6.5, colour = "gray30", max.overlaps = 50, fontface = "bold")+       
    geom_mark_ellipse(aes(colour = Color, fill = Color, filter = Color %in% c('Python bigger, R less', 'R bigger, Python less')), expand = unit(9, "mm"), alpha = 0.2)+    
    annotate("text", x = 0.71, y = 0.34, label = "R countries", size = 9.5, colour = "#df9100", fontface = "bold")+
    annotate("text", x = 0.85, y = 0.1, label = "Python countries", size = 9.5, colour = "#005d89", fontface = "bold")+
    scale_fill_manual(values = c("#005d89", "#df9100"))+
    scale_colour_manual(values = c("#005d89", "#df9100"))+
    scale_y_continuous(labels = scales::percent)+
    scale_x_continuous(labels = scales::percent)+
    labs(x = "Ratio of people using Python", y = "Ratio of people using R", title = "Ratio of people using Python vs. people using R by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(axis.text = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">We move on to a pair of factors that had Spearman's rank correlation coefficient closest to zero: the percentage of people using Python (on the X axis) and the percentage of people using R (on the Y axis). The lack of correlation can be seen in the scatter plot - it is difficult to match any lines according to which we can see the trend. We decide to distinguish two groups of countries in the chart: <span style="color:#df9100"><b>those where R is above average popular and Python's popularity is below average (R countries)</span></b> and <span style="color:#005d89"><b>those where Python is above average popular and R's popularity is below average (Python countries)</span></b>. We start with countries where R's popularity is above average compared to the global average, and where Python's popularity is lower than global. <span style="color:#df9100"><b>There are 9 R countries: USA, South Africa, UAE and 6 countries from South and Central America</span></b>. Interestingly, we will almost not find any European, Asian or African countries in this group, even though they are the majority of the analyzed countries (Spain and Indonesia were the closest). <span style="color:#005d89"><b>On the other side, we have Python-oriented countries. It also includes 9 countries from the Far East (Japan, China, South Korea), the Middle East (Iran, Turkey, Israel), Russia, Portugal and Tunisia</span></b>. It is worth noting, that the general dominance of Python (independent of the global average) is visible in every country - there is no country where R is more used technology than Python. Among the countries in which both languages are above average popular, the following should be distinguished: the Netherlands, Canada, United Kingdom, France and Australia. Countries where both languages are relatively less popular are the Philippines and Saudi Arabia.</div>

In [None]:
# Coding vs ML
ggplot(Chart, aes(Over5YearsCodingRatio, Over5YearsMLRatio))+
    geom_point(shape = 21, size = 5.5, fill = "gray10", alpha = 0.6)+
    geom_text_repel(aes(label = CountryName), size = 6.5, colour = "gray30", max.overlaps = 50, fontface = "bold")+      
    annotate("rect", xmin = 0.4, xmax = 0.6, ymin = 0.2, ymax = 0.3, alpha = 0.2, colour = "#df9100", fill = "#df9100")+
    annotate("text", x = 0.465, y = 0.275, label = "Experienced with\n coding and ML", size = 9.3, colour = "#df9100", fontface = "bold")+
    annotate("rect", xmin = 0.0, xmax = 0.20, ymin = 0.0, ymax = 0.1, alpha = 0.2, colour = "#005d89", fill = "#005d89")+
    annotate("text", x = 0.07, y = 0.075, label = "Inexperienced with\n coding and ML", size = 9.3, colour = "#005d89", fontface = "bold")+
    annotate("rect", xmin = 0.4, xmax = 0.6, ymin = 0.0, ymax = 0.1, alpha = 0.2, colour = "#c35493", fill = "#c35493")+
    annotate("text", x = 0.5, y = 0.025, label = "Experienced with coding,\n inexperienced in ML", size = 9.3, colour = "#c35493", fontface = "bold")+
    scale_y_continuous(labels = scales::percent)+
    scale_x_continuous(labels = scales::percent)+
    labs(x = "Ratio of people coding over 5 years", y = "Ratio of people using ML methods over 5 years", title = "Ratio of people coding > 5 years vs. people using ML methods > 5 years by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(axis.text = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">Going the other way, now we come to the second strongest relationship: the percentage of people coding for at least 5 years (X-axis) and the percentage of people using machine learning methods for at least 5 years (Y-axis). The value of Spearman's rank correlation coefficient here is 0.92, so the features are strongly positively correlated. In the chart, we distinguish 3 groups of countries according to the combination of these factors: low values in both factors, high values in both factors and high values in coding but low in ML methods (the fourth option where people have high ML experience but low programming does not exist in any country). <span style="color: #005d89"><b>The first group we call "starters" are countries where the percentage of coders over 5 years is low (< 20%), as well as people using ML methods over 5 years (< 10%). There are 12 such countries: in Africa and Asia (Russia is one exception)</span></b>. Nigeria is closest to zero in both factors at the same time. <span style = "color: #df9100"><b>On the other hand, we have countries with extensive experience in both topics: the proportion of people with extensive programming experience is relatively large (> 40%) as well as those experienced in ML (> 20%). This group includes 4 countries, all from Europe: Germany (at the forefront in ML), the Netherlands (at the forefront in programming), Portugal and France</span></b>. Israel is also very close to joining this group, with a good number of experienced programmers, but not enough to 20% of ML experienced developers. <span style = "color: #c35493"><b>The last group is a combination of two features from the previous ones: countries with a high percentage of experienced programming people (> 40%), but a low percentage of people in using ML methods (< 20%). This group is made up of only one country - Japan, where almost 45% of respondents declare long programming experience, but less than 8% of people who have been using machine learning methods for at least 5 years are there</span></b>. The trend in the data is not linear - the percentage of ML experts grows much slower along with the increase in the percentage of experts in programming.</div>

In [None]:
# Over1kWorkersRatio vs NumberOver10kRatio
ggplot(Chart, aes(Over1kWorkersRatio, NumberOver10kRatio))+
    geom_point(shape = 21, size = 5.5, fill = "gray10", alpha = 0.6)+
    geom_text_repel(aes(label = CountryName), size = 6.5, colour = "gray30", max.overlaps = 50, fontface = "bold")+       
    annotate("rect", xmin = 0.3, xmax = 0.45, ymin = 0.4, ymax = 0.65, alpha = 0.2, colour = "#df9100", fill = "#df9100")+
    annotate("text", x = 0.37, y = 0.618, label = "Big companies, big salaries", size = 9.3, colour = "#df9100", fontface = "bold")+
    annotate("rect", xmin = 0.00, xmax = 0.15, ymin = 0.00, ymax = 0.25, alpha = 0.2, colour = "#005d89", fill = "#005d89")+
    annotate("text", x = 0.075, y = 0.215, label = "Small companies, small salaries", size = 9.0, colour = "#005d89", fontface = "bold")+
    scale_fill_manual(values = c("#df9100", "#005d89"))+
    scale_y_continuous(labels = scales::percent)+
    scale_x_continuous(labels = scales::percent)+
    labs(x = "Ratio of people working in companies with more than 1,000 employees", y = "Ratio of people earning over 10,000 USD per year", title = "Ratio of people working with >1k employees vs. people earning >10k USD yearly by countries", 
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(axis.text = element_text(size = 18, colour = "gray45"))

<div style="text-align: justify;">Time for the last pair where we will simultaneously look at the factor showing the percentage of people working in large companies (X-axis) and the factor showing the percentage of people who have earned at least 10,000 USD in the last 12 months (Y-axis). There is a strong positive correlation here, the value of the correlation coefficient was 0.82. We will look at two groups here: near and for the origin of the coordinate system in which the points for countries in the Cartesian coordinate system are mapped. <span style="color: #005d89"><b>The first group consists of countries where the percentage of people who work in large companies is less than 15% and at the same time less than every fourth respondent in the last 12 months had high earnings. There are 15 countries in this group - we can find here representatives of all continents except Europe and North America</span></b>. This group is also characterized by high density - countries in this group do not differ much in terms of the values of both factors. <span style="color: #df9100"><b>The second group are countries where at least 30% of the respondents work with at least 1,000 other people, and at least 40% of the respondents have earned over 10,000 USD in the last 12 months. There are only 2 countries in this group that stand out from all the rest: the Netherlands and Japan</span></b>. France and the United States are also relatively close to this group (are within the required earnings, but too few people work for large companies).</div>
<br>
<div style="text-align: justify;">This pair of coefficients ends our analysis of the correlation between the factors - the remaining 99 pairs are certainly interesting to analyze, but they are already becoming material for a book instead of a notebook.</div>

<a id="section-43"></a>
<div class="alert alert-light" role="alert">
<h3><b>4.3. Clusterization of countries by factors</b></h3>
</div>

<div style="text-align: justify;">We have already used one and two dimensions, but what if we use all 15 factors simultaneously? Therefore, in the third and final part of the factor analysis, we will use all the information to divide the analyzed countries using the information contained in the factors into several groups that are as internally similar as possible and as different from others as possible. We will use the <b>hierarchical clustering method</b>, and then we will describe the groups created in this way, looking for common features of all countries and the difference between groups. Our current goal is to build a dendrogram based on Ward's method and a square euclidean distance. The theory itself and the multitude of formulas explaining the operation of the method, and the necessity to use the square euclidean distance are quite complex, therefore details can be found in well-described scientific articles, to which footnotes and links are provided [16] [17] [18] [19] [20] [21].</div> 
<br>
<div style="text-align: justify;">The input to the chart will be all 15 factors, not including information on the number of respondents in each country. Factors do not require standardization or any other form of normalization, because each of them takes values from 0 to 1. We also do not define in advance the number of groups into which we want to divide our population of countries as this will depend on the shape of the dendrogram.</div>

In [None]:
row_names_chart <- Chart$CountryName

hc_data <- Chart %>%
    select(-c(CountryName, Color, NumberOfUsers))

rownames(hc_data) <- row_names_chart

hc_data <- hc_data %>%
    dist(., method = "euclidean") %>%
    hclust(., method = "ward.D2") %>%
    dendro_data(dend, type = "rectangle")

ggplot() +
    geom_segment(data = segment(hc_data), aes(x = x, y = y, xend = xend, yend = yend)) +
    geom_text(data = label(hc_data), aes(x = x, y = y, label = label, hjust = 1.05), size = 6) +
    coord_flip()+
    ylim(-0.5, 3.1)+
    annotate("rect", xmin = 0, xmax = 14.5, ymin = -0.49, ymax = 3.05, alpha = 0.3, fill = "firebrick3")+
    annotate("rect", xmin = 14.5, xmax = 27.5, ymin = -0.49, ymax = 3.05, alpha = 0.3, fill = "skyblue3")+
    annotate("rect", xmin = 27.5, xmax = 40, ymin = -0.49, ymax = 3.05, alpha = 0.3, fill = "forestgreen")+
    labs(x = "", y = "", title = "Dendrogram based on 15 factors in countries with at least 80 responders", 
         subtitle = "Based on the square euclidean distance and Ward's method", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(axis.text = element_blank(), axis.line = element_blank())

<div style="text-align: justify;">When analyzing the shape of the dendrogram, we decide to <b>distinguish 3 country clusters</b> and mark them on the chart with colors: green, blue and red group. The biggest difference is between the red section and the other two. Before we move on to the description of the clusters in terms of factors, we create a map in which we color the groups, taking into account the location of the countries that are part of the three groups.</div>

In [None]:
Segments <- rbind(data.frame(region = c('Russia', 'Egypt', 'Viet Nam', 'India', 'China', 'Nigeria', 'Pakistan', 'Indonesia', 'Bangladesh', 'Morocco', 'Iran', 'Tunisia'),
                             fill = 'forestgreen'),
                  data.frame(region = c('Portugal', 'Israel', 'UK', 'Italy', 'Spain', 'Canada', 'Australia', 'USA', 'Poland', 'France', 'Germany', 'Netherlands', 'United Arab Emirates', 'Japan'),
                             fill = 'firebrick3'),
                  data.frame(region = c('Peru', 'Colombia', 'South Africa', 'Mexico', 'Brazil', 'Argentina', 'Chile', 'Turkey', 'Thailand', 'Taiwan', 'South Korea', 'Saudi Arabia', 'Philippines'),
                             fill = 'skyblue3'))

World <- map_data("world") %>%
    left_join(Segments, by = 'region')

World$fill <- factor(World$fill, levels = c("forestgreen", "skyblue3", "firebrick3"))

ggplot(World, aes(x = long, y = lat, group = group, fill = fill)) +
    geom_polygon(colour = 'gray20', size = 0.3, alpha = 0.8)+
    scale_fill_manual(values = c('forestgreen', 'skyblue3', 'firebrick3'), labels = c("Data science 2030", 'Worldwide balance', "Old guard", 'Under 80 responders'))+
    labs(x = "", y = "", title = "Groups of countries from hierarchical clustering on the map", fill = 'Cluster name:',
         subtitle = "In countries with at least 80 respondents", caption = "Data source: 2022 Kaggle Machine Learning & Data Science Survey \n© Made by Michau96/Kaggle")+
    theme_fivethirtyeight()+
    theme_michau+
    theme(axis.text = element_blank(), axis.line = element_blank(), legend.position = "bottom", legend.direction = "vertical",
          legend.background = element_rect(fill = "white"), legend.title = element_text(size = 19), legend.text = element_text(size = 17))

<div style="text-align: justify;">We show the clusters on the world map to better visualize the spatial distribution of groups of countries. The color corresponding to the group is filled in with the country borders, while those that were not included in the clusterization because not enough people completed questionnaires from these countries are filled with gray color.</div>
<br>
<div style="text-align: justify;">Now let's take a closer look at each cluster separately: what makes them so similar and so diffrent to each other (which factors made them so). For each of the groups and for each of the factors, we will be supported by the mean values of the factors between countries and we relate them to the overall mean in the format: mean for the cluster (difference between the mean of the group and the mean of all countries). This will allow us to conclude a general description of each group including the background of other groups and the entire population of 39 countries.</div>

<h3 style="color:#4EA24E"><b>Green cluster: Data science 2030</b></h3>

<div style="text-align: justify;">We start with the green cluster, which includes 12 countries. These are mainly Asian and African countries and one country located on the border of 2 continents (Europe and Asia) is Russia, although knowing the geography of this country, we will rather include it in Europe, because about 80% of the population lives in the Europe part of the country, so we can also guess that most of the respondents live on this continent [3]. We start by looking at the averages in demographic factors.
<br><br>
<p style="color:#4EA24E; margin-bottom:0"><b>Men ratio: 74% (-4%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Lower than 30 years old ratio: 69% (+25%)</b></p>
<br>
When it comes to gender, in this cluster of countries there is a slight advantage of women and people declaring a different gender or avoiding the answer in relation to the overall average, however, it is still less than 26% of respondents in this group. There is a huge age difference - 68.8% is the country's average number of respondents under 30, which with the overall average being less than 44% makes it a really "young" cluster in relation to the rest.
<br><br>
<p style="color:#4EA24E; margin-bottom:0"><b>Higher than bachelor degree ratio: 48% (-8%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Student ratio: 64% (+18%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Published papers ratio: 18% (-9%)</b></p>
<br>
People with at least a master's degree (or a wannabe in 2 years) are less than half responders, because 48%, which less than the overall average. The number of students is much higher than the average, while the national average of people who declare that there are pupils is 64%. Due to the lower level of education than the average and a higher number of students than the average, it is not surprising that the percentage of people with published scientific materials is only 18% or is a value 9 percentage points lower than the overall average.
<br><br>
<p style="color:#4EA24E; margin-bottom:0"><b>Over 5 years coding ratio: 14% (-17%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Over 5 years in ML methods ratio: 3% (-7%)</b></p>
<br>
Experience with coding and using ML methods is much lower than in other clusters. The national average of people who declare that they have been coding for at least 5 years is 14% (the population country's average is 31%), and for people who use machine learning methods for the same long period, it is 3% (while in all countries it is 10%). This means that this group is characterized by low experiences compared to the average.
<br><br>
<p style="color:#4EA24E; margin-bottom:0"><b>TPU ratio: 6% (-4%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Using cloud ratio: 9% (-11%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Spend money on cloud or ML ratio: 16% (-12%)</b></p>
<br>
In terms of technology, TPU is not popular in this cluster, where the national average of people who have used this solution at least once in their life is about 6% (which is lower than the general value). The difference in this group is even greater when it comes to the cloud: 11 percentage points fewer people use at least one platform regularly and 12 percentage points less spent any money on cloud or machine learning in 5 years.
<br><br>
<p style="color:#4EA24E; margin-bottom:0"><b>Python ratio: 78% (-1%)</b></p>
<p style="color:#4EA24E"><b>R ratio: 14% (-7%)</b></p>
<br>
The percentage of people programming in Python is almost the same as the overall death rate, a value of 78% on average in each country. The differences are visible in the R language - only every seventh respondent uses this language, while in all countries, on average every fifth person declares coding in R.
<br><br>
<p style="color:#4EA24E; margin-bottom:0"><b>Over 1000 employees in comapny ratio: 9% (-9%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Over 5 data scientist in comapny ratio: 10% (-10%)</b></p>
<p style="color:#4EA24E; margin-bottom:0"><b>Over 10,000 USD salary ratio: 6% (-19%)</b></p>
<br>
Finally, we turn to work-related factors: As for the percentage of people who work in companies with at least 1,000 employees, the average value in this cluster is 9%, while the overall average is 18%. The percentage of people who work in companies with at least 5 data scientists is also two times lower than the global average - in each country on average 10% of people declare that they work in such an environment, when the national average for all 39 countries is 20%. Earnings in this cluster are low - the national average of people earning over 10,000 USD is only 6% and is over 4 times lower than the one calculated for all three groups at the same time.
<br><br>
In summary, this cluster is characterized by young people who are studying without a large number of highly successful people, who relatively rarely use innovative technology, have short coding and ML experience and rarely work in large companies with high earnings and a large number of data scientists. <span style="color:#4EA24E"><b>We call this group "Data Science 2030", because although the cluster is not the most innovative, profitable and developed at this point, it will have a bright future due to its very young age and the willingness to learn from this cluster</b></span>. People filling out the survey should be in a better place in a few years when it comes to education, earnings or used technologies, so there are many indications that the future of data science is already being written in these 12 countries. 
</div>

<h3 style="color:#86B5D4"><b>Blue cluster: Worldwide balance</b></h3>

<div style="text-align: justify;">The countries marked in blue in the dendrogram are the second cluster. It is slightly larger than the green cluster (13 countries, 1 more) and the core part of it consists of the countries of South and Central America - 6 countries. Besides, we can see other countries from different regions of the world: South Africa, Turkey, Philippines, Taiwan, South Korea and Saudi Arabia. Thus, they are developed economic countries, but not among the very best. Also in this group, we will not find any European or North American countries, but it is undoubtedly the most versatile and geographically interesting cluster.
<br><br>
<p style="color:#86B5D4; margin-bottom:0"><b>Men ratio: 79% (+1%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Lower than 30 years old ratio: 39% (-5%)</b></p>
<br>
In this multi-continental group, the average percentage of respondents who declare that they are male is 79%, practically equal to the overall average. The proportion of people under the age of 30 for this group is 39%, so it is lower than the average (by 5 percentage points) and much lower in the green cluster.
<br><br>
<p style="color:#86B5D4; margin-bottom:0"><b>Higher than bachelor degree ratio: 50% (-6%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Student ratio: 46% (+1%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Published papers ratio: 25% (-2%)</b></p>
<br>
Moving on to the factors related to education, we see that higher education than a bachelor's degree (or a wannabe in 2 years) is equal to 50%, which is slightly lower than the average, while the student ratio in this cluster is on average 46% for the countries creating the group. Every fourth respondent in this group declares the publication of scientific documents, which is a value slightly lower than the overall average. In general, the cluster in terms of educational aspects is quite centered, but a bit closer to the green cluster, where the advantage is made by students and people without high education degrees and publications.
<br><br>
<p style="color:#86B5D4; margin-bottom:0"><b>Over 5 years coding ratio: 28% (-3%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Over 5 years in ML methods ratio: 6% (-3%)</b></p>
<br>
In the blue countries, on average, 28% of respondents declare at least 5 years of experience in coding (which is 3 percentage points less than the average for all countries). We have the same absolute differences in the percentage of people with more than 5 years of experience in using machine learning methods, which is declared by an average of 6% of respondents from each country in this group (9% in the population). Again, the cluster is close to the global average, but with a slight shift to less experience.
<br><br>
<p style="color:#86B5D4; margin-bottom:0"><b>TPU ratio: 9% (-1%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Using cloud ratio: 19% (-1%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Spend money on cloud or ML ratio: 27% (-1%)</b></p>
<br>
Turning to the technological factors, the blue cluster is again close to the overall average. The TRU ratio in this sector is 9% (when it is 10% in all countries). On average, less than every fifth respondent declares using at least one cloud platform - the same as in all groups, looking at the whole. The funds spent on cloud computing and machine learning are also at the same level as the overall average (27% vs. 28% in the percentage of people who spent at least USD 1 on these technologies).
<br><br>
<p style="color:#86B5D4; margin-bottom:0"><b>Python ratio: 75% (-4%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>R ratio: 22% (2%)</b></p>
<br>
In programming languages, on average in the countries belonging to the second cluster, 75% of people declare using Python, which is a value 4 percentage points lower than that of the general public. It is also the lowest value of all 3 groups. On the other hand, the R language has above-average popularity, the use of which is declared by an average of 22% of respondents in these countries. It is almost two times higher than in the green cluster.
<br><br>
<p style="color:#86B5D4; margin-bottom:0"><b>Over 1000 employees in comapny ratio: 18% (0%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Over 5 data scientist in comapny ratio: 18% (-2%)</b></p>
<p style="color:#86B5D4; margin-bottom:0"><b>Over 10,000 USD salary ratio: 21% (-4%)</b></p>
<br>
Finally, the work-related factors, where the first one is the size of the company. The value of 18% applies to the percentage of respondents employed in companies with at least 1,000 employees, both in the countries of the analyzed group and in the entire population. Also, on average, 18% of people declare that working in companies employs at least 5 data scientists, but this time it is below the global average (which is 20%). Earnings are also somewhat below the global level - in the "blue" countries, on average, 21% of respondents earn more than 10,000 USD per year, while the average value for the country in the entire population is 25%.
<br><br>
Taking into account all factors, we can see that in many of them the difference between the mean value of the factors in the cluster and in the population is very small. Differences, if present, are always expressed as a single-digit percentage. <span style="color:#86B5D4"><b>Due to the similarity of the overall average, we call this cluster "Worldwide balance", which on the one hand characterizes the geographical spread of this group, and on the other hand, informs about the moderation in the values of almost all factors</b></span>. The respondents from countries in this sector are mostly moderately aged men with high education, but there are also students. They have relatively short experience in coding and ML, they are quite unwilling to use cloud and TPU resources, they use Python efficiently and rarely work in large companies with high earnings, although such situations also occur in this versatile group of countries.
</div>


<h3 style="color:#D75151"><b>Red cluster: Old guard</b></h3>

<div style="text-align: justify;">We close the clusterization with the red group, which is the largest because it includes 14 countries. Europe has a geographic majority, with 8 countries. The list is completed by the countries of North America (USA, Canada), Australia, Israel, the United Arab Emirates and Japan. The a priori feature that connects these countries is high GDP per capita and a strongly developed economy.
<br><br>
<p style="color:#D75151; margin-bottom:0"><b>Men ratio: 80% (+2%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>Lower than 30 years old ratio: 27% (-17%)</b></p>
<br>
We start with demographic characteristics, the first of which is gender. On average, 80% of respondents from red countries declare that they are male, which is higher than the average for all 39 countries. This cluster is by far the oldest group of people who are under 30 years of age - on average, there are 27% of such people, which is much lower than the overall average of 44%.
<br><br>
<p style="color:#D75151; margin-bottom:0"><b>Higher than bachelor degree ratio: 70% (+13%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>Student ratio: 29% (-17%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>Published papers ratio: 36% (+9%)</b></p>
<br>
Moving on to factors related to education, 70% of respondents in this cluster declare that they have or intend to have a master's degree, PhD or professor degree within 2 years, which is a value higher by 13 percentage points than in the entire population. Strongly related to this percentage of students, whose value is low (29% in this group, which is 17 percentage points lower than in the entire population). On the other hand, the percentage of people who published scientific materials is higher - they do it more often than every third respondent from this cluster, while every fourth person does it for the entire population.
<br><br>
<p style="color:#D75151; margin-bottom:0"><b>Over 5 years coding ratio: 49% (+18%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>Over 5 years in ML methods ratio: 17% (+8%)</b></p>
<br>
Experience in both coding and using ML methods are above average here. Almost half (49%) of the respondents looking at aggregated values by country declare at least 5 years of experience in writing programming code (in the entire population it is 31%), while the average for the 14 countries making up this cluster in the percentage of people using machine learning methods for at least 5 years is 17%, which is the value more than twice the population mean.
<br><br>
<p style="color:#D75151; margin-bottom:0"><b>TPU ratio: 15% (+5%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>Using cloud ratio: 29% (+10%) </b></p>
<p style="color:#D75151; margin-bottom:0"><b>Spend money on cloud or ML ratio: 39% (+11%)</b></p>
<br>
In terms of technology use, this cluster has the highest value in the percentage of people using TPU on average - in each country, 15% of respondents say that they have used a Tensor Processing Unit at least once time (in the entire population, the country-level average is 10%). This group also likes to use cloud solutions - 29% of respondents declare that they use at least one cloud platform (in the population this value is 10 percentage points lower), and 39% of people in red countries declare that they have spent money on cloud computing or machine learning.
<br><br>
<p style="color:#D75151; margin-bottom:0"><b>Python ratio: 83% (+5%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>R ratio: 25% (+4%)</b></p>
<br>
In this cluster, the popularity of programming languages is also above average. 83% of respondents declare that they use Python, and every fourth person says that they program in R. Both of these values are greater than in the population, but it is worth noting that, in contrast to the previous groups of countries, the difference here is not large (5 and 4 percentage points, respectively).
<br><br>
<p style="color:#D75151; margin-bottom:0"><b>Over 1000 employees in comapny ratio: 26% (+8%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>Over 5 data scientist in comapny ratio: 30% (+11%)</b></p>
<p style="color:#D75151; margin-bottom:0"><b>Over 10,000 USD salary ratio: 45% (+20%)</b></p>
<br>
in the population, as 26% of respondents from the analyzed countries declare work in a company with at least 1,000 employees (18% of the population). The number of data scientists in the company is even better than the average in comparison with company size - 30% of respondents declare that their company employs over 4 data scientists, which significantly exceeds the value for the population of 19%. We can also see a significant difference in earnings, where 45% of respondents declare an income of at least USD 10,000 within 12 months - this is a value of 20 percentage points higher than in the population, where every fourth respondent declares such income in last year.
<br><br>
This group stands out from the other two, especially comparing it with the green cluster. <span style="color:#D75151"><b>We call this group "Old Guard" because of the high advantage of experienced people, often older with higher education, who have solved many data problems in their careers. They are now leaders, but their future as countries is at stake as more and more people in the other two clusters are taking their first steps in the world of data</b></span>. People who are part of this cluster are characterized not only by an average more advanced age and level of education, much longer experience in coding and using machine learning, more than average knowledge of cloud technologies and related expenses, experience in both Python and R higher than in other countries and frequent work in large companies with a large number of people with a similar profile and higher earnings. These are values in relation to the population average, which means that in this country we also find students, people entering the career and without much experience, but it is easier to find such people in other clusters.
</div>

<a id="section-5"></a>
<div class="alert alert-light" role="alert">
<b><h3>5. Summary and conclusions</b></h3>
</div>

<div style="text-align: justify;">The analysis has now been completed, so it's time to sum up. <b>Our goal was to create the determinants of respondents by country, analyze them in one and two dimensions and create groups of countries most similar to each other in relation to these factors using hierarchical grouping</b>. The goal was fully achieved and many interesting conclusions were drawn while achieving it. The key points from the analysis carried out are:</div>

1. In total, we managed to create 15 factors for responders from 39 analyzed countries based on over 21,000 completed questionnaires.
2. 40% of the surveyed questionnaires were from one country (India) and 60% from 3 countries (India + USA + Japan).
3. The top countries with the highest masculinization rate of the respondents are mainly countries in South America.
4. The young respondents are mostly in Asia and Africa, and the older are in Europe.
5. Two educational groups of countries can be distinguished: students, without a large share of highly educated people and scientific publications (mainly African and Asian countries), and countries with a small number of students, but with a large number of educated people (leading world economies).
6. The coding and ML experience can be seen above all in countries with older and more educated communities.
7. Japan and the Netherlands lead the way in cloud and TPU technologies, with the highest percentage of people using such tools.
8. Python is a very popular language all over the world, while R is most popular in South America, but everywhere we can see the dominance of the "snake" language.
9. Work in large companies, with high salaries and many data scientists is the domain of western countries, led by the Netherlands, Germany and UK.
10. The most atypical countries from the rest are the Netherlands, Nigeria and Tunisia.
11. The strongest correlations can be found between factors of the same category (education, technology and jobs).
12. The countries where young men are the most dominant are Bangladesh, Pakistan and Viet Nam.
13. Japan is unique in terms of experience: people often have a long experience in coding but very little in using ML methods compared to other countries.
14. The largest companies with the highest salaries at the same time are the domain of West Europe, Japan and the USA.
15. Data science 2030 is a group of people from Asian and African countries full of students and people who enter the broadly understood world of data.
16. Worldwide balance is a group of people who belong to countries where most statistics are close to the global average (halfway between the "Data Science 2030" cluster and the "Old guard" group).
17. Old guard is a set of highly developed countries in which there are many people with extensive education and professional and technical experience.

<div style="text-align: justify;">The summary is ready, so looking more broadly and adding a broader context, the analysis allows us to try what it means for the entire environment and what impact the data science image has today at the country level for the near future of the entire industry. The most important suggestions at that point are:</div><br>

<b>A. If you are a respondent from green sector countries... you have a lot of people competing around each other but also a lot of potential!</b><br>
<div style="text-align: justify;">Green cluster countries are the most densely populated places with a large number of young people (this is also visible in the number of respondents and in publicly available demographic data). For young data enthusiasts who enter the career path, there is a great advantages and a disadvantage at the same time. On the one hand, a large number of students means that there are a whole lot of people of a similar age with whom you can cooperate and motivate yourself, and the fact that there is such an advantage for young people is certainly optimistic when looking at the future when it comes to the speed of market development and its potential. On the other hand, such many people have a lot of competition, which means that it may be more difficult for people with no experience to start a dream job. Nevertheless, head up - the future of the data world is already being born in India, Nigeria, Pakistan and Indonesia!</div><br>

<b>B. If you are an experienced ML engineer and you are not in western countries then... maybe it is worth increasing your contacts or pursuing your career in the red group countries!</b><br>
<div style="text-align: justify;">The countries of the red cluster, although often the majority of them are still students, have much larger percentage of people from whom you can learn a lot (extensive work experience, high academic degrees). If you have indicated that you have or are in the course of a doctorate, you have at least several years of experience working with ML methods, a trip to the red group countries (if you do not live in them yet) may be helpful because you will be cooperate in a group of people who are equally or even more experienced, which will surely push your skills and development forward!</div><br>

<b>C. If you don't use Python and cloud tools then... maybe it's worth changing it!</b><br>
<div style="text-align: justify;">Python today is almost everywhere in data and is the core technology for today's data scientists, no matter where you live. So it's worth having it in your technology stack, and if you already have it, develop it further. A tool that is not that popular, but very useful are various types of cloud computing services. It is worth noting that there is a very strong positive correlation between earnings and the use of the cloud, so learning to use at least one cloud technology give a big chance to translate it into your earnings in the future!</div><br>

<b>D. If your company is looking for new data enthusiasts then... maybe it is worth opening an office in Asia!</b><br>
<div style="text-align: justify;">Large companies are often based in many developed countries and face a shortage of workers there. If your company has adequate funds and it lacks data scientists (especially those with less experience), there will certainly be plenty of people in Asian and African countries who could potentially work with you. Of course, the reality is more complicated and opening a branch in India and Nigeria is not a trivial and quick thing, but it is worth keeping in mind the fact that currently there are a lot of young, talented and ready-to-learn people who want to work with data!</div><br>

<b>E. No matter where you are from... it is important that you are developing!</b>
<div style="text-align: justify;">And the most important thing in the end - we only talked about dependencies at the country level and general trends that do not describe specific person. Remember that the place where you were born or where you currently live is usually important, but not everything. If you are from a country where the percentage of very experienced people and with many levels of education is low, it does not mean that you will also have the same one. The most important thing is progress, continuous learning and not falling into stereotypes. You decide to the greatest extent what data enthusiasts you are and what you will be in the future. Head up and good luck!</div>

<a id="section-6"></a>
<div class="alert alert-light" role="alert">
<h3><b>6. Sources</b></h3>
</div>

<b>Knowledge references</b>:

[1] [(2022 October). <i>2022 Kaggle DS & ML Survey: List of Questions and Answer Choices</i>](https://www.kaggle.com/competitions/kaggle-survey-2022/data)<br>
[2] [(2022 October). <i>2022 Kaggle DS & ML Survey: Methodology document</i>](https://www.kaggle.com/competitions/kaggle-survey-2022/data)<br>
[3] [Eberstadt, N., (2010 May). <i>Russia's Peacetime Demographic Crisis: Dimensions, Causes, Implications</i>, NBR Project Report](https://www.nbr.org/wp-content/uploads/pdfs/russia_pr_may2010.pdf)<br>
[4] [Visual Capitalist, (February 2022). <i>The global reglious composition landscape</i>](https://www.visualcapitalist.com/wp-content/uploads/2022/02/Worlds-Major-Religions.html)<br>
[5] [World data, <i>Median age by country</i>](https://www.worlddata.info/average-age.php)<br>
[6] [Hess, A.J., (2018 February). <i>The 10 most educated countries in the world</i>, The bulletin of mathematical biophysics volume 5, pages115–133](https://www.cnbc.com/2018/02/07/the-10-most-educated-countries-in-the-world.html)<br>
[7] [<i>SJR - International Science Ranking</i>](https://www.scimagojr.com/countryrank.php?order=itp&ord=desc&year=2020)<br>
[8] [McCulloch W.S., Pitts W., (1943). <i>A logical calculus of the ideas immanent in nervous activity</i>, The bulletin of mathematical biophysics 5, p. 115–133](https://link.springer.com/article/10.1007/BF02478259)<br>
[9] [Breiman L., (2001). <i>Random Forests</i>, Machine Learning 45, p. 5–32](https://link.springer.com/article/10.1023/A:1010933404324)<br>
[10] [(2022 October). <i>Tensor Processing Unit</i>](https://cloud.google.com/tpu/docs/tpus)<br>
[11] [Hsu, H., (2018 April). <i>2018 Museum Fellow Guido van Rossum, Python Creator & Benevolent Dictator for LifeGuido van Rossum</i>, Computer History Museum](https://computerhistory.org/blog/2018-chm-fellow-guido-van-rossum-python-creator-benevolent-dictator-for-life/?key=2018-chm-fellow-guido-van-rossum-python-creator-benevolent-dictator-for-life)<br>
[12] [<i>Microsoft R Application Network</i>](https://mran.microsoft.com/documents/what-is-r)<br>
[13] [International monetary fund, (2022 October). <i>World Economic Outlook database: October 2022</i>](https://www.imf.org/en/Publications/WEO/weo-database/2022/October/)<br>
[14] [Petrovuc, S., Lončarić, Z., Rebekić, A.E., Marić, S., (2015 December). <i>Pearson's or spearman's correlation coefficient - Which one to use?</i>, Poljoprivreda. 21. 47-54. 10.18047/poljo.21.2.8.](https://www.researchgate.net/publication/288992202_Pearson's_or_spearman's_correlation_coefficient_-_Which_one_to_use)<br>
[15] [Wiśniewski, J.W., (July 2022). <i>The possibilities on the use of the Spearman correclation coefficient</i>, V Nr 1. 151-162.](https://www.researchgate.net/publication/362218857_THE_POSSIBILITIES_ON_THE_USE_OF_THE_SPEARMAN_CORRELATION_COEFFICIENT)<br>
[16] [Estivill-Castro, V., (2002 June). <i>Why so many clustering algorithms – a position paper</i>, ACM SIGKDD Explorations Newsletter, Volume 4, Issue 1](https://dl.acm.org/doi/10.1145/568574.568575)<br>
[17] [Nielsen, F., (2016 February). <i>Hierarchical clustering</i>, 10.1007/978-3-319-21903-5_8.](https://www.researchgate.net/publication/314700681_Hierarchical_Clustering)<br>
[18] [Penn State, STAT 505, <i>Ward’s Method</i>](https://online.stat.psu.edu/stat505/lesson/14/14.7)<br>
[19] [Murtagh, F, (2011 November). <i>Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm</i>](https://www.researchgate.net/publication/51962445_Ward's_Hierarchical_Clustering_Method_Clustering_Criterion_andAgglomerative_Algorithm)<br>
[20] [Liberti, L., Lavor, C., Maculan, N., Mucherino, A., (May 2012). <i>Euclidean Distance Geometry and Applications</i>](https://www.researchgate.net/publication/224904730_Euclidean_Distance_Geometry_and_Applications)<br>
[21] [Legendre, P., Murtagh, F., (2011 December). <i>Ward’s Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm</i>](https://www.researchgate.net/publication/51962445_Ward's_Hierarchical_Clustering_Method_Clustering_Criterion_andAgglomerative_Algorithm)<br>

<b>Technical references</b>:

&nbsp; [a] [2022 Kaggle Machine Learning & Data Science Survey data](https://www.kaggle.com/competitions/kaggle-survey-2022/data)<br>
&nbsp; [b] [Education level affects data analysis?](https://www.kaggle.com/code/michau96/education-level-affects-data-analysis/notebook)<br>
&nbsp; [c] [Bootstrap v5.2.2 documentation](https://getbootstrap.com/docs/4.0/components/alerts/)<br>
&nbsp; [d] [Data color picker](https://www.learnui.design/tools/data-color-picker.html#palette)<br>
&nbsp; [e] [Writing Mathematic Fomulars in Markdown](https://csrgxtu.github.io/2015/03/20/Writing-Mathematic-Fomulars-in-Markdown/)<br>
&nbsp; [f] [ggplot2 documentation](https://ggplot2.tidyverse.org/)<br>
&nbsp; [g] [ggplot2: Elegant Graphics for Data Analysis](https://ggplot2-book.org/index.html)<br>
&nbsp; [h] [dplyr documentation](https://dplyr.tidyverse.org/)<br>
&nbsp; [i] [ggchicklet documentation](https://github.com/hrbrmstr/ggchicklet)<br>
&nbsp; [j] [ggforce documentation](https://ggforce.data-imaginist.com/)<br>
&nbsp; [k] [ggrepel documenation](https://ggrepel.slowkow.com/index.html)<br>
&nbsp; [l] [ggcorrplot documenation](https://rpkgs.datanovia.com/ggcorrplot/index.html)<br>
&nbsp; [m] [ggmap documenation](https://github.com/dkahle/ggmap)<br>
&nbsp; [n] [ggdendro documenation](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html)<br>
&nbsp; [o] [Dendrograms in ggplot2](https://rpubs.com/TX-YXL/662586)