**About the Analysis:**

When I chose to become one of the respondents in this survey, the most important thing that was in my mind was to know about the characteristics of Women in Data. There were lot of relevant questions asked in this survey and I am sure I will find some answers and relate myself to one of the categories in this community. I am mainly focussing on women kagglers( with no offence to any other category), and this is just out of curiosity. I will definitely compare the results arrived with all other categories as well, but that will be towards the end of this analysis, where there are a lot of questions related to barriers faced while coding. 

So, without further ado, let's try to make a fairly simple(read the visualizations) and impactful(I hope to find something interesting) analysis. 

**We will start with, Loading the Libraries and checking the data**

In [None]:
library('tidyverse') 
library('leaflet')
library('ggmap')
library('GGally')
library('viridis')
library('plotly')
library('IRdisplay')
library('ggrepel')
library('cowplot')

options(warn = -1)

list.files(path = "../input")

In [None]:
multichoice <- read_csv("../input/multipleChoiceResponses.csv")
#names(multichoice) <- as.data.frame(multichoice[1,])
#multichoice <- multichoice[-1,]

In [None]:
str(multichoice)

In [None]:
head(multichoice)

**1- Percentage of Wogglers**

Out of the total respondents in the survey, we find that around **16.81%** are **Female Kagglers**. Males form a larger part of the community with **81.44% **. Few respondents **preferred not to say(1.43%)** and few preferred to **self describe(0.33%). **

In [None]:
theme1 <- theme_bw()+theme(text = element_text(size=20),
                           legend.position = 'none', plot.title = element_text(hjust = 0.5))

options(repr.plot.width=12, repr.plot.height=6)

multichoice[-1,] %>% group_by(Q1)%>%summarise(Count = length(Q1))%>%
mutate(pct = prop.table(Count)*100)%>%
    ggplot(aes(x = reorder(Q1, -pct), y = pct, fill = Q1)) + 
   geom_bar(stat = 'identity') + scale_fill_brewer(palette="Set1",  na.value = "gray")+
    geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.5,
            vjust = -0.5, size =5)+ theme1+  xlab("") + ylab("Percent")+
              ggtitle("Respondents Gender")

**2- What is her age? "Age is just a number for her"**

I am amazed, there is a variety of age groups in this section. Honestly, I was expecting to see the ranges upto 45, but we have few women kagglers who are** 60+, 70+ and even 80+** years of age. This is so inspiring. 

While **comparing** the age ranges of **female and male kagglers**, I found that the **percent** of **female** kagglers is higher than male kagglers in the age ranges of **22-24 (by 5%)** and **25-29(around 2%).** 

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

multichoice$Q2 <- factor(multichoice$Q2, 
                         level = c("18-21","22-24","25-29",
                                   "30-34","35-39","40-44",
                                   "45-49","50-54","55-59",
                                   "60-69","70-79","80+"))

multichoice %>% 
group_by(Q1,Q2)%>%
filter(Q1 == "Female"| Q1== "Male")%>%
summarise(Count = length(Q2))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Q2, y = pct, fill = pct)) + 
   geom_bar(stat = 'identity') + scale_fill_viridis(direction = -1)+
    facet_grid(Q1~.)+
    geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.5,
            vjust = -0.1, size =4)+ theme1+  xlab("") + ylab("Percent")+
              ggtitle("Age Comparison")

**3- What is her Country? "Ohh!! I miss you African Kagglers"**

This is quite intuitive, most of the wome kagglers belong to **USA, India **followed by** China**. Great going. I am really happy for women from India(you have conquered all the barriers and made me happy- I can feel your challenges). But, on the other hand, the **African popuation** are** less** represented in kaggle, be it any category. Out of many African countries, only** Egypt, Nigeria, Kenya** are the only countries that are appearing in this survey. 

Another stark difference can be seen here is that **Female** Kagglers from **USA** are far **more** in percent as compared to **Indian female** kagglers. However, this is **not the case** when it comes to their **male counterparts**. **Male kagglers** from **India **slightly** outnumber** **male kagglers** from **USA**. 

Please read** "UK & NI"** as **United Kingdom of Great Britain and Northern Ireland.**

In [None]:
options(repr.plot.width=12, repr.plot.height=10)

#Replacing the long titles, for making the plots convinient
multichoice$Q3 <- 
str_replace(multichoice$Q3, "United Kingdom of Great Britain and Northern Ireland","UK & NI" )

multichoice$Q3 <- 
str_replace(multichoice$Q3, "I do not wish to disclose my location","Won't disclose" )

multichoice$Q3 <- 
str_replace(multichoice$Q3, "Iran, Islamic Republic of...","Iran" )


multichoice$Q3 <- 
str_replace(multichoice$Q3, "United States of America","USA" )

#Setting the theme for the plots
theme2 <- theme_bw()+theme(text = element_text(size=15),
                           axis.text.x = element_text(angle = 0, hjust = 1))+
theme(legend.position = 'none', plot.title = element_text(hjust = 0.5))

multichoice[-1,] %>% 
group_by(Q1,Q3)%>%
filter(Q1 == "Female"| Q1== "Male")%>%
summarise(Count = length(Q3))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q3,pct), y = pct, fill = pct)) + 
   geom_bar(stat = 'identity') + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
        theme2+  xlab("") + ylab("Percent")+coord_flip()+
              ggtitle("Country")

**Country-Age_Gender Comparison: **

There are few respondents who are 80+ years of in both Male and Female categories, which is very inspiring. I wanted to see which country they belong to, and no points for guessing, **most of the 80+** respondents are from **USA. **

Another striking difference, I found between India and USA (the top two countries w.r.t numbers in kaggle), have completely different trend with respect to age. In **USA**, the **participation increases** with **increase in age **and in **India** it is the other way round. The **participation decreases** with **increase in age. **

I have included all countries, so that kagglers will know about their respective countries. 

In [None]:
theme2.1 <- theme_bw()+theme(text = element_text(size=15),
                           axis.text.x = element_text(angle = 30, hjust = 1))+
theme(legend.position = 'none', plot.title = element_text(hjust = 0.5))


multichoice[] %>% 
group_by(Q1,Q2,Q3)%>%
filter(Q1 == "Female"| Q1== "Male")%>%
summarise(Count = length(Q2))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Q2, y = Q3, fill = pct)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
     theme2.1+  xlab("") + ylab(" ")+
       ggtitle("Country-Age-Gender Comparison")

**4- How educated is she? "Well she is Jack of all trades and Master of all as well."**

A whooping **52%** of Women in Kaggle have **Master's Degree**, which is around **7% higher** than their **male counter ****parts.** Higher education does matter. This is also the case in case of Doctoral Degree holders. This indicates that females with Master's and above degree are more likely to come to this platform, as compared to their male counterparts.



But we should not miss that **0.41% women** who have **no formal education **but are still coding. Now this is inspiring. 

In [None]:
multichoice %>% 
group_by(Q1,Q4)%>%
filter(Q1 == "Female"|Q1=="Male")%>%
filter(!is.na(Q4))%>%
summarise(Count = length(Q4))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q4, -pct), y = pct, fill = Q4)) + 
   geom_bar(stat = 'identity') + scale_fill_brewer(palette="Set1",  na.value = "gray")+
    facet_grid(Q1~.)+
    geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.5,vjust = -0.5, size =4)+   
        scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+theme2+ 
            xlab("") + ylab("Percent")+
              ggtitle("Education")

**Country-Education-Gender Comparison**:

The dominance of **Master's Degree** holder can be seen for all countries in both the male and female categories. Some countries where there are **more Bachelor's degree holders** compared to Master's degree holders are,** India(can be seen in Male category), Vietnam, South Korea, Australia, Indonesia, Egypt, Kenya, Bangladesh, Nigeria**. The dominance of **Master's degree** holders is still there in most of the **Europe and America**. 

In [None]:
options(repr.plot.width=12, repr.plot.height=12)

multichoice[] %>% 
group_by(Q1,Q3,Q4)%>%
filter(Q1 == "Female"| Q1== "Male")%>%
filter(!is.na(Q4))%>%
summarise(Count = length(Q3))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Q4, y = Q3, fill = pct)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
     #scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
      scale_y_discrete(labels = function(y) str_wrap(y, width = 20))+             
      theme2.1+  xlab("") + ylab(" ")+
       ggtitle("Country-Education-Gender Comparison")

**5- What is her Undergraduate Major? "Software we are Aware!!!!"**

While most of the respondents have **Software Major**, but we do have lots of categories here(although less in percent), right from** 'Non-computer major'** to **'Never declared a major'**.

Now coming to the difference between female and males, it can be clearly seen that, more females are likely to be a member of kaggle from **Mathematics and Statistics** background than their male counterparts(the difference is around 5%). Similarly, **females** from **business, social science and medical backgrounds** are more in percent, as compared to their male counterparts. 

However, females from **non-computer focussed engineering** background are around **6% less** when compared to their male counterparts.


If you are inetersted, I belong to the third category from left. 

In [None]:
theme3 <- theme_bw()+theme(text = element_text(size=13),
                           axis.text.y = element_text(angle = 0, hjust = 1))+
theme(legend.position = 'none', plot.title = element_text(hjust = 0.5))

multichoice %>% 
group_by(Q1,Q5)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q5))%>%
summarise(Count = length(Q5))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q5, -pct), y = pct, fill = pct)) + 
 geom_bar(stat = 'identity') + scale_fill_viridis(direction = -1)+
  facet_grid(Q1~.)+
   geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.5,vjust = -0.5, size =4)+
    scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
     theme3+xlab("") + ylab("Percent")+
      ggtitle("Undergraduate Major")

**6- What is her Current Role? "That nerd is a student"**

Most are students and I am happy to know that they have figured it out early. 

**More females** identify themselves as **student(26.2%) **and **data analyst(10.85%)**, when compared to their male counterpart. 

In [None]:
multichoice %>% 
group_by(Q1, Q6)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q6))%>%
summarise(Count = length(Q6))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q6, pct), y = pct, fill = pct)) + 
geom_bar(stat = 'identity') + scale_fill_viridis(direction= -1)+
facet_wrap(Q1~.)+coord_flip()+
geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = -0.1,vjust = -0.5, size =3, angle = 0)+   
    scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
        theme3+ xlab("") + ylab("Percent")+
            ggtitle("Current Role")

**7- Which industry she belongs to? "Wait!!! She is yet to start. Go for it girl"**

While most of them are still students, the professionals from software/computer industry are not far behind. Obiviously. But don't miss the **other categories**, although they are less in percentages. Remember girls, you can apply this field in any industry. So keep learning. 

In [None]:
theme4 <- theme_bw()+theme(text = element_text(size=13),
                           axis.text.x = element_text(angle = 30, hjust = 1))+
theme(legend.position = 'none', plot.title = element_text(hjust = 0.5))

multichoice %>% 
group_by(Q1,Q7)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q7))%>%
summarise(Count = length(Q7))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q7, -pct), y = pct, fill = pct)) + 
geom_bar(stat = 'identity') + scale_fill_viridis(direction= -1)+
facet_grid(Q1~.)+
geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.5,vjust = -0.5, size =4)+   
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme4+xlab("") + ylab("Percent")+
ggtitle("Current Industry")

**8- What is her experience in her current role? "She is a newbie.. I told ya!!!"**

Pretty intuitive, since **most of them are students** and are here in kaggle for building their experience and knowledge offcourse. Keep going girls!!!

While making a comparison, I found that 14.16% females and 11.74% males have not answered the question. Apart from that the female students are more as compared to their male counterparts. Respondents with 5-10 years of experience are more in case of both the categories.

In [None]:
multichoice$Q8 <- factor(multichoice$Q8, level = c("0-1","1-2","2-3",
                                                   "3-4","4-5","5-10",
                                                   "10-15","15-20","20-25",
                                                   "25-30","30+"))

multichoice %>% 
group_by(Q1,Q8)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q8))%>%
summarise(Count = length(Q8))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Q8, y = pct, fill = Q8)) + 
geom_bar(stat = 'identity') + scale_fill_brewer(palette="Set3",  na.value = "gray")+
facet_grid(Q1~.)+
geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.5,vjust = -0.5, size =5)+   
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme1+xlab("") + ylab("Percent")+
ggtitle("Experience in Current Role")

**9- What is her yearly compensation? "She doesn't want you to know"**

**26%** ladies do not wish to disclose . The trend is more or less same for both male and females. The trend of refusing to disclose  the compensation is commonly echoed by other categories as well. 

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

theme2.2 <- theme_bw()+theme(text = element_text(size=15),
                           axis.text.x = element_text(angle = 30, hjust = 1))+
theme(plot.title = element_text(hjust = 0.5))


multichoice$Q9 <- 
str_replace(multichoice$Q9, 
            "I do not wish to disclose my approximate yearly compensation",
            "Won't disclose" )

multichoice$Q9 <- factor(multichoice$Q9, 
                         level = c("Won't disclose",
                                   "0-10,000","10-20,000","20-30,000","30-40,000",
                                   "40-50,000","50-60,000","60-70,000","70-80,000",
                                   "80-90,000","90-100,000","100-125,000",
                                   "125-150,000","150-200,000","200-250,000",
                                   "250-300,000","300-400,000", "400-500,000","500,000+"))

multichoice %>% 
group_by(Q1,Q9)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q9))%>%
summarise(Count = length(Q9))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Q9, y = pct, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
theme2.2+ xlab("Salary Ranges") + ylab("Percent")+
ggtitle("Yearly Salary")

In [None]:
options(repr.plot.width=12, repr.plot.height=12)

theme2.3 <- theme_bw()+theme(text = element_text(size=10),
                           axis.text.x = element_text(angle = 90, hjust = 1))+
theme(legend.position = "top",plot.title = element_text(hjust = 0.5))


multichoice[] %>% 
group_by(Q1,Q3,Q9)%>%
filter(Q1 == "Female"| Q1== "Male")%>%
filter(!is.na(Q3)) %>% filter(!is.na(Q9))%>%
filter(Q9 != "Won't disclose") %>%
filter(Q3 != "Won't disclose" & Q3 != "Other")%>%
summarise(Count = length(Q1))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Q9, y= pct, fill = Q1)) + 
   geom_col(position = "dodge") + #scale_fill_viridis(direction= -1)+
    facet_wrap(Q3~.)+
     #scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
      scale_y_discrete(labels = function(y) str_wrap(y, width = 20))+             
      theme2.3+  xlab("") + ylab(" ")+#coord_flip()+
       ggtitle("Country-Education-Gender Comparison")

**10- Does her employer uses ML? "She is way ahead of the game"**

This is really interesting and relatable, around** 60%** of the ladies said that their** employers** do **not** use **ML** or **planning to do ML** or simply they do not know if ML is in use in their companies or not, but these **60%** are **active** in kaggle. This is an indication that ladies in data are **self starter** and a **self motivated** bunch and I am not surprised at all. 

In [None]:
options(repr.plot.width=12, repr.plot.height=10)

multichoice %>% 
group_by(Q1,Q10)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q10))%>%
summarise(Count = length(Q10))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q10, -pct), y = pct, fill = pct)) + 
geom_bar(stat = 'identity') + scale_fill_viridis(direction= -1)+
facet_grid(Q1~.)+
geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.01,vjust = 0.05, size =5)+ 
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme1+ xlab("") + ylab("Percent")+
ggtitle("Employer uses ML or not")

**11(Part 1 to Part 7)- What is her day to day role?"Analysis is her function & passion"**

Around **29.19%** of the ladies conduct analysis to** understand data** and influence **business decisions**, which is around** 4% higher** than their male counterparts. The rest of the results are more or less same. 

In [None]:
#I gathered all the columns related to Q11 from part 1 to part 7 and aggregated the values.

multichoice %>% 
select(Q1,Q11_Part_1,Q11_Part_2, Q11_Part_3,Q11_Part_4,Q11_Part_5,Q11_Part_6,Q11_Part_7)%>%
filter(Q1 == "Female"|Q1=="Male")%>%
gather(2:8, key = "questions", value = "Function")%>%
group_by(Q1,Function)%>%
filter(!is.na(Function))%>%
summarise(Count = length(Function))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Function, -percent), y = percent, fill = Function)) + 
geom_bar(stat = 'identity') + scale_fill_brewer(palette="Set1",  na.value = "gray")+
facet_grid(Q1~.)+
geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.01, size =5)+ 
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme1+ xlab("") + ylab("Percent")+
ggtitle("Day to Day function?")

**12- Which tool she likes to use to analyze data? "No points for guessing"**

**45%** of the **females** use local or hosted development environments such as **Rstudio and JupyterLab.** 22.64% females use basic statistical softwares such as Microsoft Excel and Google Sheets. I will compare these percentages with their male counterparts as well in the second plot under this section. 

The overall trend is same for the male counterparts here, except the fact that **females** use **Business Intelligence** tools slightly **more**(6.78%) than their male counterparts(5.75%). The use of statistical softwares such as **SPSS, SAS** is more in case of **females** by **more** than **2%**. Also, the use of **cloud** based softwares is **2% less** in case **females** as compared to their male counterparts. 

In [None]:
multichoice %>% 
group_by(Q1, Q12_MULTIPLE_CHOICE)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q12_MULTIPLE_CHOICE))%>%
summarise(Count = length(Q12_MULTIPLE_CHOICE))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q12_MULTIPLE_CHOICE, -pct), y = pct, fill = Q12_MULTIPLE_CHOICE)) + 
   geom_bar(stat = 'identity') + scale_fill_brewer(palette="Set1")+
    facet_grid(Q1~.)+
    geom_text(aes(label = sprintf("%.2f%%", pct)), hjust = 0.01,vjust = 0.01, size =5)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme1+ xlab("") + ylab("Percent")+
              ggtitle("Choice of Tools for Analysis")

**13(Part-1 to Part-15)- IDE's used at school or college?"They are from Venus, but they like Jupyter a lot"**

We found that **most** of them have used **Jupyter/IPython** in their school or college **followed by Rstudio**.

In [None]:
multichoice %>% 
select(Q1,30:45)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:16, key = "questions", value = "Function")%>%
group_by(Q1,Function)%>%
filter(!is.na(Function))%>%
summarise(Count = length(Function))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Function, -percent), y = percent, fill = percent)) + 
   geom_bar(stat = 'identity') + scale_fill_viridis(direction = -1)+
    facet_grid(Q1~.)+
    geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.01, size =5)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("IDE's used at school or work")

**14(Part 1- Part 11)- Hosted Notebooks used in the last 5 years?** 

Around 37% of the females have not used a hosted notebook in the last 5 years. **Kaggle kernels** are the most preferred hosted notebooks. 

In [None]:
multichoice %>% 
select(Q1,46:57)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:11, key = "questions", value = "Function")%>%
group_by(Q1,Function)%>%
filter(!is.na(Function))%>%
summarise(Count = length(Function))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Function, -percent), y = percent, fill = percent)) + 
   geom_bar(stat = 'identity') + scale_fill_viridis(direction = -1)+
    facet_grid(Q1~.)+
    geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.01, size =5)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Hosted Notebooks used at school or work")

**15(Part 1- Part7) Cloud services used in the last 5 years:"No they are not on cloud9"**

**41.77%** of **female **kagglers have not used any cloud based services, which is around **10%** **more** than that of their **male** counterparts. AWS comes next in rank. 

In [None]:
multichoice %>% 
select(Q1,59:65)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:7, key = "questions", value = "Function")%>%
group_by(Q1,Function)%>%
filter(!is.na(Function))%>%
summarise(Count = length(Function))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Function, -percent), y = percent, fill = percent)) + 
   geom_bar(stat = 'identity') + scale_fill_viridis(direction = -1)+
    facet_grid(Q1~.)+
    geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.01, size =5)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Cloud services used in the last 5 years")

**16(Part 1- 18)Programming languages use on a regular basis: "They are regulaR with R"**

Although Python tops the list of preferred Programming Languages used on a daily basis. The difference lies in case of **SQL, R and MATLAB,** where **Female Kagglers exceed **in percent (between 2-4.5%) w.r.t their Male counterparts. 

I feel confident now...

In [None]:
options(repr.plot.width=10, repr.plot.height=6)


theme5 <- theme_bw()+theme(text = element_text(size=15),
                           axis.text.x = element_text(angle = 60, hjust = 1))+
theme(legend.position = 'top', plot.title = element_text(hjust = 0.5))

multichoice %>% 
select(Q1,66:84)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:18, key = "questions", value = "Programming_Language")%>%
group_by(Q1,Programming_Language)%>%
filter(!is.na(Programming_Language))%>%
filter(!is.na(Q1))%>%
summarise(Count = length(Programming_Language))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Programming_Language,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
   # geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Programming_Language used on a Regular Basis")

**17-Specific Programming language used most often:**

The programming laguage used most often is Python in both the genders, but again, the more female kagglers prefer using R when compared to Male Kagglers. 

In [None]:
multichoice %>% 
select(Q1,Q17)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
#gather(2:18, key = "questions", value = "Programming_Language")%>%
group_by(Q1,Q17)%>%
filter(!is.na(Q17))%>%
#filter(!is.na(Q1))%>%
summarise(Count = length(Q17))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q17,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
    #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Specific Programming language used most often")

**18- Recommended Programming Language:**

Ladies are more likely to refer R and SQL when compared to their male counterparts. The trend however continues here in favour of Python as a recommended programming languages. 

In [None]:
multichoice %>% 
select(Q1,Q18)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
#gather(2:18, key = "questions", value = "Programming_Language")%>%
group_by(Q1,Q18)%>%
filter(!is.na(Q18))%>%
#filter(!is.na(Q1))%>%
summarise(Count = length(Q18))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q18,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
    #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Recommended Programming language")

**Profession-Programming Language-Gender Comparison:"They mean it when they say, R is a statistician's language"**

The plot below compares the profession of respondents and their language preference. It can be seen that **Statisticians** are more likely to recommend you **R over Python**, to learn as a Data Scientist. I have alsways read that, R is a statistician's language and this data doesn't deny as well. 

Followed by Statisticians, **Marketing Analysts** are also more likely to **recommend R. **

In [None]:
multichoice %>% 
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q18 == "Python"|Q18 == "R")%>%
group_by(Q1,Q6,Q18)%>%
filter(!is.na(Q18))%>%
summarise(Count = length(Q18))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x =Q18 , y = Q6, fill = percent)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
     theme2.1+  xlab("") + ylab(" ")+
       ggtitle("Profession-Programming Language-Gender Comparison")

In [None]:
multichoice %>% 
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q18 == "Python"|Q18 == "R")%>%
group_by(Q1,Q5,Q18)%>%
filter(!is.na(Q18))%>%
filter(!is.na(Q5))%>%
summarise(Count = length(Q18))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x =Q18 , y = Q5, fill = percent)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
    scale_y_discrete(labels = function(y) str_wrap(y, width = 40))+
     theme2.1+  xlab("") + ylab(" ")+
       ggtitle("Industry-Programming Language-Gender Comparison")

**19(Part 1- Part 19) & 20: ML Framework used in last 5 years and most often:**

**Scikit-Learn** remains the most popular in both the categories, followed by tensorflow. **Ladies** have a knack for **randomForest and Caret** as well. We can see that more percent of ladies are using Scikit-Learn, randomForest and Caret from the last 5 years and most often when compared to Male Kagglers. 

Apart from analyzing the difference between Male and Female Kagglers in this category, I also found out that, there are so many interesting ML Frameworks such as **H20, Prophet and Fastai** etc., that can be explored by Kagglers. 

In [None]:
multichoice %>% 
select(Q1,Q19_Part_1:Q19_Part_19)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:19, key = "questions", value = "ML")%>%
group_by(Q1,ML)%>%
filter(!is.na(ML))%>%
filter(!is.na(Q1))%>%
summarise(Count = length(ML))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(ML,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
    #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("ML framework used in last 5 years")

In [None]:
multichoice %>% 
select(Q1,Q20)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q20)%>%
filter(!is.na(Q20))%>%
#filter(!is.na(Q1))%>%
summarise(Count = length(Q20))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q20,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
    #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("ML framework used most often")

**21- (Part 1- Part 13) Data Visualization libraries or tools used in the past 5 years:"Her Plots are Pretty as well"**

**22-Data Visualization libraries or tools used most often:**

Awesome!!! **Matplotlib**, is used by both the categories but **Male Kagglers**, tend to use it more. On the other hand **Female** Kagglers, use** ggplot** more often. In fact, when it comes to **R based visualization** tools such as ggplot, shiny, lattice, plotly(present in both R and Python) etc., **females** are more likely to respond. Not to mention R is popular for making data look pretty when compared to python, although python is catching up fast. This kind of gels with our findings above, where we saw that **female kagglers** are more likely to **recommend and use R **as compared to the Male Kagglers. 

In [None]:
multichoice %>% 
select(Q1,Q21_Part_1:Q21_Part_13)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:13, key = "questions", value = "viz")%>%
group_by(Q1,viz)%>%
filter(!is.na(viz))%>%
filter(!is.na(Q1))%>%
filter(viz != "None")%>%
summarise(Count = length(viz))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(viz,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
    #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Visualization Libraries used in last 5 years")

In [None]:
multichoice %>% 
select(Q1,Q22)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q22)%>%
filter(!is.na(Q22))%>%
summarise(Count = length(Q22))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q22,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
     scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Visualization Libraries used most often")

**23-Percent of your time at work or school is spent actively coding**

When it comes to time spent on coding, Men and Women Kagglers are more or less alike. Most of them spent around **50-75% **of their **time coding **in their school or work. But, we must see those kagglers who do not get a chance to code in their school or work, but they are still motivated to joing kaggle and participate actively(atleast on a survey) and amongst them females are more. 

In [None]:
multichoice$Q23 <- factor(multichoice$Q23, level = c("0% of my time",
                                                     "1% to 25% of my time",
                                                     "25% to 49% of my time",
                                                     "50% to 74% of my time",
                                                     "75% to 99% of my time",
                                                     "100% of my time"))


multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q23)%>%
filter(!is.na(Q23))%>%
summarise(Count = length(Q23))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(Q23, Percent, fill = Q1))+
geom_col(position ="dodge")+ 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme5+ xlab("") + ylab("Percent")+
ggtitle("Time spent actively coding at work or school")

**24- Experience in coding to analyze data: "They want to learn, Encourage them"**

Most of the **Wogglers** have **less than 1 year** of experience and around 6% of Wogglers have not written a code, but they want to learn. That's awesome. The **percent **of Wogglers **decreases** when the **years **in** coding **experience **increases. **

In [None]:
theme6 <- theme_bw()+theme(text = element_text(size=15),
                           axis.text.x = element_text(angle = 0, hjust = 1))+
theme(legend.position = 'top', plot.title = element_text(hjust = 0.5))


multichoice$Q24 <- factor(multichoice$Q24, 
                          level = c("I have never written code and I do not want to learn",
                                    "I have never written code but I want to learn",
                                    "< 1 year","1-2 years","3-5 years","5-10 years",
                                    "10-20 years","20-30 years","30-40 years", "40+ years"))




multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q24)%>%
filter(!is.na(Q24))%>%
summarise(Count = length(Q24))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(Q24, Percent, fill = Q1))+
geom_col(position ="fill")+ 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 8))+
theme6+ xlab("") + ylab("Percent")+
ggtitle("Experience in coding to analyze data")

**Experience in coding for analysis-Current Role-Gender Comparison**

When comparing current role with respect to experience in coding for analysis, I found that mostly, kagglers who are sales professionals, Managers, Business Analysts have never done coding for analysis but they are willing to learn. I also, found that few Data Scientists(in both categories) and Data Analysts(Only in Male Category), have never done coding and are not willing to learn as well. 

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

theme7 <- theme_bw()+theme(text = element_text(size=15),
                           axis.text.x = element_text(angle = 60, hjust = 1))+
theme(legend.position = 'none', plot.title = element_text(hjust = 0.5))




multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q6, Q24)%>%
filter(!is.na(Q24))%>%
filter(!is.na(Q6))%>%
summarise(Count = length(Q24))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(x =Q24 , y = Q6, fill = Percent)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
    scale_x_discrete(labels = function(x) str_wrap(x, width = 30))+    
     theme7+  xlab("") + ylab(" ")+
       ggtitle("Experience in coding for analysis-Current Role-Gender Comparison")

**25-Machine Learning method used in school or work:They have just started**

Most of the Kagglers, be it Females or Males, have started using Machine Learning. Their experience varies from **<1 year to 1-2 years**. The **gap** between **1-2 years and 2-3 years** is around **15%**(in case of Female Kagglers), and this gap is **increasing,** when we move towards increasing experience. 

In [None]:
multichoice$Q25 <- factor(multichoice$Q25, 
            level = c("I have never studied machine learning and I do not plan to",
                      "I have never studied machine learning but plan to learn in the future",
                      "< 1 year","1-2 years","2-3 years","3-4 years",
                      "4-5 years","5-10 years","10-15 years"))


multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q25)%>%
filter(!is.na(Q25))%>%
summarise(Count = length(Q25))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(Q25, Percent, fill = Q1))+
geom_col(position ="dodge")+ 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 8))+
theme6+ xlab("") + ylab("Percent")+
ggtitle("ML Method used in School or Work")

**26- Consider themselves a Data Scientist:"Yes most of them do"**

Most of the Kagglers, do consider them Data Scientists, and Wogglers are also much confident about this particular aspect. 

In [None]:
multichoice$Q26 <- factor(multichoice$Q26,level = c("Definitely not", "Probably not",
                                                   "Maybe","Probably yes","Definitely yes"))



multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q26)%>%
filter(!is.na(Q26))%>%
summarise(Count = length(Q26))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(Q26, Percent, fill = Q1))+
geom_col(position ="dodge")+ 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 8))+
theme6+ xlab("") + ylab("Percent")+
ggtitle("Consider themselves a Data Scientist?")

**Current Industry-Consider Data Scientist-Gender Comparison**



In [None]:
multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q7, Q26)%>%
filter(!is.na(Q26))%>%
filter(!is.na(Q7))%>%
summarise(Count = length(Q26))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(x =Q26 , y = Q7, fill = Percent)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
    scale_x_discrete(labels = function(x) str_wrap(x, width = 30))+    
     theme7+  xlab("") + ylab(" ")+
       ggtitle("Current Industry-Consider Data Scientist-Gender Comparison")

**Undergrad Major-Consider Data Scientist-Gender Comparison**

More percent of Kagglers from Physics and Astronomy background, do consider themselves to be a Data Scientist, followed by Mathematics and Statistics background. 

In [None]:
multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q5, Q26)%>%
filter(!is.na(Q26))%>%
filter(!is.na(Q5))%>%
summarise(Count = length(Q26))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(x =Q26 , y = Q5, fill = Percent)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
    scale_y_discrete(labels = function(y) str_wrap(y, width = 30))+    
     theme7+  xlab("") + ylab(" ")+
       ggtitle("Undergrad Major-Consider Data Scientist-Gender Comparison")

**Yearly Compensation-Consider Data Scientist-Gender Comparison:**

Is there a little trend in this section? Well, can't say that as of now, but we do see that, when the salary ranges are high, the confidence to consider Data Scientist high too. 

In [None]:
multichoice %>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q9, Q26)%>%
filter(!is.na(Q26))%>%
filter(!is.na(Q9))%>%
summarise(Count = length(Q26))%>%
mutate(Percent = prop.table(Count)*100)%>%
ggplot(aes(x =Q26 , y = Q9, fill = Percent)) + 
   geom_tile() + scale_fill_viridis(direction= -1)+
    facet_wrap(Q1~.)+
    scale_y_discrete(labels = function(y) str_wrap(y, width = 30))+    
     theme7+  xlab("") + ylab(" ")+
       ggtitle("Yearly Compensation-Consider Data Scientist-Gender Comparison")

**27(Part-1 to Part 20) Cloud Computing products used at work or school in the last 5 years:**

The percent of ladies, having used Cloud Computing products in last 5 years is less in almost all products, but there are few exceptions, like in all of the IBM cloud products( 4 out of 4 products listed) and some Azure cloud products(3 out of 6 products listed) , the percent of ladies is slightly higher.

In [None]:
multichoice %>% 
select(Q1,Q27_Part_1:Q27_Part_20)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:21, key = "questions", value = "cloud")%>%
group_by(Q1,cloud)%>%
filter(!is.na(cloud))%>%
#filter(!is.na(Q1))%>%
summarise(Count = length(cloud))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(cloud,-percent), y = percent, fill = Q1)) + 
geom_col( position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
    #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Cloud Computing products used at work or school in the last 5 years ")

**28(Part 1- Part43)- ML_Products products used at work or school in the last 5 years:"They are called SASsy for some reason"**

Well, **more female** respondents have used **SAS** ML products most in the last **five years** when compared to their Male counterparts. Similar story belongs to using **Clodera** products as well. 

In [None]:
options(repr.plot.width=12, repr.plot.height=12)

multichoice %>% 
select(Q1,Q28_Part_1:Q28_Part_43)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:44, key = "questions", value = "ML_Products")%>%
group_by(Q1,ML_Products)%>%
filter(!is.na(ML_Products))%>%
#filter(!is.na(Q1))%>%
summarise(Count = length(ML_Products))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(ML_Products,percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 40))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("ML_Products used at work or school in the last 5 years ")

**29(Part-1 to Part-28) Relational Database Products used at work or school in the last 5 years:**

**MySQL** tops the list and the percent of ladies who have used MySQL in the last 5 years is slightly **higher** than their male counterparts. Similarly, the percent of ladies using **Oracle Database** and **Microsoft Access** is little higher than Male respondents. 

In [None]:
multichoice %>% 
select(Q1,Q29_Part_1:Q29_Part_28)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:29, key = "questions", value = "RDB_Products")%>%
group_by(Q1,RDB_Products)%>%
filter(!is.na(RDB_Products))%>%
#filter(!is.na(Q1))%>%
summarise(Count = length(RDB_Products))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(RDB_Products,percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 40))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Relational Database Products used at work or school in the last 5 years ")

**30(Part-1 to Part-25)-Big data and Analytics products used at work or school in the last 5 years**

While around 40% of the respondents have not used any Big Data and Analytics products in the last 5 years, those who have used, have selected Google BigQuery, AWS Redshift, Databricks, Teradata and AWS EMR as the top 5 Big Data and Analytics products used in the last 5 years. 

In [None]:
multichoice %>% 
select(Q1,Q30_Part_1:Q30_Part_25)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:26, key = "questions", value = "BigData_Products")%>%
group_by(Q1,BigData_Products)%>%
filter(!is.na(BigData_Products))%>%
#filter(!is.na(Q1))%>%
summarise(Count = length(BigData_Products))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(BigData_Products,percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 40))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Big data and Analytics products used at work or school in the last 5 years ")

**31(Part-1 to Part-12) & 32- Data Types used most often:**

The **Female respondents** have used more traditional data types(I mean the usual, **numeric, categorical and text data**), while the percent of **Male respondents **is **more** in case of **tabular, image, sensor, audio, video data** etc. 

In [None]:
multichoice %>% 
select(Q1,Q31_Part_1:Q31_Part_12)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:13, key = "questions", value = "DataType")%>%
group_by(Q1,DataType)%>%
filter(!is.na(DataType))%>%
summarise(Count = length(DataType))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(DataType,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 40))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Data Types used most at work or school")

In [None]:
multichoice %>% 
select(Q1,Q32)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q32)%>%
filter(!is.na(Q32))%>%
summarise(Count = length(Q32))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q32,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 40))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Data Types used most at work or school")

**33(Part-1 to Part-12) Data Source used for getting Public Data:**

Female Respondents are **more likely** to get the data from **Data aggregator platforms, government websites or University research group websites** as compared to using other sources such as using google search or web-scrapping. 

In [None]:
multichoice %>% 
select(Q1,Q33_Part_1:Q33_Part_11)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:12, key = "questions", value = "DataSource")%>%
group_by(Q1,DataSource)%>%
filter(!is.na(DataSource))%>%
summarise(Count = length(DataSource))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(DataSource,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 30))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Data Source used for Public Data")

**34(Part-1 to Part-6) Percent of time spent on Gathering data, Cleaning data, Model Building, Model Deployment, Communicating with stakeholders**

I am trying to learn the median values of each activity here. There are several outliers though. However, when we compare the median percent of time spent on each activity for the Female and Male respondents, we find that around **10% **of the time is spent on **gathering data**, **20% **of the time is spent on **cleaning data,** **10% **of the time is spent of **model building**, **20%** on **model deployment**, **5%** on **finding insights and communicating** with the stakeholders and **10%** of the time is spent on **other **tasks. 

In [None]:
numcols <- c("Q34_Part_1","Q34_Part_2","Q34_Part_3","Q34_Part_4","Q34_Part_5","Q34_Part_6")

multichoice[,numcols] <- sapply(multichoice[,numcols], as.numeric)


a <- multichoice %>% 
select(Q1,Q34_Part_1)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q34_Part_1 < 50)%>%
ggplot(aes("",Q34_Part_1, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ ggtitle("Gathering Data")

b <- multichoice %>% 
select(Q1,Q34_Part_2)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q34_Part_2 < 60)%>%
ggplot(aes("", Q34_Part_2, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("Cleaning Data")

c <- multichoice %>% 
select(Q1,Q34_Part_3)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q34_Part_3 < 50)%>%
ggplot(aes("", Q34_Part_3, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("Model Building")

d <- multichoice %>% 
select(Q1,Q34_Part_4)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q34_Part_4 < 65)%>%
ggplot(aes("", Q34_Part_4, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("Putting Model into Production")

e <- multichoice %>% 
select(Q1,Q34_Part_5)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q34_Part_5 < 25)%>%
ggplot(aes("", Q34_Part_5, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("Finding insights & Communicating\nwith stakeholders")

f <- multichoice %>% 
select(Q1,Q34_Part_6)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q34_Part_6 < 50)%>%
ggplot(aes("", Q34_Part_6, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("Other tasks")

plot_grid(a,b,c,d,e,f, hjust = 0, vjust = 1)
          

**35(Part-1 to Part-6)Percent of machine learning/data science training falls under Self Study, Online Courses, Work, University, Kaggle Competitions and Other:**

While the **median percent **of data science training falling under Self Study, Online Courses and Work is **slightly higher **for **Male** respondents when compared to Female respondents, when it comes to **learining in University, Female **respondents have  **higher median percent **as compared to Male respondents. May be this because of lot of Female Kagglers are studying in the university. 

In [None]:
numcols1 <- c("Q35_Part_1","Q35_Part_2","Q35_Part_3","Q35_Part_4","Q35_Part_5","Q35_Part_6")

multichoice[,numcols1] <- sapply(multichoice[,numcols1], as.numeric)


a <- multichoice %>% 
select(Q1,Q35_Part_1)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
ggplot(aes("",Q35_Part_1, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ ggtitle("Self-taught")

b <- multichoice %>% 
select(Q1,Q35_Part_2)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
ggplot(aes("", Q35_Part_2, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle(" Online courses\n(Coursera, Udemy, edX, etc.)")

c <- multichoice %>% 
select(Q1,Q35_Part_3)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
#filter(Q35_Part_3 < 75)%>%
ggplot(aes("", Q35_Part_3, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("Work")

d <- multichoice %>% 
select(Q1,Q35_Part_4)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
ggplot(aes("", Q35_Part_4, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("University")

e <- multichoice %>% 
select(Q1,Q35_Part_5)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
#filter(Q35_Part_5 < 1)%>%
ggplot(aes("", Q35_Part_5, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle(" Kaggle competitions")

f <- multichoice %>% 
select(Q1,Q35_Part_6)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(Q35_Part_6 < 5)%>%
ggplot(aes("", Q35_Part_6, fill = Q1))+geom_boxplot()+theme5+
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
xlab("") + ylab("Percent")+ggtitle("Other")

plot_grid(a,b,c,d,e,f, hjust = 0, vjust = 1)

**36(Part-1 to Part-13)Online Platforms used for data science courses:**

**Coursera **is the preferred choice for both Female and Male Kagglers, followed by DataCamp and Udemy. One thing to notice here that, **Female Kagglers** are more in percent in **DataCamp **than their Male counterparts. Female Kagglers do prefer taking online university courses. 

In [None]:
multichoice %>% 
select(Q1,Q36_Part_1:Q36_Part_13)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:14, key = "questions", value = "OnlinePlatform")%>%
group_by(Q1,OnlinePlatform)%>%
filter(!is.na(OnlinePlatform))%>%
summarise(Count = length(OnlinePlatform))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(OnlinePlatform,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 30))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Online Platform used for learning")

**37-Online Platform where most time is spent:**

**Coursera** remains the go to platform for spending time learning for both categories here. DataCamp is popular among Wogglers and Udemy among the Male Kagglers(w.r.t percentages).

In [None]:
multichoice %>% 
select(Q1,Q37)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q37)%>%
filter(!is.na(Q37))%>%
summarise(Count = length(Q37))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q37,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 40))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Online Platform where most time is spent")

**38-Favorite Media Sources for data science topics:**

**Kaggle forums & Medium blog posts** are favourites here, with **Female** Kagglers representing a slightly **higher** percentage w.r.t the Male Kagglers. The same trend can be seen in case of **KDNuggets** and **Journal Publications **as well. However, **Male** Kagglers are more in percent in case of **ArXiv & Preprints, Siraj Raval YouTube Channel** and **HackerNews.**

In [None]:
multichoice %>% 
select(Q1,Q38_Part_1:Q38_Part_22)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:23, key = "questions", value = "MediaSources")%>%
group_by(Q1,MediaSources)%>%
filter(!is.na(MediaSources))%>%
summarise(Count = length(MediaSources))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(MediaSources,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 30))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Favorite Media Sources for data science topics")

**39(Part-1)-Online learning platforms as compared to the quality of the education provided by traditional brick and mortar institutions:**

The **ladies** have **slightly neutral **views about the **online learning platform**, when compared to their Male counterparts. The **percent of ladies** considering online platform much** better** than learning in institutions is **less** than the percent of ladies considering it otherwise.

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

multichoice$Q39_Part_1 <- factor(multichoice$Q39_Part_1, 
                                 level = c("Much worse","Slightly worse",
                                           "Neither better nor worse","Slightly better",
                                           "Much better","No opinion; I do not know"))



multichoice %>% 
select(Q1,Q39_Part_1)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q39_Part_1)%>%
filter(!is.na(Q39_Part_1))%>%
summarise(Count = length(Q39_Part_1))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q39_Part_1, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 30))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("Online MOOCs Vs. Traditional Institution Learnings")

**39(Part-2)- In-person bootcamps as compared to the quality of the education provided by traditional brick and mortar institutions:**

Most of the Kagglers, have no opinion, on which one is better, In-person bootcamps or learning in institutions. 

In [None]:

multichoice$Q39_Part_2 <- factor(multichoice$Q39_Part_2, 
                                 level = c("Much worse","Slightly worse",
                                           "Neither better nor worse","Slightly better",
                                           "Much better","No opinion; I do not know"))


multichoice %>% 
select(Q1,Q39_Part_2)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q39_Part_2)%>%
filter(!is.na(Q39_Part_2))%>%
summarise(Count = length(Q39_Part_2))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q39_Part_2, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 30))+
         theme5+ xlab("") + ylab("Percent")+
              ggtitle("In-person Bootcamp Vs. Traditional Institution Learnings")

**40- Independent Projects Vs. Academic Achievements:**

Like most of the respondents, value independent projects on par with academic achievements. Female Kagglers are slightly less in percent, who think independent projects are much important than academic achievements with respect to Male Kagglers. 

In [None]:
multichoice$Q40 <- factor(multichoice$Q40, 
        level = c("Independent projects are much less important than academic achievements",
                  "Independent projects are slightly less important than academic achievements",
                  "Independent projects are equally important as academic achievements",
                  "Independent projects are slightly more important than academic achievements",
                  "Independent projects are much more important than academic achievements",
                  "No opinion; I do not know"))



multichoice %>% 
select(Q1,Q40)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q40)%>%
filter(!is.na(Q40))%>%
summarise(Count = length(Q40))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q40, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Academic Achievements Vs. Independent Projects")

**41(Part-1) Importance of Fairness and bias in ML algorithms:**

While most of the ladies do understand the importance of Fairness and bias in ML algorithms, but most of their male counterparts think otherwise. Percentage of Females considering Fairness and bias very important is higher than the percentage of Males considering the same. 


In [None]:
multichoice$Q41_Part_1 <- factor(multichoice$Q41_Part_1, 
                                 level = c("Not at all important","Slightly important",
                                           "Very important","No opinion; I do not know"))


multichoice %>% 
select(Q1,Q41_Part_1)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q41_Part_1)%>%
filter(!is.na(Q41_Part_1))%>%
summarise(Count = length(Q41_Part_1))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q41_Part_1, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Importance of Fairness and bias in ML algorithms")

**41(Part-2) Ability to explain ML model outputs and/or predictions:**

More than 70% of the Female Kagglers consider, the ability to  explain ML model outputs and/or predictions is very important.

In [None]:
multichoice$Q41_Part_2 <- factor(multichoice$Q41_Part_2, 
                                 level = c("Not at all important","Slightly important",
                                           "Very important","No opinion; I do not know"))


multichoice %>% 
select(Q1,Q41_Part_2)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q41_Part_2)%>%
filter(!is.na(Q41_Part_2))%>%
summarise(Count = length(Q41_Part_2))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q41_Part_2, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Ability to explain ML model outputs and/or predictions")

**41(Part-3) Reproducibility in data science:**

Close to 70% of the Kagglers consider, reproducibility in data science is very important.

In [None]:
multichoice$Q41_Part_3 <- factor(multichoice$Q41_Part_3, 
                                 level = c("Not at all important","Slightly important",
                                           "Very important","No opinion; I do not know"))


multichoice %>% 
select(Q1,Q41_Part_3)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q41_Part_3)%>%
filter(!is.na(Q41_Part_3))%>%
summarise(Count = length(Q41_Part_3))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q41_Part_3, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Reproducibility in data science")

**42(Part-1 to Part-5) Metrics used by Organizations to determine Model's Success:**

For most of the organizations, where kagglers work, **accuracy and business goals**, **matter most** while deciding the success of an ML Model. The **female** percent is **slightly higher** in the organizations that consider, **bias as a metrics**. While around 19% of Female kagglers are not associated with any organization that builds ML models.  

In [None]:
multichoice %>% 
select(Q1,Q42_Part_1:Q42_Part_5)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:6, key = "questions", value = "Metrics")%>%
group_by(Q1,Metrics)%>%
filter(!is.na(Metrics))%>%
summarise(Count = length(Metrics))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Metrics,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Metrics used by Organizations to determine Model's Success")

**43-Percent of data projects involved exploring unfair bias in the dataset and/or algorithm:**

As indicated above, most of the organizations do not bother to explore unfair bias most of the time. 

In [None]:
#multichoice$Q43 <- factor(multichoice$Q43, 
                               #  level = c("0-10","20-30","30-40","40-50","50-60",
                                         #  "60-70","70-80","80-90","90-100"))


multichoice %>% 
select(Q1,Q43)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q43)%>%
filter(!is.na(Q43))%>%
summarise(Count = length(Q43))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q43, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Exploring unfair bias")

**44(Part-1 to Part-6)Challenges in ensuring fairness and unbiased algorithm:**

While there are a lot of challenges faced while ensuring fairness in the model, out which, difficulty in collecting enough data for the unfairly targeted groups, tops the list. 

In [None]:
multichoice %>% 
select(Q1,Q44_Part_1:Q44_Part_6)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:7, key = "questions", value = "Metrics")%>%
group_by(Q1,Metrics)%>%
filter(!is.na(Metrics))%>%
summarise(Count = length(Metrics))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Metrics,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Challenges in ensuring fairness and unbiased algorithm")

**45(Part-1 to Part-6)Interpreting Model's insights and predictions:**

Most of the respondents, have different circumstances for interpreting the model insights, out of which four circumstances are more prominent here, and they are, when exploring a new model or dataset , when the model is specifically designed to produce insights, while determining the worth of model in production and for all the models right before putting in production. 

In [None]:
multichoice %>% 
select(Q1,Q45_Part_1:Q45_Part_6)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:7, key = "questions", value = "Metrics")%>%
group_by(Q1,Metrics)%>%
filter(!is.na(Metrics))%>%
summarise(Count = length(Metrics))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Metrics,-percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+#coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Interpreting Model's insights and predictions")

**46- Percent of data projects involved in exploring model insights:**

Around 15-30% of the Wogglers, do not spend much time in exploring model insights. 

In [None]:
multichoice %>% 
select(Q1,Q46)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q46)%>%
filter(!is.na(Q46))%>%
summarise(Count = length(Q46))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q46, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Percent of data projects involved in exploring model insights")

**47(Part-1 to Part-15)Preferred methods for explaining and/or interpreting decisions made by ML models:**

Females are more likely to examine feature correlations and examine individual model coefficients.

In [None]:
options(repr.plot.width=12, repr.plot.height=12)

multichoice %>% 
select(Q1,Q47_Part_1:Q47_Part_15)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:16, key = "questions", value = "Metrics")%>%
group_by(Q1,Metrics)%>%
filter(!is.na(Metrics))%>%
summarise(Count = length(Metrics))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Metrics,percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 60))+
       theme5+ xlab("") + ylab("Percent")+
        ggtitle("Preferred methods for explaining and/or interpreting decisions made by ML models")

**48- Consider ML models outputs difficult or impossible to explain:**

Around **45-48%** of kagglers feel that they can explain the output of ML models most of the time. However, around **10-12%** of respondents are still there who **consider** that ML models are **"black boxes"**. The **percent of wogglers** who **do not have an opinion** in this subject is **higher(around 12-13%)** than that of their male counterparts.

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

multichoice %>% 
select(Q1,Q48)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
group_by(Q1,Q48)%>%
filter(!is.na(Q48))%>%
summarise(Count = length(Q48))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = Q48, y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+
      scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
         theme2+ xlab("") + ylab("Percent")+
              ggtitle("Consider ML models outputs difficult or impossible to explain")

**49-Tools and Methods used to make work easy to reproduce:**

The percent of females who prefer to make their code well documented and human readable is higher than their male counterparts. 

In [None]:
multichoice %>% 
select(Q1,Q49_Part_1:Q49_Part_12)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:13, key = "questions", value = "Tools")%>%
group_by(Q1,Tools)%>%
filter(!is.na(Tools))%>%
summarise(Count = length(Tools))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Tools,percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 60))+
       theme5+ xlab("") + ylab("Percent")+
        ggtitle("Tools and Methods used to make work easy to reproduce")

**50-Barriers in making work easier to reuse and reproduce:**

Most of the Kagglers, think that, it is too **time consuming** or there is **not enough incentives **for sharing their work, with the percentage of males being slightly higher than the females. Around **12-14%** of kagglers do feel that **none of the reasons** are applicable to them. Requiring too much **technical knowledge** is also a barrier that **prevents** the kaggler (slightly **high** percent in case of** females**) from making their work easier to reuse of reproduce.

In [None]:
multichoice %>% 
select(Q1,Q50_Part_1:Q50_Part_8)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:9, key = "questions", value = "barriers")%>%
group_by(Q1,barriers)%>%
filter(!is.na(barriers))%>%
summarise(Count = length(barriers))%>%
mutate(percent =  prop.table(Count)*100)%>%
ggplot(aes(x = reorder(barriers,percent), y = percent, fill = Q1)) + 
geom_col(position = "dodge") + 
scale_fill_manual(values=c("#00AFBB", "#E7B800"))+coord_flip()+
   #geom_text(aes(label = sprintf("%.2f%%", percent)), hjust = 0.5,vjust = 0.1, size =3)+ 
      scale_x_discrete(labels = function(x) str_wrap(x, width = 60))+
       theme5+ xlab("") + ylab("Percent")+
        ggtitle("Barriers in making work easier to reuse and reproduce")

**CONCLUSION:**

After, analyzing almost all questions from the survey and comparing the female kagglers w.r.t their male counterpart, it is clear that, the overall trend of answering the survey remains the same for both the genders. However, there are areas where we can find some differences either subtle or stark. Some of these areas are, use of R over Python, experience in coding to analyze the data, yearly compensation, educational background, less participation of non-computer science major females, use of ML tools and frameworks and so on. 

While it was fun to  analyze this subset of data and find some interesting insights, the basic question still lingers around that, "Why the percent of females is so less?". This question is not specific to Kaggle only, this trend is all over the places and we all know how to tackle it. 

With 26% of female students already  present in Kaggle, we can expect them to be around in this platform and actively take part to increase their domain knowledge. The 17% Data Scientists and 10% Data Analyst can actively participate in competitions, discussion forums etc. and can become a mentor for the newbies, because ladies have very few role models in this field, and having one definitely inspires those who are just starting off. 