This project is to visualise and analyse the gender gap in the tech industry, especially in the Data Science field. There are only arond 19,000 respondents, and that this sample population is not representative of the general population. However, this will give us a good idea of the actual population.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py
import math

  import pandas.util.testing as tm


In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/loycatherine/dataset/master/multiple_choice_responses.csv')
df = df.drop(df.index[0])


Columns (0,3,7,19,34,47,49,50,51,52,53,54,68,81,94,96,109,115,130,139,147,154,167,180,193,206,219,232,245) have mixed types.Specify dtype option on import or set low_memory=False.



##Data cleaning

In [3]:
df = df[['Q1','Q2','Q3','Q4','Q5','Q10']]
df.columns = ["age", "gender", "country", "education", "job title", "salary"]

In [4]:
df.dtypes

age          object
gender       object
country      object
education    object
job title    object
salary       object
dtype: object

In [5]:
df.loc[df.salary == '$0-999', 'salary'] = '0-999'
df.loc[df.salary == '> $500,000', 'salary'] = '> 500,000'

In [6]:
df

Unnamed: 0,age,gender,country,education,job title,salary
1,22-24,Male,France,Master’s degree,Software Engineer,"30,000-39,999"
2,40-44,Male,India,Professional degree,Software Engineer,"5,000-7,499"
3,55-59,Female,Germany,Professional degree,,
4,40-44,Male,Australia,Master’s degree,Other,"250,000-299,999"
5,22-24,Male,India,Bachelor’s degree,Other,"4,000-4,999"
...,...,...,...,...,...,...
19713,50-54,Male,Japan,,,
19714,18-21,Male,India,Bachelor’s degree,Other,0-999
19715,35-39,Male,India,Master’s degree,Student,
19716,25-29,Male,India,Master’s degree,Statistician,"1,000-1,999"


In [7]:
job_title = df.groupby("job title")

##Visualise Gender Distribution

In [8]:
df.gender.unique()

array(['Male', 'Female', 'Prefer to self-describe', 'Prefer not to say'],
      dtype=object)

In [9]:
gender = df['gender'].value_counts()
labels = gender.index
percentages = np.array((gender / gender.sum())*100)
color_palette_list = ['#A5C8E4', '#C0ECCC', '#F9F0C1', '#F6A8A6']
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               marker=dict(colors=color_palette_list),
               rotation=90)

layout = go.Layout(
                    title="Gender Distribution",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Gender Distribution')

As you can see, 81% of the respondents are male, and 16% of the respondents are female. This shows the predominance of males in the tech industry.

##Visualise Age Distribution

In [10]:
age = df['age'].value_counts()
age.index

Index(['25-29', '22-24', '30-34', '18-21', '35-39', '40-44', '45-49', '50-54',
       '55-59', '60-69', '70+'],
      dtype='object')

In [11]:
np.array(age)

array([4458, 3610, 3120, 2502, 2087, 1439,  949,  692,  422,  338,  100])

In [12]:
age.index

Index(['25-29', '22-24', '30-34', '18-21', '35-39', '40-44', '45-49', '50-54',
       '55-59', '60-69', '70+'],
      dtype='object')

In [13]:
labels = age.index
percentages = np.array((age / age.sum())*100)
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               rotation=90)

layout = go.Layout(
                    title="Age Distribution",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Age Distribution')

From the pie chart, most of the respondents are quite young, around 20-30 years of age. This could be due to the technilogical boom happening around this period of time, encouraging more people to join the tech industry.

##Visualise Education Level Distribution

In [14]:
education = df['education'].value_counts()

colors_bar = ['#111258', '#4C1559','#B1235E', '#F0855F', '#F8C48A' ,'#F8F4A9', '#faf8dc']

trace = go.Bar(
            x=education.index,
            y=np.array(education),
            marker=dict(
            color=colors_bar))

data = [trace]

layout = go.Layout(
    title='Education Distribution',
    font=dict(color='#909090'),
    xaxis=dict(
        title='Education Level',
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=-45,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
),
    yaxis=dict(
        title="No. of respondants with particular education level",
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=0,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='education distribution')

Most of the respondents have a Master's Degree, or at least a Bachelor's degree

##Visualise Education level distribution by Gender

In [15]:
education_gender = df.groupby(['gender', 'education']).agg({'education':'count'})
education_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,education
gender,education,Unnamed: 2_level_1
Female,Bachelor’s degree,865
Female,Doctoral degree,521
Female,I prefer not to answer,50
Female,Master’s degree,1496
Female,No formal education past high school,26
Female,Professional degree,103
Female,Some college/university study without earning a bachelor’s degree,87
Male,Bachelor’s degree,5049
Male,Doctoral degree,2186
Male,I prefer not to answer,236


In [63]:
labels = education_gender.loc["Female"].index
percentages = np.array((education_gender.loc["Female"]["education"] / gender.Female.sum())*100)
# colors_bar = ['#111258', '#4C1559','#B1235E', '#F0855F', '#F8C48A' ,'#F8F4A9', '#faf8dc']
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               rotation=90)

layout = go.Layout(
                    title="Education Distribution for Females",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Education Distribution for Females')

In [62]:
labels = education_gender.loc["Male"].index
percentages = np.array((education_gender.loc["Male"]["education"] / gender.Male.sum())*100)
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               rotation=90)

layout = go.Layout(
                    title="Education Distribution for Males",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Education Distribution for Males')

Generally, both males and females are equally educated.

##Visualise Job title Distribution

In [17]:
job_title = df['job title'].value_counts()
job_title

Data Scientist             4085
Student                    4014
Software Engineer          2705
Other                      1690
Data Analyst               1598
Research Scientist         1470
Not employed                942
Business Analyst            778
Product/Project Manager     723
Data Engineer               624
Statistician                322
DBA/Database Engineer       156
Name: job title, dtype: int64

In [18]:
colors_bar = ['#9fcc47', '#4ac9e2' , '#e045c1' , '#efcb5d' , '#f4a69f' , '#63c92c' , '#7da6e8' , '#ffc9b2' ,'#f7b7cb', '#510d91', '#b5520c', '#021982']

trace = go.Bar(
            x=job_title.index,
            y=np.array(job_title),
            marker=dict(
            color=colors_bar))

data = [trace]

layout = go.Layout(
    title='Job Title Distribution',
    font=dict(color='#909090'),
    xaxis=dict(
        title='Job Titles',
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=-45,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
  	),
    yaxis=dict(
        title="No. of respondents",
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=0,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='job title distribution')

Most of the respondents are either Data Scientists or Students. This is expected as the dataset is from Kaggle, where many students and data scientists visit. 

##Visualise Gender distribution for each Job

In [19]:
job_gender = df.groupby(['gender', 'job title']).agg({'gender':'count'})

In [20]:
job_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,gender
gender,job title,Unnamed: 2_level_1
Female,Business Analyst,132
Female,DBA/Database Engineer,17
Female,Data Analyst,347
Female,Data Engineer,76
Female,Data Scientist,606
Female,Not employed,214
Female,Other,258
Female,Product/Project Manager,83
Female,Research Scientist,241
Female,Software Engineer,313


In [21]:
gender = df['gender'].value_counts()
gender

Male                       16138
Female                      3212
Prefer not to say            318
Prefer to self-describe       49
Name: gender, dtype: int64

In [58]:
labels = job_gender.loc["Female"].index
percentages = np.array((job_gender.loc["Female"]["gender"] / gender.Female.sum())*100)
colors_bar = ['#9fcc47', '#4ac9e2' , '#e045c1' , '#efcb5d' , '#f4a69f' , '#63c92c' , '#7da6e8' , '#ffc9b2' ,'#f7b7cb', '#510d91', '#b5520c', '#021982']
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               rotation=90)

layout = go.Layout(
                    title="Job Distribution for Females",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Job Distribution for Females')

In [59]:
labels = job_gender.loc["Female"].index
percentages = np.array((job_gender.loc["Male"]["gender"] / gender.Male.sum())*100)
colors_bar = ['#9fcc47', '#4ac9e2' , '#e045c1' , '#efcb5d' , '#f4a69f' , '#63c92c' , '#7da6e8' , '#ffc9b2' ,'#f7b7cb', '#510d91', '#b5520c', '#021982']
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               rotation=90)

layout = go.Layout(
                    title="Job Distribution for Males",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Job Distribution for Males')

##Visualise Salary Distribution

In [24]:
salary = df["salary"].value_counts().reset_index()
salary.columns = ["salary", "count"]
salary["int salary"] = ""

for idx in salary.index:
  salary["int salary"][idx] = int(salary["salary"][idx].replace(',', '').replace('> ', '').split('-')[0])



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [25]:
salary

Unnamed: 0,salary,count,int salary
0,0-999,1513,0
1,"10,000-14,999",833,10000
2,"100,000-124,999",750,100000
3,"30,000-39,999",728,30000
4,"40,000-49,999",719,40000
5,"50,000-59,999",704,50000
6,"1,000-1,999",599,1000
7,"60,000-69,999",576,60000
8,"5,000-7,499",536,5000
9,"15,000-19,999",529,15000


In [26]:
sort_salary = salary.sort_values(by=['int salary'])
sort_salary.drop(columns=['int salary'], inplace=True)

In [27]:
sort_salary

Unnamed: 0,salary,count
0,0-999,1513
6,"1,000-1,999",599
17,"2,000-2,999",390
19,"3,000-3,999",305
20,"4,000-4,999",289
8,"5,000-7,499",536
15,"7,500-9,999",408
1,"10,000-14,999",833
9,"15,000-19,999",529
10,"20,000-24,999",526


In [28]:
trace = go.Bar(
            x=np.array(sort_salary["salary"]),
            y=np.array(sort_salary["count"]),
            marker=dict(
            color='#746bf4'))

data = [trace]

layout = go.Layout(
    title='Salary Distribution',
    font=dict(color='#909090'),
    xaxis=dict(
        title='Salary',
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=-45,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
),
    yaxis=dict(
        title="No. of respondants with said salary",
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=0,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='job title distribution')

Interestingly, most of the respondents only earn between $0-999. This could be attributed to the fact that many of the respondents are students, hence they are not earning an income.

In [29]:
counter = 0
salary_map = {}
for idx in sort_salary.index:
  salary_map[sort_salary['salary'][idx]] = counter
  counter += 1

In [30]:
salary_map

{'0-999': 0,
 '1,000-1,999': 1,
 '10,000-14,999': 7,
 '100,000-124,999': 18,
 '125,000-149,999': 19,
 '15,000-19,999': 8,
 '150,000-199,999': 20,
 '2,000-2,999': 2,
 '20,000-24,999': 9,
 '200,000-249,999': 21,
 '25,000-29,999': 10,
 '250,000-299,999': 22,
 '3,000-3,999': 3,
 '30,000-39,999': 11,
 '300,000-500,000': 23,
 '4,000-4,999': 4,
 '40,000-49,999': 12,
 '5,000-7,499': 5,
 '50,000-59,999': 13,
 '60,000-69,999': 14,
 '7,500-9,999': 6,
 '70,000-79,999': 15,
 '80,000-89,999': 16,
 '90,000-99,999': 17,
 '> 500,000': 24}

In [31]:
df["salary map"] = ""
df["salary map"] = df['salary'].map(salary_map)

In [32]:
df

Unnamed: 0,age,gender,country,education,job title,salary,salary map
1,22-24,Male,France,Master’s degree,Software Engineer,"30,000-39,999",11.0
2,40-44,Male,India,Professional degree,Software Engineer,"5,000-7,499",5.0
3,55-59,Female,Germany,Professional degree,,,
4,40-44,Male,Australia,Master’s degree,Other,"250,000-299,999",22.0
5,22-24,Male,India,Bachelor’s degree,Other,"4,000-4,999",4.0
...,...,...,...,...,...,...,...
19713,50-54,Male,Japan,,,,
19714,18-21,Male,India,Bachelor’s degree,Other,0-999,0.0
19715,35-39,Male,India,Master’s degree,Student,,
19716,25-29,Male,India,Master’s degree,Statistician,"1,000-1,999",1.0


In [33]:
salary_job = df.groupby('job title').agg({'salary map':'median'})

In [34]:
salary_job

Unnamed: 0_level_0,salary map
job title,Unnamed: 1_level_1
Business Analyst,9.0
DBA/Database Engineer,11.0
Data Analyst,8.0
Data Engineer,11.0
Data Scientist,12.0
Not employed,
Other,10.0
Product/Project Manager,12.0
Research Scientist,10.0
Software Engineer,8.0


In [35]:
swap_salary_map = dict([(value, key) for key, value in salary_map.items()]) 

In [36]:
swap_salary_map

{0: '0-999',
 1: '1,000-1,999',
 2: '2,000-2,999',
 3: '3,000-3,999',
 4: '4,000-4,999',
 5: '5,000-7,499',
 6: '7,500-9,999',
 7: '10,000-14,999',
 8: '15,000-19,999',
 9: '20,000-24,999',
 10: '25,000-29,999',
 11: '30,000-39,999',
 12: '40,000-49,999',
 13: '50,000-59,999',
 14: '60,000-69,999',
 15: '70,000-79,999',
 16: '80,000-89,999',
 17: '90,000-99,999',
 18: '100,000-124,999',
 19: '125,000-149,999',
 20: '150,000-199,999',
 21: '200,000-249,999',
 22: '250,000-299,999',
 23: '300,000-500,000',
 24: '> 500,000'}

In [37]:
salary_job["median salary range"] = ""
salary_job["median salary range"] = salary_job['salary map'].map(swap_salary_map)

In [38]:
salary_job = salary_job[salary_job['median salary range'].notna()]

In [39]:
salary_job

Unnamed: 0_level_0,salary map,median salary range
job title,Unnamed: 1_level_1,Unnamed: 2_level_1
Business Analyst,9.0,"20,000-24,999"
DBA/Database Engineer,11.0,"30,000-39,999"
Data Analyst,8.0,"15,000-19,999"
Data Engineer,11.0,"30,000-39,999"
Data Scientist,12.0,"40,000-49,999"
Other,10.0,"25,000-29,999"
Product/Project Manager,12.0,"40,000-49,999"
Research Scientist,10.0,"25,000-29,999"
Software Engineer,8.0,"15,000-19,999"
Statistician,9.0,"20,000-24,999"


In [40]:
colors_bar = ['#9fcc47', '#4ac9e2' , '#e045c1' , '#efcb5d' , '#f4a69f' , '#63c92c' , '#7da6e8' , '#ffc9b2' ,'#f7b7cb', '#510d91', '#b5520c', '#021982']
trace = go.Bar(
            x=salary_job.index,
            y=np.array(salary_job["salary map"]),
            marker=dict(
            color=colors_bar))

data = [trace]

layout = go.Layout(
    title='Median Salary of Job',
    font=dict(color='#909090'),
    xaxis=dict(
        title='Job Title',
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=-45,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
),
    yaxis=dict(
        title="Median Salary Range",
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        tickmode = 'array',
        tickvals = np.array(list(swap_salary_map.keys())),
        ticktext = np.array(list(swap_salary_map.values())),
        tickangle=0,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        )
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='salary by job')

From the bar graph above, we can see that Data Analyst and Software Engineers have the lowest median salary range, between 15,000 to 19,999. On the other hand, Data Scientists and Product/Project Managers have the highest median salary range, between 40,000 to 49,000.

##Visualise Salary distribution by Gender

In [41]:
salary_job_gender = df.groupby(['gender', 'job title']).agg({'salary map':'median'})

In [42]:
salary_job_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,salary map
gender,job title,Unnamed: 2_level_1
Female,Business Analyst,7.0
Female,DBA/Database Engineer,10.0
Female,Data Analyst,7.0
Female,Data Engineer,7.0
Female,Data Scientist,11.0
Female,Not employed,
Female,Other,7.0
Female,Product/Project Manager,11.0
Female,Research Scientist,8.0
Female,Software Engineer,7.0


In [43]:
salary_job_gender["median salary range"] = ""
for idx in salary_job_gender.index:
  if salary_job_gender['salary map'][idx] == 4.5:
    salary_job_gender['salary map'][idx] = 5.0
  if salary_job_gender['salary map'][idx] == 18.5:
    salary_job_gender['salary map'][idx] = 19.0   
salary_job_gender["median salary range"] = salary_job_gender['salary map'].map(swap_salary_map)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [44]:
salary_job_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,salary map,median salary range
gender,job title,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,Business Analyst,7.0,"10,000-14,999"
Female,DBA/Database Engineer,10.0,"25,000-29,999"
Female,Data Analyst,7.0,"10,000-14,999"
Female,Data Engineer,7.0,"10,000-14,999"
Female,Data Scientist,11.0,"30,000-39,999"
Female,Not employed,,
Female,Other,7.0,"10,000-14,999"
Female,Product/Project Manager,11.0,"30,000-39,999"
Female,Research Scientist,8.0,"15,000-19,999"
Female,Software Engineer,7.0,"10,000-14,999"


In [45]:
salary_job_gender = salary_job_gender[salary_job_gender['median salary range'].notna()]

In [46]:
salary_job_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,salary map,median salary range
gender,job title,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,Business Analyst,7.0,"10,000-14,999"
Female,DBA/Database Engineer,10.0,"25,000-29,999"
Female,Data Analyst,7.0,"10,000-14,999"
Female,Data Engineer,7.0,"10,000-14,999"
Female,Data Scientist,11.0,"30,000-39,999"
Female,Other,7.0,"10,000-14,999"
Female,Product/Project Manager,11.0,"30,000-39,999"
Female,Research Scientist,8.0,"15,000-19,999"
Female,Software Engineer,7.0,"10,000-14,999"
Female,Statistician,5.0,"5,000-7,499"


In [47]:
np.array(salary_job_gender.loc["Female"]['salary map'])

array([ 7., 10.,  7.,  7., 11.,  7., 11.,  8.,  7.,  5.])

In [48]:
salary_job_gender.loc["Female"]['salary map'].index

Index(['Business Analyst', 'DBA/Database Engineer', 'Data Analyst',
       'Data Engineer', 'Data Scientist', 'Other', 'Product/Project Manager',
       'Research Scientist', 'Software Engineer', 'Statistician'],
      dtype='object', name='job title')

In [49]:
color_palette_list = ['#A5C8E4', '#C0ECCC', '#F9F0C1', '#F6A8A6']

layout = go.Layout(
    title='Salary Distribution by Gender for each Job Title',
    font=dict(color='#909090'),
    xaxis=dict(
        title='Job Title',
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        showticklabels=True,
        tickangle=-45,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
  	),
    yaxis=dict(
        title="Median Salary Range",
        titlefont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        ),
        tickmode = 'array',
        tickvals = np.array(list(swap_salary_map.keys())),
        ticktext = np.array(list(swap_salary_map.values())),
        tickangle=0,
        tickfont=dict(
            family='Arial, sans-serif',
            size=12,
            color='#909090'
        )
    )
)

fig = go.Figure(layout=layout)
fig.add_trace(go.Bar(
    x=np.array(salary_job_gender.loc["Female"]['salary map'].index),
    y=np.array(salary_job_gender.loc["Male"]['salary map']),
    name='Male',
    marker_color='#A5C8E4'
))
fig.add_trace(go.Bar(
    x=np.array(salary_job_gender.loc["Female"]['salary map'].index),
    y=np.array(salary_job_gender.loc["Female"]['salary map']),
    name='Female',
    marker_color='#C0ECCC'
))
fig.add_trace(go.Bar(
    x=np.array(salary_job_gender.loc["Female"]['salary map'].index),
    y=np.array(salary_job_gender.loc["Prefer not to say"]['salary map']),
    name='Prefer not to say',
    marker_color='#F9F0C1'
))

fig.add_trace(go.Bar(
    x=np.array(salary_job_gender.loc["Female"]['salary map'].index),
    y=np.array(salary_job_gender.loc["Prefer to self-describe"]['salary map']),
    name='Prefer to self-describe',
    marker_color='#F6A8A6'
))


# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.show()

As you can see in all the jobs, females earn less than males. It is also interesting that respondents that have the label 'Prefer to self-describe' earn the most for some of the jobs, such as Business Analyst and Database Engineer. However, it could be because there are not many people in this category. For example, there is only 1 respondent in the 'Prefer to self-describe' category who is a Business Analyst. This will cause the results to be skewed and inaccurate.

##Visualise Gender Distribution in each Country

In [50]:
country_gender = df.groupby(['gender', 'country']).agg({'country':'count'})
country_gender

Unnamed: 0_level_0,Unnamed: 1_level_0,country
gender,country,Unnamed: 2_level_1
Female,Algeria,12
Female,Argentina,13
Female,Australia,44
Female,Austria,5
Female,Bangladesh,7
...,...,...
Prefer to self-describe,Russia,4
Prefer to self-describe,Spain,1
Prefer to self-describe,Turkey,1
Prefer to self-describe,United Kingdom of Great Britain and Northern Ireland,6


In [51]:
labels = country_gender.loc["Female"].index
percentages = np.array((country_gender.loc["Female"]["country"] / gender.Female.sum())*100)
colors_bar = ['#9fcc47', '#4ac9e2' , '#e045c1' , '#efcb5d' , '#f4a69f' , '#63c92c' , '#7da6e8' , '#ffc9b2' ,'#f7b7cb', '#510d91', '#b5520c', '#021982']
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               marker=dict(colors=color_palette_list),
               rotation=90)

layout = go.Layout(
                    title="Country Distribution for Females",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Country Distribution for Females|')

In [52]:
labels = country_gender.loc["Male"].index
percentages = np.array((country_gender.loc["Male"]["country"] / gender.Male.sum())*100)
colors_bar = ['#9fcc47', '#4ac9e2' , '#e045c1' , '#efcb5d' , '#f4a69f' , '#63c92c' , '#7da6e8' , '#ffc9b2' ,'#f7b7cb', '#510d91', '#b5520c', '#021982']
trace = go.Pie(labels=labels, 
               hoverinfo='label+percent', 
               values=percentages, 
               textposition='outside',
               marker=dict(colors=color_palette_list),
               rotation=90)

layout = go.Layout(
                    title="Country Distribution for Males",
                    font=dict(family='Arial', size=12, color='#909090'),
                    legend=dict(x=0.9, y=0.5)
                    )
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='Country Distribution for Males')

For both males and females, a large percentage of the respondents came from either India or the United States of America.

In conclusion, there is gender disparity in the tech industry, and women are still under-represented in this field. In addition, there is a salary disparity between males and females in the tech industry. We should encourage more girls to join the tech industry, by setting up more programs conducted by women already in the industry to inspire more girls to take up more STEM jobs.