<a href="https://colab.research.google.com/github/iman-g/Pay-gap-Survey-kaggle/blob/main/Pay_Gap_Survey_Kaggle_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Table of contents**<a id='toc0_'></a>    
- [General Information](#toc1_)    
  - [Total Responders](#toc1_1_)    
  - [Number of Questions](#toc1_2_)    
  - [Questions With less than 20% null share](#toc1_3_)    
- [General Answers](#toc2_)    
  - [Gender](#toc2_1_)    
  - [Age distribution](#toc2_2_)    
  - [Countries with the most respondents](#toc2_3_)    
  - [Education](#toc2_4_)    
  - [Age & Experience](#toc2_5_)    
  - [Number of Softwares respondnents know](#toc2_6_)    
  - [Roles](#toc2_7_)    
- [Salary](#toc3_)    
  - [Experience and Salary](#toc3_1_)    
  - [Salary and Seniority](#toc3_2_)    
  - [Role and Salary](#toc3_3_)    
- [conclusion](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #likley won't be used much as i'm experimenting with plotly
import plotly.graph_objects as go #you will be learning how go and px work with me!
import plotly.express as px
from plotly.subplots import make_subplots

# pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [49]:
import plotly.io as pio
pio.renderers.default = "notebook+pdf"

In [None]:
df = pd.read_csv('kaggle_survey_2020_responses.csv')

df_final = df.iloc[1:]


Columns (0,18,31,45,50,63,80,92,99,105,130,171,187,241,254,266,278,289,307,322,330,342,353) have mixed types. Specify dtype option on import or set low_memory=False.



# <a id='toc1_'></a>[General Information](#toc0_)

## <a id='toc1_1_'></a>[Total Responders](#toc0_)

In [None]:
responders = df_final['Time from Start to Finish (seconds)'].count()

print(f'Total survey responders = {responders}')

Total survey responders = 20036


## <a id='toc1_2_'></a>[Number of Questions](#toc0_)

In [None]:
test = df_final.columns

unique_questions = set()

for col in test:
    if col.startswith('Q'):
        base_question = col.split('_')[0]
        unique_questions.add(base_question)

unique_question_count = len(unique_questions)
print("Number of Unique Questions:", unique_question_count)

Number of Unique Questions: 39


## <a id='toc1_3_'></a>[Questions With less than 20% null share](#toc0_)

In [None]:
missing_data = df_final.isnull().sum() * 100 / len(df)

missing_data = missing_data.reset_index()
missing_data.columns = ['column','null_share']
missing_data[missing_data['null_share']<=20].sort_values('null_share')

Unnamed: 0,column,null_share
0,Time from Start to Finish (seconds),0.0
1,Q1,0.0
2,Q2,0.0
3,Q3,0.0
4,Q4,2.330688
5,Q5,3.787992
6,Q6,4.571543
20,Q8,11.458801
52,Q13,16.249938
47,Q11,16.464541


# <a id='toc2_'></a>[General Answers](#toc0_)

## <a id='toc2_1_'></a>[Gender](#toc0_)

In [None]:
Q2 = df_final.Q2.value_counts().reset_index()
fig = px.pie(Q2, values='count', names='Q2', title='Genders')
fig.show()

In [None]:
Questions = {}

qnums = list(dict.fromkeys([i.split('_')[0] for i in df_final.columns]))


for i in qnums:
    if i in ['Q1','Q2','Q3']:
        Questions[i] = df_final[i]
    else:
        Questions[i] = df_final[[q for q in df_final.columns if q.startswith(i)]]

In [None]:
Genders ={}

for i in df_final.Q2.unique():
    Genders[i] = df_final[df_final.Q2 == i]

## <a id='toc2_2_'></a>[Age distribution](#toc0_)

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"))

fig.add_trace(go.Histogram(name = 'All', x = df_final['Q1'], histnorm="probability density"),row=1, col=1)

for i, n in zip(['Man', 'Woman'], [2, 3]):
    fig.add_trace(go.Histogram(name = i, x = df_final[df_final['Q2']==i]['Q1'], histnorm="probability density"),row=1, col=n)


fig.update_xaxes(categoryorder='array', categoryarray= ['18-21','22-24','25-29','30-34','35-39'
                                                  ,'40-44','45-49','50-54','55-59','60-69','70+'])

fig.update_layout(title_text="Age", showlegend=False)

fig.show()

In [None]:
total_age = df_final['Q1'].value_counts(normalize=True).reset_index()
women_age = df_final[df_final['Q2']=='Woman']['Q1'].value_counts(normalize=True).reset_index()
men_age = df_final[df_final['Q2']=='Man']['Q1'].value_counts(normalize=True).reset_index()


print(
"""Women who responded to this survey were significantly younger than men. The proportion of women under 30 was {:.0f}%, compared to {:.0f}% for men."""
      .format(100*women_age[women_age['Q1'].isin(['18-21','22-24','25-29'])]['proportion'].sum()
              , 100*men_age[men_age['Q1'].isin(['18-21','22-24','25-29'])]['proportion'].sum()))


Women who responded to this survey were significantly younger than men. The proportion of women under 30 was 64%, compared to 54% for men.


## <a id='toc2_3_'></a>[Countries with the most respondents](#toc0_)

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"))

fig.add_trace(go.Bar(name='Total', x = [(i[:15] + '...') if len(i) > 15 else i for i in df_final.Q3.value_counts().head(10).index]
                    , y = df_final.Q3.value_counts().head(10).values)
                    , row=1, col=1)


for i, n in zip(['Man', 'Woman'], [2, 3]):
    fig.add_trace(go.Bar(name=i, x = [(i[:15] + '...') if len(i) > 15 else i for i in df_final[df_final['Q2']==i].Q3.value_counts().head(10).index]
                               , y = df_final[df_final['Q2']==i].Q3.value_counts().head(10).values), row=1, col=n)

fig.update_layout(title_text="Top Ten Country", showlegend=False)

fig.show()

In [None]:
print(
"""Although the top four countries of origin for both men and women are identical, the distribution of respondents among the remaining countries differs between the two genders.
For example, Japan ranks fourth in terms of male respondents, while the UK holds the fourth position for female respondents."""
      .format(100*women_age[women_age['Q1'].isin(['18-21','22-24','25-29'])]['proportion'].sum()
              , 100*men_age[men_age['Q1'].isin(['18-21','22-24','25-29'])]['proportion'].sum()))

Although the top four countries of origin for both men and women are identical, the distribution of respondents among the remaining countries differs between the two genders.
For example, Japan ranks fourth in terms of male respondents, while the UK holds the fourth position for female respondents.


In [None]:

def plotly_choropleth_map(df, column, title, max_value):
    fig = px.choropleth(df,
                    locations = 'country',
                    color = column,
                    locationmode = 'country names',
                    color_continuous_scale = 'viridis',
                    title = title,
                    range_color = [0, max_value])
    fig.update(layout=dict(title=dict(x=0.5)))
    fig.show()


countries = df_final.Q3.value_counts().reset_index()
countries.columns = ['country','# of respondents']
countries = countries[countries['# of respondents']>100]
plotly_choropleth_map(countries,
                       '# of respondents',
                       'Total number of responses per country',
                        max_value = 700)

## <a id='toc2_4_'></a>[Education](#toc0_)

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"), shared_yaxes=True)


fig.add_trace(go.Histogram(name = 'All', y = df_final['Q4'], histnorm="probability density"),row=1, col=1)

for i, n in zip(['Man', 'Woman'], [2, 3]):
    fig.add_trace(go.Histogram(name = i, y = df_final[df_final['Q2']==i]['Q4'], histnorm="probability density"),row=1, col=n)

# [(i[:15] + '...') if len(i) > 15 else i for i in df_final['Q4'].value_counts().index]


fig.update_xaxes(categoryorder='array', categoryarray= ['I prefer not to answer', 'No formal education past high school', 'No formal education past high school'
                                                        , 'Professional degree', 'Some college/university study without earning a bachelor’s degree'
                                                        , "Bachelor’s degree", "Master’s degree", 'Doctoral degree'])
# fig.update_xaxes(title_text="xaxis 1 title", row=1, col=1)


fig.update_layout(title_text="Education Level", showlegend=False)


fig.show()

In [None]:
print(
"""Both women and men exhibit similar educational attainment patterns.""")

Both women and men exhibit similar educational attainment patterns.


In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"), shared_yaxes=True)

fig.add_trace(go.Histogram(name = 'All', y = df_final['Q6'], histnorm="probability density"),row=1, col=1)

for i, n in zip(['Man', 'Woman'], [2, 3]):
    fig.add_trace(go.Histogram(name = i, y = df_final[df_final['Q2']==i]['Q6'], histnorm="probability density"),row=1, col=n)


fig.update_xaxes(categoryorder='array', categoryarray= ['18-21','22-24','25-29','30-34','35-39'
                                                  ,'40-44','45-49','50-54','55-59','60-69','70+'])

fig.update_layout(title_text="Experience", showlegend=False)



fig.update_yaxes(categoryorder='array', categoryarray= ['I have never written code','< 1 years','1-2 years',
                                                  '3-5 years','5-10 years','10-20 years','20+ years'])

fig.show()

In [None]:
print(
"""Both men and women reported similar work experience durations. The majority of respondents had 1 to 5 years of work experience.""")

Both men and women reported similar work experience durations. The majority of respondents had 1 to 5 years of work experience.


## <a id='toc2_5_'></a>[Age & Experience](#toc0_)

In [None]:
fig = px.density_heatmap(df_final, x="Q1", y="Q6"
                         , category_orders={'Q1':['18-21','22-24','25-29','30-34','35-39'
                                                  ,'40-44','45-49','50-54','55-59','60-69','70+'],
                                            'Q6':['I have never written code','< 1 years','1-2 years',
                                                  '3-5 years','5-10 years','10-20 years','20+ years'][::-1]}
                         , title = 'Age and Experience'
                         , labels={"Q1": "Age", "Q6": "Experience"})
fig.show()

## <a id='toc2_6_'></a>[Number of Softwares respondnents know](#toc0_)

In [None]:
# df = pd.read_csv('kaggle_survey_2020_responses.csv')

# df_final = df.iloc[1:]
df_final = df_final.drop("Q7_Part_12", axis='columns')

q7_col = []

for col in df_final.columns:
    if col.startswith('Q7_'):
        if df_final[col].notna().any():
            mode_value = df_final[col].mode()[0]

            new_col_name = col.replace(col, mode_value)
            df_final.rename(columns={col: new_col_name}, inplace=True)
            q7_col.append(new_col_name)


df_final['Number_of_Software'] = df_final[q7_col].notnull().sum(axis=1)

ah = q7_col.copy()
ah.append('Q2')

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"))


fig.add_trace(go.Histogram(name = 'All', x = df_final['Number_of_Software'], histnorm="probability density"),row=1, col=1)

for i, n in zip(['Man', 'Woman'], [2, 3]):
    fig.add_trace(go.Histogram(name = i, x = df_final[df_final['Q2']==i]['Number_of_Software'], histnorm="probability density"),row=1, col=n)

fig.update_layout(title_text="Number of Softwares respondent know", showlegend=False)

fig.show()

In [None]:
print(
"""More women lack software language knowledge.
Most of respondents in both genders know about 1 to 3 software languages.
""")

More women lack software language knowledge.
Most of respondents in both genders know about 1 to 3 software languages.



In [None]:
software_count_all = df_final[q7_col].notnull().sum()

software_count_men = df_final[df_final['Q2'] == 'Man'][q7_col].notnull().sum()

software_count_women = df_final[df_final['Q2'] == 'Woman'][q7_col].notnull().sum()

fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"))

fig.add_trace(
    go.Bar(name='All', x=software_count_all.index, y=software_count_all.values),row=1, col=1)

# Men
fig.add_trace(
    go.Bar(name='Men', x=software_count_men.index, y=software_count_men.values),
    row=1, col=2
)


fig.add_trace(
    go.Bar(name='Women', x=software_count_women.index, y=software_count_women.values),
    row=1, col=3
)

fig.update_xaxes(categoryorder= "total descending")
fig.update_layout(title_text="Software Knowledge by Gender",
                  showlegend=False)

fig.show()


## <a id='toc2_7_'></a>[Roles](#toc0_)

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"))


fig.add_trace(go.Histogram(name = 'All', y = df_final['Q5'], histnorm="probability density"),row=1, col=1).update_yaxes(categoryorder= "total descending")

for i, n in zip(['Man', 'Woman'], [2, 3]):
    fig.add_trace(go.Histogram(name = i, y = df_final[df_final['Q2']==i]['Q5'], histnorm="probability density"),row=1, col=n)


# fig.update_axes(categoryorder= "total descending")
fig.update_layout(title_text="Role", showlegend=False)


fig.show()

# <a id='toc3_'></a>[Salary](#toc0_)

In [None]:
def salary_midpoint(salary):
    if pd.isna(salary):
        return np.nan
    salary = salary.replace(',', '').replace('$', '')
    if '>' in salary:
        return float(salary.split(' ')[-1])  # Return just the upper limit for '>'
    elif '-' in salary:
        lower, upper = salary.split('-')
        return (float(lower) + float(upper)) / 2  # Return midpoint of the range
    return np.nan

# Apply the function to create a new column with numeric values
df_final['Salary_Numeric'] = df_final['Q24'].apply(salary_midpoint)

In [None]:
responses_in_order = ['$0-999', '1,000-1,999', '2,000-2,999', '3,000-3,999', '4,000-4,999', '5,000-7,499', '7,500-9,999', '10,000-14,999',
                      '15,000-19,999', '20,000-24,999', '25,000-29,999', '30,000-39,999', '40,000-49,999', '50,000-59,999','60,000-69,999',
                      '70,000-79,999', '80,000-89,999', '90,000-99,999', '100,000-124,999', '125,000-149,999', '150,000-199,999', '200,000-249,999',
                      '250,000-299,999', '300,000-500,000', '> $500,000']

fig = make_subplots(rows=1, cols=3, subplot_titles=("All Respondents", "Men", "Women"), shared_yaxes= True)


fig.add_trace(go.Histogram(name = 'All', y = df_final['Q24'], histnorm="probability density"),row=1, col=1)

for i, n in zip(['Man', 'Woman'], [2, 3]):
    fig.add_trace(go.Histogram(name = i, y = df_final[df_final['Q2']==i]['Q24'], histnorm="probability density"),row=1, col=n)



fig.update_yaxes(categoryorder='array', categoryarray= responses_in_order)
fig.update_layout(title_text="Role", showlegend=False, height=600)


fig.show()

In [None]:
fig = go.Figure()

fig.add_trace(go.Violin(x=df_final[df_final['Q2'].isin(['Man','Woman'])]['Q2'],
                        y=df_final[df_final['Q2'].isin(['Man','Woman'])]['Salary_Numeric'],
                        line_color='blue')
             )


fig.update_layout(height = 600, title='Salary Distribution')
fig.show()

In [None]:
print(
"""Without controlling other factors (like education, experience, and role), Gender inequality persists in the workplace, with women disproportionately
earning lower salaries than men.
""")

Without controlling other factors (like education, experience, and role), Gender inequality persists in the workplace, with women disproportionately 
earning lower salaries than men.



## <a id='toc3_1_'></a>[Experience and Salary](#toc0_)

In [None]:
fig = px.density_heatmap(df_final[df_final['Q2']=='Man'], y="Q6", x="Q24"
                         , category_orders={'Q24':responses_in_order,
                                            'Q6':['I have never written code','< 1 years','1-2 years',
                                                  '3-5 years','5-10 years','10-20 years','20+ years'][::-1]}
                         , title = 'Experience and Salary - Men'
                         , labels={"Q24": "Salary", "Q6": "Experience"})
fig.show()

In [None]:
print(
"""The correlation between salary and experience is stronger among men than women.
""")

The correlation between salary and experience is stronger among men than women.



In [None]:
fig = px.density_heatmap(df_final[df_final['Q2']=='Woman'], y="Q6", x="Q24"
                         , category_orders={'Q24':responses_in_order,
                                            'Q6':['I have never written code','< 1 years','1-2 years',
                                                  '3-5 years','5-10 years','10-20 years','20+ years'][::-1]}
                         , title = 'Experience and Salary - Women'
                         , labels={"Q24": "Salary", "Q6": "Experience"})
fig.show()

## <a id='toc3_2_'></a>[Salary and Seniority](#toc0_)

In [None]:
conditions = [
    # (df['big_coin_trades'] > 0)&(df['HMSTR_trades'] > 0)&(df['dogs_trades'] > 0),
    (df_final['Q6']=='I have never written code'),
    (df_final['Q6']=='< 1 years'),
    (df_final['Q6']=='1-2 years'),
    (df_final['Q6']=='3-5 years'),
    (df_final['Q6']=='5-10 years'),
    (df_final['Q6']=='10-20 years'),
    (df_final['Q6']=='20+ years'),
]


values = [0,0.5,1.5,4,7.5,15,20]


df_final['seniority_Numeric'] = np.select(conditions, values, default='unknown')

In [None]:
conditions = [
    (df_final['Q6'].isin(['I have never written code','< 1 years'])),
    (df_final['Q6'].isin(['1-2 years','3-5 years'])),
    (df_final['Q6'].isin(['5-10 years','10-20 years','20+ years']))
]


values = ['Junior', 'Senior', 'Expert']


df_final['Seniority'] = np.select(conditions, values, default='unknown')

In [None]:
fig = go.Figure()

fig.add_trace(go.Violin(x=df_final[df_final['Q2']=='Man']['Seniority'],
                        y=df_final[df_final['Q2']=='Man']['Salary_Numeric'],
                        line_color='blue',
                        side='negative')
             )

fig.add_trace(go.Violin(x=df_final[df_final['Q2']=='Woman']['Seniority'],
                        y=df_final[df_final['Q2']=='Woman']['Salary_Numeric'],
                        line_color='red',
                        side='positive')
             )


fig.update_layout(height = 600, title='Salary Distribution')
fig.show()

In [None]:
correlation = round(df_final[(df_final['seniority_Numeric']!='unknown')&(df_final['Q2']=='Man')]['seniority_Numeric'].corr(df_final['Salary_Numeric']),2)
print(f"Correlation between Seniority and Salary for Men: {correlation}")

correlation = round(df_final[(df_final['seniority_Numeric']!='unknown')&(df_final['Q2']=='Woman')]['seniority_Numeric'].corr(df_final['Salary_Numeric']),2)
print(f"Correlation between Seniority and Salary for Women: {correlation}")

print(
"""Expert women earn lower salaries compared to men.
""")

Correlation between Seniority and Salary for Men: 0.34
Correlation between Seniority and Salary for Women: 0.27
Expert women earn lower salaries compared to men.



## <a id='toc3_3_'></a>[Role and Salary](#toc0_)

In [None]:
roles_order = df_final['Q5'].unique().tolist()

fig = px.density_heatmap(df_final[df_final['Q2']=='Man'], y="Q5", x="Q24"
                         , category_orders={'Q24':responses_in_order,
                                            'Q5':roles_order}
                         , title = 'Experience and Salary - Men'
                         , labels={"Q24": "Salary", "Q6": "Experience"})
fig.show()

In [None]:
fig = px.density_heatmap(df_final[df_final['Q2']=='Woman'], y="Q5", x="Q24"
                         , category_orders={'Q24':responses_in_order,
                                            'Q5':roles_order}
                         , title = 'Experience and Salary - Women'
                         , labels={"Q24": "Salary", "Q6": "Experience"})
fig.show()

In [None]:
from scipy.stats import ttest_ind
import warnings
warnings.filterwarnings(
    action='ignore', category=UserWarning, message=r"Boolean Series.*")

male_salaries = df_final[df['Q2'] == 'Man']['Salary_Numeric']
female_salaries = df_final[df['Q2'] == 'Woman']['Salary_Numeric']

t_stat, p_value = ttest_ind(male_salaries[male_salaries.notnull()], female_salaries[female_salaries.notnull()], equal_var=False)  # Assuming unequal variances

print(f"""T-statistic: {t_stat}, P-value: {p_value}
We can reject the null hypothesis of equal salaries between genders.
""")

T-statistic: 8.682396092994427, P-value: 6.620262025871115e-18
We can reject the null hypothesis of equal salaries between genders.



# <a id='toc4_'></a>[conclusion](#toc0_)

In [None]:
print("""
The data indicates a significant salary difference between men and women. However, to determine if this difference is solely due to gender, we must control for other factors
such as seniority, role, and country. By incorporating these variables into a regression analysis, we can isolate the impact of gender on salary. In the subsequent notebook,
 I will conduct a regression model to examine the gender pay gap. """)


The data indicates a significant salary difference between men and women. However, to determine if this difference is solely due to gender, we must control for other factors 
such as seniority, role, and country. By incorporating these variables into a regression analysis, we can isolate the impact of gender on salary. In the subsequent notebook,
 I will conduct a regression model to examine the gender pay gap. 
