**<center><font size="3.5">Diversity and Inclusion Analysis - Silicon Valley Edition**

Diversity is critical in tech, as it enables companies to create better and safer products that take everyone into consideration, not just one section of society. Diverse companies perform better, hire better talent, have more engaged employees, and retain workers better than companies that do not focus on diversity and inclusion. 

In short, there are social issues in the world of technology, and I intend to address the gender gap issue by exploring the top 23 silicon valley gender data sets during 2016.

**<font size="3.5">Imports**

In [22]:
pip install chart-studio

In [23]:
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.mode.chained_assignment = None
from IPython.display import HTML
from chart_studio import plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.graph_objs import *
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()

**<font size="3.5">The Data**
    
There are six columns in this dataset:

**company**: Company name

**year**: For now, 2016 only

**race**: Possible values: "American_Indian_Alaskan_Native", "Asian", "Black_or_African_American", "Latino", "Native_Hawaiian_or_Pacific_Islander", "Two_or_more_races", "White", "Overall_totals"

**gender**: Possible values: "male", "female". Non-binary gender is not counted in EEO-1 reports.

**job_category**: Possible values: "Administrative support", "Craft workers", "Executive/Senior officials & Mgrs", "First/Mid officials & Mgrs", "laborers and helpers", "operatives", "Professionals", "Sales workers", "Service workers", "Technicians", "Previous_totals", "Totals"

**count**: Mostly integer values, but contains "na" for a no-data variable.    

In [24]:
reveal = pd.read_csv('../input/silicon-valley-diversity-data/Reveal_EEO1_for_2016.csv')
print("Number of coloumns are: ",len(reveal.axes[1]))
print("Coloumn names are: ", reveal.columns)
print("\n")
print("Number of rows are: ",len(reveal.axes[0]))
print("\n")
print("Unique number of companies are: ")
print(reveal.company.unique())
print("\n")
print("Unique races are: ")
print(reveal.race.unique())
print("\n")
print("Unique genders are: ")
print(reveal.gender.unique())
print("\n")
print("Unique job categories are: ")
print(reveal.job_category.unique())
print("\n")
reveal.head()

**<font size="3.5">Data Cleaning**

**<font size="3.5">Exploratory Data Analysis**
    
***1. Total Employee Count***

In [48]:
company_count = reveal_modified.groupby(['company']).agg({'count': lambda x: sum((x).astype(int))})
company_count

In [27]:
plt.figure(figsize=(10,8))
sns.barplot(x=company_count.index,y=company_count['count'])

plt.title('Silicon Valley Companies',size=25)
plt.ylabel('Number of employees',size=14)
plt.xlabel('Companies',size=14)

plt.xticks(size=14,rotation=90)
plt.show()

***2. Exploring Gender Data***

In [28]:
reveal_percent = pd.pivot_table(reveal_modified,index = 'company', columns = 'gender', values='count',aggfunc=sum)
reveal_percent['Percent of female'] = reveal_percent['female']/(reveal_percent['female']+reveal_percent['male'])
reveal_percent = reveal_percent.sort_values('Percent of female', ascending=False)
reveal_percent

In [29]:
labels = reveal_modified.groupby(['gender']).agg({'count':sum}).index
values = reveal_modified.groupby(['gender']).agg({'count':sum})['count'].values
colors = ['#a1d99b', '#deebf7']
trace = go.Pie(labels=labels, values=values,
               textinfo="label+percent",
               textfont=dict(size=20),
               hole=.3, pull=.03,
               marker=dict(colors=colors, 
                           line=dict(color='#ff7f00', width=.5)))
layout=go.Layout(title='Pie Chart of Female and Male Employee')
data=[trace]

fig = dict(data=data,layout=layout)
iplot(fig, filename='Pie Chart of Female and Male Employees')

In [30]:
x = list(reveal_percent.index)
fem = reveal_percent['female']
male = reveal_percent['male']
percfemale = fem/(fem+male)

ind = np.arange(len(x))
width = 0.3
fig, ax = plt.subplots(figsize=(30,12))
plt.rcParams.update({'font.size':17})
rects1 = ax.bar(ind - width/2, percfemale, width, 
                color='DarkKhaki', label='Percent of Female Employees')
ax.set_ylabel('Percent of employees',)
ax.set_title('Employees by Gender')
ax.set_xticks(ind)
ax.set_xticklabels(x,rotation=45)
ax.legend()
plt.show()

In [31]:
%matplotlib inline

def get_attribute_count_by_value(attribute_name, value, df):
    return df.loc[df[attribute_name] == value, 'count'].sum()

plt.rc('figure', figsize=(50, 50))
font_options = {'family' : 'sans-serif','weight' : 'normal','size' : 36}
plt.rc('font', **font_options)

unique_companies = reveal_modified['company'].unique()
df2 = pd.DataFrame(columns=['percent of males', 'percent of females'], index=unique_companies)

for company in unique_companies:
    
    new_df = reveal_modified[reveal_modified['company'] == company]
    
    num_of_males = get_attribute_count_by_value('gender', 'male', new_df)
    num_of_females = get_attribute_count_by_value('gender', 'female', new_df)
    total = num_of_males + num_of_females
    
    percentage_of_males = float(num_of_males/total)*100
    percentage_of_females = float(num_of_females/total)*100
    
    df2.loc[company] = [percentage_of_males, percentage_of_females]
    
df2.sort_values(['percent of females'], axis=0, ascending=False, inplace=True)
ax = df2.plot.barh(alpha=0.75, colormap='Wistia')
ax.set(xlabel = "Percentage", ylabel = "Company", title = "Percentage of workforce by gender")
for p in ax.patches:
    ax.annotate(np.round(p.get_width(),decimals=2), \
                (p.get_x() + p.get_width(), p.get_y()), \
                ha='left', va='center', xytext=(0, 10), \
                textcoords='offset points') 

In [32]:
d=reveal_modified.groupby(['gender','company']).agg({'count':sum}).reset_index()
trace1 = go.Bar(
    x=d[d.gender=='male']['company'],
    y=d[d.gender=='male']['count'],
    name='Males',
    marker=dict(
        color='rgb(153,0,0)'
    )
)
trace2 = go.Bar(
    x=d[d.gender=='female']['company'],
    y=d[d.gender=='female']['count'],
    name='Females',
    marker=dict(
        color='rgb(255,255,204)'
    )
)
data = [trace1, trace2]
layout = go.Layout(template =  "plotly_dark",barmode='group',title='Distribution of Male and Female Employees by Company')


fig = dict(data=data, layout=layout)
iplot(fig, filename='Distribution of Male and Female Employees by Company')

In [33]:
logfem = np.log(fem)
logmale = np.log(male)
plt.rcParams.update({'font.size':15})
plt.figure(figsize=(20,10))
plt.scatter(logfem, logmale, alpha=0.5)
for i in range (0,len(x)):
    xy=(logfem[i],logmale[i])
    plt.annotate(x[i],xy)
plt.plot([0, 11], [0, 11], color = 'b')
plt.xlim(4,11)
plt.ylim(4,11) 
plt.ylabel("Male Employees", )
plt.xlabel("Female Employees")
plt.title("Distribution of Male and Female Employees")
plt.show()

In [34]:
d=reveal_modified.groupby(['company','gender']).agg({'count':sum})
d=d.unstack()
d=d['count']
d=np.round(d.iloc[:,:].apply(lambda x: (x/x.sum())*100,axis=1))
d['Ratio']=np.round(d['male']/d['female'],2)
d.sort_values(by='Ratio',inplace=True,ascending=False)
d.columns=['Female %','Male %','Ratio']

text=d['Male %'],
textposition='auto',
marker=dict(
    color='rgb(161,217,155)',
    line=dict(
        color='rgb(8,48,107)',
        width=1.5,
    )
),
opacity=0.6
trace1 = go.Bar(
    y=d['Female %'],
    x=d.index,
    text=d['Female %'],
    textposition='auto',
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)
    
trace2= go.Bar(
    y=d['Male %'],
    x=d.index,
    text=d['Male %'],
    textposition='auto',
    marker=dict(
        color='rgb(161,217,155)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace1, trace2]
#stacks the data on top of each other 
layout = go.Layout(
    barmode = 'stack'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename="Normalize Percentage of male to female data")

***3. Exploring Race Data***

In [35]:
totals = reveal[reveal['job_category']=='Totals']
totals['count'] = totals['count'].astype(dtype='int32')
race = totals.groupby(["race"])
g=race.mean()
gg=g.drop("Overall_totals",axis=0,inplace=True)
gg=g.drop("year",axis=1,inplace=True)
g["race"]=g.index
gg=g.drop("race",axis=1)
gg["races"]=g.index

fig,ax=plt.subplots(figsize=(10,5))
b = sns.barplot(x="races",y="count",data=gg,ax=ax, color = 'navy')
b.set_xlabel("races",fontsize=10)
b.set_ylabel("count",fontsize=10)
b.tick_params(labelsize=15, rotation=90)
plt.show()

In [36]:
%matplotlib inline
font_options = {'family' : 'sans-serif','weight' : 'normal','size' : 14}
plt.rc('font', **font_options)

unique_companies = reveal_modified['company'].unique()
unique_races = reveal_modified['race'].unique()

df3 = pd.DataFrame(columns=unique_races, index=unique_companies)

for company in unique_companies:
    company_df = reveal_modified[reveal_modified['company'] == company]
    total_for_company = company_df['count'].sum()
    
    race_percents = []
    for race in unique_races:
        percent = float(company_df.loc[company_df['race'] == race, 'count'].sum() / total_for_company)*100
        race_percents.append(percent)
    
    df3.loc[company] = race_percents

for index, row in df3.iterrows():
    
    ax = row.plot.barh(alpha=0.75,color=['DarkOrange', 'OrangeRed', 'Tomato', 'Coral', 'LightSalmon', 'Salmon','IndianRed'])
    ax.set(xlabel = "Percentage of workforce", \
           ylabel = "Races", \
           title = index)
    legend = ax.legend(loc='upper right', borderpad=1, labelspacing=1)
    legend = legend.remove()

    for p in ax.patches:
        ax.annotate(np.round(p.get_width(),decimals=2), \
                    (p.get_x() + p.get_width(), p.get_y()), \
                    ha='left', va='center', xytext=(0, 5), \
                    textcoords='offset points')
    plt.show()

In [37]:
data = reveal_modified.drop(columns=['company','year'])
data.head()

In [38]:
profession = ['Managers', 'Executives']
profession = data[data['job_category'].isin(profession)]
profession.head()

In [39]:
sns.catplot(x="race", hue="job_category", kind="count",
            palette="pastel", edgecolor=".6",
            data= profession,height=8, aspect=18/7)

In [40]:
sns.catplot(x="gender", hue="job_category", kind="count",
            palette="pastel", edgecolor=".6",
            data= profession)

***3. Overall Race and Gender Numbers***

In [41]:
data = reveal_modified.drop(columns=['company','year'])
encoded_data = pd.get_dummies(data)
encoded_data.head()

In [47]:
total = encoded_data['count'].sum()
totalMales = encoded_data.groupby('gender_male')['count'].sum()[1]
totalFemales = encoded_data.groupby('gender_female')['count'].sum()[1]

totalExecs = encoded_data.groupby('job_category_Executives')['count'].sum()[1]
totalManagers = encoded_data.groupby('job_category_Managers')['count'].sum()[1]
totalSeniorPositions = totalExecs + totalManagers

maleExecs = encoded_data.groupby(['job_category_Executives','gender_male'])['count'].sum()[1][1]
maleManagers = encoded_data.groupby(['job_category_Managers','gender_male'])['count'].sum()[1][1]
totalMaleSeniorPositions = maleExecs + maleManagers

totalWhite = encoded_data.groupby('race_White')['count'].sum()[1]
totalNonWhite = total-totalWhite

whiteExecs = encoded_data.groupby(['job_category_Executives','race_White'])['count'].sum()[1][1]
whiteManagers = encoded_data.groupby(['job_category_Managers','race_White'])['count'].sum()[1][1]
whiteTop = whiteExecs + whiteManagers
whiteMaleExecs = encoded_data.groupby(['job_category_Managers','race_White','gender_male'])['count'].sum()[1][1][1]
whiteMaleManagers = encoded_data.groupby(['job_category_Managers','race_White','gender_male'])['count'].sum()[1][1][1]
whiteMaletop = whiteMaleExecs + whiteMaleManagers

print("Overall Statistics")
print("\nThere are a total of", total, "people working at the top 23 companies at Silicon Valley.\n\nOf these", totalMales,
      "or", (totalMales/total)*100,"% are male.\n\nAnd", totalFemales,"or", 
      (totalFemales/total)*100,"% are female\n\n")
print("Of the entire workforce", totalWhite, "or", (totalWhite/total)*100,"% are white\n\n")

print("Senior Position Dristibution",
      "\n\nThere are",totalSeniorPositions,"Executive or Senior Official positions.\n\nOf these", 
      totalMaleSeniorPositions, "or", (totalMaleSeniorPositions/totalSeniorPositions)*100, "% are males.")

print("\n\nOf all executives", whiteTop, "or",(whiteTop/totalSeniorPositions)*100,"are white"
      "\n\nand", whiteMaletop,"or",(whiteMaletop/totalSeniorPositions)*100,"% are both white and male") 

**<font size="3.5">Conclusion**

In the past several years, Silicon Valley has begun to grapple with these problems, or at least to quantify them. In 2014, Google released data on the number of women and minorities it employed. Other companies followed, including LinkedIn, Yahoo, Facebook, Twitter, Pinterest, eBay, and Apple. The numbers were not good, and neither was the resulting news coverage, but the companies pledged to spend hundreds of millions of dollars changing their work climates, altering the composition of their leadership, and refining their hiring practices.