# Data Exploration


First we imported the data from our data collection notebook.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from altair import *
%matplotlib inline
import warnings; warnings.simplefilter('ignore')

final = pd.read_csv("~/data301/share/Brooke_Michal/final_data.csv")
final.head()

We created a row in the data frame to represent the numeric data for all countries(using NaN for categorical variables) and then appended this row to the final data frame. We filled in this row with the mean of each column because all columns were either a percentage or a proportion, so the mean was the most appropriate statistic to represent each column for all countries

In [None]:
final["Pop. Density (per sq. mi.)"] = final["Pop. Density (per sq. mi.)"].astype(float)
final.loc[:,"1990_P":"2018_P"]
total = pd.Series(final.mean())
df = pd.DataFrame(total).transpose()
df["Country"] = "Total"
df["Government Type"] = "Nan"
df["Religion"] = "Nan"
df["Country Code"] = "Nan"
df["Region"] = "Nan"
df["IncomeGroup"] = "Nan"
final_df = pd.concat([final, df])
final_df = final_df.set_index("Country")
final_df.tail()

**How are managment and parliament related?**

In [None]:
Chart(final_df).mark_circle(clip=True).encode(
    x=X("2016_P", scale=Scale(domain=(0, 60))),
    y=Y("2016_M",scale=Scale(domain=(0, 50)))
)


The graph above represents the relationship between women in management and women in government, where each dot represents a country(only countries that had values for each varible appear on the plot). The scatter plot releaves that there is not a strong association between these varibels. This was suprising to us, so we wanted to also look at how these variables are related in total, using our new observation in the data set that represents the world. 

In [None]:
dfnew = df.loc[:,"2000_P":"2016_P"].stack()
pdf = pd.DataFrame(dfnew)
dfnew2 = df.loc[:,"2000_M":"2016_M"].stack()
mdf = pd.DataFrame(dfnew2)

plt.plot(mdf, pdf)
fig = plt.gcf()
ax = plt.gca()
fig.set_size_inches(12,6)
ax.set_ylabel("Women in Upper and Middle\n"
              "Level Management Positions", size = 15)
ax.set_xlabel("Women in Parliament", size = 15)
plt.title("Women Involved in Government vs\n "
          "Women Involved in the Private Sector:\n"
          "An Aggregate of All Countries", size= 20)
for txt in fig.texts:
    txt.set_visible(False)

This graph shows that on average, there is a positive, linear assosiation between women in management and women in government. There is, however, lots of spikes in the graph, which could explain the weakness in the scatterplot. We hypothessize these spikes come from the changing climates of various countries, which affects the world as a whole at any given time. 

**What is the relationship between all of our variables?**

In [None]:
final_df[['Government Type','Religion', 'Infant mortality (per 1000 births)', 
 'Literacy (%)','Birthrate', 'Deathrate', "2016_P", "2016_M"]].corr()

In [None]:
final_df["2016_P"].corr(final_df["2016_M"])

We wanted to highlight the correlation between women in managemnt and parliament since we investigated this relationship in the above graphs. The correlation is very weak, which corresponds to what we saw in the grpahs above. This was actually the weakest relationship between any two variables. Now we are interested to see how these other variables affect women in maganement and government.

**What is the relationship between the proportion women in managment and parliament, by religion of country?**

In [None]:
Chart(final_df[final_df["Religion"] != "Nan"]).mark_circle(clip=True).encode(
    x=X("2016_P", scale=Scale(domain=(0, 60))),
    y=Y("2016_M",scale=Scale(domain=(0, 50))),
    color="Religion"
)

This scatter plot shows how religion affects the relationship between women in management and women in parliament. We can see that the countries with the highest proportion of women in parliament are Christian countries and the countries with the lowest proportion of women in government are Muslim and Buddhist countries. This gives us preliminary evidence that relgion may be a useful feature in our model.

**What is the relationship between the proportion women in managment and parliament, by government type of country?**

In [None]:
gov_df = final_df[(
    final_df["Government Type"] == 'Constitutional Monarchy') | 
    (final_df["Government Type"] == "Absolute Monarchy") |
    (final_df["Government Type"] == "Republic")]
parliament = Chart(gov_df).mark_bar(size=10).encode(
    y="Government Type",
    x="2016_P"
)
management = Chart(gov_df).mark_bar(size=10).encode(
    y="Government Type",
    x="2016_M",
)
vconcat(parliament, management).configure_view(stroke='transparent')

This grpahs shows us how government type of a country effects women's roles in the country. Republics have the most women in both management and government, then constiutuional monarchies. Absolute monarchies have the least women in parliament and had no data for how many women are in management. It is a posibility that this information on women in management posiitons is missing because absolute monarchies tend to have strict government policies and may have chosen not to relase this information. Therefore, we want to investigate the completeness the managment data further. 

In [None]:
(len(final),
 len(final[final["2016_M"].isna()]), 
 len(final[final["latest_value_M"].isna()]),
 len(final[final["2018_P"].isna()]))

Looking at these numbers, we can see that out of 180 countries 127 of them have no information about women in managment in 2016 (the last year available in this data set). We see that using the latest value of managment is slightly better than just using the year 2016, but still has an overwhelming majority of countries missing informaiton (104 missing instead of 127). Furthermore, the numbers reveal that we have no zeros in the managemnt data set. The parliament data set; however, only has one country with a missing value in 2018, which makes us want to focus our exploration on parliament instead of managment. 

**We wanted to investigate the countries with the highest and lowest proportion of women in parliament to try to understand which factors contribute to this statistic.**

In [None]:
top_5 = list(final_df["2018_P"].sort_values(ascending = False).head(5).index)
top_5_df = final_df.loc[top_5]
top_5_df["2018_P"]

In [None]:
top_5_df

In [None]:
get_bottom = final_df[~final_df["2018_P"].isna()]
bottom_5 = list(get_bottom["2018_P"].sort_values().head(5).index)
bottom_5_df = final_df.loc[bottom_5]
bottom_5_df["2018_P"]

In [None]:
bottom_5_df

In [None]:
title = 'of Countries with Most and Least Women in National Parliament'
label_least='Countries with Least Women in Parliament'
label_most='Countries with Most Women in Parliament'

In [None]:
plt.bar(bottom_5_df.index, bottom_5_df["GDP ($ per capita)"])
plt.bar(top_5_df.index, top_5_df["GDP ($ per capita)"])

fig = plt.gcf()
ax = plt.gca()
fig.set_size_inches(16,6)
ax.set_ylabel("GDP", size = 15)
ax.set_xlabel("Country", size = 15)
plt.title("GDP " + title, size= 20)
blue_patch = mpatches.Patch(color='blue', label=label_least)
o_patch = mpatches.Patch(color='orange', label=label_most)
plt.legend(handles=[blue_patch, o_patch], fontsize=14)

plt.show()

This graph is interesting because intitally we thought that the counties with the least women in parliament would have low GDP. However, this graph reveals that indivual climates of countries is very important. For example, Oman and Kuwait are both middle eastern countries and have very high GDP because of their oil weath but are also muslim countries which explains why they have less women in government. 

In [None]:
plt.bar(bottom_5_df.index, bottom_5_df["Literacy (%)"])
plt.bar(top_5_df.index, top_5_df["Literacy (%)"])

fig = plt.gcf()
ax = plt.gca()
fig.set_size_inches(16,6)
ax.set_ylabel("Literary Rate", size = 15)
ax.set_xlabel("Country", size = 15)
plt.title("Literary Rates " + title, size= 20)
blue_patch = mpatches.Patch(color='blue', label=label_least)
o_patch = mpatches.Patch(color='orange', label=label_most)
plt.legend(handles=[blue_patch, o_patch], fontsize=12)

plt.show()

While there is a fair amount of vairation in the plot above, it does show that on average, the countries with higher literacy rates have a higher proportion of women in government. 

In [None]:
plt.bar(bottom_5_df.index, bottom_5_df["Infant mortality (per 1000 births)"])
plt.bar(top_5_df.index, top_5_df["Infant mortality (per 1000 births)"])

fig = plt.gcf()
ax = plt.gca()
fig.set_size_inches(16,6)
ax.set_ylabel("Infant Mortality (per 1000 births)", size = 15)
ax.set_xlabel("Country", size = 15)
plt.title("Infant Mortality " + title, size= 20)
blue_patch = mpatches.Patch(color='blue', label=label_least)
o_patch = mpatches.Patch(color='orange', label=label_most)
plt.legend(handles=[blue_patch, o_patch], fontsize=14)

plt.show()

Overall, the above three bar plots have a lot of variation and did not give us as much insight into which factors contribute to countries having the largest and smallest proportion of women in government. What these graphs did show is that the individual climate of a country is very influential in its demogrpahics and the proportion of women in government. For example, in the graph above we see that Rwanda has a very high infant morality rate as well as being a leading country for women in parliament. This tells us that Rwanda is a more progressive country, in terms of gender equality, but is still struggling to combat health epidemics which led to a very high infant mortality. 

**How does the U.S compare to the leading 5 countries for proportion of women in parliament?**

In [None]:
final_df = final_df.reset_index()

In [None]:
us = final_df[final_df["Country"] == "United States"].set_index("Country")

In [None]:
top_and_us = pd.concat([top_5_df, us])
fig = top_and_us[["2018_P", "2017_P", "2016_P", "2015_P", "2014_P"]].plot.bar(
    figsize = (12, 6))
plt.title("Proportion of Women in Parliament of Top 5 Countries and U.S Over the Last 5 Years", size=16)
fig.set_xlabel("Country", size=12)
fig.set_ylabel("Proportion of Women in Parliament", size=12)
for txt in fig.texts:
    txt.set_visible(False)

The graph above gives us insight into the trend countries have followed in the last 5 years for women involvment in government. Specifically, the U.S. has had its first slight increase in 2018, but overall has barely changed. However; Cuba, Mexico, and Grenada have made significant increases over the last 5 years. The other two countries already had equal representation in government, with at least half of their government being women.

**Map of the world with color grid representing percentage of women in each country's government.**

In [None]:
import cartopy.io.shapereader as shpreader
import cartopy.crs as ccrs

shpfilename = shpreader.natural_earth(resolution='110m',
                                      category='cultural',
                                      name='admin_0_countries')
shp = shpreader.Reader(shpfilename)
world_df = pd.DataFrame(
    [record.attributes for record in shp.records()]
)
world_df = world_df.rename(columns = {"ADM0_A3":"Country Code"})

In [None]:
all_data = final.merge(world_df, 
                        how="inner", 
                        on="Country Code")
all_data.head()

In [None]:
fig, ax = plt.subplots(figsize  = (18, 6))

ax = plt.axes(
    projection=ccrs.Robinson()
)

import matplotlib as mpl
norm = mpl.colors.Normalize(vmin=all_data["2018_P"].min(), 
                            vmax=all_data["2018_P"].max())
cmap = plt.cm.YlGn

for geometry, (_, row) in zip(shp.geometries(), all_data.iterrows()):
    if ~pd.isnull(row["2018_P"]):
        ax.add_geometries([geometry],
                          ccrs.PlateCarree(),
                          facecolor=cmap(norm(row["2018_P"])))
ax.natural_earth_shp(name="lakes",
                     resolution="110m",
                     category="physical", 
                     facecolor="skyblue")
ax.natural_earth_shp(name="ocean",
                     resolution="110m",
                     category="physical", 
                     facecolor="skyblue")
ax.natural_earth_shp(name="admin_0_countries",
                     resolution="110m",
                     category="cultural", 
                     facecolor="None", edgecolor="black")

plt.title("Proportion of Women in Each Country's National Parliament (2018), By Country", size = 25)
dark = mpatches.Patch(color='darkgreen', label='Higher Proportion')
light = mpatches.Patch(color='palegoldenrod', label='Lower Proportion')
plt.legend(handles=[dark, light], fontsize=12)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize  = (18, 6))

ax = plt.axes(
    projection=ccrs.Robinson()
)

import matplotlib as mpl
norm = mpl.colors.Normalize(vmin=all_data["latest_value_M"].min(), 
                            vmax=all_data["latest_value_M"].max())
cmap = plt.cm.RdPu

for geometry, (_, row) in zip(shp.geometries(), all_data.iterrows()):
    if ~pd.isnull(row["latest_value_M"]):
        ax.add_geometries([geometry],
                          ccrs.PlateCarree(),
                          facecolor=cmap(norm(row["latest_value_M"])))
ax.natural_earth_shp(name="lakes",
                     resolution="110m",
                     category="physical", 
                     facecolor="skyblue")
ax.natural_earth_shp(name="ocean",
                     resolution="110m",
                     category="physical", 
                     facecolor="skyblue")
ax.natural_earth_shp(name="admin_0_countries",
                     resolution="110m",
                     category="cultural", 
                     facecolor="None", edgecolor="black")

plt.title("Proportion of Women in Management Positions, By Country", size=25)

dark = mpatches.Patch(color='mediumvioletred', label='Higher Proportion')
light = mpatches.Patch(color='lightpink', label='Lower Proportion')
plt.legend(handles=[dark, light], fontsize=12)

plt.show()

In [None]:
all_data_bottom_5 = all_data.merge(bottom_5_df)

lat = pd.DataFrame([18.9712, 29.3759, 23.5859, 9.4438, 17.7333,])

long = pd.DataFrame([72.2852,47.9774, 58.4059, 147.1803, 168.3273])

all_data_bottom_5["latitude"] = lat
all_data_bottom_5["longitude"] = long
all_data_bottom_5[["longitude", "latitude"]]

In [None]:
#since one of the original top 5 countries was missing in the world_df, we had to take it out of top 6
top_6 = list(final_df["2018_P"].sort_values(ascending = False).head(6).index)
top_6_df = final_df.loc[top_6]

In [None]:
all_data_top_5 = all_data.merge(top_6_df)
lat = pd.DataFrame([19.0196, 23.1136, 19.4326, 22.5609, 1.9706])

long = pd.DataFrame([65.2620,82.3666, 99.1332,17.0658, 30.1044])

all_data_top_5["latitude"] = lat
all_data_top_5["longitude"] = long
all_data_top_5[["longitude", "latitude"]]

In [None]:
fig, ax = plt.subplots(figsize  = (18, 6))
ax = plt.axes(projection=ccrs.Robinson())
ax.stock_img()

all_data_top_5.plot.scatter(ax=ax,
                    x="longitude", y="latitude",
                    c="orange", s= 60,
                    transform=ccrs.Geodetic())
all_data_bottom_5.plot.scatter(ax=ax,
                    x="longitude", y="latitude",
                    c="fuchsia", s = 60,
                    transform=ccrs.Geodetic())

ax.natural_earth_shp(name="admin_0_countries",
                     resolution="110m",
                     category="cultural", 
                     facecolor="None", edgecolor="black")

plt.title("Countries with Most and Least Women in National Parliament", size=25)

dark = mpatches.Patch(color='orange', label='Higher Proportion')
light = mpatches.Patch(color='fuchsia', label='Lower Proportion')
plt.legend(handles=[dark, light], fontsize=12)

plt.show()