# Suicide Rates Overview 1985 to 2016

This dataset was posted on Kaggle.com, and gives an overview of the suicide rates per year and country. It will go through data cleaning, data visualization, and analysis. Before we proceed, let's import the necessary libraries and load the dataset into a pandas dataframe.

### Import Libraries and Load Dataset

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
suicide_df = pd.read_csv("master.csv")

# Dataset Description

To further understand the dataset, let's check its variables and number of observations by invoking the `info()` function.

In [None]:
suicide_df.info()

Upon checking the information above, there are **27820 observations** and **12 variables**. There is sufficient data to get meaningful output from. Let's see the observations by invoking the `head()` and `tail()` functions.

In [None]:
suicide_df.head()

In [None]:
suicide_df.tail()

# Data Cleaning

We will now check for any null, heinous, or outlying values within our dataset. This ensures that our output will stay precise and accurate. First, let's check for any null values within each variable.

In [None]:
suicide_df.isnull().sum()

Upon checking, the only variable that has a significant amount of null values are under the `HDI for year` variable. The value is too great for it to be dropped from the dataset as it may skew the output, thus we will be leaving it alone for now. Next, let's check the unique values for each variable.

In [None]:
for column in suicide_df.columns:
    print("\nUnique values for column: '{}'".format(column))
    print(suicide_df[column].unique())

Upon checking the unique values per column, there seems to be no need for mapping or dropping any heinous/odd values from the dataset. Next, let's check to see if there are any duplicate values in the dataset.

In [None]:
suicide_df.duplicated(subset=None, keep='first').sum()

The code returned a value of 0, meaning that there are no duplicate observations in the dataset.

# Exploratory Data Analysis

Let's answer a few exploratory data analysis questions before proceeding to data visualizations.
- Which country had the most number of suicides in a year?
- Which age group had the most suicides?

### Question 1

To answer the first question, let's filter down the dataset with the variables that we need, namely, the `suicides_no`, `country`, and `year` variable. We get the maximum value from the `suicides_no`.

In [None]:
country_suicides = suicide_df.loc[suicide_df['suicides_no'].idxmax()]
country_suicides

Based on the series object returned, the Russian Federation had the most number of suicides in the span of a year (1994). The age group that was common in committing suicide was 35-54 years old and mostly done by males.

### Question 2

Let's answer the second question. We need to group our data by the `age` and get the summation of the variable `suicides_no` per age group.

In [None]:
age_group_suicides = suicide_df.groupby("age").suicides_no.sum()

ax = age_group_suicides.plot.bar(title="Number of Suicides per Age Group", figsize=(20,10))

plt.xticks(rotation=0)

for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

Based on the bar chart from above, the highest number of suicides comes from the age group of 35-54 years old with a value of 2,452,141. This age group is also known as Generation X. 

***Noting a discrepancy in the values of the dataset, the answer in question 1 said that the age group of 35-54 years old is known as Boomers, but upon checking other observations in the dataset of the same age group, majority said it was known as Generation X.***

# Data Visualization

Let's try and visualize the number of suicides per year in each generation. We want to analyze the trend over the span of 1985 to 2016. Since there are 6 generations, we will have 6 horizontal lines and superimpose it on top of one another and plot it over time.

In [None]:
gen_x = suicide_df[suicide_df["generation"] == "Generation X"].groupby("year").agg({"suicides_no": ["sum"]}).unstack()
silent = suicide_df[suicide_df["generation"] == "Silent"].groupby("year").agg({"suicides_no": ["sum"]}).unstack()
gi_gen = suicide_df[suicide_df["generation"] == "G.I. Generation"].groupby("year").agg({"suicides_no": ["sum"]}).unstack()
boomers = suicide_df[suicide_df["generation"] == "Boomers"].groupby("year").agg({"suicides_no": ["sum"]}).unstack()
millenials = suicide_df[suicide_df["generation"] == "Millenials"].groupby("year").agg({"suicides_no": ["sum"]}).unstack()
gen_z = suicide_df[suicide_df["generation"] == "Generation Z"].groupby("year").agg({"suicides_no": ["sum"]}).unstack()

data = pd.concat([gen_x, silent, gi_gen, boomers, millenials, gen_z], axis=1)
data.rename(columns={0:"Generation X", 1:"Silent", 2: "G.I. Generation", 3: "Boomers", 4: "Millenials", 5: "Generation Z"})

ax = data.plot.line(title="Number of Suicides per Age Group", figsize=(20,10))

labels = [item.get_text() for item in ax.get_xticklabels()]
labels[1] = '1985'
labels[2] = '1990'
labels[3] = '2000'
labels[4] = '2005'
labels[5] = '2010'
labels[6] = '2015'
labels[7] = '2020'

ax.set_xticklabels(labels)
ax.set_xlabel("Year")
ax.legend(["Generation X", "Silent", "G.I. Generation", "Boomers", "Millenials", "Generation Z"])

plt.show()

As seen from the graph above, there were spikes of suicides during the 1990s-2000s in the generation of Boomers, as well as Generation X in 2015. There were some data such as Generation Z that had too low of a number for the graph to show significant trend. There were also a significant amount of missing data in between each years, resulting in a NaN value for that specific part of the trend, causing it to show a flat line.

# Conclusion

As seen from the graph above, the change in **socio-cultural norms** may have impacted the number of suicides in each generation. It is not only limited to that, but also other factors such as **socio-economic crisis**  (financial depression, war, etc.) during that time period may have severely increased the number of those who committed suicide worldwide. It could be said that society before the 2000s was much more unforgiving, as seen from the sudden rise in suicides, which then would decay and maintain a certain level after the turn of the millenium.