# Veteran Suicide Prevention - Preprocessing & Exploration Scratchpad

In [1]:
import pandas as pd
import numpy as np
import sklearn.preprocessing
import sklearn.model_selection
import sklearn.impute
from sklearn.model_selection import train_test_split

%matplotlib inline
from matplotlib import cm
import matplotlib.pyplot as plt
import seaborn as sns

import suicide_acquire


SyntaxError: invalid syntax (suicide_acquire.py, line 300)

In [None]:
age_adjusted_df = suicide_acquire.age_adjusted()

age_adjusted_df.head()

In [None]:
age_adjusted_df.info()

In [None]:
age_adjusted_df.index

In [None]:
age_adjusted_df = age_adjusted_df.set_index("year")

**Because we're dealing with discrete variables (definite numbers - no one counted could be considered 'partly' or 'in the process of' suicide; they either committed it or they didn't), we can take a look at the heatmap.**`

In [None]:
nums = sns.relplot(x="total_vet_suicides", y="male_suicides", kind="line", data=age_adjusted_df);

nums

In [None]:
# adding column that gives percentage of total suicides that are male

age_adjusted_df["suicide_%_thats_male"] = (age_adjusted_df.male_suicides / age_adjusted_df.total_vet_suicides)*100

In [None]:
age_adjusted_df.head()

In [None]:
pct = sns.relplot(x="total_vet_suicides", y="suicide_%_thats_male", kind="line", data=age_adjusted_df);

pct

**^^Seems that even though as time moves forward the male veteran suicide numbers increase, the overall percentage of male veteran suicides fluctuates considerably.** 

- So, what do the numbers vs the percentages look like?

**Next, checking to see what female age-adjusted rate looks like:**

In [None]:
pd.crosstab(age_adjusted_df.female_age_adjusted_rate_per_100K, age_adjusted_df.index,
            margins=True).style.background_gradient(cmap="PuBuGn")

**^^Similarly, female age-adjusted rates increase over time, however, it seems there is more periodic reducition of rates with women.**

In [None]:
sns.relplot(x="total_vet_suicides", y="female_suicides", kind="line", data=age_adjusted_df);

### Walkthrough of Age-Adjusted Suicide Rates

### The age distribution of a population greatly affects its mortality rate.  In calculating the age-adjusted rate for this project, age groups (18-35, 36-55, etc.) were created and the total number of veterans in those age groups were counted.

Within each age group, the number of suicides were counted, and to get the age-adjusted suicide rate, that number was divided by the total population of that group.  

**For example:**

>You have 100 veterans between the ages of 20 and 25.  Of those 100 veterans, 3 committed suicide.
    
>The age-adjusted suicide rate for that group of 20-25 year-old veterans = 3 / 100 or 3%
    
Again, the purpose of the age-adjusted rate is to narrow down which specific age groups are most prone to committing suicide.  While crude rate helps us understand suicide's devastating effects across the entire veteran population, crude-rate data has been excluded from our datasets in favor of the more specific age-adjusted rate.

In [None]:
age_group_df = suicide_acquire.age_group_df()

age_group_df.head()

In [None]:
age_group_df.info()

In [None]:
# age_group_df = age_group_df.apply(pd.to_numeric, errors='coerce')

In [None]:
age_group_df.info()

In [None]:
recent_vha_user_df = suicide_acquire.recent_vha_user()

recent_vha_user_df.head()

In [None]:
recent_vha_user_df.info()

In [None]:
vha_by_age_group_df = suicide_acquire.vha_by_age_group()

vha_by_age_group_df.head()

In [None]:
vha_by_age_group_df.info()

In [None]:
non_vha_user_df = suicide_acquire.non_vha_user()

non_vha_user_df.head()

In [None]:
non_vha_user_df.info()

In [None]:
non_vha_by_age_df = suicide_acquire.non_vha_by_age()

non_vha_by_age_df.head()

In [None]:
non_vha_by_age_df.info()

In [None]:
pd.crosstab(non_vha_by_age_df.non_vha_veteran_crude_per_100K, age_adjusted_df.year,
            margins=True).style.background_gradient(cmap="PuBuGn")

### Preprocessing functions that follow:

- Based on the similar division of the datasets into train, validate, and test portions, the 'behind the scenes' work of the preprocessing the datasets for exploration takes place in the accompanying '