<font size="5">**Data preprocessing**</font>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [None]:
df = pd.read_csv("../input/master.csv")
df.head()

In [None]:
df.shape

In [None]:
## Summarizing the data
df.describe()

<font size="2">There are missing values in our dataset (HDI for year). Let's check how many</font>

In [None]:
df.count()

<font size="2">Since HDI for year is continious, we can fill those missing values with mean values</font>

In [None]:
df.fillna(df.mean(), inplace=True)

## We don't need the column "country-year", so we'll just drop it
df.drop("country-year", axis=1, inplace=True)
df.head()

In [None]:
df.count()

<font size="2">As we can see, there are no missing values anymore. Now we need to check the type of our data</font>

In [None]:
df.dtypes

In [None]:
(df.dtypes=="object").index[df.dtypes=="object"]

In [None]:
## Renaming some columns for better interpretation
df.rename(columns={" gdp_for_year ($) ":
                  "gdp_for_year", "gdp_per_capita ($)":
                  "gdp_per_capita"}, inplace=True)
df.head()

In [None]:
df.shape

In [None]:
## Turning object types into category and integer types
df[["country","age","sex","generation"]] = df[["country","age","sex","generation"]].astype("category")
## Converting number strings with commas into integer
df['gdp_for_year'] = df['gdp_for_year'].str.replace(",", "").astype("int")
df.info()

<font size="5">**Data visualization**, because a picture is worth a thousand words :)</font>

In [None]:
sns.set(style='whitegrid')
ns = df['suicides_no'].groupby(df.year).count()
ns.plot(figsize=(10,8), linewidth=2, fontsize=15,color='black')
plt.xlabel('year', fontsize=15)
plt.ylabel('suicides_no',fontsize=15)

<font size="2">According to this plot numbers of suicides had been decreasing overall</font>

In [None]:
f,ax = plt.subplots(1,1,figsize=(13,6))
ax = sns.barplot(x = df.generation.sort_values(),y = 'suicides_no',
                  hue='sex',data=df,palette='bright')

<font size="2">This barplot shows that males in general are more likely to commit suicides than females</font>

In [None]:
f,ax = plt.subplots(1,1,figsize=(10,10))
ax = sns.heatmap(df.corr(),annot=True)

<font size="2">The correlation between the factors except population with GDP for year is low</font>

In [None]:
sns.set(style='darkgrid')
data = df['suicides_no'].groupby(df.country).sum().sort_values(ascending=False)
f,ax = plt.subplots(1,1,figsize=(10,20))
ax = sns.barplot(data.head(20),data.head(20).index,palette='Reds')

<font size="2">The highest number of suicides is in Russian Federation</font>

In [None]:
data = df['suicides_no'].groupby(df.country).sum().sort_values(ascending=False)
f,ax = plt.subplots(1,1,figsize=(10,20))
ax = sns.barplot(data.tail(20),data.tail(20).index,palette='Blues_r')

<font size="2">The lowest number of suicides is in San Marino</font>

In [None]:
f, ax = plt.subplots(1,1, figsize=(10,8))
ax = sns.scatterplot(x="gdp_for_year", y="suicides_no", hue="age", data=df)

<font size="2">The relationship between "gdp_for_year" and "suicides_no" is not linear. Hence, GDP is not something that has a real impact on suicide rate </font>

In [None]:
##Suicides by age and gender in Russian Federation
f, ax = plt.subplots(1,1, figsize=(10,10))
ax = sns.boxplot(x='age', y='suicides_no', hue='sex',
                 data=df[df['country']=='Russian Federation'],
                 palette='Set1')

<font size='2'> Males in Russia aged from 35 to 54 yrs commit suicide more often </font>

<font size="5">Machine Learning</font>

In [None]:
## Using cat.codes method to convert category into numerical labels
columns = df.select_dtypes(['category']).columns
df[columns] = df[columns].apply(lambda fx: fx.cat.codes)
df.dtypes

<font size="3">*K-means Clustering*<font>

<font size='2'>The task is to cluster the countries into two groups - the ones with high number of suicides and the ones with low number of suicides. For that we have to drop the 'suicides_no' column from the dataset and make it unlabeled</font>

In [None]:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
x = df.drop('suicides_no', axis=True)
y = df['suicides_no']
kmeans = KMeans(n_clusters=2)
kmeans.fit(x)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=600,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)
y_kmeans = kmeans.predict(x)
x, y_kmeans = make_blobs(n_samples=600, centers=2, cluster_std=0.60, random_state=0)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(x[:,0], x[:,1], c=y_kmeans, cmap='cool')

In [None]:
from sklearn.metrics import silhouette_score
print(silhouette_score(x, y_kmeans))

<font size='2'>Great! The model was able to cluster correctly with a 71% without even tweaking any parameters of the model itself and scaling the values of the features</font>

<font size='5'>**Conclusions**</font>


<font size='2'>Data cleaning is very important, as real world data is usually messy. Visualizing the data is also a very important step because it makes it easier for a lot of people to understand the data and detect patterns, trends and outliers. K-means clustering algorithm (which found a strong structure in our dataset) was easy to implement in this case, since we had some domain knowledge that told us the number of suicides committed by people in different countries, so we didn't have to pre-specify the number of clusters(k). However, this doesn't always happen that way.
    
As for suicides and factors that influence them one can say while age and gender can be some of those factors, Gdp and Hdi not really, because even in countries with high Gdp and Hdi a lot of people commit suicide. Other than that, there's not enough data available for better analysis, as there are other biological, psychological and social factors that may cause suicides (race, ethnicity, social isolation, contagion, religion, etc.), as well as geographical (climate)</font>