<a href="https://colab.research.google.com/github/hegame1998/Suicide-Statistic/blob/main/WHO_Suicide_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing the modules & libraries**

In [None]:
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# **Data Importing**

I called the data from  my GitHub and put them in a specific variable to use it in future analysis.

In [None]:
# creating the suicide dataframe
suicide_df = pd.read_csv('https://raw.githubusercontent.com/hegame1998/Suicide-Statistic/main/who_suicide_statistics.csv')

# **Information about dataset**

In [None]:
# returning suicide dataframe values
suicide_df

Unnamed: 0,country,year,sex,age,suicides_no,population
0,Albania,1985,female,15-24 years,,277900.0
1,Albania,1985,female,25-34 years,,246800.0
2,Albania,1985,female,35-54 years,,267500.0
3,Albania,1985,female,5-14 years,,298300.0
4,Albania,1985,female,55-74 years,,138700.0
...,...,...,...,...,...,...
43771,Zimbabwe,1990,male,25-34 years,150.0,
43772,Zimbabwe,1990,male,35-54 years,132.0,
43773,Zimbabwe,1990,male,5-14 years,6.0,
43774,Zimbabwe,1990,male,55-74 years,74.0,


In previous code, at the end of dataset we can see the size of data (number of rows and columns) but we can see it with another code :

In [None]:
suicide_df.shape

(43776, 6)

If I want to return the numbers of row of dataset :

In [None]:
# returning the number of rows in suicide dataframe
suicide_df.shape[0]

43776

This method is used to retrieve the first few rows of the dataset.
It returns 5 rows by default, but we can call for specific rows.
We call 8 rows here, so it will return the first 8 rows of the dataset.

In [None]:
# returning first n rows of suicide dataset
suicide_df.head(8)

Unnamed: 0,country,year,sex,age,suicides_no,population
0,Albania,1985,female,15-24 years,,277900.0
1,Albania,1985,female,25-34 years,,246800.0
2,Albania,1985,female,35-54 years,,267500.0
3,Albania,1985,female,5-14 years,,298300.0
4,Albania,1985,female,55-74 years,,138700.0
5,Albania,1985,female,75+ years,,34200.0
6,Albania,1985,male,15-24 years,,301400.0
7,Albania,1985,male,25-34 years,,264200.0


This code same as last code but print the last rows of dataset

In [None]:
# returning last n rows of suicide dataset
suicide_df.tail(3)

Unnamed: 0,country,year,sex,age,suicides_no,population
43773,Zimbabwe,1990,male,5-14 years,6.0,
43774,Zimbabwe,1990,male,55-74 years,74.0,
43775,Zimbabwe,1990,male,75+ years,13.0,


This method provides a concise summary of the dataset's information, including the number of non-null entries and the data types of each column.<br>This is useful for understanding the data types in dataset and identifying any missing values so we can clean the data if there is any empty cell.

In [None]:
# getting information about each column counts and datatype in suicide dataframe
suicide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43776 entries, 0 to 43775
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   country      43776 non-null  object 
 1   year         43776 non-null  int64  
 2   sex          43776 non-null  object 
 3   age          43776 non-null  object 
 4   suicides_no  41520 non-null  float64
 5   population   38316 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 2.0+ MB


This method is used to generate descriptive statistics of the data.
For a dataset, it provides a summary of the central tendency, dispersion, and shape of the distribution of the data. For each numeric column in the dataset, it calculates statistics such as count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum.

In [None]:
# getting information like min,max and mean about numeric columns in suicide dataframe
suicide_df.describe()

Unnamed: 0,year,suicides_no,population
count,43776.0,41520.0,38316.0
mean,1998.502467,193.31539,1664091.0
std,10.338711,800.589926,3647231.0
min,1979.0,0.0,259.0
25%,1990.0,1.0,85112.75
50%,1999.0,14.0,380655.0
75%,2007.0,91.0,1305698.0
max,2016.0,22338.0,43805210.0


# **Cleaning dataset**

Cleaning the dataset before analysis is a fundamental step to ensuring the quality, reliability, and integrity of results. It allows for a perfect analysis and helps in drawing accurate and meaningful conclusions.


### **Calculate null cells**


I want to print the values that has null value.

In [None]:
#show the value that are nulls
null_colls = [i for i in suicide_df.columns if suicide_df[i].isnull().any()]
null_colls

['suicides_no', 'population']

Now I want to calculate the number of null cells.

In [None]:
#calculate the number of null cells for each value
suicide_df.isnull().sum()

country           0
year              0
sex               0
age               0
suicides_no    2256
population     5460
dtype: int64

### **Cleaning base on *'suicides_no'***

I remove the null cells:

In [None]:
# removing rows containing null values of suicides number column
suicide_df.dropna(subset=['suicides_no'], inplace=True)

### **Cleaning base on *'population'***

I remove the null cells:

In [None]:
# removing rows containing null values of population column
suicide_df.dropna(subset=['population'], inplace=True)

Now I want to sure about null cells because I want a clean dataset for my analysis

In [None]:
# getting information about each column counts and datatype in suicide dataframe
suicide_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36060 entries, 24 to 43763
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   country      36060 non-null  object 
 1   year         36060 non-null  int64  
 2   sex          36060 non-null  object 
 3   age          36060 non-null  object 
 4   suicides_no  36060 non-null  float64
 5   population   36060 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 1.9+ MB


In [None]:
#calculate the number of null cells for each value
suicide_df.isnull().sum()

country        0
year           0
sex            0
age            0
suicides_no    0
population     0
dtype: int64

### **Converting value base on *'population'***

Converting the value from string to integer for value of sex

In [None]:
# replacing string values of sex with integer values
suicide_df.sex.replace({'female':1, 'male':2}, inplace=True)

### **Converting value base on *'age'***

Converting the value from string to integer for value of Age, first I print all unique values

In [None]:
# Print unique values of Country
suicide_df["age"].unique()

array(['15-24 years', '25-34 years', '35-54 years', '5-14 years',
       '55-74 years', '75+ years'], dtype=object)

In [None]:
# replacing string values of age with integer values
suicide_df.age.replace({'5-14 years':1, '15-24 years':2, '25-34 years':3, '35-54 years':4, '55-74 years':5, '75+ years':6}, inplace=True)

# **Exploration on clean dataset**

This is a pandas dataset method used to compute pair correlation of columns, excluding null values. <br> It computes the correlation matrix for the numerical columns in the DataFrame

In [None]:
# finding correlation between different feaures
suicide_df = suicide_df.select_dtypes(include=[float, int])

suicide_df.corr()

Unnamed: 0,year,sex,age,suicides_no,population
year,1.0,-1.205969e-16,2.837968e-18,-0.011356,0.012601
sex,-1.205969e-16,1.0,1.478274e-18,0.136476,-0.010822
age,2.837968e-18,1.478274e-18,1.0,0.075336,-0.069339
suicides_no,-0.01135649,0.1364758,0.07533582,1.0,0.611406
population,0.01260078,-0.01082212,-0.06933916,0.611406,1.0


In [None]:
country_wise=suicide_df["suicides_no"].groupby(suicide_df["country"]).sum().sort_values(ascending=False)


KeyError: 'country'

In [None]:
# number of suicider by country
suicider_by_country = suicide_df["suicides_no"].groupby(suicide_df["country"]).sum()
suicider_by_country = suicider_by_country.sort_values(ascending=False)
print(suicider_by_country)

KeyError: 'country'