<a href="https://colab.research.google.com/github/marcocolognesi/programming_class_final_project/blob/main/final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data exploration & wrangling**



In [1]:
import pandas as pd
import matplotlib.pyplot as plt

## Initial data exploration (*importing files, renaming columns and dropping useless ones*)

In [2]:
original_suicide_df = pd.read_csv('master.csv')

In [3]:
suicide_df = original_suicide_df.copy()

In [4]:
suicide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31756 entries, 0 to 31755
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             31756 non-null  object 
 1   year                31756 non-null  int64  
 2   sex                 31756 non-null  object 
 3   age                 31756 non-null  object 
 4   suicides_no         30556 non-null  float64
 5   population          31756 non-null  int64  
 6   suicides/100k pop   31756 non-null  float64
 7   country-year        31756 non-null  object 
 8   HDI for year        12300 non-null  float64
 9    gdp_for_year ($)   31756 non-null  object 
 10  gdp_per_capita ($)  31756 non-null  float64
 11  generation          31756 non-null  object 
dtypes: float64(4), int64(2), object(6)
memory usage: 2.9+ MB


By a first look at the info of our dataset we see that the columns have tricky symbols to work with with, so we rename them:

In [5]:
suicide_df.columns = ['country', 'year', 'gender', 'age', 'suicides_no', 'population', 'suicides_100k_pop', 'country_year', 'hdi_for_year', 'gdp_for_year', 'gdp_per_capita', 'generation']

In [None]:
#Other way
#Replace ' ' with '_' method
'''
#FIXING COLUMNS NAME
new_columns = []
for column_name in data.columns:
  new_column_name = column_name.replace(' ','_').replace('/','_').replace('-','_')
  new_columns.append(new_column_name)

data.columns = new_columns
'''

"\n#FIXING COLUMNS NAME\nnew_columns = []\nfor column_name in data.columns:\n  new_column_name = column_name.replace(' ','_').replace('/','_').replace('-','_')\n  new_columns.append(new_column_name)\n\ndata.columns = new_columns\n"

In [6]:
suicide_df.columns

Index(['country', 'year', 'gender', 'age', 'suicides_no', 'population',
       'suicides_100k_pop', 'country_year', 'hdi_for_year', 'gdp_for_year',
       'gdp_per_capita', 'generation'],
      dtype='object')

Once the name are fixed, we see that there are a few columns that we should drop:
*   **'country_year'** , as it's useless for our analysis;
*   **'hdi_for_year'** , as it has not enough data to work with (*only 12.300 values instead of the 31.756 of the others*);
*   **'generation'**, because some values are inaccurate

In [7]:
suicide_df.drop('country_year', axis=1, inplace=True)
suicide_df.drop('hdi_for_year', axis=1, inplace=True)
suicide_df.drop('generation', axis=1, inplace=True)

In [8]:
suicide_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31756 entries, 0 to 31755
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   country            31756 non-null  object 
 1   year               31756 non-null  int64  
 2   gender             31756 non-null  object 
 3   age                31756 non-null  object 
 4   suicides_no        30556 non-null  float64
 5   population         31756 non-null  int64  
 6   suicides_100k_pop  31756 non-null  float64
 7   gdp_for_year       31756 non-null  object 
 8   gdp_per_capita     31756 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 2.2+ MB


Once the columns are initially fixed we can start looking inside of each one:

In [9]:
suicide_df['country'].unique()

array(['Albania', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Barbados', 'Belarus', 'Belgium', 'Belize',
       'Bosnia and Herzegovina', 'Brazil', 'Bulgaria', 'Cabo Verde',
       'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Croatia', 'Cuba',
       'Cyprus', 'Czech Republic', 'Denmark', 'Dominica', 'Ecuador',
       'El Salvador', 'Estonia', 'Fiji', 'Finland', 'France', 'Georgia',
       'Germany', 'Greece', 'Grenada', 'Guatemala', 'Guyana', 'Hungary',
       'Iceland', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan',
       'Kazakhstan', 'Kiribati', 'Kuwait', 'Kyrgyzstan', 'Latvia',
       'Lithuania', 'Luxembourg', 'Macau', 'Maldives', 'Malta',
       'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Netherlands',
       'New Zealand', 'Nicaragua', 'Norway', 'Oman', 'Panama', 'Paraguay',
       'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar',
       'Republic of Korea', 'Romania', '

Looking at the countries in our data (**'country'** column), we see that the United States are listed in two different ways, so we fix that:

In [10]:
suicide_df.loc[suicide_df['country'] == 'United States', 'country'] = 'United States of America'

Looking at the **'age'** column, for a better sorting we fix the 5-14 years range name:

In [11]:
suicide_df.loc[suicide_df['age'] == '5-14 years', 'age'] = '05-14 years'