<a href="https://colab.research.google.com/github/ivyownn/Infant-Mortality-Rate-Project-Analysis/blob/main/Infant_Mortality_Rate_Project_Analysis_CY.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Infant Mortality Rate Project Analysis**
![Alt text](IMR_Image/IMR_CodeYou_CP.png)

/content/drive/MyDrive/Infant-Mortality-Rate-Project-Analysis/IMR_Image

This project aims to analyze child mortality rates in the world using data covering the periods 1990, 2000 & 2021. The data used is from UNICEF's State of the World's Children 2023: Statistical tables. The goal of this analysis is to better understand the health of populations, help improve healthcare systems, and support the implementation of effective interventions to reduce infant mortality and improve maternal and child health outcomes.

Python will be used for data preprocessing, analysis, and modeling
and Tableau & Matplotlib for data visualization.

Source of data:

https://data.unicef.org/resources/dataset/the-state-of-the-worlds-children-2023-statistical-tables/

**Resources:**

*   Data sets on child mortality
*   Git
*   Google Colab or Visual Studio Code
*   Tableau software for visualization
*   Data dictionary (included in the readme and project folder)

**Deliverables:**

*   Cleaned and preprocessed child mortality data
*   Python/Pandas scripts for data analysis and modeling
*   Tableau dashboards and visualizations
*   Project report

**Features:**
1. Data collection: Read in two .CSV data files
2. Data processing: Cleaned and processed data; performed a data (pandas) merge with the two .csv files                    
    *   Checked for missing values, duplicates, data types and basic statistics
    *   Replaced NaNs with zeroes
    *   Printed out column names
    *   Transformed the data for viewing trends over time
3. Data visualization: presented the data using
                            *  a Tableau dashboard
                            *  3 matplotlib visualizations
                            *  plotly
                      
4. Best practices
5. Data dictionary
6. Data interpretation


**Best Practices:**
1. Clone the repository
2. Open Git
3. Create a virtual environment using Venv
  a. Install venv to your host Python by running this command in your terminal:
        ~ pip install virtualenv
  b. To use venv in your project, in your terminal, create a new project folder, cd to the project folder in your terminal, and run the following command:
        ~ python<version> -m venv venv
    The second "venv" is the <virtual-environment-name>
  c. Then run:
        ~ pip install -r requirements.txt
4. Clone the repository by typing:
        ~ git clone followed by the link you copied from GitHub
5. Run the project

**How to Deactivate a Virtual Environment:**
To deactivate your virtual environment, simply run the following code in the terminal:

        ~ deactivate

To run this project, Git, Visual Studio Code, or Google Colab will be required.

In order to run the project in Google Colab, upload the data files to your Google drive.

**How to install:**
*  [Git](https://github.com/git-guides/install-git)

*  [VS Code](https://code.visualstudio.com/download)

*  [Google Colab](https://research.google.com/colaboratory/)


**Other resources:**

https://data.unicef.org/resources/dataset/the-state-of-the-worlds-children-2023-statistical-tables/



**Future plans:**


In [39]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Import Libraries**

In [40]:
# Import pandas library
from google.colab import drive
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt

# Step 1: Mount Google Drive

To analyze the dataset on child mortality rates from UNICEF:
1. Mount the Google Drive in Colab to access files stored in Google Drive.
2. After running this prompt: drive.mount('/content/drive'), you'll receive a prompt to authorize Google Colab to access your Google Drive. Follow the instructions to allow access.

NOTE: For more information on mounting Google Drive locally see:
 [Mounting Google Drive locally](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=u22w3BFiOveA)

In [41]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Step 2: Load the first data source

Assuming the data is stored in .csv files in your Google Drive, use pandas to read the file.
If file path changes, be sure to adjust this in the code with the actual path to your file within your Google Drive. (i.e. replace `path_to_your_file` with the actual file path) [link text](https://)

In [42]:
# Load the data
df = pd.read_csv('/content/drive/My Drive/Infant-Mortality-Rate-Project-Analysis/IMR_One_output_file_1.csv')
df2 = pd.read_csv('/content/drive/My Drive/Infant-Mortality-Rate-Project-Analysis/IMR_Two_output_file_2.csv')
df.head()

Unnamed: 0,Countries and areas,Under-five mortality rate 1990,Under-five mortality rate 2000,Under-five mortality rate 2021,Annual rate of reduction in under-five mortality rate 2000-2021,Under-five mortality rate Male 2021,Under-five mortality rate Female 2021,Infant mortality rate 1990,Infant mortality rate 2021,Neonatal mortality rate 1990,...,Mortality rate among children aged 5‚Äì14 years 1990,Mortality rate among children aged 5‚Äì14 years 2021,Stillbirth rate 2000,Stillbirth rate 2021,Annual rate of reduction in stillbirth rate 2000-2021,Under-five deaths 2021,Neonatal deaths 2021,Neonatal deaths as a percentage of under-five deaths 2021,Deaths among children aged 5‚Äì14 years 2021,Stillbirths 2021
0,Afghanistan,178,129,56,4.0,59,52,121,43,74,...,19,4,35,26,1.5,77811,49061,63,4396,37980
1,Albania,41,27,9,5.0,10,9,35,8,13,...,6,2,7,4,2.3,279,209,75,59,128
2,Algeria,52,42,22,3.0,24,21,44,19,24,...,9,3,20,10,3.5,21567,14888,69,2475,9429
3,Andorra,13,8,3,4.8,3,2,9,3,7,...,3,1,4,2,2.3,2,1,50,0,1
4,Angola,224,205,69,5.2,75,63,132,47,54,...,55,16,28,19,1.8,89896,35644,40,15451,26351


## Step 3: Perform Basic Data Quality Checks

After loading the data, you should perform some basic data quality checks to understand its structure and identify any obvious issues.
Here are a few checks you could perform:

In [43]:
# Check for missing values
print(df.isnull().sum())

# Check data types of each column
print(df.dtypes)

# Basic statistics for numeric columns
print(df.describe())

# Check for duplicate rows
print(df.duplicated().sum())

Countries and areas                                                0
Under-five mortality rate 1990                                     0
Under-five mortality rate 2000                                     0
Under-five mortality rate 2021                                     0
Annual rate of reduction in under-five mortality rate 2000-2021    0
Under-five mortality rate Male 2021                                0
Under-five mortality rate Female 2021                              0
Infant mortality rate 1990                                         0
Infant mortality rate 2021                                         0
Neonatal mortality rate 1990                                       0
Neonatal mortality rate 2000                                       0
Neonatal mortality rate 2021                                       0
Mortality rate among children aged 5‚Äì14 years 1990               0
Mortality rate among children aged 5‚Äì14 years 2021               0
Stillbirth rate 2000              

These steps will give you an initial understanding of the data's quality, including missing values, data types, basic statistics, and potential duplicate entries.
From here you can bring in the second data set, and join them together in a new dataframe.
You can continue preprocessing and analyzing the data based on your findings and analysis goals.

### # Combine the dataframes

# **Combine Data Sources**

In [44]:
# Concatenate the dataframes
combined_df = pd.concat([df, df2], ignore_index=True)

# (Optional) Save the combined DataFrame to a new CSV file
combined_csv_path = '/content/drive/MyDrive/Infant-Mortality-Rate-Project-Analysis.csv'
combined_df.to_csv(combined_csv_path, index=False)

# Display the head of the combined dataframe to verify
combined_df.tail()

Unnamed: 0,Countries and areas,Under-five mortality rate 1990,Under-five mortality rate 2000,Under-five mortality rate 2021,Annual rate of reduction in under-five mortality rate 2000-2021,Under-five mortality rate Male 2021,Under-five mortality rate Female 2021,Infant mortality rate 1990,Infant mortality rate 2021,Neonatal mortality rate 1990,...,Mortality rate among children aged 5‚Äì14 years 1990,Mortality rate among children aged 5‚Äì14 years 2021,Stillbirth rate 2000,Stillbirth rate 2021,Annual rate of reduction in stillbirth rate 2000-2021,Under-five deaths 2021,Neonatal deaths 2021,Neonatal deaths as a percentage of under-five deaths 2021,Deaths among children aged 5‚Äì14 years 2021,Stillbirths 2021
197,Venezuela (Bolivarian Republic of),30,22,24,-0.6,26,22,25,21,13,...,4,4,10,11,-0.5,11322,6779,60,2006,4882
198,Viet Nam,52,30,21,1.8,24,17,37,16,24,...,10,3,11,8,1.6,30455,15404,51,3982,11822
199,Yemen,126,95,62,2.0,66,58,89,47,44,...,18,7,24,23,0.1,61914,28554,46,5902,24195
200,Zambia,182,156,58,4.7,62,53,108,40,37,...,27,10,22,14,2.1,37822,16492,44,5558,9703
201,Zimbabwe,80,96,50,3.2,54,45,51,36,23,...,13,11,23,19,0.8,23960,12211,51,4684,9711


# Basic Information about the DataFrame

*   First, get a basic understanding of the structure of your DataFrame.

In [45]:
# Display the first few rows of the DataFrame
print(combined_df.head())

# Display the last few rows of the DataFrame
print(combined_df.tail())

# Get a concise summary of the DataFrame
print(combined_df.info())

# Get the number of rows and columns
print(combined_df.shape)

  Countries and areas Under-five mortality rate 1990  \
0         Afghanistan                            178   
1             Albania                             41   
2             Algeria                             52   
3             Andorra                             13   
4              Angola                            224   

  Under-five mortality rate 2000 Under-five mortality rate 2021  \
0                            129                             56   
1                             27                              9   
2                             42                             22   
3                              8                              3   
4                            205                             69   

  Annual rate of reduction in under-five mortality rate 2000-2021  \
0                                                  4                
1                                                  5                
2                                                  3 

# Data Cleaning / Transformation

In [46]:
# Replace '-' with '0' in the entire DataFrame, ignoring column headers
combined_df = combined_df.replace('-', '0')
combined_df


Unnamed: 0,Countries and areas,Under-five mortality rate 1990,Under-five mortality rate 2000,Under-five mortality rate 2021,Annual rate of reduction in under-five mortality rate 2000-2021,Under-five mortality rate Male 2021,Under-five mortality rate Female 2021,Infant mortality rate 1990,Infant mortality rate 2021,Neonatal mortality rate 1990,...,Mortality rate among children aged 5‚Äì14 years 1990,Mortality rate among children aged 5‚Äì14 years 2021,Stillbirth rate 2000,Stillbirth rate 2021,Annual rate of reduction in stillbirth rate 2000-2021,Under-five deaths 2021,Neonatal deaths 2021,Neonatal deaths as a percentage of under-five deaths 2021,Deaths among children aged 5‚Äì14 years 2021,Stillbirths 2021
0,Afghanistan,178,129,56,4,59,52,121,43,74,...,19,4,35,26,1.5,77811,49061,63,4396,37980
1,Albania,41,27,9,5,10,9,35,8,13,...,6,2,7,4,2.3,279,209,75,59,128
2,Algeria,52,42,22,3,24,21,44,19,24,...,9,3,20,10,3.5,21567,14888,69,2475,9429
3,Andorra,13,8,3,4.8,3,2,9,3,7,...,3,1,4,2,2.3,2,1,50,0,1
4,Angola,224,205,69,5.2,75,63,132,47,54,...,55,16,28,19,1.8,89896,35644,40,15451,26351
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197,Venezuela (Bolivarian Republic of),30,22,24,-0.6,26,22,25,21,13,...,4,4,10,11,-0.5,11322,6779,60,2006,4882
198,Viet Nam,52,30,21,1.8,24,17,37,16,24,...,10,3,11,8,1.6,30455,15404,51,3982,11822
199,Yemen,126,95,62,2,66,58,89,47,44,...,18,7,24,23,0.1,61914,28554,46,5902,24195
200,Zambia,182,156,58,4.7,62,53,108,40,37,...,27,10,22,14,2.1,37822,16492,44,5558,9703


In [47]:
for col in combined_df.select_dtypes(include=['object']):  # assuming string columns are of type 'object'
    combined_df[col] = combined_df[col].str.replace(r'[^\x20-\x7E]', '', regex=True)

# Save to CSV with UTF-8 encoding
combined_df.to_csv('new_combined_utf8.csv', index=False, encoding='utf-8-sig')
combined_df

Unnamed: 0,Countries and areas,Under-five mortality rate 1990,Under-five mortality rate 2000,Under-five mortality rate 2021,Annual rate of reduction in under-five mortality rate 2000-2021,Under-five mortality rate Male 2021,Under-five mortality rate Female 2021,Infant mortality rate 1990,Infant mortality rate 2021,Neonatal mortality rate 1990,...,Mortality rate among children aged 5‚Äì14 years 1990,Mortality rate among children aged 5‚Äì14 years 2021,Stillbirth rate 2000,Stillbirth rate 2021,Annual rate of reduction in stillbirth rate 2000-2021,Under-five deaths 2021,Neonatal deaths 2021,Neonatal deaths as a percentage of under-five deaths 2021,Deaths among children aged 5‚Äì14 years 2021,Stillbirths 2021
0,Afghanistan,178,129,56,4,59,52,121,43,74,...,19,4,35,26,1.5,77811,49061,63,4396,37980
1,Albania,41,27,9,5,10,9,35,8,13,...,6,2,7,4,2.3,279,209,75,59,128
2,Algeria,52,42,22,3,24,21,44,19,24,...,9,3,20,10,3.5,21567,14888,69,2475,9429
3,Andorra,13,8,3,4.8,3,2,9,3,7,...,3,1,4,2,2.3,2,1,50,0,1
4,Angola,224,205,69,5.2,75,63,132,47,54,...,55,16,28,19,1.8,89896,35644,40,15451,26351
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197,Venezuela (Bolivarian Republic of),30,22,24,-0.6,26,22,25,21,13,...,4,4,10,11,-0.5,11322,6779,60,2006,4882
198,Viet Nam,52,30,21,1.8,24,17,37,16,24,...,10,3,11,8,1.6,30455,15404,51,3982,11822
199,Yemen,126,95,62,2,66,58,89,47,44,...,18,7,24,23,0.1,61914,28554,46,5902,24195
200,Zambia,182,156,58,4.7,62,53,108,40,37,...,27,10,22,14,2.1,37822,16492,44,5558,9703


In [48]:
# Write the df to a new CSV file
combined_df.to_csv('/content/drive/My Drive/Infant-Mortality-Rate-Project-Analysis/new_combined.csv')

# Data Transformation

In [49]:
# Print out column names
print(combined_df.columns)

Index(['Countries and areas', 'Under-five mortality rate 1990',
       'Under-five mortality rate 2000', 'Under-five mortality rate 2021',
       'Annual rate of reduction in under-five mortality rate 2000-2021',
       'Under-five mortality rate Male 2021',
       'Under-five mortality rate Female 2021', 'Infant mortality rate 1990',
       'Infant mortality rate 2021', 'Neonatal mortality rate 1990',
       'Neonatal mortality rate 2000', 'Neonatal mortality rate 2021',
       'Mortality rate among children aged 5‚Äì14 years 1990',
       'Mortality rate among children aged 5‚Äì14 years 2021',
       'Stillbirth rate 2000', 'Stillbirth rate 2021',
       'Annual rate of reduction in stillbirth rate 2000-2021',
       'Under-five deaths 2021', 'Neonatal deaths 2021',
       'Neonatal deaths as a percentage of under-five deaths 2021',
       'Deaths among children aged 5‚Äì14 years 2021', 'Stillbirths 2021'],
      dtype='object')


In [50]:
# Transforming the data for viewing trends over time
# Convert relevant columns to numeric, if not already
columns_to_convert = ['Under-five mortality rate 1990', 'Under-five mortality rate 2000', 'Under-five mortality rate 2021',
                      'Under-five mortality rate Male 2021', 'Under-five mortality rate Female 2021',
                      'Infant mortality rate 1990', 'Infant mortality rate 2021',
                      'Neonatal mortality rate 1990', 'Neonatal mortality rate 2000']
combined_df[columns_to_convert] = combined_df[columns_to_convert].apply(pd.to_numeric, errors='coerce')
# Melting the DataFrame to long format
melted_df = pd.melt(combined_df, id_vars=['Countries and areas'], value_vars=columns_to_convert,
                    var_name='Indicator and Year', value_name='Value')
# Splitting the 'Indicator and Year' column into separate 'Indicator' and 'Year' columns
melted_df[['Indicator', 'Year']] = melted_df['Indicator and Year'].str.rsplit(' ', n=1, expand=True)
# Drop the original 'Indicator and Year' column if it's no longer needed
melted_df = melted_df.drop('Indicator and Year', axis=1)
# Reordering columns for clarity
melted_df = melted_df[['Countries and areas', 'Year', 'Indicator', 'Value']]
melted_df.head()

Unnamed: 0,Countries and areas,Year,Indicator,Value
0,Afghanistan,1990,Under-five mortality rate,178
1,Albania,1990,Under-five mortality rate,41
2,Algeria,1990,Under-five mortality rate,52
3,Andorra,1990,Under-five mortality rate,13
4,Angola,1990,Under-five mortality rate,224





Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.







Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




In [51]:
melted_df.to_csv('/content/drive/My Drive/Infant-Mortality-Rate-Project-Analysis/transformed_data.csv')

# Examples of Visualization

In [55]:
# High-risk Group Identification: Top 10 Countries by Under-five Mortality Rate in 2021
high_risk_countries = combined_df.sort_values('Under-five mortality rate 2021', ascending=False).head(10)
fig = px.bar(high_risk_countries, x='Countries and areas', y='Under-five mortality rate 2021', title='Top 10 High-risk Countries for Under-five Mortality Rate in 2021')
fig.show()

In [53]:
# Impact Analysis: Correlation between rate of reduction and current mortality rate
# Note: Plotly doesn't directly compute correlation, so we'll just visualize the relationship
fig = px.scatter(combined_df, x='Annual rate of reduction in under-five mortality rate 2000-2021', y='Under-five mortality rate 2021', title='Correlation between Rate of Reduction and Under-five Mortality Rate in 2021')
fig.show()