# **DATA DRIVEN EXPLORATION OF GLOBAL HIV/AIDS TRENDS USING PYTHON**

## **Phase 1 - Data Loading and Initial Overview**

## **Project Overview**

**Domain** : Public Health Analytics – Global Disease Surveillance and Epidemiology



**Objective**  :


*   Analyze the global status of HIV/AIDS using data from WHO and UNESCO sources.


*   Identify patterns and trends in HIV prevalence, AIDS-related deaths, and treatment coverage across countries.
*   Evaluate the effectiveness of ART coverage and prevention strategies over time.


*   Compare regional and country-wise disparities in HIV burden and access to healthcare services.
*   Derive actionable public health insights using exploratory data analysis (EDA) and data visualization techniques to support disease surveillance and health monitoring.


**Problem :**

**H**IV/AIDS remains one of the most critical global public health challenges, impacting millions of individuals across the world. Although significant progress has been made in medical treatment and prevention, the burden of HIV/AIDS is not evenly distributed. Some countries continue to experience high infection rates and mortality, while others show improvement due to better healthcare infrastructure and treatment access.

**A** major concern is the variation in access to preventive measures and antiretroviral therapy (ART). In several regions, limited healthcare resources, lack of awareness, and social barriers hinder effective disease control. As a result, understanding country-wise and regional differences in HIV prevalence, AIDS-related deaths, and treatment coverage is essential for informed decision-making and targeted interventions.

**T**he dataset used in this project provides detailed information on the number of people living with HIV, deaths due to HIV/AIDS, adult infection rates (ages 19–45), prevention of mother-to-child transmission estimates, and ART coverage among adults and children.

**T**here is a strong need to analyze this data systematically to identify trends, uncover hidden patterns, and detect gaps in treatment and prevention efforts. Through exploratory data analysis and effective visualizations, this project aims to highlight global and regional disparities and evaluate whether the HIV/AIDS situation is improving or worsening over time. The insights gained can assist healthcare organizations, policymakers, and stakeholders in strengthening disease surveillance, improving health monitoring, and supporting evidence-based public health strategies.

**Dataset :**

The dataset comprises data on the number of people living with HIV, sourced from authoritative organizations such as the World Health Organization (WHO) and UNESCO.

The data set contains data on:

1.   No. of people living with HIV AIDS
2. No. of deaths due to HIV AIDS
3. No. of cases among adults (19-45)
4. Prevention of mother-to-child transmission estimates
5. ART (Anti Retro-viral Therapy) coverage among people living with HIV estimates
6. ART (Anti Retro-viral Therapy) coverage among children estimates




## **Step 1:Data Loading and Initial Overview**

In this step, the dataset is loaded using the Pandas library to understand its basic structure. An initial overview is performed by examining the number of rows and columns, checking the data types of each feature, and reviewing sample records. This helps in getting familiar with the dataset and identifying any potential issues before proceeding with detailed analysis.

**1.1 Load All Files From ZIP**

In [1]:
from google.colab import files    # to upload files from the computer into the Colab environment.
import zipfile
import io

# Upload ZIP file
uploaded = files.upload()

# Get the uploaded ZIP filename
zip_filename = list(uploaded.keys())[0]

# Extract ZIP file
with zipfile.ZipFile(io.BytesIO(uploaded[zip_filename]), 'r') as zip_ref:
    zip_ref.extractall("data")

print("ZIP file uploaded and extracted successfully!")


Saving HIV_AIDS_MessyDATASET.zip to HIV_AIDS_MessyDATASET.zip
ZIP file uploaded and extracted successfully!


**1.2 Display dataset dimensions**

*  Imports the pandas library, which is used for reading and analyzing datasets.
*  Imports the os module, which helps interact with the operating system (folders and files).
*   Stores the folder name where the extracted CSV files are located.
*   Loops through all files inside the data folder.
*   Reads the CSV file into a pandas DataFrame.
*   Prints: Number of rows (df.shape[0]), Number of columns (df.shape[1]) for each CSV file.








In [3]:
import pandas as pd
import os

data_path = "data"

for file in os.listdir(data_path):
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join(data_path, file))
        print(f"{file} → Rows: {df.shape[0]}, Columns: {df.shape[1]}")


no_of_deaths_by_country_clean.csv → Rows: 510, Columns: 7
art_pediatric_coverage_by_country_clean.csv → Rows: 170, Columns: 11
no_of_cases_adults_15_to_49_by_country_clean.csv → Rows: 680, Columns: 7
art_coverage_by_country_clean.csv → Rows: 170, Columns: 11
prevention_of_mother_to_child_transmission_by_country_clean.csv → Rows: 170, Columns: 11
no_of_people_living_with_hiv_by_country_clean.csv → Rows: 680, Columns: 7


**1.3 Display the first 5 rows of the dataset**

We will examine the first few rows of the dataset to understand:

*   The structure and format of the data
*   Actual values in each column
*   Data types (numbers, text, etc.)
*   Any obvious data quality issues

The df.head() function displays the first 5 rows by default, giving a quick preview of the dataset. We can use the df.head(n) method to check the top n rows of the dataframe, where n is an integer.

In [4]:
for file in os.listdir(data_path):
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join(data_path, file))
        print(f"\nFirst 5 rows of {file}:")
        display(df.head())



First 5 rows of no_of_deaths_by_country_clean.csv:


Unnamed: 0,Country,Year,Count,Count_median,Count_min,Count_max,WHO Region
0,Afghanistan,2018,500[200–610],500.0,200.0,610.0,Eastern Mediterranean
1,Albania,2018,na,,,,Europe
2,Algeria,2018,200[200–200],200.0,200.0,200.0,Africa
3,Angola,2018,14000[9500–18000],14000.0,9500.0,18000.0,Africa
4,Argentina,2018,1700[1300–2100],1700.0,1300.0,2100.0,Americas



First 5 rows of art_pediatric_coverage_by_country_clean.csv:


Unnamed: 0,Country,Reported number of children receiving ART,Estimated number of children needing ART based on WHO methods,Estimated ART coverage among children (%),Estimated number of children needing ART based on WHO methods_median,Estimated number of children needing ART based on WHO methods_min,Estimated number of children needing ART based on WHO methods_max,Estimated ART coverage among children (%)_median,Estimated ART coverage among children (%)_min,Estimated ART coverage among children (%)_max,WHO Region
0,Afghanistan,60,500[500-530],17[10-26],500.0,500.0,530.0,17.0,10.0,26.0,Eastern Mediterranean
1,Albania,20,Nodata,Nodata,,,,,,,Europe
2,Algeria,770,500[500-520],95[95-95],500.0,500.0,520.0,95.0,95.0,95.0,Africa
3,Angola,4800,38000[30000-47000],13[10-16],38000.0,30000.0,47000.0,13.0,10.0,16.0,Africa
4,Argentina,1700,1800[1600-2100],92[84-95],1800.0,1600.0,2100.0,92.0,84.0,95.0,Americas



First 5 rows of no_of_cases_adults_15_to_49_by_country_clean.csv:


Unnamed: 0,Country,Year,Count,Count_median,Count_min,Count_max,WHO Region
0,Afghanistan,2018,0.1[0.1–0.1],0.1,0.1,0.1,Eastern Mediterranean
1,Albania,2018,na,,,,Europe
2,Algeria,2018,0.1[0.1–0.1],0.1,0.1,0.1,Africa
3,Angola,2018,2.0[1.7–2.3],2.0,1.7,2.3,Africa
4,Argentina,2018,0.4[0.4–0.4],0.4,0.4,0.4,Americas



First 5 rows of art_coverage_by_country_clean.csv:


Unnamed: 0,Country,Reported number of people receiving ART,Estimated number of people living with HIV,Estimated ART coverage among people living with HIV (%),Estimated number of people living with HIV_median,Estimated number of people living with HIV_min,Estimated number of people living with HIV_max,Estimated ART coverage among people living with HIV (%)_median,Estimated ART coverage among people living with HIV (%)_min,Estimated ART coverage among people living with HIV (%)_max,WHO Region
0,Afghanistan,920,7200[4100–11000],13[7–20],7200.0,4100.0,11000.0,13.0,7.0,20.0,Eastern Mediterranean
1,Albania,580,Nodata,Nodata,,,,,,,Europe
2,Algeria,12800,16000[15000–17000],81[75–86],16000.0,15000.0,17000.0,81.0,75.0,86.0,Africa
3,Angola,88700,330000[290000–390000],27[23–31],330000.0,290000.0,390000.0,27.0,23.0,31.0,Africa
4,Argentina,85500,140000[130000–150000],61[55–67],140000.0,130000.0,150000.0,61.0,55.0,67.0,Americas



First 5 rows of prevention_of_mother_to_child_transmission_by_country_clean.csv:


Unnamed: 0,Country,Received Antiretrovirals,Needing antiretrovirals,Percentage Recieved,Needing antiretrovirals_median,Needing antiretrovirals_min,Needing antiretrovirals_max,Percentage Recieved_median,Percentage Recieved_min,Percentage Recieved_max,WHO Region
0,Afghanistan,20,200[100–500],11[7–18],200.0,100.0,500.0,11.0,7.0,18.0,Eastern Mediterranean
1,Albania,No data,Nodata,Nodata,,,,,,,Europe
2,Algeria,320,500[500–500],74[69–78],500.0,500.0,500.0,74.0,69.0,78.0,Africa
3,Angola,9600,25000[19000–32000],38[29–48],25000.0,19000.0,32000.0,38.0,29.0,48.0,Africa
4,Argentina,1800,1800[1600–2000],95[85–95],1800.0,1600.0,2000.0,95.0,85.0,95.0,Americas



First 5 rows of no_of_people_living_with_hiv_by_country_clean.csv:


Unnamed: 0,Country,Year,Count,Count_median,Count_min,Count_max,WHO Region
0,Afghanistan,2018,7200[4100–11000],7200.0,4100.0,11000.0,Eastern Mediterranean
1,Albania,2018,na,,,,Europe
2,Algeria,2018,16000[15000–17000],16000.0,15000.0,17000.0,Africa
3,Angola,2018,330000[290000–390000],330000.0,290000.0,390000.0,Africa
4,Argentina,2018,140000[130000–150000],140000.0,130000.0,150000.0,Americas


**1.4 Display last 5 rows of the dataset**

Checking the last few rows helps us:

*   Verify the entire file loaded completely
*   Check if data patterns change at the end
*   Ensure no corruption at the file's end

The df.tail(n) function displays the last n rows.

In [5]:
for file in os.listdir(data_path):
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join(data_path, file))
        print(f"\nLast 5 rows of {file}:")
        display(df.tail())



Last 5 rows of no_of_deaths_by_country_clean.csv:


Unnamed: 0,Country,Year,Count,Count_median,Count_min,Count_max,WHO Region
505,Venezuela (Bolivarian Republic of),2000,na,,,,Americas
506,Viet Nam,2000,6100[4300–7800],6100.0,4300.0,7800.0,Western Pacific
507,Yemen,2000,100[100–200],100.0,100.0,200.0,Eastern Mediterranean
508,Zambia,2000,62000[49000–81000],62000.0,49000.0,81000.0,Africa
509,Zimbabwe,2000,120000[98000–150000],120000.0,98000.0,150000.0,Africa



Last 5 rows of art_pediatric_coverage_by_country_clean.csv:


Unnamed: 0,Country,Reported number of children receiving ART,Estimated number of children needing ART based on WHO methods,Estimated ART coverage among children (%),Estimated number of children needing ART based on WHO methods_median,Estimated number of children needing ART based on WHO methods_min,Estimated number of children needing ART based on WHO methods_max,Estimated ART coverage among children (%)_median,Estimated ART coverage among children (%)_min,Estimated ART coverage among children (%)_max,WHO Region
165,Venezuela (Bolivarian Republic of),No data,Nodata,Nodata,,,,,,,Americas
166,Viet Nam,4600,5000[4000-5900],92[74-95],5000.0,4000.0,5900.0,92.0,74.0,95.0,Western Pacific
167,Yemen,130,500[500-580],33[24-50],500.0,500.0,580.0,33.0,24.0,50.0,Eastern Mediterranean
168,Zambia,49 100,62000[52000-74000],79[65-93],62000.0,52000.0,74000.0,79.0,65.0,93.0,Africa
169,Zimbabwe,63 900,84000[65000-100000],76[59-93],84000.0,65000.0,100000.0,76.0,59.0,93.0,Africa



Last 5 rows of no_of_cases_adults_15_to_49_by_country_clean.csv:


Unnamed: 0,Country,Year,Count,Count_median,Count_min,Count_max,WHO Region
675,Venezuela (Bolivarian Republic of),2000,na,,,,Americas
676,Viet Nam,2000,0.3[0.2–0.3],0.3,0.2,0.3,Western Pacific
677,Yemen,2000,0.1[0.1–0.1],0.1,0.1,0.1,Eastern Mediterranean
678,Zambia,2000,16.2[14.3–18.2],16.2,14.3,18.2,Africa
679,Zimbabwe,2000,25.0[21.2–28.3],25.0,21.2,28.3,Africa



Last 5 rows of art_coverage_by_country_clean.csv:


Unnamed: 0,Country,Reported number of people receiving ART,Estimated number of people living with HIV,Estimated ART coverage among people living with HIV (%),Estimated number of people living with HIV_median,Estimated number of people living with HIV_min,Estimated number of people living with HIV_max,Estimated ART coverage among people living with HIV (%)_median,Estimated ART coverage among people living with HIV (%)_min,Estimated ART coverage among people living with HIV (%)_max,WHO Region
165,Venezuela (Bolivarian Republic of),Nodata,120000[100000–130000],Nodata,120000.0,100000.0,130000.0,,,,Americas
166,Viet Nam,150000,230000[200000–260000],65[57–73],230000.0,200000.0,260000.0,65.0,57.0,73.0,Western Pacific
167,Yemen,2200,11000[6500–18000],21[12–35],11000.0,6500.0,18000.0,21.0,12.0,35.0,Eastern Mediterranean
168,Zambia,965000,1200000[1100000–1400000],78[69–88],1200000.0,1100000.0,1400000.0,78.0,69.0,88.0,Africa
169,Zimbabwe,1151000,1300000[1100000–1500000],88[77–95],1300000.0,1100000.0,1500000.0,88.0,77.0,95.0,Africa



Last 5 rows of prevention_of_mother_to_child_transmission_by_country_clean.csv:


Unnamed: 0,Country,Received Antiretrovirals,Needing antiretrovirals,Percentage Recieved,Needing antiretrovirals_median,Needing antiretrovirals_min,Needing antiretrovirals_max,Percentage Recieved_median,Percentage Recieved_min,Percentage Recieved_max,WHO Region
165,Venezuela (Bolivarian Republic of),410,Nodata,Nodata,,,,,,,Americas
166,Viet Nam,1900,2400[2000–2800],81[69–95],2400.0,2000.0,2800.0,81.0,69.0,95.0,Western Pacific
167,Yemen,30,500[200–500],13[8–20],500.0,200.0,500.0,13.0,8.0,20.0,Eastern Mediterranean
168,Zambia,56 500,48000[38000–57000],95[94–95],48000.0,38000.0,57000.0,95.0,94.0,95.0,Africa
169,Zimbabwe,59 600,63000[48000–76000],94[71–95],63000.0,48000.0,76000.0,94.0,71.0,95.0,Africa



Last 5 rows of no_of_people_living_with_hiv_by_country_clean.csv:


Unnamed: 0,Country,Year,Count,Count_median,Count_min,Count_max,WHO Region
675,Venezuela (Bolivarian Republic of),2000,na,,,,Americas
676,Viet Nam,2000,120000[110000–130000],120000.0,110000.0,130000.0,Western Pacific
677,Yemen,2000,1100[680–2500],1100.0,680.0,2500.0,Eastern Mediterranean
678,Zambia,2000,890000[800000–1000000],890000.0,800000.0,1000000.0,Africa
679,Zimbabwe,2000,1600000[1400000–1900000],1600000.0,1400000.0,1900000.0,Africa


**1.5 Dataset Information and Data Types**

**Understanding Data Types is Crucial for:**

*   Finding columns that need to be changed to the correct data type
*   Choosing the right methods for analysis and visualization
*   Spotting errors, inconsistencies, or poor-quality data early
*   Knowing how much memory the dataset uses and managing it better

**The info() Method Provides:**

*  The total number of rows present in the dataset
*  Names of columns along with their data types
*   Count of non-missing values in each column, helping to find missing data
*   Information about the memory used by the dataset

In [6]:
for file in os.listdir(data_path):
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join(data_path, file))
        print(f"\nDataset Information for {file}:")
        df.info()



Dataset Information for no_of_deaths_by_country_clean.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 510 entries, 0 to 509
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country       510 non-null    object 
 1   Year          510 non-null    int64  
 2   Count         510 non-null    object 
 3   Count_median  400 non-null    float64
 4   Count_min     400 non-null    float64
 5   Count_max     400 non-null    float64
 6   WHO Region    510 non-null    object 
dtypes: float64(3), int64(1), object(3)
memory usage: 28.0+ KB

Dataset Information for art_pediatric_coverage_by_country_clean.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 11 columns):
 #   Column                                                                Non-Null Count  Dtype  
---  ------                                                                --------------  -----  
 0   Country 

**1.6 Statistical Summary of Numerical Features**

The describe() function provides key statistics for all numerical columns:

**Statistical Measures**:


*   count: Number of non-null values
*   mean: Average value
*   std: Standard deviation (variability in the data)
*   min: Minimum value
*   max: Maximum value
*   25%(First Quartile – Q1), 50% (Median), 75% (Third Quartile – Q3)



**This helps to identify:**

*   Data distribution and spread
*   Potential outliers (unusually high or low values)
*   Missing data (if count is less than total rows)
*   Columns stored incorrectly as text (these won't appear in the summary)










In [8]:
for file in os.listdir(data_path):
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join(data_path, file))
        print(f"\nStatistical Summary for {file}:")

# Generate statistical summary and round values
        summary = df.describe().round(2)

        display(summary)




Statistical Summary for no_of_deaths_by_country_clean.csv:


Unnamed: 0,Year,Count_median,Count_min,Count_max
count,510.0,400.0,400.0,400.0
mean,2009.33,6871.82,5232.48,9116.42
std,7.37,17748.91,13557.74,24101.58
min,2000.0,100.0,100.0,100.0
25%,2000.0,100.0,100.0,100.0
50%,2010.0,605.0,500.0,875.0
75%,2018.0,3825.0,3025.0,5025.0
max,2018.0,140000.0,110000.0,190000.0



Statistical Summary for art_pediatric_coverage_by_country_clean.csv:


Unnamed: 0,Estimated number of children needing ART based on WHO methods_median,Estimated number of children needing ART based on WHO methods_min,Estimated number of children needing ART based on WHO methods_max,Estimated ART coverage among children (%)_median,Estimated ART coverage among children (%)_min,Estimated ART coverage among children (%)_max
count,102.0,102.0,102.0,93.0,93.0,93.0
mean,15963.92,12347.94,20667.75,48.74,39.86,57.96
std,37717.28,28997.47,51232.66,27.15,24.98,28.96
min,100.0,100.0,100.0,5.0,4.0,6.0
25%,500.0,500.0,500.0,25.0,19.0,30.0
50%,1900.0,1600.0,2400.0,41.0,34.0,55.0
75%,11000.0,8350.0,13000.0,70.0,54.0,89.0
max,260000.0,200000.0,360000.0,95.0,95.0,95.0



Statistical Summary for no_of_cases_adults_15_to_49_by_country_clean.csv:


Unnamed: 0,Year,Count_median,Count_min,Count_max
count,680.0,556.0,556.0,556.0
mean,2008.25,2.03,1.76,2.3
std,6.65,4.58,4.12,4.98
min,2000.0,0.1,0.1,0.1
25%,2003.75,0.1,0.1,0.1
50%,2007.5,0.4,0.3,0.5
75%,2012.0,1.5,1.2,1.9
max,2018.0,27.4,25.2,29.3



Statistical Summary for art_coverage_by_country_clean.csv:


Unnamed: 0,Estimated number of people living with HIV_median,Estimated number of people living with HIV_min,Estimated number of people living with HIV_max,Estimated ART coverage among people living with HIV (%)_median,Estimated ART coverage among people living with HIV (%)_min,Estimated ART coverage among people living with HIV (%)_max
count,138.0,138.0,138.0,136.0,136.0,136.0
mean,227337.68,195366.01,263125.43,55.55,46.83,65.53
std,743297.75,668940.91,824131.12,20.1,18.39,21.83
min,200.0,100.0,500.0,9.0,7.0,11.0
25%,7250.0,6275.0,8325.0,41.0,34.75,49.75
50%,30500.0,25500.0,35500.0,56.0,47.0,67.0
75%,137500.0,117500.0,150000.0,71.25,60.0,85.0
max,7700000.0,7100000.0,8300000.0,92.0,84.0,95.0



Statistical Summary for prevention_of_mother_to_child_transmission_by_country_clean.csv:


Unnamed: 0,Needing antiretrovirals_median,Needing antiretrovirals_min,Needing antiretrovirals_max,Percentage Recieved_median,Percentage Recieved_min,Percentage Recieved_max
count,100.0,100.0,100.0,92.0,92.0,92.0
mean,12289.4,9163.5,15168.2,67.51,55.15,74.8
std,35475.13,25867.59,43534.54,27.79,24.85,26.8
min,100.0,100.0,100.0,5.0,2.0,9.0
25%,500.0,200.0,500.0,45.5,34.75,57.5
50%,1100.0,820.0,1400.0,79.0,59.5,94.5
75%,5525.0,4700.0,6650.0,93.0,71.0,95.0
max,290000.0,210000.0,350000.0,95.0,95.0,95.0



Statistical Summary for no_of_people_living_with_hiv_by_country_clean.csv:


Unnamed: 0,Year,Count_median,Count_min,Count_max
count,680.0,553.0,553.0,553.0
mean,2008.25,185791.83,158800.49,215200.83
std,6.65,575675.03,509382.27,643259.89
min,2000.0,100.0,100.0,100.0
25%,2003.75,3700.0,3100.0,4600.0
50%,2007.5,21000.0,18000.0,27000.0
75%,2012.0,110000.0,94000.0,130000.0
max,2018.0,7700000.0,7100000.0,8300000.0
