
## Life Span Trends: Analyzing Longevity

### Project Description:
Our aim is to uncover what contributes to longer life, by analyzing factors that contribute to longer lifespans covering economic gdp, health diseases, and spending on healthcare by countries.

### Research Questions to Answer:
1. On average, do people in developed or developing countries live longer?
2. Do countries with high gdp spend more on health (%) expenditure
3. Do countries that spend more on health exp have less diseases
    - vs average diseases (for each)
    - what are the 5 countries with the least amount of diseases have a higher life expectancy

Dataset: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who/data

Note: Question 1 will be answered the below script. For Question 2 and Question 3, please review Lakshmi and Jasleen's findings.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from pathlib import Path 
from scipy.stats import linregress
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

# The path to our CSV file
life_expectancy_data = Path("Life Expectancy Data.csv")

# Read our Life Expectancy data into pandas
life_expectancy_df = pd.read_csv(life_expectancy_data)

In [None]:
life_expectancy_df.info()

In [None]:
print(f"Number of countries in the final file is {life_expectancy_df['Country'].nunique()}")

## Data Cleaning

In [None]:
# Assuming life_expectancy_df is your DataFrame containing the data
# Find the first year
first_year = life_expectancy_df['Year'].min()

# Find the last year
last_year = life_expectancy_df['Year'].max()

print(first_year)
print(last_year)

In [None]:
# List of the range of years for the analysis

years_list = list(range(first_year,last_year +1, 1))

years_list

In [None]:
# Get a list of all of our columns for easy reference
life_expectancy_df.columns

In [None]:
# Remove unnecessary columns
lifee_reduced = life_expectancy_df.drop(columns=['percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources'])

# Rename column Country Name to align with other datasets
lifee_reduced.rename(columns={'Life expectancy ': 'Life expectancy'}, inplace=True)

lifee_reduced.head()

In [None]:
# Count the number of rows for each country
country_row_counts = lifee_reduced['Country'].value_counts()

# Print the result
print(country_row_counts.tail(15))

# Export the DataFrame to a CSV file to check just to make sure
# so we can decide if we need to drop any countries that do not have all the years
# country_row_counts.to_csv('country_row_counts.csv', index=False)

In [None]:
# Filtered out countries with only 1 year of data
# Group the data by country and count the number of years for each country
country_counts = lifee_reduced.groupby('Country')['Year'].count()

# Then, filter out the countries with only one year of data
countries_with_multiple_years = country_counts[country_counts > 1].index.tolist()

# Finally, filter the original DataFrame to include only the countries with more than one year of data
filtered_df = lifee_reduced[lifee_reduced['Country'].isin(countries_with_multiple_years)]

filtered_df.head()


#### Population
'Population' only shows it has non-null value of 2286 out of 2938.
Let's check to see what are countries with the null value

In [None]:
# Filter the DataFrame to include only rows where Population is NaN
countries_with_nan_population = filtered_df[filtered_df['Population'].isna()]

# Export the DataFrame to a CSV file to check just to make sure
# so we can decide if we need to drop
# countries_with_nan_population.to_csv('countries_with_nan_population.csv', index=False)

# Get unique country names
unique_countries_with_nan_population = countries_with_nan_population['Country'].unique()

# Print the unique country names
unique_countries_with_nan_population



### Discussion - Population data
The Population series were missing data of 41 countries, including prominent nations like the United States, the United Kingdom and New Zealand, posing a significant limitation. While merging with additional datasets could address this gap, the dynamic nature of populations due to factors like immigration introduces complexity and potential distortions in analyzing their relationship with life expectancies.

RE option: find another dataset for a merge

Example:
https://www.kaggle.com/datasets/iamsouravbanerjee/world-population-dataset

About this another dataset for further use:
Provide population count in 5-year intervals (i.e. 2015 Population, 2010 Population, 2000 Population) and a series on Growth Rate. 
Users can create an estimate on total population by country from 2000 to 2015 using calcuation. 

Vertical list of data available: ['Rank','CCA3', 'Country/Territory', 'Capital', 'Continent', '2022 Population', '2020 Population', '2015 Population', '2010 Population', '2000 Population', '1990 Population', '1980 Population', '1970 Population', 'Area (kmÂ²)', 'Density (per kmÂ²)', 'Growth Rate', 'World Population Percentage']

## Question 1: On average, do people in developed or developing countries live longer?

In [None]:
# Calculate the average Life Expectancy in age
per_country_status = life_expectancy_df.groupby("Country")["Status"].unique().str[0]
per_country_status.head()

In [None]:
# Calculate the average Life Expectancy in age
per_country_life_ex = life_expectancy_df.groupby("Country")["Life expectancy "].mean()
per_country_life_ex.head()


In [None]:
# Calculate the average Life Expectancy in age
per_country_avg_p = filtered_df.groupby("Country")["Population"].mean().map('{:,.0f}'.format)
print(per_country_avg_p.tail(25))

In [None]:
# Calculate the average Adult Mortality 
# (Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population))

per_country_adult_mor = filtered_df.groupby("Country")["Adult Mortality"].mean()
per_country_adult_mor

In [None]:
# Calculate the average infant deaths
# infant deaths (Number of Infant Deaths per 1000 population)

per_country_infant_deaths = filtered_df.groupby("Country")["infant deaths"].mean()
per_country_infant_deaths

In [None]:
# Create a DataFrame called `per_country_summary` with columns for the calculations above.
per_country_summary = pd.DataFrame({"Country_status": per_country_status,
                                    "Avg. Population": per_country_avg_p,
                                    "Avg. Life Expectancy": per_country_life_ex,
                                    "Adult Mortality Rates": per_country_adult_mor,
                                    "Avg. Infant Deaths":per_country_infant_deaths})

per_country_summary = per_country_summary.sort_values(["Avg. Life Expectancy"],ascending=False)

drop_per_country_summary = pd.DataFrame.dropna(per_country_summary,how="any")

drop_per_country_summary


In [None]:
status_counts = drop_per_country_summary['Country_status'].value_counts()
status_counts

In [None]:
developed_df = filtered_df.loc[:, ["Year", "Status", "Life expectancy"]]
developed_df = developed_df.dropna(how="any")

developed_group_df = developed_df.groupby(["Status", "Year"])
average = developed_group_df.mean().round(2)

# Define colors for each status
colors = {"Developed": "#0EA4D5", "Developing": "#AC2D47"}

# Plot the line graph with different colors for each status and smaller circle markers
ax = average["Life expectancy"].unstack(level=0).plot(kind="line", figsize=(10, 5), rot=90, color=colors, marker='o', markersize=3)

# Add data labels
for status in colors:
    for year, life_expectancy in average["Life expectancy"][status].items():
        ax.annotate(f'{life_expectancy}', xy=(year, life_expectancy), xytext=(-10, 5), textcoords='offset points')

# Add legend
ax.legend(title="Status")

# Set y-axis range from 60 to 90
ax.set_ylim(60, 90)

plt.title("Life Expectancy Growth by Year", fontweight="bold")
plt.ylabel("Age", fontweight="bold")
plt.xlabel("Year", fontweight="bold")
plt.tight_layout()
plt.show()



### Discussion - On average do developed or developing countries live longer?
On average, developed countries tend to have a higher life expectancy compared to developing countries, even we can see an increase in both statuses over the years.

The average life expectancy for Developed Countries ranges between 76.8 to 80.71 years.
The average life expectancy for Developing Countries ranges between 64.62 to 69.69 years.




## Further Exploration on the Series/ the Columns in the dataset
Note: The below will not make it to the main presentation, but are good preparation materials for FAQs, as it provides further readings into the existing dataset on factors influencing life expectancy.


### Schooling v.s. Life expectancy

In [None]:
# Schooling v.s. Life expectancy

y_data="Life expectancy "
x_data="Schooling"
xy_df=life_expectancy_df.loc[:,[x_data,y_data]]

xy_df=pd.DataFrame.dropna(xy_df,how="any")
xy_df.plot(kind="scatter",x=x_data,y=y_data)

x=xy_df[x_data]
y=xy_df[y_data]

(slope, intercept, rvalue, pvalue, stderr) = linregress(x,y)
print(f"The r-value is: {rvalue}")
line_eq=f"y = {round(slope,2)}x + {round(intercept,2)}"
regress_values = x * slope + intercept

plt.plot(x,regress_values,"r-")
plt.annotate(line_eq, ((x.min(),y.max())), fontsize=15, color="red") 
plt.title(f"{x_data} V.S. {y_data}")

### Discussion - Schooling v.s. Life Expectancy
While there looks to be correlation coefficient (r-value of 0.75) between schooling and life expectancy, this relationship can be interpreted differently. 

Improved education is often associated with greater awareness of health risks, and better access to preventive healthcare knowledges, resulting in health behaviors and longer life expectancy. 

However, we can also say that longer life expectancy leads to increased years of schooling, as individuals have more time to pursue education before entering the workforce. 


### Alcohol v.s. Life expectancy

In [None]:
#Alcohol v.s. Life expectancy
y_data="Life expectancy "
x_data="Alcohol"
xy_df=life_expectancy_df.loc[:,[x_data,y_data]]

xy_df=pd.DataFrame.dropna(xy_df,how="any")
xy_df.plot(kind="scatter",x=x_data,y=y_data)

x=xy_df[x_data]
y=xy_df[y_data]

(slope, intercept, rvalue, pvalue, stderr) = linregress(x,y)
print(f"The r-value is: {rvalue}")
line_eq=f"y = {round(slope,2)}x + {round(intercept,2)}"
regress_values = x * slope + intercept

plt.plot(x,regress_values,"r-")
plt.annotate(line_eq, ((x.min(),y.max())), fontsize=15, color="red") 
plt.title(f"{x_data} V.S. {y_data}")

### Discussion - Alcohol v.s. Life expectancy
RE: Topic "Dataset Limitation" in the main presentation

Even though the dataset provides alcohol consumption data recorded per capita for aged 15+, its scope is limited to the volume of pure alcohol consumed. This restricts the ability to establish meaningful correlations between alcohol consumption and life expectancy. Further research incorporating variables such as the type of alcohol consumed, like red wine, beer, hard liquor, or the context of consumption, such as social occasions and cultural practices may yield more better insights into this relationship. 