# Part (d): Correlation
### UE22CS342AA2 - Data Analytics 

- `Correlation` is a measure of the strength and direction of linear relationship between two random variables in other words it is a measure of the association between two variables.
- Correlation is a descriptive statistic that lies in the range of `(-1,1)`
- There are different types of correlation coefficients, based on the nature of the data being compared:
    - Between two continuous (interval, ratio) random variables - Pearson’s Product Moment Correlation Coefficient
    - Between two ordinal random variables - Spearman-Rank Correlation Coefficient
    - Between a continuous RV and a dichotomous RV - Point Bi-Serial Correlation Coefficient
    - Between two binary random variables - Phi Coefficient
    
    
- The assignment has the below problems
    - Problem 1
    - Problem 2
    - Problem 3
    - Problem 4
    - Problem 5
    - Problem 6

*Snippet to install a package cleanly*
```
if (!requireNamespace("tidyverse", quietly = TRUE)) {
    install.packages("tidyverse")
}
```
*Load a package*

```
library(tidyverse)
```


# About The Dataset

- The dataset is a summary of various health and disease related statistics of various countries and the effect the economy has on the health status (or maybe not). Below is the description of each column of the dataset.

1. **SlNo**: Serial Number – A unique identifier for each record.
2. **Country**: Name of the country where the data was collected.
3. **Year**: Year in which the data was recorded (Ranges from 2001 to 2004).
4. **Status**: Development status of the country (e.g., Developing, Developed).
5. **Life_Expectancy**: Average life expectancy at birth in years.
6. **Tuberculosis**: Number of tuberculosis cases per 100,000 people.
7. **Influenza**: Number of influenza cases per 100,000 people.
8. **Adult_Mortality**: Number of adult deaths (per 1,000) between ages 15 and 60.
9. **Infant_Deaths**: Number of infant deaths.
10. **Alcohol**: Per capita alcohol consumption (in liters).
11. **Percentage_Expenditure**: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
12. **Hepatitis_B**: Number of Hepatitis B cases per 100,000 people.
13. **Measles**: Number of measles cases per 100,000 people.
14. **Under_Five_Deaths**: Number of under-five deaths per 1000 population
15. **Polio**: Number of polio cases per 100,000 people
16. **Total_Expenditure**: General government expenditure on health as a percentage of total government expenditure (%).
17. **Diphtheria**: Percentage of children vaccinated against diphtheria.
18. **HIV_AIDS**: Number of HIV/AIDS cases per 100,000 people.
19. **GDP**: Gross Domestic Product (per capita).
20. **Population**: Total population of the country.
21. **Thinness_1_19_Years**: Prevalence of thinness among children and adolescents for Age 10 to 19 (%).
22. **Thinness_5_9_Years**: Prevalence of thinness among children for Age 5 to 9(%)
23. **Income_Composition_Of_Resources**: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
24. **Schooling**: Number of years of Schooling(years).

In [None]:
# Install necessary packages.

if (!requireNamespace("tidyverse", quietly = TRUE)) {
    install.packages("tidyverse")
}
if (!requireNamespace("moments", quietly = TRUE)) {
    install.packages("moments") 
}
if (!requireNamespace("readr", quietly = TRUE)) { 
    install.packages("readr")
}
if (!requireNamespace("ltm", quietly = TRUE)) { 
    install.packages("ltm")
}
if (!requireNamespace("psych", quietly = TRUE)) { 
    install.packages("psych")
}

In [None]:
# Load necessary packages
library(tidyverse)
library(moments) 
library(ggplot2)
library(readr)
library(dplyr)
library(ggpubr)
library(ltm)
library(psych)

In [None]:
# Read the data frame from the dataset.
data <- read.csv("/kaggle/input/health-dataset-correlation/health_dataset_final.csv")
head(data)

*Problem 1*

Across the year 2002, find the total number of airborne disease cases in each of the countries in a greatest first order. (1 point)

In [None]:
data_2002 <- data %>%
  filter(Year == 2002)

# Calculate total airborne disease cases
data_2002 <- data_2002 %>%
  mutate(Total_Airborne_Diseases = Influenza + Measles + Polio)

# Sum cases for each country
result <- data_2002 %>%
  group_by(Country) %>%
  summarise(Total_Airborne_Diseases = sum(Total_Airborne_Diseases)) %>%
  arrange(desc(Total_Airborne_Diseases))

# Print the result
result

*Problem 2*

For the year 2001, what is the strength of relation between alcohol consumption and life expectancy and in what direction? State any issues with this if present. How will you rectify it. Submit observations after resolution and justify the changes. Do analyse the scatter plot. (2 points)

In [None]:
# fetch the data pertaining to year 2003
data_2001 = subset(data, Year == 2001)
head(data_2001)

In [None]:
# Make a Scatter plot for the datapoints.
plot(data_2001$Alcohol, data_2001$Life_Expectancy,
     main = "Scatter Plot of Alcohol Consumption vs. Life Expectancy",
     xlab = "Alcohol Consumption (liters per capita consumption)",
     ylab = "Life Expectancy (years)",
     pch = 19, col = "maroon"
)

# Calculate the best-fit line
fit <- lm(Life_Expectancy ~ Alcohol, data = data_2001)

# Add the best-fit line to the plot
abline(fit, col = "red", lwd = 2)

#Find the correlation here (The data with outliders).
corr_with_out = cor(data_2001$Life_Expectancy, data_2001$Alcohol)
print(paste("The correlation is: ",corr_with_out))

In [None]:
# Filter out the rows to remove the outlier.
data_2001_wo_outlier = subset(data_2001,Life_Expectancy < 150 )
plot(data_2001_wo_outlier$Alcohol, data_2001_wo_outlier$Life_Expectancy,
     main = "Scatter Plot of Alcohol Consumption vs. Life Expectancy",
     xlab = "Alcohol Consumption (liters per capita consumption)",
     ylab = "Life Expectancy (years)",
     pch = 19, col = "maroon"
)

# Calculate the best-fit line
fit <- lm(Life_Expectancy ~ Alcohol, data = data_2001_wo_outlier)

# Add the best-fit line to the plot
abline(fit, col = "red", lwd = 2)

# Get the correlation after removing the outlier.
corr_final = cor(data_2001_wo_outlier$Alcohol, data_2001_wo_outlier$Life_Expectancy)
print(paste("The correlation is: ",corr_final))

In [None]:
data_2001_wo_outlier

In [None]:
# Making the scatter plot with the confidence interval of the correlation coefficient.
ggscatter(
    data_2001_wo_outlier, x='Alcohol', y='Life_Expectancy', add='reg.line', conf.int=TRUE, 
    cor.coef=TRUE, cor.method = 'spearman'
)

- In the above first plot, we can observe a moderately negative correlation coefficient between alcohol consumption and life expectancy however, there is an outlier. 
- Correlation coefficient, that measures the strength of the relation between 2 columns, is sensitive to outliers, as shown above.
- We observe an outlier with lifespan of 195.65 years which is extremely rare for humans. 
- After removing this outlier, we plot the results and observe a shift in correlation from -0.63 to -0.88. Hence a strong negative correlation.

*Problem 3*

Convert the Tuberculosis column of countries in 2001 to a binary column based on the presence or absence of tuberculosis cases. Calculate the point-biserial correlation between the binary tuberculosis column and the life expectancy column. (2 points)

In [None]:
data_2001_wo_outlier$Tuberculosis_Binary <- ifelse(data_2001_wo_outlier$Tuberculosis == 0, 0, 1)
data_new = data_2001_wo_outlier

In [None]:
# performing the correlation test to find any relation.
correlation_test <- cor.test(data_new$Life_Expectancy, data_new$Tuberculosis_Binary, method = "pearson")

# Extract and print the correlation coefficient and p-value
correlation <- correlation_test$estimate
p_value <- correlation_test$p.value

# Print the results
cat("Point-Biserial Correlation Coefficient between Life Expectancy and Binary Tuberculosis:", correlation, "\n")

- We can notice, there is no significant correlation between life expectancy and the Tuberculosis column.

*Problem 4*

Estimate the phi coefficient correlation between the occurrence of infant deaths and the presence of polio cases in 2001. Convert the two columns into a contingency table. (2 points)

In [None]:
data_new['Polio']

In [None]:
data_new['Infant_Deaths']

In [None]:
# Converting and storing the new columns in binary.
data_new$Polio_Binary <- ifelse(data_new$Polio == 0, 0, 1)
data_new$Infant_Deaths_Binary <- ifelse(data_new$Infant_Deaths == 0, 0, 1)
data_new

The Phi-coefficient is given by:
$$
\phi = \frac{n_{11} \cdot n_{00} - n_{10} \cdot n_{01}}{\sqrt{n_{X1} \cdot n_{X0} \cdot n_{Y1} \cdot n_{Y0}}}
$$
used to find the strength of relation between 2 binary variables, where in the following representations hold:
- **n11**: Number of cases where \( X = 1 \) and \( Y = 1 \).
- **n00**: Number of cases where \( X = 0 \) and \( Y = 0 \).
- **n10**: Number of cases where \( X = 1 \) and \( Y = 0 \).
- **n01**: Number of cases where \( X = 0 \) and \( Y = 1 \).

- **nX1**: Total number of cases where \( X = 1 \). 
- **nX0**: Total number of cases where \( X = 0 \). 
- **nY1**: Total number of cases where \( Y = 1 \). 
- **nY0**: Total number of cases where \( Y = 0 \). 


In [None]:
# plotting the contingency table
contingency_table <- table(data_new[, c('Polio_Binary', 'Infant_Deaths_Binary')]) 
contingency_table

In [None]:
# Printing the phi coefficient.
phi(contingency_table)

*Problem 5*

Calculate the Spearman-Rank correlation coefficient between the two sets of ranks, Life Expectancy and GDP, to explore the relationship between the same during the year 2001. (Rank the countries based on life expectancy and GDP, assigning a rank of 1 to the country with the highest value in each category.)  Now, statistically infer whether this correlation is significant at `α = 0.05`, and test the hypothesis that the correlation coefficient is `at least 0.35.` (2 points)

In [None]:
# Creating new columns for the Ranks.
data_new$Life_Expectancy_Rank <- rank(desc(data_new$Life_Expectancy), ties.method='random')
data_new$GDP_Rank <- rank(desc(data_new$GDP), ties.method='random')

In [None]:
# Finding the correlation coefficient.
rs <- cor(data_new$Life_Expectancy_Rank, data_new$GDP_Rank, method='spearman')
rs

In [None]:
# Testing the same.
print(cor.test(data_new$GDP_Rank, data_new$Life_Expectancy_Rank, method='spearman'))

In [None]:
# scatter plot with the fit.
ggscatter(data_new, x='GDP_Rank', y='Life_Expectancy_Rank', add='reg.line', conf.int=TRUE,
cor.coef=TRUE, cor.method = 'spearman')

- The above scatter plot shows a strong positive correlation between the ranks of the GDP with the
- average lifespan which means that a country with a high GDP, due to more investment in healthcare, 
- more money into education (medical), etc can improve the lifespan of the country! 

In [None]:
# testing the hypothesis.

- Null Hypothesis: H0: `ρs ≤ 0.35`
- Alternate Hypothesis: H1: `HA: ρs > 0.35`

Check if correlation coefficient is at least 0.35. The t statistic is given by

$$\frac{r_s - \rho_s}{\sqrt{\frac{1 - r_s^2}{n - 2}}}$$

where:
- t represents the t-statistic,
- rs is the Spearman-Rank correlation coefficient
- ρs value of the population correlation coefficient being tested against, 
- n is the number of data points in the sample.



In [None]:
# calculating the t-statistic
degrees <- nrow(data_new) - 2
t_stat <- (rs - 0.35)/sqrt((1 - rs*rs)/(nrow(data_new) - 2)) 
t_stat

In [None]:
# 2 tailed test p-value.
2 * pt(q=t_stat, df=degrees, lower.tail=FALSE)

Hence we reject the null hypothesis/ accept the alternate hypothesis and conclude saying the correlation is atleast 0.35.

*Problem 6*

A report mentions a significant decline in the U.S. healthcare outcomes between the 1990s and 2010s, using data from life expectancy and disease prevalence. Contributors of the report utilized words such as “crisis” to describe the current state of the healthcare system at the time. It was an imminent call to action for reformation. What are your thoughts on this? Was it truly due to the decline in the quality of the healthcare system? (1 point)
- Your task is to explore answers given by different Large Language Models (Optionally, as a bonus, you can test out open-source models via local inference! You can find a few @[huggingface](https://huggingface.co/models?sort=trending) here) or Small LM or Quantised Models.
- Which gave the best results? Why do you think so?

In [None]:
# Answers given by different language models
# 1. ChatGPT (GPT-4o)

The report's characterization of the decline in U.S. healthcare outcomes as a "crisis" and its call for reformation likely stem from various interconnected factors. Whether this decline is due to a drop in the quality of the healthcare system itself or other influencing elements is a multifaceted question. Here are some potential considerations:

Healthcare System Quality: The quality of the healthcare system involves various components, including accessibility, affordability, medical technology, healthcare policies, and the efficiency of healthcare delivery. A decline in any of these areas could impact overall healthcare outcomes.

Economic Factors: Economic downturns or stagnation can affect healthcare outcomes. Economic instability often leads to reduced funding for public health initiatives and decreased access to healthcare services, particularly for low-income populations.

Social Determinants of Health: Factors such as education, employment, social support, and living conditions significantly impact health outcomes. Worsening social determinants can lead to poorer health outcomes, independent of the healthcare system's quality.

Chronic Disease Prevalence: An increase in chronic diseases such as diabetes, obesity, and heart disease can strain the healthcare system and negatively impact life expectancy and overall health outcomes.

Public Health Policy and Preventive Care: The effectiveness of public health policies and the emphasis on preventive care can play crucial roles. A lack of focus on preventive care and public health measures can lead to poorer health outcomes.

Healthcare Disparities: Persistent disparities in healthcare access and quality among different populations, particularly among racial and ethnic minorities, can contribute to overall declines in health outcomes.

External Factors: Other factors, such as environmental changes, lifestyle changes, and advances in medical science and technology, can also influence healthcare outcomes. For instance, improvements in technology and treatments may not be evenly distributed, leading to discrepancies in healthcare outcomes.

While the report highlights significant concerns, attributing the decline solely to the quality of the healthcare system may overlook these broader factors. It is likely a combination of the healthcare system's inefficiencies and various socio-economic factors contributing to the observed decline in healthcare outcomes during that period. Addressing these issues requires a holistic approach that considers improving healthcare quality, enhancing social determinants of health, and implementing effective public health policies.

In [None]:
# Gemini by Google.

The report you describe highlights a concerning trend in U.S. healthcare. Here's a breakdown of the situation:

The Decline:

It's true that some health outcomes in the U.S. worsened between the 1990s and 2010s. This is reflected in metrics like life expectancy stalling or even declining, and a rise in certain chronic diseases.
However, it's important to note that advancements were made in other areas. For example, cancer mortality rates did continue to improve even during this period.
Possible Causes (beyond healthcare quality):

Social determinants of health: Factors like poverty, lack of access to healthy food and exercise, and social isolation can significantly impact health outcomes.
Rise in chronic diseases: Obesity, diabetes, and addiction rates have increased, contributing to poorer health overall.
Opioid epidemic: This public health crisis tragically led to many deaths.
Healthcare System as a Factor:

The healthcare system likely played a role, but it's not the sole culprit.
Issues like high costs, limited access to preventive care, and disparities in treatment based on race and income could be contributing factors.
So, was it a crisis?

While the decline in some areas is concerning, "crisis" might be a strong term. Nonetheless, it was a clear signal for improvement.
The report served as a wake-up call to address the complex issues plaguing U.S. healthcare.
Looking Forward:

Efforts are underway to address these issues:
Affordable Care Act aimed to expand health insurance coverage.
Focus on preventive care and public health initiatives.
Efforts to address social determinants of health.
Overall:

The situation is multifaceted. While healthcare quality may be a factor, addressing the broader social and economic determinants of health is crucial for improvement. The call for reform in the report was a necessary step to address these issues.

Additional Points:

You can find resources like the Commonwealth Fund and the Kaiser Family Foundation that provide data and analysis on U.S. healthcare.
I hope this helps! Let me know if you have any further questions.

In [None]:
# Local Inference (Optional)
# Llama3-8B-Instruct by Meta

As a data analyst, I must approach this topic with a critical eye. Upon reviewing the report, I noticed that the decline in healthcare outcomes is likely attributed to a combination of factors rather than a single cause.

One crucial aspect to consider is the improvement in disease diagnosis and reporting methods during the 1990s and early 2000s. The introduction of new technologies and advancements in medical research led to increased detection and reporting of previously underreported diseases, such as HIV/AIDS. This could have contributed to the perceived decline in healthcare outcomes.

Another significant factor is the growing prevalence of chronic diseases, such as obesity, diabetes, and hypertension. These conditions have become increasingly prevalent due to lifestyle changes and demographic shifts. As these conditions are often managed through long-term care and medication, they can have a significant impact on healthcare outcomes, even if the quality of care remains consistent.

Furthermore, it is essential to consider the potential impact of socio-economic factors, such as poverty, education, and access to healthcare. Disparities in these areas can influence healthcare outcomes, making it more challenging to accurately compare outcomes across different populations.

In addition, the report's description of the healthcare system as being in a "crisis" might be an exaggeration. While it is true that the healthcare system faced challenges during this period, it is essential to consider the broader context and the efforts made to address these issues.

In conclusion, as a data analyst, I would recommend a more nuanced approach to understanding the decline in healthcare outcomes. It is crucial to consider the interplay of various factors, including changes in disease diagnosis and reporting, the growing prevalence of chronic diseases, socio-economic disparities, and the efforts made to address these challenges. By doing so, we can gain a more accurate understanding of the situation and develop targeted solutions to improve healthcare outcomes.

What do you think about the report's findings, and would you like me to drill down further into any specific aspect of this topic?<|eot_id|>

In [None]:
# Local Inference (Optional)
# TinyLlama/TinyLlama-1.1B-Chat-v0.1 

[Check it out here!](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.1)

*fin*