Readme Section:

1. Describe the project

I am using a public health related dataset from the data.cityofchicago.org website. This dataset includes aggregated weekly respiratory virus lab data, which is activly used by the CDPH to monitor influenza, COVID-19, RSV, and other respiratory virus activity in Chicago. The dataset has information on pathogen activity for multiple years to the present, and was last updated on February 16, 2024.
My project aims to analyze multiple aspects of this weekly updated dataset, and use this data to further understand respiratory virus activity trends in Chicago over the recent years. 

2. Explain in a few sentences why you selected this project, and if you learned what you had hoped to learn by doing this project

I selected this project in the beginning, because I was very interested to explore the epidemiological aspect of disease tracking and disease tracing within public health. This was also one of my main areas of interest when applying to the MPH program at UChicago.
Therefore for my final project in this class, I wanted to use a healthcare related dataset about infectious disease trends and further explore it or analyze it to reach some interesting findings/observations. I think I have learned a lot about analyzing health data trends through this project, however I want to keep further exploring and expanding my knowledge about disease tracking or tracing analysis. 

3. Describe the two major class themes selected (any concept from any lecture is fair game!), why you selected them, and how they are applied in the project. 

The first major concept or theme I have used throughout my project is the DataFrame, and creating a DataFrame by importing a dataset. A DataFrame is a 2-D labeled data structure with different columns of data. This also required me to use Pandas which is a python library for data analysis. I selected this concept because it was essential for my project, as I wanted to do some sort of data analysis on my imported csv dataset. I applied Pandas to my project, as I used it to help change the size of my datasets and then also helped slice my dataset.

The second major theme I used in my project was data visualization. Visualization helps simplify making conclusions from the data because graphing a dataset lets you quickly support the conclusions. I used Matplotlib in my project to do all the visualizations and make all the graphs. Matplotlib is a python library that provides a lot of control over every aspect of a plot. I used it in my project for creating line plots, box plots, and bar charts. I was able to customize my graphs easily and create complex visualizations that gave me insight into the data.

4. What you would do differently if you were to have an opportunity to redo this project and why.

If I had the opportunity to redo this project, I would want to choose a dataset that includes chicago zip codes/geographic location information about the viruses along with the Infectious disease case rates. 
And then I would want to learn how to use that information to do some sort of geospatial analysis in python. I think it would be very cool to make some kind of heatmaps or plot the spatial distribution of virus cases across different geographic regions of Chicago.

5. How to run your project. If your project requires a dataset, please include it if possible.

The dataset for this project is available on the City of Chicago Data Portal 
https://data.cityofchicago.org/Health-Human-Services/Influenza-COVID-19-RSV-and-Other-Respiratory-Virus/qgdz-d5m4/about_data
However, I have also attached my dataset to canvas because I had renamed it.

6. Was the project challenging in the way you expected? What did you overcome?

One of the primary challenges I faced was getting stuck on small errors and not knowing how to troubleshoot the problems. Since I have never used python before and this class was my first introduction to python, I had no prior experience with troubleshooting issues in Python. It was time consuming, but I think researching and consulting relevant websites and tutorials helped me overcome this. 


7. Cited sources, appropriate acknowledgements. Explain how each source applied to your project. (5 points)

I used this source for guidance when importing my csv dataset into the dataframe.
https://www.stratascratch.com/blog/how-to-import-pandas-as-pd-in-python/

I used this source for guidance when looking into how to do statistical analysis in python in order to obtain the mean, min, max, and std.
https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html

I used this source for guidance when looking into how to obtain the correlation coefficient in python because I wanted to make a correlation coefficient matrix to analyze the 3 variable I was working with.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

I used this source for guidance because I did not know how to limit my Positivity Rate percentage value to 2 decimal places and display it with a % sign.
https://stackoverflow.com/questions/455612/limiting-floats-to-two-decimal-points

I used this source for guidance when plotting my bar plot since I have not created this type of graph before on python.
https://www.geeksforgeeks.org/bar-plot-in-matplotlib/

I used this source for guidance because I did not know how to remove the default title added by pandas on my boxplot.
https://stackoverflow.com/questions/23507229/set-no-title-for-pandas-boxplot-groupby 

8. If you attempted the extra credit, explain how you successfully met the criteria

In [None]:
#Final Project:

#First importing the dataset (which I have attached on canvas when submittng the project)
import pandas as pd
import numpy as np

data = pd.read_csv('Infectious_Disease_Surveillance_Data.csv')
data.head()

In [None]:
# Saving only the first 40 rows of the data 
# This reduces the dataset to only show the weekly disease surveillance data for the first five weeks of 2024 

df2024 = data[:40]
df2024

In [None]:
#Saving the dataset to only show the weekly disease surveillance data for the entire year of 2023 

df2023 = data[40:456]
df2023

In [None]:
#Extracting relevant columns from the 2 smaller datasets to be able to calculate summary statistics (mean, min, max, std)
#lab_tot_tested: Total number of specimens tested for the specified pathogen as reported by participating laboratories
#lab_tot_positive: Total number of specimens that were positive for the specified pathogen as reported by participating laboratories
#lab_pct_positive: Percentage of specimens that were positive for the specified pathogen as reported by participating laboratories

relevant_columns = ['lab_tot_tested', 'lab_tot_positive', 'lab_pct_positive']
relevant_data2024 = df2024[relevant_columns]
relevant_data2023 = df2023[relevant_columns]

summary_stats1 = relevant_data2024.describe()
summary_stats2 = relevant_data2023.describe()

print("Summary Statistics for Chicago Infectious Disease Data (Jan1-Present):")
print(summary_stats1)

print("Summary Statistics for Chicago Infectious Disease Data (entire year of 2023):")
print(summary_stats2)

In [None]:
#Want to further investigate if there are any correlations between the variables: number of specimens tested, number of positive cases, and the percentage of specimens positive. 
#Using the relevant columns from the 2 smaller datasets, I want to calculate the correlation coefficients

correlation_coeff1 = relevant_data2024.corr()
correlation_coeff2 = relevant_data2023.corr()

print("Correlation Matrix for Chicago Infectious Disease Data (Jan1-Present):")
print(correlation_coeff1)

print("Correlation Matrix for Chicago Infectious Disease Data (entire year of 2023):")
print(correlation_coeff2)

In [None]:
#Only focusing on 2 specific pathogens of interest- influenza and COVID-19
#I want to now analyze the pathogen's prevalence and positivity rates in both 2023 and in 2024
#First lets do 2024

pathogens_of_interest = ['Influenza', 'SARS-CoV-2', 'Seasonal Coronaviruses']

filtered_data_2024 = df2024[df2024['pathogen'].isin(pathogens_of_interest)]
filtered_data_2023 = df2023[df2023['pathogen'].isin(pathogens_of_interest)]


# Calculate prevalence statistics for each pathogen
    # Prevalence (total positive cases)

print("Specific pathogens of interest- Influenza and COVID-19. The pathogen's prevalence and positivity rates in 2024:")

for pathogen in pathogens_of_interest:
    pathogen_data_2024 = filtered_data_2024[filtered_data_2024['pathogen'] == pathogen]

    total_positive_cases2024 = pathogen_data_2024['lab_tot_positive'].sum()
    print(f"{pathogen} Total Positive Cases: {total_positive_cases2024}")
    
    total_specimens_tested2024 = pathogen_data_2024['lab_tot_tested'].sum()

# Positivity rate is the percentage of positive cases (Total positive cases/Total specimens tested)
    positivity_rate2024 = (total_positive_cases2024 / total_specimens_tested2024) * 100
    print(f"{pathogen} Positivity Rate: {positivity_rate2024:.2f}%\n")

In [None]:
#Continued calculations for the year 2023

print("Specific pathogens of interest- Influenza and COVID-19. The pathogen's prevalence and positivity rates in 2023:")
for pathogen in pathogens_of_interest:
    pathogen_data_2023 = filtered_data_2023[filtered_data_2023['pathogen'] == pathogen]
    
    total_positive_cases2023 = pathogen_data_2023['lab_tot_positive'].sum()
    print(f"{pathogen} Total Positive Cases: {total_positive_cases2023}")
    
    total_specimens_tested2023 = pathogen_data_2023['lab_tot_tested'].sum()

# Positivity rate is the percentage of positive cases (Total positive cases/Total specimens tested)
    positivity_rate2023 = (total_positive_cases2023 / total_specimens_tested2023) * 100
    print(f"{pathogen} Positivity Rate: {positivity_rate2023:.2f}%\n")

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#Visualizing and making a line plot for the weekly positive cases of influenza and covid in 2024 over the past 5 weeks
#This type of plot could help visualize and predict the future positivity trends for the 2 pathogens

influenza_data2024 = df2024[df2024['pathogen'] == 'Influenza']
covid_data2024 = df2024[df2024['pathogen'] == 'SARS-CoV-2']

# Grouping the data by week and calculating total positive cases for each pathogen
# mmwr_week: A weekly counting system within a calendar year standardized by the U.S. CDC

influenza_weekly_cases24 = influenza_data2024.groupby('mmwr_week')['lab_tot_positive'].sum()
covid_weekly_cases24 = covid_data2024.groupby('mmwr_week')['lab_tot_positive'].sum()

# Plotting the weekly positive cases for both pathogens
plt.figure(figsize=(10, 6))
plt.plot(influenza_weekly_cases24.index, influenza_weekly_cases24.values, label='Influenza')
plt.plot(covid_weekly_cases24.index, covid_weekly_cases24.values, label='COVID-19')

plt.xlabel('MMWR Week')
plt.ylabel('Weekly Positive Cases')
plt.title('Weekly Positive Cases of Influenza and COVID-19 this year 2024')
plt.legend()

plt.grid(True)
plt.show()

In [None]:
#Now lets visualize and make a line plot for the weekly positive cases of influenza and covid over the last year of 2023
#This type of plot will be interesting for visualizing the change in positivity trends for the 2 pathogens over the entire last year (52 weeks)

influenza_data2023 = df2023[df2023['pathogen'] == 'Influenza']
covid_data2023 = df2023[df2023['pathogen'] == 'SARS-CoV-2']

influenza_weekly_cases23 = influenza_data2023.groupby('mmwr_week')['lab_tot_positive'].sum()
covid_weekly_cases23 = covid_data2023.groupby('mmwr_week')['lab_tot_positive'].sum()

plt.figure(figsize=(10, 6))
plt.plot(influenza_weekly_cases23.index, influenza_weekly_cases23.values, label='Influenza')
plt.plot(covid_weekly_cases23.index, covid_weekly_cases23.values, label='COVID-19')

plt.xlabel('MMWR Week')
plt.ylabel('Weekly Positive Cases')
plt.title('Weekly Positive Cases of Influenza and COVID-19 over the year 2023')
plt.legend()

plt.grid(True)
plt.show()

In [None]:
#I want to create a bar plot showing the cumulative positive cases for ALL pathogens that exist over the 2023-2024 season. 
#This will require using the season data, and can provide an overview of the total positive cases throughout the season.

#First filter the data by the specific season 2023-2024
#season: Annually recurring reporting period for which estimates are calculated, beginning during MMWR week 40 and ending with week 39 of the following year

season_data = data[(data['mmwr_week'] >= 202340) | (data['mmwr_week'] >= 202405)]
season_data

In [None]:
#Continuing to make the Bar Plot
#Now group the data by pathogens and find total cumulative positive cases for each pathogen

cumulative_cases = season_data.groupby('pathogen')['lab_tot_positive_cumulative'].max()

plt.figure(figsize=(10, 6))
cumulative_cases.plot.bar (color='skyblue')
plt.xlabel('Pathogen Type')
plt.ylabel('Cumulative Positive Cases')
plt.title('Cumulative Positive Cases for Each Pathogen over the season 2023-2024')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
#Lastly I want to create a box plot showing the distribution of positivity rates for ALL pathogens that exist over the 2023-2024 season.

boxplot_data = season_data[['pathogen', 'lab_pct_positive']]

# Making a box plot showing the distribution of positivity rates for each pathogen
plt.figure(figsize=(10, 6))
boxplot = boxplot_data.boxplot(by='pathogen', column='lab_pct_positive', figsize=(10, 6))

plt.xlabel('Pathogen Type')
plt.ylabel('Positivity Rate (%)')
plt.title('Distribution of Positivity Rates for Each Pathogen over the season 2023-2024')

plt.suptitle('')
plt.xticks(rotation=45, ha='right')
plt.show()