# **Analysis of Diabetes Risk in India**

## Objectives

* Creating a comprehensive data analysis tool designed to help medical professionals in streamlining data exploration, analysis, and visualisation to analyze impact of behavioural and lifestyle factors on risk of diabetes in young adults of India.

## Inputs

* Data source: https://www.kaggle.com/datasets/ankushpanday1/diabetes-in-youth-vs-adult-in-india


## Outputs

* A jupyter notebook file (Diabetes Risk Analysis (Hackathon1.ipynb) redone to showcase the data analysis and my progress since first project.
* A code that helps in conducting descriptive analysis, exploring influence of lifestyle on diabetes outcome.
.



---

# Working directory

Changed the working directory from its current folder to its parent folder
* Accessing the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm new directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1 :  Data Extraction, Transformation, and Loading (ETL) 

Setting up & Importing Python packages that we will be using in this project to carry out the analysis. For example Numpy to compute numerical operations and handle arrays, Pandas for data manipulation and analysis, Matplotlib, Seaborn and Plotly to create different data visualisations.

In [54]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline
import plotly.express as px

Loading and Extracting the dataset into a dataframe for data clean, transformation and analysis

In [None]:
# Loading the CSV dataset containing the data collected previously and extracting it into 
# diabetes_df dataframe using pd.read_csv() function
 
diabetes_df= pd.read_csv('inputs\thyroid_cancer_risk_data.csv')

In [None]:
#Previewing top 15 entries in dataset to get a general overview of the dataset with .head() method
diabetes_df.head(15)

In [None]:
#Checking the total number of rows and columns in diabetes_df dataframe using the .shape attribute
diabetes_df.shape
print(f"There are {diabetes_df.shape[0]} rows and {diabetes_df.shape[1]} columns in the Diabetes Dataframe.")

In [None]:
# Checking information on Index, Column names, Datatypes and Memory used using .info method
Dataframe_info= diabetes_df.info()

In [None]:
#Checking for any duplicate values
duplicates_check= diabetes_df.duplicated().any()
print (f'Any duplicate values:',duplicates_check)

In [None]:
# Checking for missing values in  dataset
missingvalues_check= diabetes_df.isnull().dropna().any()
missingvalues_check

In [None]:
#Creating a dictionary of columns and their respective unique values

unique_values = {
    'Age' : diabetes_df['Age'].unique().tolist(),
    'Gender': diabetes_df['Gender'].unique().tolist(),
    'Region': diabetes_df['Region'].unique().tolist(),
    'Physical_Activity_Level': diabetes_df['Physical_Activity_Level'].unique().tolist(),
    'Dietary_Habits': diabetes_df['Dietary_Habits'].unique().tolist(),
    'Alcohol_Consumption': diabetes_df['Alcohol_Consumption'].unique().tolist(),
    'Smoking': diabetes_df['Smoking'].unique().tolist(),
    'Sleep_Hours': diabetes_df['Sleep_Hours'].unique().tolist(),
    'Stress_Level' : diabetes_df['Stress_Level'].unique().tolist(),
    'Screen_Time': diabetes_df['Screen_Time'].unique().tolist(),
}

# Converting the dictionary to a DataFrame for better visualization
unique_behaviour_df = pd.DataFrame.from_dict(unique_values, orient='index').transpose()

# Displaying the unique values of each column in DataFrame
unique_behaviour_df.head(15)

In [None]:
# Displaying the unique terms for each specified column 
for column, values in unique_values.items(): 
    print(f"Unique values for column '{column}':") 
    print(values)
    print("\n")

2. Transformation: cleaning the date for data analysis and visualisation

In [None]:
#As this analysis is focused on lifestyle factors influence on diabetes risk in young popuplation
# Dropping data columns not going to be used in further analysis
col_dropped= [
              'ID',
              'Family_Income', 
              'Family_History_Diabetes',
              'Parent_Diabetes_Type', 
              'Genetic_Risk_Score',
              'Prediabetes',
              'Diabetes_Type'
              ]                               
diabetes_df= diabetes_df.drop(columns= col_dropped)

In [None]:
#Predictive analysis based on HbA1c test levels and Fasting Blood Sugar levels

# Both tests are used to diagnose diabetes and prediabetes 
# HbA1c test is done to check average blood sugar levels over the past 3 months, ranges from 4.5 to 6.5 for normal levels, from 5.7 to 6.4 for prediabetes,
# 6.5 and above for diabetes

# Fasting Blood Sugar test is done to check blood sugar levels after fasting for 8 hours, ranges from 70 to 100 for normal levels, 
# from 100 to 125 for prediabetes and 126 and above for diabetes

# Creating the 'Diabetes_Outcome' column based on both HbA1c and Fasting Blood Sugar levels
def determine_outcome(row):
    if row['HbA1c'] >= 6.5 or row['Fasting_Blood_Sugar'] >= 126:
        return 'Diabetic'
    elif (row['HbA1c'] >= 5.7 and row['HbA1c'] < 6.5) or (row['Fasting_Blood_Sugar'] >= 100 and row['Fasting_Blood_Sugar'] < 126):
        return 'Prediabetic'
    else:
        return 'Normal'

# Apply the function to create the 'Diabetes_Outcome' column
diabetes_df['Diabetes_Outcome'] = diabetes_df.apply(determine_outcome, axis=1)

# Display the updated DataFrame
diabetes_df[['HbA1c', 'Fasting_Blood_Sugar', 'Diabetes_Outcome']].head()

In [None]:
diabetes_df

In [None]:
# Pipepline to change category values into numeric values 
from sklearn.pipeline import Pipeline
from category_encoders import OrdinalEncoder

# Create the pipeline with ordinal encoder
pipeline = Pipeline([
    ('ordinal_encoder', OrdinalEncoder())
])

# Apply the pipeline to the DataFrame
diabetes_clean_df = pipeline.fit_transform(diabetes_df)

# Display the cleaned DataFrame
diabetes_clean_df


In [None]:
pipeline.fit(diabetes_df)

# Accessing the OrdinalEncoder instance after fitting
encoder = pipeline.named_steps['ordinal_encoder']

# Accessing the mappings from the OrdinalEncoder in category_encoders
mappings = encoder.mapping

# Displaying the mappings for each feature
for mapping in mappings:
    print(f"Feature: {mapping['col']}, Mappings: {mapping['mapping']}")

*Cleaning any NAN values
diabetes_clean_df= diabetes_clean_df.dropna()

In [None]:
# Descrpitive Statistics Overview
Summary_stats= diabetes_clean_df.describe()

Summary_stats

Pipeline functions

In [None]:
   # Correlating data analysis
corr_analysis= diabetes_df.corr(method='pearson')
corr_analysis

---

# Data Visualisation

---

In [21]:
# The dataset is too big for project timescope
# Randomly selected 500 records

diabetes_clean_df = diabetes_clean_df.sample(n=500, random_state=42)

In [None]:
sns.set_palette("viridis")
sns.set_style("whitegrid")

sns.set_theme(style="whitegrid")
mask = np.zeros_like(corr_analysis, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_analysis,annot=True,mask=mask,cmap='viridis',annot_kws={"size": 9},linewidths=1.5)
plt.ylim(corr_analysis.shape[1],0);
plt.show()



In [None]:
# Grouping by 'Age' and 'Gender' and creating a stacked bar plot
ax = diabetes_clean_df.groupby(['Age','Gender']).size().unstack().plot(kind='bar', stacked=True)
# Setting the legend labels
ax.legend(title='Gender', labels=['Male', 'Female', 'Other'])
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming diabetes_clean_df is your DataFrame
# Group by 'Age' and calculate the average Diabetes_Outcome for each gender
grouped_data = diabetes_clean_df.groupby(['Age', 'Gender'], as_index=False).agg(AvgOutcome=('Diabetes_Outcome', 'mean'))

# Pivot the data to get separate columns for each gender
pivot_data = grouped_data.pivot(index='Age', columns='Gender', values='AvgOutcome')

# Plot the data
pivot_data.plot(kind='line', figsize=(5, 5))
plt.xlabel('Age')
plt.ylabel('Average Diabetes Outcome')

# Set custom labels for the legend
gender_labels = ['Male', 'Female', 'Other']  # Adjust this list as needed depending on your dataset
plt.title('Average Diabetes Outcome by Age and Gender')
plt.legend(title='Gender', labels=gender_labels)
plt.show()


In [None]:

# Scatter plot: Age vs. HbA1c, colored by Gender
fig1 = px.scatter(diabetes_clean_df, x='Age', y='HbA1c', color='Gender',
                  labels={'Age': 'Age', 'HbA1c': 'HbA1c Level', 'Gender': 'Gender'},
                  title='Age vs. HbA1c Levels by Gender')
fig1.show()


In [None]:
#Bar Plot: Average BMI by Region
fig2 = px.bar(diabetes_clean_df, x='Region', y='BMI', color='Region', 
              labels={'Region': 'Region', 'BMI': 'Average BMI'},
              title='Average BMI by Region', barmode='group')
fig2.show()


In [None]:

# Create the box plot with jitter and different colors for each level of physical activity

fig3 = px.box(diabetes_clean_df, x='Physical_Activity_Level', y='Diabetes_Outcome', 
              points="all", color='Physical_Activity_Level',
              labels={'Physical_Activity_Level': 'Physical Activity Level', 'Diabetes_Outcome': 'Diabetes Outcome'},
              title='Diabetes Outcome by Physical Activity Level')

# Show the plot
fig3.show()

In [None]:
# Calculating the correlation matrix
corr_matrix = diabetes_clean_df.corr()

# Creating an annotated heatmap
fig4 = go.Figure(data=go.Heatmap(
                   z=corr_matrix.values,
                   x=corr_matrix.columns,
                   y=corr_matrix.index,
                   colorscale='Viridis'))

# Adding annotations
for i in range(len(corr_matrix)):
    for j in range(len(corr_matrix)):
        fig4.add_annotation(
            x=corr_matrix.columns[j],
            y=corr_matrix.index[i],
            text=str(np.round(corr_matrix.values[i, j], 2)),
            showarrow=False,
            font=dict(color="white" if corr_matrix.values[i, j] < 0 else "black")
        )
fig4.update_layout(
    title='Correlation Heatmap',
    xaxis_nticks=36
)
fig4.show()


NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
