# Diabetes risk factors
(Mathematical Statistics Homework Project)

**Viktória Nemkin (M8GXSS)**

I am interested in medical research, so I have chosen the [Diabetes prediction dataset](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset) from Kaggle as the topic of my homework project.

## Input data

The dataset is anonymised and contains the following data about individuals:

- Age
- Gender
- Body Mass Index (BMI)
- Hypertension = presistently elevated blood pressure in the arteries.
- Heart disease
- Smoking history
- HbA1c level = Blood sugar level.
- Blood glucose level
- Diabetes

These are some of the key indicators along with generic demographic data, which could be used to determine risk factors for developing diabetes.

## Tools

I used Python, the Pandas library for manipulation of the dataset and Scikit-learn for the various statistical analysis and evaluation tools it offers.

In [144]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Data cleaning and sanity checks

The first step in working with data is understanding it and making sure it is
reliable and cleaning problematic items from it.

In [145]:
df = pd.read_csv('diabetes_prediction_dataset.csv')
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


Types of columns:

In [146]:
df.dtypes

gender                  object
age                    float64
hypertension             int64
heart_disease            int64
smoking_history         object
bmi                    float64
HbA1c_level            float64
blood_glucose_level      int64
diabetes                 int64
dtype: object

Age should be an integer, gender and smoking history should be categorical.

In [147]:
df['gender'] = df['gender'].astype('category')
df['age'] = df['age'].astype(int)
df['smoking_history'] = df['smoking_history'].astype('category')
df['hypertension'] = df['hypertension'].astype(bool)
df['heart_disease'] = df['heart_disease'].astype(bool)
df['diabetes'] = df['diabetes'].astype(bool)

In [148]:
df.dtypes

gender                 category
age                       int32
hypertension               bool
heart_disease              bool
smoking_history        category
bmi                     float64
HbA1c_level             float64
blood_glucose_level       int64
diabetes                   bool
dtype: object

Make sure nothing is N/A.

In [149]:
df.isna().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

Drop any na values:

In [150]:
df.dropna(inplace=True)

In [151]:
df_encoded = pd.get_dummies(df)
X = df_encoded.drop('diabetes', axis=1)
X.head()

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,gender_Female,gender_Male,gender_Other,smoking_history_No Info,smoking_history_current,smoking_history_ever,smoking_history_former,smoking_history_never,smoking_history_not current
0,80,False,True,25.19,6.6,140,True,False,False,False,False,False,False,True,False
1,54,False,False,27.32,6.6,80,True,False,False,True,False,False,False,False,False
2,28,False,False,27.32,5.7,158,False,True,False,False,False,False,False,True,False
3,36,False,False,23.45,5.0,155,True,False,False,False,True,False,False,False,False
4,76,True,True,20.14,4.8,155,False,True,False,False,True,False,False,False,False


In [152]:
y = df_encoded['diabetes']
y.head()

0    False
1    False
2    False
3    False
4    False
Name: diabetes, dtype: bool

In [153]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [154]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.051180377272048494


In [155]:
# Display the model's coefficients and intercept
coefficients = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})
intercept = pd.DataFrame({'Variable': ['Intercept'], 'Coefficient': model.intercept_})

print("Coefficients:")
print(coefficients)
print("\nIntercept:")
print(intercept)

Coefficients:
                       Variable  Coefficient
0                           age     0.001336
1                  hypertension     0.092471
2                 heart_disease     0.118214
3                           bmi     0.004055
4                   HbA1c_level     0.081449
5           blood_glucose_level     0.002275
6                 gender_Female     0.011399
7                   gender_Male     0.024079
8                  gender_Other    -0.035478
9       smoking_history_No Info    -0.011962
10      smoking_history_current    -0.001451
11         smoking_history_ever    -0.003983
12       smoking_history_former     0.015201
13        smoking_history_never    -0.002184
14  smoking_history_not current     0.004379

Intercept:
    Variable  Coefficient
0  Intercept    -0.870916


Forrás: https://www.geeksforgeeks.org/multiple-linear-regression-with-scikit-learn/

In [156]:
correlation_matrix = df.corr()

print("Correlation Matrix:")
print(correlation_matrix)


ValueError: could not convert string to float: 'never'