# Diabetes risk factors
(Mathematical Statistics Homework Project)

**Viktória Nemkin (M8GXSS)**

I am interested in medical research, so I have chosen the [Diabetes prediction dataset](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset) from Kaggle as the topic of my homework project.

## Input data

The dataset is anonymised and contains the following data about individuals:

- Age
- Gender
- Body Mass Index (BMI)
- Hypertension = presistently elevated blood pressure in the arteries.
- Heart disease
- Smoking history
- HbA1c level = Blood sugar level.
- Blood glucose level
- Diabetes

These are some of the key indicators along with generic demographic data, which could be used to determine risk factors for developing diabetes.

## Tools

I used Python, the Pandas library for manipulation of the dataset and Scikit-learn for the various statistical analysis and evaluation tools it offers.

In [194]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Data cleaning and sanity checks

The first step in working with data is understanding it and making sure it is
reliable and cleaning problematic items from it.

In [195]:
df = pd.read_csv('diabetes_prediction_dataset.csv')
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


Types of columns:

In [196]:
df.dtypes

gender                  object
age                    float64
hypertension             int64
heart_disease            int64
smoking_history         object
bmi                    float64
HbA1c_level            float64
blood_glucose_level      int64
diabetes                 int64
dtype: object

In [197]:

for column in df.columns:
    


gender:
Female Male Other 

age:
0.08 0.16 0.24 0.32 0.4 0.48 0.56 0.64 0.72 0.8
0.88 1.0 1.08 1.16 1.24 1.32 1.4 1.48 1.56 1.64
1.72 1.8 1.88 2.0 3.0 4.0 5.0 6.0 7.0 8.0
9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0
19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0
29.0 30.0 31.0 32.0 33.0 34.0 35.0 36.0 37.0 38.0
39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0
59.0 60.0 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0
69.0 70.0 71.0 72.0 73.0 74.0 75.0 76.0 77.0 78.0
79.0 80.0 

hypertension:
0 1 

heart_disease:
0 1 

smoking_history:
No Info current ever former never not current 

bmi:
10.01 10.08 10.14 10.19 10.21 10.3 10.34 10.4 10.5 10.59
10.6 10.62 10.64 10.69 10.76 10.77 10.86 10.89 10.91 10.98
11.0 11.01 11.05 11.08 11.09 11.1 11.16 11.2 11.24 11.25
11.28 11.31 11.34 11.36 11.38 11.39 11.4 11.43 11.44 11.47
11.51 11.53 11.55 11.56 11.65 11.69 11.74 11.75 11.82 11.85
11.88 11.9 11.91 11.93 11.94 11.95 11.97 11.98 11.99 12.0
12.03 

In [210]:
df.describe()

Unnamed: 0,age,bmi,HbA1c_level,blood_glucose_level
count,100000.0,100000.0,100000.0,100000.0
mean,41.87566,27.320767,5.527507,138.05806
std,22.535417,6.636783,1.070672,40.708136
min,0.0,10.01,3.5,80.0
25%,24.0,23.63,4.8,100.0
50%,43.0,27.32,5.8,140.0
75%,60.0,29.58,6.2,159.0
max,80.0,95.69,9.0,300.0


In [212]:
columns = df.select_dtypes(include='category').columns.tolist()

for column in columns:
    values = sorted(list(df[column].unique()))
    print(column)
    print(values)
    print()

gender
['Female', 'Male', 'Other']

smoking_history
['current', 'ever', 'former', 'never', 'no_info', 'not_current']



Age should be an integer, gender and smoking history should be categorical.

In [199]:
df['smoking_history'] = df['smoking_history'].replace({'No Info': 'no_info', 'not current': 'not_current'})

df['gender'] = df['gender'].astype('category')
df['age'] = df['age'].astype(int)
df['smoking_history'] = df['smoking_history'].astype('category')
df['hypertension'] = df['hypertension'].astype(bool)
df['heart_disease'] = df['heart_disease'].astype(bool)
df['diabetes'] = df['diabetes'].astype(bool)

In [200]:
df.dtypes

gender                 category
age                       int32
hypertension               bool
heart_disease              bool
smoking_history        category
bmi                     float64
HbA1c_level             float64
blood_glucose_level       int64
diabetes                   bool
dtype: object

Make sure nothing is N/A.

In [201]:
df.isna().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

Drop any na values:

In [202]:
df.dropna(inplace=True)

In [203]:
df_encoded = pd.get_dummies(df)
X = df_encoded.drop('diabetes', axis=1)
X.head()

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,gender_Female,gender_Male,gender_Other,smoking_history_current,smoking_history_ever,smoking_history_former,smoking_history_never,smoking_history_no_info,smoking_history_not_current
0,80,False,True,25.19,6.6,140,True,False,False,False,False,False,True,False,False
1,54,False,False,27.32,6.6,80,True,False,False,False,False,False,False,True,False
2,28,False,False,27.32,5.7,158,False,True,False,False,False,False,True,False,False
3,36,False,False,23.45,5.0,155,True,False,False,True,False,False,False,False,False
4,76,True,True,20.14,4.8,155,False,True,False,True,False,False,False,False,False


In [204]:
y = df_encoded['diabetes']
y.head()

0    False
1    False
2    False
3    False
4    False
Name: diabetes, dtype: bool

In [205]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [206]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.0511803772720485


In [207]:
# Display the model's coefficients and intercept
coefficients = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})
intercept = pd.DataFrame({'Variable': ['Intercept'], 'Coefficient': model.intercept_})

print("Coefficients:")
print(coefficients)
print("\nIntercept:")
print(intercept)

Coefficients:
                       Variable  Coefficient
0                           age     0.001336
1                  hypertension     0.092471
2                 heart_disease     0.118214
3                           bmi     0.004055
4                   HbA1c_level     0.081449
5           blood_glucose_level     0.002275
6                 gender_Female     0.011399
7                   gender_Male     0.024079
8                  gender_Other    -0.035478
9       smoking_history_current    -0.001451
10         smoking_history_ever    -0.003983
11       smoking_history_former     0.015201
12        smoking_history_never    -0.002184
13      smoking_history_no_info    -0.011962
14  smoking_history_not_current     0.004379

Intercept:
    Variable  Coefficient
0  Intercept    -0.870916


Forrás: https://www.geeksforgeeks.org/multiple-linear-regression-with-scikit-learn/

In [208]:
correlation_matrix = df.corr()

print("Correlation Matrix:")
print(correlation_matrix)


ValueError: could not convert string to float: 'Female'