# Aplikacja internetowa do przewidywania czy pacjent może doznać udaru

### Opis

Według Światowej Organizacji Zdrowia (WHO) udar jest drugą najczęstszą przyczyną śmierci na świecie i odpowiada za około 11% wszystkich zgonów.

Ta aplikacja służy do przewidywania, czy pacjent może doznać udaru na podstawie parametrów wejściowych, takich jak płeć, wiek, wybrane choroby i palenie tytoniu.

### Cel

Celem projektu było zidentyfikowanie czynników ryzyka wpływających na ryzyko udaru.

### Wyniki

Zaobserwowano, że istotny wpływ na ryzyko udaru mają choroby takie jak nadciśnienie lub choroby serca. Ponad to zauważalnie większe ryzyko udaru występuje wśród osób samozatrudnionych.

### Dane
Dane pochodzą ze datasetu [Stroke Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)





## Import potrzebnych bibliotek

In [126]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf

## Wczytanie i przekształcenia danych

In [127]:
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/stroke/healthcare-dataset-stroke-data.csv'
data = pd.read_csv(file_path)
data = data.drop('id', axis=1).drop('avg_glucose_level', axis=1)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Eksploracyjna analiza danych

In [128]:
print("Shape")
print(data.shape)
print("First 10")
print(data.head(10))
print("Types")
print(data.dtypes)
print("Description")
print(data.describe())
print("Nulls")
print(data.isnull().sum())

Shape
(5110, 10)
First 10
   gender   age  hypertension  heart_disease ever_married      work_type  \
0    Male  67.0             0              1          Yes        Private   
1  Female  61.0             0              0          Yes  Self-employed   
2    Male  80.0             0              1          Yes        Private   
3  Female  49.0             0              0          Yes        Private   
4  Female  79.0             1              0          Yes  Self-employed   
5    Male  81.0             0              0          Yes        Private   
6    Male  74.0             1              1          Yes        Private   
7  Female  69.0             0              0           No        Private   
8  Female  59.0             0              0          Yes        Private   
9  Female  78.0             0              0          Yes        Private   

  Residence_type   bmi   smoking_status  stroke  
0          Urban  36.6  formerly smoked       1  
1          Rural   NaN     never smok

## Mapowanie wartości tekstowych na liczbowe

In [129]:
gender_mapping = {'Other': 2, 'Male': 0, 'Female': 1}
data['gender'] = data['gender'].map(gender_mapping)

married_mapping = {'No': 0, 'Yes': 1}
data['ever_married'] = data['ever_married'].map(married_mapping)

work_mapping = {'Never_worked': 0, 'Private': 1, 'Self-employed': 2, 'Govt_job': 3, 'children': 4}
data['work_type'] = data['work_type'].map(work_mapping)

residence_mapping = {'Rural': 0, 'Urban': 1}
data['Residence_type'] = data['Residence_type'].map(residence_mapping)

smoking_mapping = {'Unknown': 3, 'formerly smoked': 0, 'never smoked': 1, 'smokes': 2}
data['smoking_status'] = data['smoking_status'].map(smoking_mapping)

## Usuwanie niechcianych wartości

In [130]:
data = data.dropna()
data = data[data["gender"] != 2]
data = data[data["smoking_status"] != 3]

### Eksport danych

In [131]:
data.to_csv('/content/drive/MyDrive/stroke/cleanup.csv')

In [125]:
nd1 = data[data["stroke"] == 1]
nd1 = nd1[nd1["gender"] == 0]
nd2 = data[data["stroke"] == 1]
print(len(nd1)/len(nd2))

0.4166666666666667


## Podzial na dane testowe i treningowe

In [None]:
X = data.drop('stroke', axis=1)
y = data['stroke']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Standaryzacja danych

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Definicja modelu sieci neuronowej

---



In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

## Kompilacja modelu

In [None]:
model.compile(optimizer='adam', loss='mse')

## Trenowanie modelu

In [None]:
model.fit(X_train_scaled, y_train, epochs=50, batch_size=8, verbose=2)

Epoch 1/50
343/343 - 1s - loss: 0.0530 - 1s/epoch - 4ms/step
Epoch 2/50
343/343 - 1s - loss: 0.0464 - 516ms/epoch - 2ms/step
Epoch 3/50
343/343 - 1s - loss: 0.0454 - 524ms/epoch - 2ms/step
Epoch 4/50
343/343 - 1s - loss: 0.0446 - 507ms/epoch - 1ms/step
Epoch 5/50
343/343 - 1s - loss: 0.0430 - 518ms/epoch - 2ms/step
Epoch 6/50
343/343 - 1s - loss: 0.0425 - 640ms/epoch - 2ms/step
Epoch 7/50
343/343 - 1s - loss: 0.0419 - 746ms/epoch - 2ms/step
Epoch 8/50
343/343 - 1s - loss: 0.0418 - 793ms/epoch - 2ms/step
Epoch 9/50
343/343 - 1s - loss: 0.0412 - 790ms/epoch - 2ms/step
Epoch 10/50
343/343 - 1s - loss: 0.0410 - 761ms/epoch - 2ms/step
Epoch 11/50
343/343 - 1s - loss: 0.0405 - 1s/epoch - 4ms/step
Epoch 12/50
343/343 - 1s - loss: 0.0400 - 1s/epoch - 3ms/step
Epoch 13/50
343/343 - 1s - loss: 0.0397 - 1s/epoch - 4ms/step
Epoch 14/50
343/343 - 1s - loss: 0.0390 - 982ms/epoch - 3ms/step
Epoch 15/50
343/343 - 1s - loss: 0.0395 - 848ms/epoch - 2ms/step
Epoch 16/50
343/343 - 1s - loss: 0.0390 - 865m

<keras.src.callbacks.History at 0x7ba30a021660>

## Ocena modelu

In [None]:
mse = model.evaluate(X_test_scaled, y_test, verbose=0)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.07942944765090942


## Przykładowa predykcja na nowych danych

In [None]:
new_data = pd.DataFrame([[1, 23, 0, 0, 0, 1, 1, 31, 3]], columns=['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'bmi', 'smoking_status'])
predicted = 'stroke' if model.predict(new_data)[0][0] > 0.5 else 'no stroke'
print(f'Predicted: {predicted}')

Predicted: stroke


## Export danych

0         Male
2         Male
3       Female
4       Female
5         Male
         ...  
5100      Male
5102    Female
5106    Female
5107    Female
5108      Male
Name: gender, Length: 3425, dtype: object


In [None]:
print(data.info())
test = data.drop('stroke', axis=1)

data['training_score'] = model.predict(test)

data['training_score'] = (data['training_score']-data['training_score'].min())/(data['training_score'].max()-data['training_score'].min())
data['training_score'] = round(data['training_score'])
data['training_score'] = data['training_score'].astype(int)


gender_mapping = {0: 'Male', 1: 'Female'}
data['gender'] = data['gender'].map(gender_mapping)

data[data["age"] < 1]["age"] = 0

hypertension_mapping = {0: 'No', 1: 'Yes'}
data['hypertension'] = data['hypertension'].map(hypertension_mapping)

heart_mapping = {0: 'No', 1: 'Yes'}
data['heart_disease'] = data['heart_disease'].map(heart_mapping)

married_mapping = {0: 'No', 1: 'Yes'}
data['ever_married'] = data['ever_married'].map(married_mapping)

work_mapping = {0: 'Never_worked', 1: 'Private', 2: 'Self-employed', 3: 'Govt_job', 4: 'children'}
data['work_type'] = data['work_type'].map(work_mapping)

residence_mapping = {0: 'Rural', 1: 'Urban'}
data['Residence_type'] = data['Residence_type'].map(residence_mapping)

smoking_mapping = {0: 'Unknown', 1: 'formerly smoked', 2: 'never smoked', 3: 'smokes'}
data['smoking_status'] = data['smoking_status'].map(smoking_mapping)

stroke_mapping = {0: 'No', 1: 'Yes'}
data['stroke'] = data['stroke'].map(stroke_mapping)

data.to_csv('/content/drive/MyDrive/stroke/result.csv')

<class 'pandas.core.frame.DataFrame'>
Index: 3425 entries, 0 to 5108
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gender          3425 non-null   int64  
 1   age             3425 non-null   float64
 2   hypertension    3425 non-null   int64  
 3   heart_disease   3425 non-null   int64  
 4   ever_married    3425 non-null   int64  
 5   work_type       3425 non-null   int64  
 6   Residence_type  3425 non-null   int64  
 7   bmi             3425 non-null   float64
 8   smoking_status  3425 non-null   int64  
 9   stroke          3425 non-null   int64  
dtypes: float64(2), int64(8)
memory usage: 294.3 KB
None
