# ***Data Description***

* **Disease (Categorical):**
 Represents a specific disease or condition.
Likely a string or code indicating a diagnosis.
* **Fever (Binary or Numeric):**
Indicates whether the patient has a fever.
Could be 0/1 (no/yes) or a numerical value (e.g., temperature).
* **Cough (Binary or Numeric):**
Indicates whether the patient has a cough.
Could be binary (yes/no) or severity levels.
* **Fatigue (Binary or Numeric):**
Represents whether the patient experiences fatigue.
May be binary (yes/no) or a measure of intensity.
* **Difficulty Breathing (Binary or Numeric):**
Indicates whether the patient has breathing difficulties.
Could be binary or scaled (e.g., mild, moderate, severe).
* **Age (Numeric):**
The patient’s age, likely a numerical value.
* **Gender (Categorical):**
Indicates the patient’s gender (e.g., M, F, or encoded values).
* **Blood Pressure (Numeric):**
Measurement of the patient's blood pressure, possibly in mmHg.
* **Cholesterol Level (Numeric):**
Represents the cholesterol level of the patient, possibly in mg/dL.
* **Outcome Variable (Categorical):**
Represents the outcome related to the patient's condition or treatment (e.g., diagnosis, severity level, or recovery status).

**Importing libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
data=pd.read_csv("Disease_symptom_and_patient_profile_dataset.csv")

**Visualizing our data**

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Disease               349 non-null    object
 1   Fever                 349 non-null    object
 2   Cough                 349 non-null    object
 3   Fatigue               349 non-null    object
 4   Difficulty Breathing  349 non-null    object
 5   Age                   349 non-null    int64 
 6   Gender                349 non-null    object
 7   Blood Pressure        349 non-null    object
 8   Cholesterol Level     349 non-null    object
 9   Outcome Variable      349 non-null    object
dtypes: int64(1), object(9)
memory usage: 27.4+ KB


In [None]:
data['Disease'].value_counts()

Unnamed: 0_level_0,count
Disease,Unnamed: 1_level_1
Asthma,23
Stroke,16
Osteoporosis,14
Hypertension,10
Diabetes,10
...,...
Autism Spectrum Disorder (ASD),1
Hypoglycemia,1
Fibromyalgia,1
"Eating Disorders (Anorexia,...",1


**Preparing my data**

In [None]:
data=data.dropna(subset=['Outcome Variable'])

**Encoding**

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data['Disease'] = le.fit_transform(data['Disease'])
data['Fever'] = le.fit_transform(data['Fever'])
data['Cough'] = le.fit_transform(data['Cough'])
data['Fatigue'] = le.fit_transform(data['Fatigue'])
data['Difficulty Breathing'] = le.fit_transform(data['Difficulty Breathing'])
data['Gender'] = le.fit_transform(data['Gender'])
data['Blood Pressure'] = le.fit_transform(data['Blood Pressure'])
data['Cholesterol Level'] = le.fit_transform(data['Cholesterol Level'])
data['Outcome Variable'] = le.fit_transform(data['Outcome Variable'])

**Scaling using ROBUST SCALING**


In [None]:
from sklearn.preprocessing import RobustScaler

columns = data.columns

scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)

data = pd.DataFrame(scaled_data, columns=columns)

Dealing with outliers

In [None]:
def remove_outlier(df, column):
    # Calculate the first quartile (Q1) - 25th percentile
    Q1 = np.percentile(df[column], 25, method='midpoint')

    # Calculate the third quartile (Q3) - 75th percentile
    Q3 = np.percentile(df[column], 75, method='midpoint')

    # Calculate the Interquartile Range (IQR)
    IQR = Q3 - Q1

    # Define lower limit for outliers
    low_lim = Q1 - 1.5 * IQR

    # Define upper limit for outliers
    up_lim = Q3 + 1.5 * IQR

    # Print column name, lower limit, and upper limit for reference
    print('*', column)
    print('low_limit is', low_lim)
    print('up_limit is', up_lim)
    print('\n')

    # Filter the DataFrame to keep only rows within the lower and upper limits
    df = df[(df[column] < up_lim) & (df[column] > low_lim)]

    return df

In [None]:
data=remove_outlier(data,'Age')



* Age
low_limit is -2.0
up_limit is 2.0




**Visualizing data**

In [None]:
data

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,0.000000,0.0,0.0,0.0,1.0,-1.30,0.0,0.0,0.5,0.0
1,-0.592593,-1.0,1.0,0.0,0.0,-1.00,0.0,0.5,0.5,-1.0
2,-0.351852,-1.0,1.0,0.0,0.0,-1.00,0.0,0.5,0.5,-1.0
3,-0.925926,0.0,1.0,-1.0,1.0,-1.00,1.0,0.5,0.5,0.0
4,-0.925926,0.0,1.0,-1.0,1.0,-1.00,1.0,0.5,0.5,0.0
...,...,...,...,...,...,...,...,...,...,...
340,0.870370,-1.0,0.0,0.0,0.0,1.25,0.0,-0.5,-0.5,0.0
341,0.925926,0.0,1.0,0.0,0.0,1.25,0.0,-0.5,-0.5,0.0
342,1.074074,-1.0,0.0,0.0,0.0,1.25,0.0,0.5,0.5,0.0
343,0.833333,0.0,0.0,0.0,0.0,1.75,0.0,-0.5,-0.5,0.0


*Splitting Data*

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
x= data[['Disease', 'Fever', 'Cough', 'Fatigue', 'Difficulty Breathing', 'Age', 'Gender', 'Blood Pressure', 'Cholesterol Level']]
y=data['Outcome Variable']
y = y.loc[x.index]

***Feature Selection***

**ANOVA (F-Test): Evaluates the variance between features and the target for classification problems.**

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

anova_selector = SelectKBest(f_classif, k=5)
X_kbest = anova_selector.fit_transform(x, y)
print("Selected features:", x.columns[anova_selector.get_support()])


Selected features: Index(['Fever', 'Fatigue', 'Gender', 'Blood Pressure', 'Cholesterol Level'], dtype='object')


In [None]:
x = x.loc[:, anova_selector.get_support()]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

# ***Decision tree***

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(random_state=0)
clf=clf.fit(X_train, y_train)

In [None]:
y_pred=clf.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

[[23  7]
 [13 26]]


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

        -1.0       0.77      0.64      0.70        36
         0.0       0.67      0.79      0.72        33

    accuracy                           0.71        69
   macro avg       0.72      0.71      0.71        69
weighted avg       0.72      0.71      0.71        69

