# AI-Powered Symptom Checker: Data Exploration

This notebook will help you explore and understand the datasets in the `data` folder. We will load, inspect, and summarize the CSV files to prepare for further preprocessing and modeling.

In [1]:
# Import Required Libraries
import pandas as pd
import os

In [2]:
# Set Data Folder Path
data_folder = 'data'

In [3]:
# List CSV Files in Data Folder
csv_files = [f for f in os.listdir(data_folder) if f.endswith('.csv')]
print('Found CSV files:', csv_files)

Found CSV files: ['Symptom-severity.csv', 'new_p.csv', 'disease_centric_knowledgebase_with_doctor.csv', 'symptom_Description.csv']


In [4]:
# Define Function to Explore CSV Files
def explore_csv(file_path):
    print(f'\nExploring {file_path}')
    df = pd.read_csv(file_path)
    print('Shape:', df.shape)
    print('Columns:', df.columns.tolist())
    print('First 5 rows:')
    print(df.head())
    print('Missing values per column:')
    print(df.isnull().sum())
    print('-' * 40)

In [5]:
# Explore All CSV Files in Folder
for csv_file in csv_files:
    explore_csv(os.path.join(data_folder, csv_file))


Exploring data/Symptom-severity.csv
Shape: (133, 2)
Columns: ['Symptom', 'weight']
First 5 rows:
                Symptom  weight
0               itching       1
1             skin_rash       3
2  nodal_skin_eruptions       4
3   continuous_sneezing       4
4             shivering       5
Missing values per column:
Symptom    0
weight     0
dtype: int64
----------------------------------------

Exploring data/new_p.csv
Shape: (100000, 137)
Columns: ['abdominal_pain', 'abnormal_menstruation', 'acidity', 'acute_liver_failure', 'altered_sensorium', 'anxiety', 'back_pain', 'belly_pain', 'blackheads', 'bladder_discomfort', 'blister', 'blood_in_sputum', 'bloody_stool', 'blurred_and_distorted_vision', 'breathlessness', 'brittle_nails', 'bruising', 'burning_micturition', 'chest_pain', 'chills', 'cold_hands_and_feets', 'coma', 'congestion', 'constipation', 'continuous_feel_of_urine', 'continuous_sneezing', 'cough', 'cramps', 'dark_urine', 'dehydration', 'depression', 'diarrhoea', 'dischromic__p

# Data Preprocessing

Now that we have explored the data, the next step is to preprocess it. Data preprocessing involves cleaning and transforming the data so it can be used to train a machine learning model. This includes handling missing values, standardizing text, and encoding categorical variables.

## 1. Handling Missing Values

Missing values can cause problems for machine learning models. We need to check for missing data and decide how to handle it (e.g., fill with a default value or remove rows/columns with too many missing values).

In [6]:
# Reload all CSV files into dataframes for preprocessing
all_dfs = {}
for csv_file in csv_files:
    df = pd.read_csv(os.path.join(data_folder, csv_file))
    all_dfs[csv_file] = df
    print(f"Loaded {csv_file} with shape {df.shape}")

Loaded Symptom-severity.csv with shape (133, 2)
Loaded new_p.csv with shape (100000, 137)
Loaded disease_centric_knowledgebase_with_doctor.csv with shape (4920, 4)
Loaded symptom_Description.csv with shape (41, 2)


In [7]:
# Check missing values in each dataframe
for name, df in all_dfs.items():
    print(f"\nMissing values in {name}:")
    print(df.isnull().sum())


Missing values in Symptom-severity.csv:
Symptom    0
weight     0
dtype: int64

Missing values in new_p.csv:
abdominal_pain           0
abnormal_menstruation    0
acidity                  0
acute_liver_failure      0
altered_sensorium        0
                        ..
gender                   0
disease                  0
precaution               0
doctor_type              0
total_symptoms           0
Length: 137, dtype: int64

Missing values in disease_centric_knowledgebase_with_doctor.csv:
disease         0
symptom_list    0
precaution      0
doctor_type     0
dtype: int64

Missing values in symptom_Description.csv:
Disease        0
Description    0
dtype: int64


## 2. Standardizing Text Data

To ensure consistency, we will standardize all text data (such as symptom and disease names) by converting them to lowercase and stripping extra spaces. This helps avoid issues caused by different capitalizations or accidental spaces.

In [8]:
# Standardize all string columns in each dataframe
for name, df in all_dfs.items():
    for col in df.select_dtypes(include='object').columns:
        df[col] = df[col].astype(str).str.lower().str.strip()
    all_dfs[name] = df
    print(f"Standardized text columns in {name}")

Standardized text columns in Symptom-severity.csv
Standardized text columns in new_p.csv
Standardized text columns in disease_centric_knowledgebase_with_doctor.csv
Standardized text columns in symptom_Description.csv


## 3. Encoding Categorical Variables

Machine learning models work with numbers, not text. We need to convert categorical columns (like disease or symptom names) into numeric codes. This is called encoding. We'll use label encoding for this step.

In [9]:
# Encode categorical variables using label encoding
for name, df in all_dfs.items():
    for col in df.select_dtypes(include='object').columns:
        df[col + '_code'] = df[col].astype('category').cat.codes
    all_dfs[name] = df
    print(f"Encoded categorical columns in {name}")

Encoded categorical columns in Symptom-severity.csv
Encoded categorical columns in new_p.csv
Encoded categorical columns in disease_centric_knowledgebase_with_doctor.csv
Encoded categorical columns in symptom_Description.csv


# Feature Selection and Model Building

Now that our data is clean and numeric, the next step is to select the most relevant features (columns) for predicting diseases. After selecting features, we will build a simple machine learning model to predict diseases based on symptoms.