<a href="https://colab.research.google.com/github/kevinsoni2511/ELEVATE-LAB/blob/main/TASK1/MEDICAL_APPOINMENT_NO_SHOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [67]:

# Step 1: Import libraries and load dataset

import pandas as pd

# Load dataset
DF = pd.read_csv("MEDICAL_APPOINMENT_NO_SHOW.csv")

print("✅ Dataset loaded successfully!\n")
print("Initial shape\n", DF.shape)
print("\nPreview of dataset\n\n", DF.head(10))


✅ Dataset loaded successfully!

Initial shape
 (110527, 14)

Preview of dataset

       PatientId  AppointmentID Gender          ScheduledDay  \
0  2.987250e+13        5642903      F  2016-04-29T18:38:08Z   
1  5.589978e+14        5642503      M  2016-04-29T16:08:27Z   
2  4.262962e+12        5642549      F  2016-04-29T16:19:04Z   
3  8.679512e+11        5642828      F  2016-04-29T17:29:31Z   
4  8.841186e+12        5642494      F  2016-04-29T16:07:23Z   
5  9.598513e+13        5626772      F  2016-04-27T08:36:51Z   
6  7.336882e+14        5630279      F  2016-04-27T15:05:12Z   
7  3.449833e+12        5630575      F  2016-04-27T15:39:58Z   
8  5.639473e+13        5638447      F  2016-04-29T08:02:16Z   
9  7.812456e+13        5629123      F  2016-04-27T12:48:25Z   

         AppointmentDay  Age      Neighbourhood  Scholarship  Hipertension  \
0  2016-04-29T00:00:00Z   62    JARDIM DA PENHA            0             1   
1  2016-04-29T00:00:00Z   56    JARDIM DA PENHA            0        

In [68]:

# Step 2: Identify and handle missing values

print("Identify and handle missing values")
print("\nMissing values before cleaning:\n", DF.isnull().sum())

# Drop missing rows in place
DF.dropna(inplace=True)

print("\n✅ Missing values after cleaning:\n", DF.isnull().sum())


Identify and handle missing values

Missing values before cleaning:
 PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

✅ Missing values after cleaning:
 PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64


In [69]:

# Step 3: Remove duplicate records

print("Remove Duplicate Records")

duplicates_before = DF.duplicated().sum()
print(f"\nDuplicates before cleaning:\n {duplicates_before}")

DF.drop_duplicates(inplace=True)

duplicates_after = DF.duplicated().sum()
print(f"\n✅ Duplicates after cleaning:\n {duplicates_after}")


Remove Duplicate Records

Duplicates before cleaning:
 0

✅ Duplicates after cleaning:
 0


In [70]:

# Step 4: Rename columns for consistency

print("Rename columns for consistency")

DF.columns = [col.strip().lower().replace(" ", "_") for col in DF.columns]

print("\n✅ Standardized column names:\n", DF.columns.tolist())


Rename columns for consistency

✅ Standardized column names:
 ['patientid', 'appointmentid', 'gender', 'scheduledday', 'appointmentday', 'age', 'neighbourhood', 'scholarship', 'hipertension', 'diabetes', 'alcoholism', 'handcap', 'sms_received', 'no-show']


In [71]:
# Step 5: Standardize text values (like gender, yes/no)

print("Standardize text values (like gender, yes/no)")

for col in DF.select_dtypes(include="object").columns:
    DF[col] = DF[col].astype(str).str.strip().str.lower()

# ✅ Fixed: No FutureWarning — modify DataFrame directly
if "gender" in DF.columns:
    DF.replace({"gender": {"f": "female", "m": "male"}}, inplace=True)

if "no-show" in DF.columns or "no_show" in DF.columns:
    colname = "no-show" if "no-show" in DF.columns else "no_show"
    DF.replace({colname: {"no": "showed_up", "yes": "no_show"}}, inplace=True)

print("\n✅ Text values standardized successfully!")


Standardize text values (like gender, yes/no)

✅ Text values standardized successfully!


In [72]:
# Step 6: Convert date columns to consistent format

print("Convert date columns to consistent format")
date_cols = [col for col in DF.columns if "day" in col or "date" in col]

for col in date_cols:
    DF[col] = pd.to_datetime(DF[col], errors="coerce")

print("\n🕒 Date columns converted to datetime:", date_cols)


Convert date columns to consistent format

🕒 Date columns converted to datetime: ['scheduledday', 'appointmentday']


In [73]:
# Step 7: Correct data types

print("Correct data types")

if "age" in DF.columns:
    DF["age"] = pd.to_numeric(DF["age"], errors="coerce").fillna(0).astype(int)

print("\n✅ Data types corrected where necessary!")


Correct data types

✅ Data types corrected where necessary!


In [74]:
# Step 8: Final inspection of cleaned dataset

print("Final inspection of cleaned dataset")
print("\n📊 Dataset information after cleaning:\n")
print(DF.info())
print("\nFinal shape:", DF.shape)


Final inspection of cleaned dataset

📊 Dataset information after cleaning:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   patientid       110527 non-null  float64       
 1   appointmentid   110527 non-null  int64         
 2   gender          110527 non-null  object        
 3   scheduledday    110527 non-null  datetime64[ns]
 4   appointmentday  110527 non-null  datetime64[ns]
 5   age             110527 non-null  int64         
 6   neighbourhood   110527 non-null  object        
 7   scholarship     110527 non-null  int64         
 8   hipertension    110527 non-null  int64         
 9   diabetes        110527 non-null  int64         
 10  alcoholism      110527 non-null  int64         
 11  handcap         110527 non-null  int64         
 12  sms_received    110527 non-null  int64         
 13  no-show      

In [75]:

# Step 9: Save cleaned dataset

print("Save cleaned dataset")
output_path = "medical_appointment_cleaned.csv"
DF.to_csv(output_path, index=False)

print(f"\n✅ Cleaned dataset saved successfully as: {output_path}")


Save cleaned dataset

✅ Cleaned dataset saved successfully as: medical_appointment_cleaned.csv


In [76]:

# Step 10: Summary of cleaning changes

print("Summary of cleaning changes")
summary = {
    "Missing Values Removed": True,
    "Duplicates Removed": True,
    "Columns Renamed": True,
    "Text Standardized": True,
    "Date Columns Converted": date_cols,
    "Numeric Data Fixed": True,
    "Final Shape": df.shape
}

print("\n📄 Cleaning Summary:\n")
for key, value in summary.items():
    print(f"{key}: {value}")

print("\n🎯 Data Cleaning & Preprocessing Completed Successfully!")


Summary of cleaning changes

📄 Cleaning Summary:

Missing Values Removed: True
Duplicates Removed: True
Columns Renamed: True
Text Standardized: True
Date Columns Converted: ['scheduledday', 'appointmentday']
Numeric Data Fixed: True
Final Shape: (110527, 14)

🎯 Data Cleaning & Preprocessing Completed Successfully!
