Problem statement:
Patient Arrival and Check-In:
Case Study: 
•	Upon arrival, patients check in using an AI-powered kiosk or mobile app. 
•	The system uses data input to determine the purpose of visit (appointment, emergency, diagnostics).
•	 Patients receive a digital token or notification with estimated wait time and location guidance.


The features in the dataset are as follows:

1. **Patient ID**: 
   - Description: An identifier for each patient.
   - Non-Null Count: 5000 (indicating there are 5000 non-null values)
   - Dtype: int64 (indicating it is of integer type)

2. **Age**:
   - Description: Age of the patient.
   - Non-Null Count: 5000
   - Dtype: int64

3. **Gender**:
   - Description: Gender of the patient.
   - Non-Null Count: 5000
   - Dtype: object (typically indicating categorical data)

4. **Arrival Time**:
   - Description: Time of arrival of the patient.
   - Non-Null Count: 5000
   - Dtype: float64 (possibly indicating a continuous numerical value)

5. **Check-In Method**:
   - Description: Method used by the patient for check-in.
   - Non-Null Count: 5000
   - Dtype: object

6. **Purpose of Visit**:
   - Description: Purpose for which the patient visited (e.g., Routine Check-up, Consultation, Treatment, Emergency).
   - Non-Null Count: 5000
   - Dtype: object

7. **Medical History**:
   - Description: Medical history of the patient.
   - Non-Null Count: 3968 (indicating there are some missing values)
   - Dtype: object

8. **Estimated Wait Time**:
   - Description: Estimated wait time for the patient.
   - Non-Null Count: 5000
   - Dtype: object (it might be expected to be numerical, but it's stored as object indicating further exploration is needed)

9. **Location Guidance**:
   - Description: Guidance provided to the patient regarding the location.
   - Non-Null Count: 5000
   - Dtype: object

10. **Reason for Visit**:
    - Description: Reason for the patient's visit (e.g., Symptoms, Follow-up, Prescription Refill, General Inquiry).
    - Non-Null Count: 5000
    - Dtype: object

These features provide information about the patients, their characteristics, and the details of their visit to a healthcare facility. The "Non-Null Count" indicates the number of non-null values for each feature, while the "Dtype" indicates the data type of each feature. It's essential to understand these features' characteristics to perform data analysis, visualization, and modeling effectively. Additionally, handling missing values and ensuring appropriate data types are crucial preprocessing steps before further analysis.

In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.offline as py 
py.init_notebook_mode(connected=True)                  
import plotly.graph_objs as go                         
import plotly.tools as tls                             
from collections import Counter                        
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV                                         # to split the data
from sklearn.metrics import mean_squared_error, roc_auc_score, roc_curve, r2_score, accuracy_score, confusion_matrix, classification_report, fbeta_score     # to evaluate our model
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from catboost import CatBoostClassifier
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore")

pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)


In [2]:
#Loading dataset
df = pd.read_csv(r"C:\Users\Bhimesh\Downloads\SyntheticData.csv")

In [3]:
#Information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Patient ID           5000 non-null   int64  
 1   Age                  5000 non-null   int64  
 2   Gender               5000 non-null   object 
 3   Arrival Time         5000 non-null   float64
 4   Check-In Method      5000 non-null   object 
 5   Purpose of Visit     5000 non-null   object 
 6   Medical History      3968 non-null   object 
 7   Estimated Wait Time  5000 non-null   object 
 8   Location Guidance    5000 non-null   object 
 9   Reason for Visit     5000 non-null   object 
dtypes: float64(1), int64(2), object(7)
memory usage: 390.8+ KB


In [4]:
#Shape of the dataset
df.shape

(5000, 10)

The shape of the dataset (5000, 10) indicates that it contains 5000 rows and 10 columns.



In [5]:
df.describe()

Unnamed: 0,Patient ID,Age,Arrival Time
count,5000.0,5000.0,5000.0
mean,5.358,48.125,45398.451
std,2.652,10.748,0.056
min,1.0,28.0,45398.354
25%,3.0,39.0,45398.402
50%,6.0,49.0,45398.469
75%,7.0,55.0,45398.494
max,10.0,68.0,45398.562


In [None]:
Here are some insights about the columns Patient ID, Age, and Arrival Time based on the provided statistics:

**Patient ID:**
- Count: 5000
- Mean: 5.358
- Standard Deviation (std): 2.652
- Minimum: 1
- 25th Percentile (25%): 3
- Median (50%): 6
- 75th Percentile (75%): 7
- Maximum: 10

These statistics suggest that:
- There are 5000 unique patient IDs in the dataset.
- The mean patient ID is approximately 5.358, with a standard deviation of 2.652, indicating some variation in the distribution of patient IDs.
- The range of patient IDs is from 1 to 10.
- The median patient ID is 6, which means that 50% of the patient IDs are less than or equal to 6.
- The interquartile range (IQR), which is the difference between the 75th and 25th percentiles, is 4, indicating the middle 50% of patient IDs fall within this range.

**Age:**
- Count: 5000
- Mean: 48.125
- Standard Deviation (std): 10.748
- Minimum: 28
- 25th Percentile (25%): 39
- Median (50%): 49
- 75th Percentile (75%): 55
- Maximum: 68

These statistics suggest that:
- There are 5000 age values in the dataset.
- The mean age is approximately 48.125, with a standard deviation of 10.748, indicating some variability in the ages of patients.
- The age range is from 28 to 68 years.
- The median age is 49, which means that 50% of the ages are less than or equal to 49.
- The interquartile range (IQR) is 16, indicating the middle 50% of ages fall within this range.

**Arrival Time:**
- Count: 5000
- Mean: 45398.451
- Standard Deviation (std): 0.056
- Minimum: 45398.354
- 25th Percentile (25%): 45398.402
- Median (50%): 45398.469
- 75th Percentile (75%): 45398.494
- Maximum: 45398.562

These statistics suggest that:
- There are 5000 arrival time values in the dataset.
- The mean arrival time is approximately 45398.451, with a very small standard deviation of 0.056, indicating minimal variability in the arrival times.
- The arrival time range is from 45398.354 to 45398.562.
- The median arrival time is 45398.469, indicating that 50% of the arrival times are less than or equal to this value.
- The interquartile range (IQR) is very small, indicating that most arrival times are clustered closely together.

Overall, these insights provide a summary of the distribution and variability of the Patient ID, Age, and Arrival Time columns in the dataset.

In [6]:
#Checking null values
df.isnull().sum()

Patient ID                0
Age                       0
Gender                    0
Arrival Time              0
Check-In Method           0
Purpose of Visit          0
Medical History        1032
Estimated Wait Time       0
Location Guidance         0
Reason for Visit          0
dtype: int64

Based on the provided information, here are some insights about the columns:

1. **Patient ID, Age, Gender, Arrival Time, Check-In Method, Purpose of Visit, Estimated Wait Time, Location Guidance, Reason for Visit**:
   - These columns have no missing values (counted as 0). This indicates that all 5000 rows in the dataset have values for these columns.
   - For Patient ID, Age, Arrival Time, Estimated Wait Time, and other categorical columns (Gender, Check-In Method, Purpose of Visit, Location Guidance, Reason for Visit), there are no missing values. This suggests that data collection for these features is complete and consistent across all samples.

2. **Medical History**:
   - This column has 1032 missing values.
   - Missing values in the Medical History column may indicate that not all patients have a recorded medical history in the dataset.
   - It's essential to handle missing values appropriately before further analysis or modeling, depending on the specific requirements of the analysis. Options for handling missing values include imputation (e.g., filling missing values with a default value or the mean/median of the column), deletion of rows or columns with missing values, or using advanced techniques such as predictive modeling to impute missing values.

Overall, the insights provided highlight the presence of missing values in the Medical History column, while other columns in the dataset have complete data. Handling missing values in the Medical History column is crucial to ensure accurate analysis and modeling results.

In [7]:
#The code df['Reason for Visit'].value_counts() is used to compute the frequency of each unique value in the "Reason for Visit" column of the DataFrame df.

df['Reason for Visit'].value_counts()


Reason for Visit
Follow-up on diabetes management       783
Physical examination for employment    782
Cholesterol level check                758
Psychological counseling               570
Joint pain management                  492
Cough and shortness of breath          404
Skin rash and itching                  348
Flu-like symptoms                      325
Cardiac rehabilitation                 318
Annual checkup                         220
Name: count, dtype: int64

In [8]:
#The code df['Purpose of Visit'].value_counts() is used to compute the frequency of each unique value in the "Purpose of Visit" column of the DataFrame df.

df['Purpose of Visit'].value_counts()


Purpose of Visit
Appointment    2331
Diagnostics    1769
Emergency       900
Name: count, dtype: int64

Based on the provided insights for the "Purpose of Visit" column:

1. **Appointment**: 
   - Count: 2331
   - This category indicates that a significant portion of visits are for scheduled appointments. These appointments could be for routine check-ups, follow-up visits, consultations, or other planned medical procedures.
   - The relatively high count suggests that a considerable number of patients make appointments for healthcare services, which could indicate proactive healthcare management or regular monitoring of health conditions.

2. **Diagnostics**:
   - Count: 1769
   - This category suggests that a substantial number of visits are for diagnostic purposes. These visits could involve tests, screenings, or evaluations to diagnose medical conditions, monitor disease progression, or assess overall health.
   - The count indicates that diagnostics play an essential role in patient care, highlighting the importance of accurate diagnosis and medical testing in healthcare delivery.

3. **Emergency**:
   - Count: 900
   - This category represents visits related to emergencies or urgent medical needs. Patients visiting the healthcare facility under this category likely require immediate medical attention due to acute illnesses, injuries, or exacerbation of existing medical conditions.
   - The count of emergency visits underscores the critical role of emergency healthcare services in providing timely and life-saving interventions to patients in urgent situations.

Overall, the insights provide a breakdown of the different purposes for patient visits, including scheduled appointments, diagnostic procedures, and emergency care. Understanding the distribution of visit purposes can help healthcare providers allocate resources effectively, prioritize patient needs, and tailor healthcare services to meet patient demands and expectations.

In [9]:
# Replace null values in a specific categorical column with the mode
df['Medical History'] = df['Medical History'].fillna(df['Medical History'].mode()[0])


In [10]:
#Again checking null values:
df.isnull().sum()

Patient ID             0
Age                    0
Gender                 0
Arrival Time           0
Check-In Method        0
Purpose of Visit       0
Medical History        0
Estimated Wait Time    0
Location Guidance      0
Reason for Visit       0
dtype: int64

There are no values in the dataset

In [11]:
# Step 2: Preprocess the data
# Encoding categorical variables using one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Check-In Method', 'Location Guidance','Gender','Medical History','Reason for Visit'])


In [12]:
df_encoded

Unnamed: 0,Patient ID,Age,Arrival Time,Purpose of Visit,Estimated Wait Time,Check-In Method_Kiosk,Check-In Method_Mobile App,Location Guidance_ER Department,Location Guidance_Lab 1,Location Guidance_Lab 2,Location Guidance_Room 101,Location Guidance_Room 103,Location Guidance_Room 105,Location Guidance_Room 107,Location Guidance_Room 109,Gender_Female,Gender_Male,Medical History_Allergies,Medical History_Arthritis,Medical History_Asthma,Medical History_Depression,Medical History_Heart attack,Medical History_High cholesterol,Medical History_Hypertension,Medical History_Type 2 Diabetes,Reason for Visit_Annual checkup,Reason for Visit_Cardiac rehabilitation,Reason for Visit_Cholesterol level check,Reason for Visit_Cough and shortness of breath,Reason for Visit_Flu-like symptoms,Reason for Visit_Follow-up on diabetes management,Reason for Visit_Joint pain management,Reason for Visit_Physical examination for employment,Reason for Visit_Psychological counseling,Reason for Visit_Skin rash and itching
0,6,29,45398.413,Appointment,15 minutes,False,True,True,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1,3,43,45398.544,Diagnostics,25 minutes,False,True,False,False,False,True,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,7,33,45398.493,Appointment,25 minutes,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False
3,6,55,45398.407,Diagnostics,20 minutes,False,True,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
4,8,60,45398.473,Appointment,30 minutes,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,6,50,45398.437,Appointment,30 minutes,True,False,False,True,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
4996,6,50,45398.376,Emergency,20 minutes,True,False,True,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4997,1,67,45398.410,Appointment,25 minutes,False,True,True,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
4998,2,39,45398.477,Emergency,15 minutes,True,False,True,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


In [13]:
mf = df_encoded

In [14]:
mf.columns

Index(['Patient ID', 'Age', 'Arrival Time', 'Purpose of Visit',
       'Estimated Wait Time', 'Check-In Method_Kiosk',
       'Check-In Method_Mobile App', 'Location Guidance_ER Department',
       'Location Guidance_Lab 1', 'Location Guidance_Lab 2',
       'Location Guidance_Room 101', 'Location Guidance_Room 103',
       'Location Guidance_Room 105', 'Location Guidance_Room 107',
       'Location Guidance_Room 109', 'Gender_Female', 'Gender_Male',
       'Medical History_Allergies', 'Medical History_Arthritis',
       'Medical History_Asthma', 'Medical History_Depression',
       'Medical History_Heart attack', 'Medical History_High cholesterol',
       'Medical History_Hypertension', 'Medical History_Type 2 Diabetes',
       'Reason for Visit_Annual checkup',
       'Reason for Visit_Cardiac rehabilitation',
       'Reason for Visit_Cholesterol level check',
       'Reason for Visit_Cough and shortness of breath',
       'Reason for Visit_Flu-like symptoms',
       'Reason for Visi

In [15]:
mf.drop(columns=['Patient ID'], inplace=True)


In [16]:
mf.columns


Index(['Age', 'Arrival Time', 'Purpose of Visit', 'Estimated Wait Time',
       'Check-In Method_Kiosk', 'Check-In Method_Mobile App',
       'Location Guidance_ER Department', 'Location Guidance_Lab 1',
       'Location Guidance_Lab 2', 'Location Guidance_Room 101',
       'Location Guidance_Room 103', 'Location Guidance_Room 105',
       'Location Guidance_Room 107', 'Location Guidance_Room 109',
       'Gender_Female', 'Gender_Male', 'Medical History_Allergies',
       'Medical History_Arthritis', 'Medical History_Asthma',
       'Medical History_Depression', 'Medical History_Heart attack',
       'Medical History_High cholesterol', 'Medical History_Hypertension',
       'Medical History_Type 2 Diabetes', 'Reason for Visit_Annual checkup',
       'Reason for Visit_Cardiac rehabilitation',
       'Reason for Visit_Cholesterol level check',
       'Reason for Visit_Cough and shortness of breath',
       'Reason for Visit_Flu-like symptoms',
       'Reason for Visit_Follow-up on diabet

In [17]:
mf

Unnamed: 0,Age,Arrival Time,Purpose of Visit,Estimated Wait Time,Check-In Method_Kiosk,Check-In Method_Mobile App,Location Guidance_ER Department,Location Guidance_Lab 1,Location Guidance_Lab 2,Location Guidance_Room 101,Location Guidance_Room 103,Location Guidance_Room 105,Location Guidance_Room 107,Location Guidance_Room 109,Gender_Female,Gender_Male,Medical History_Allergies,Medical History_Arthritis,Medical History_Asthma,Medical History_Depression,Medical History_Heart attack,Medical History_High cholesterol,Medical History_Hypertension,Medical History_Type 2 Diabetes,Reason for Visit_Annual checkup,Reason for Visit_Cardiac rehabilitation,Reason for Visit_Cholesterol level check,Reason for Visit_Cough and shortness of breath,Reason for Visit_Flu-like symptoms,Reason for Visit_Follow-up on diabetes management,Reason for Visit_Joint pain management,Reason for Visit_Physical examination for employment,Reason for Visit_Psychological counseling,Reason for Visit_Skin rash and itching
0,29,45398.413,Appointment,15 minutes,False,True,True,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1,43,45398.544,Diagnostics,25 minutes,False,True,False,False,False,True,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,33,45398.493,Appointment,25 minutes,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False
3,55,45398.407,Diagnostics,20 minutes,False,True,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
4,60,45398.473,Appointment,30 minutes,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,50,45398.437,Appointment,30 minutes,True,False,False,True,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
4996,50,45398.376,Emergency,20 minutes,True,False,True,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4997,67,45398.410,Appointment,25 minutes,False,True,True,False,False,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
4998,39,45398.477,Emergency,15 minutes,True,False,True,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


In [18]:
mf.shape

(5000, 34)

In [19]:
string_to_remove = 'minutes'
mf['Estimated Wait Time'] = mf['Estimated Wait Time'].str.replace(string_to_remove, '')

In [20]:
from sklearn.preprocessing import StandardScaler
s_sc = StandardScaler()
col_to_scale =['Age','Estimated Wait Time']
mf[col_to_scale] = s_sc.fit_transform(mf[col_to_scale])

In [21]:
mf.head()

Unnamed: 0,Age,Arrival Time,Purpose of Visit,Estimated Wait Time,Check-In Method_Kiosk,Check-In Method_Mobile App,Location Guidance_ER Department,Location Guidance_Lab 1,Location Guidance_Lab 2,Location Guidance_Room 101,Location Guidance_Room 103,Location Guidance_Room 105,Location Guidance_Room 107,Location Guidance_Room 109,Gender_Female,Gender_Male,Medical History_Allergies,Medical History_Arthritis,Medical History_Asthma,Medical History_Depression,Medical History_Heart attack,Medical History_High cholesterol,Medical History_Hypertension,Medical History_Type 2 Diabetes,Reason for Visit_Annual checkup,Reason for Visit_Cardiac rehabilitation,Reason for Visit_Cholesterol level check,Reason for Visit_Cough and shortness of breath,Reason for Visit_Flu-like symptoms,Reason for Visit_Follow-up on diabetes management,Reason for Visit_Joint pain management,Reason for Visit_Physical examination for employment,Reason for Visit_Psychological counseling,Reason for Visit_Skin rash and itching
0,-1.78,45398.413,Appointment,-0.807,False,True,True,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1,-0.477,45398.544,Diagnostics,0.213,False,True,False,False,False,True,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
2,-1.407,45398.493,Appointment,0.213,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False
3,0.64,45398.407,Diagnostics,-0.297,False,True,False,False,False,True,False,False,False,False,False,True,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
4,1.105,45398.473,Appointment,0.723,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False


In [22]:
import imblearn
from imblearn.over_sampling import SMOTE

X = mf.drop('Purpose of Visit', axis=1)
y = mf[['Purpose of Visit']]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


In [23]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")


In [24]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)

print_score(lr_clf, X_train, y_train, X_test, y_test, train=True)
print_score(lr_clf, X_train, y_train, X_test, y_test, train=False)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
print_score(lr_clf, X_train_resampled, y_train_resampled, X_test, y_test, train=True)
print_score(lr_clf, X_train_resampled, y_train_resampled, X_test, y_test, train=False)


Train Result:
Accuracy Score: 65.89%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.737        0.555      0.000     0.659      0.431   
recall           0.890        0.678      0.000     0.659      0.523   
f1-score         0.807        0.610      0.000     0.659      0.472   
support       1652.000     1232.000    616.000     0.659   3500.000   

           weighted avg  
precision         0.543  
recall            0.659  
f1-score          0.596  
support        3500.000  
_______________________________________________
Confusion Matrix: 
 [[1471  181    0]
 [ 397  835    0]
 [ 127  489    0]]

Test Result:
Accuracy Score: 63.87%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.723        0.533      0.000     0.639      0.419   
recall           0.888        0.6

In [25]:
test_score = accuracy_score(y_test, lr_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, lr_clf.predict(X_train)) * 100

results_df = pd.DataFrame(data=[["Logistic Regression", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df

Unnamed: 0,Model,Training Accuracy %,Testing Accuracy %
0,Logistic Regression,65.886,63.867


In [26]:
from sklearn.svm import SVC


svm_clf = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm_clf.fit(X_train, y_train)

print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:
Accuracy Score: 74.74%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.767        0.758      0.666     0.747      0.730   
recall           0.876        0.630      0.638     0.747      0.715   
f1-score         0.818        0.688      0.652     0.747      0.719   
support       1652.000     1232.000    616.000     0.747   3500.000   

           weighted avg  
precision         0.746  
recall            0.747  
f1-score          0.743  
support        3500.000  
_______________________________________________
Confusion Matrix: 
 [[1447  128   77]
 [ 336  776  120]
 [ 103  120  393]]

Test Result:
Accuracy Score: 74.40%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.745        0.743      0.743     0.744      0.744   
recall           0.873        0.6

In [27]:
print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)
print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)
print_score(svm_clf, X_train_resampled, y_train_resampled, X_test, y_test, train=True)
print_score(svm_clf, X_train_resampled, y_train_resampled, X_test, y_test, train=False)

results_df_2 = pd.DataFrame(data=[["Support Vector Machine", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df_2

Train Result:
Accuracy Score: 74.74%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.767        0.758      0.666     0.747      0.730   
recall           0.876        0.630      0.638     0.747      0.715   
f1-score         0.818        0.688      0.652     0.747      0.719   
support       1652.000     1232.000    616.000     0.747   3500.000   

           weighted avg  
precision         0.746  
recall            0.747  
f1-score          0.743  
support        3500.000  
_______________________________________________
Confusion Matrix: 
 [[1447  128   77]
 [ 336  776  120]
 [ 103  120  393]]

Test Result:
Accuracy Score: 74.40%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.745        0.743      0.743     0.744      0.744   
recall           0.873        0.6

Unnamed: 0,Model,Training Accuracy %,Testing Accuracy %
0,Support Vector Machine,65.886,63.867


In [28]:
from sklearn.tree import DecisionTreeClassifier


tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:
Accuracy Score: 99.89%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.998        0.999      1.000     0.999      0.999   
recall           1.000        0.999      0.995     0.999      0.998   
f1-score         0.999        0.999      0.998     0.999      0.999   
support       1652.000     1232.000    616.000     0.999   3500.000   

           weighted avg  
precision         0.999  
recall            0.999  
f1-score          0.999  
support        3500.000  
_______________________________________________
Confusion Matrix: 
 [[1652    0    0]
 [   1 1231    0]
 [   2    1  613]]

Test Result:
Accuracy Score: 62.27%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.701        0.573      0.517     0.623      0.597   
recall           0.719        0.5

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

rf_clf = RandomForestClassifier(n_estimators=300, random_state=42)
rf_clf.fit(X_train, y_train)

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)
print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)


Train Result:
Accuracy Score: 99.89%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.998        1.000      0.998     0.999      0.999   
recall           1.000        0.998      0.997     0.999      0.998   
f1-score         0.999        0.999      0.998     0.999      0.999   
support       1652.000     1232.000    616.000     0.999   3500.000   

           weighted avg  
precision         0.999  
recall            0.999  
f1-score          0.999  
support        3500.000  
_______________________________________________
Confusion Matrix: 
 [[1652    0    0]
 [   1 1230    1]
 [   2    0  614]]

Test Result:
Accuracy Score: 72.73%
_______________________________________________
CLASSIFICATION REPORT:
           Appointment  Diagnostics  Emergency  accuracy  macro avg  \
precision        0.736        0.714      0.725     0.727      0.725   
recall           0.854        0.6

In [30]:
test_score = accuracy_score(y_test, rf_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(data=[["Random Forest Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df_2


Unnamed: 0,Model,Training Accuracy %,Testing Accuracy %
0,Random Forest Classifier,99.886,72.733


In [31]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10],          # Regularization parameter
    'gamma': [0.01, 0.1, 1],    # Kernel coefficient for 'rbf' kernel
}


# Initialize SVM classifier
svm_clf = SVC(kernel='rbf')

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=svm_clf, param_grid=param_grid, cv=5)

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model
best_model_accuracy = best_model.score(X_test, y_test)

print("Best Parameters:", best_params)
print("Best Model Accuracy:", best_model_accuracy)


Best Parameters: {'C': 1, 'gamma': 0.1}
Best Model Accuracy: 0.744


In [32]:
# Assuming 'df' contains both features and target variable
X = mf.drop(columns=["Purpose of Visit"])  # Features
y = mf["Purpose of Visit"]  # Target variable

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Instantiate the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)  # You can choose the number of neighbors (K) here

# Train the KNN model
knn.fit(X_train_scaled, y_train)

# Make predictions on the test data
y_pred = knn.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.7


In [33]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit label encoder and transform labels
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


In [34]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming 'df' contains both features and target variable
X = mf.drop(columns=["Purpose of Visit"])  # Features
y = mf["Purpose of Visit"]  # Target variable

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit label encoder and transform labels
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


# Define the XGBoost classifier
xgb_classifier = xgb.XGBClassifier(objective='binary:logistic', random_state=42)

# Train the XGBoost model
xgb_classifier.fit(X_train, y_train_encoded)

# Make predictions on the test data
y_pred = xgb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test_encoded, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.707


Conclusion:
1)Trained the model using various algorithms,Random forest has given highest training and testing accurcy
2)Did hyperparameter tuning for SVM,observed a little improvement
3)Applied SMOTE technique which reduces imbalance in the dataser