# Title: Building a Synthetic Medical Dataset for Health Condition Analysis Using Machine Learning

Abstract:
The increasing need for large, comprehensive datasets in health condition analysis often collides with stringent restrictions on accessing real medical data, especially for non-medical students. This report presents a methodology for constructing a synthetic dataset by integrating multiple publicly available datasets, enhancing them with additional variables through machine learning techniques, and generating a large-scale synthetic dataset using Conditional Tabular Generative Adversarial Networks (CTGAN). Our approach primarily utilized the Asthma Disease Dataset as the base, with the Classification of Coronary Artery Disease (CAD) Dataset providing supplementary cardiac variables. This document details the dataset integration, cleaning processes, model testing, and synthetic data generation.

# 1. Introduction

For this project, our objective was to create a synthetic dataset that models various health conditions and risk factors. However, our attempts to gain access to more comprehensive and detailed datasets from specialized sources were unsuccessful due to restrictions and delays in the approval process. To circumvent these challenges, we turned to publicly available datasets on platforms such as Kaggle, understanding that these might not fully encompass all the variables required for our analysis.

In this context, we aimed to build a synthetic dataset that models asthma and its interaction with cardiac variables, among other health factors. To achieve this, we selected the Asthma Disease Dataset and the Classification of Coronary Artery Disease Dataset from Kaggle. We employed machine learning techniques to integrate and augment these datasets, followed by the application of CTGAN to generate a large-scale synthetic dataset.

# 2. Core Datasets and Integration

The foundation of our synthetic dataset was the Asthma Disease Dataset https://www.kaggle.com/datasets/rabieelkharoua/asthma-disease-dataset, a robust and detailed collection of data concerning asthma patients. This dataset was chosen due to its comprehensive coverage of variables related to respiratory health, which aligned with our project’s objectives.

However, to enrich the dataset and include cardiac variables, we integrated data from the Classification of Coronary Artery Disease (CAD) Dataset https://www.google.com/url?q=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Fsaeedeheydarian%2Fclassification-of-coronary-artery-disease. This dataset provided critical variables related to coronary health, which are essential for understanding the interactions between asthma and cardiac conditions. The integration process involved the following steps:

Variable Matching and Alignment: Ensuring that the variables from both datasets were compatible, with consistent data types and ranges.
Data Augmentation: Using machine learning models to predict missing variables and fill gaps in the data.
Data Merging: Combining the datasets into a unified structure that could be used as a basis for further analysis and synthetic data generation.
# 3. Challenges in Data Integration and Augmentation

One of the major challenges in this project was the heterogeneity of the datasets. The Asthma and CAD datasets had different structures and variable distributions, making it difficult to merge them seamlessly. To address these issues, we tested several machine learning models to predict and align variables, ensuring that the final integrated dataset was coherent and comprehensive.

Additionally, we encountered issues with missing data, particularly in the CAD dataset. To resolve this, we employed imputation techniques and used machine learning to estimate missing values, particularly for variables that are critical for predicting health outcomes, such as age, BMI, and blood pressure.

# 4. Addressing the Challenges of Pollen Allergy Data

In constructing a comprehensive synthetic dataset for asthma and related respiratory conditions, one of the major hurdles was the incorporation of specific pollen allergy data. Pollen allergies play a significant role in the exacerbation of asthma symptoms, and understanding the relationship between pollen exposure and allergic reactions is crucial for accurate modeling. However, obtaining datasets that not only capture the presence of specific pollen allergies (e.g., ragweed, grass, birch) but also detail the severity of these allergies and their interaction with varying pollen concentrations presented several challenges.

The primary difficulties were:

Lack of Detailed Allergic Response Data:
Publicly available datasets often include general information about pollen exposure but fall short of providing specific details on the allergic responses of individuals to different types of pollen. This includes data on the severity of the allergic reactions, which is essential for understanding how different concentrations of pollen influence the health outcomes of individuals with asthma.
Inconsistent Measurement of Pollen Concentrations:
While some datasets provide data on pollen counts, the variability in how these counts are measured and reported across different studies and regions made it challenging to integrate them into a cohesive model. Moreover, datasets that linked pollen concentrations directly to health outcomes were scarce, making it difficult to draw definitive conclusions about the impact of pollen exposure on allergy severity and asthma exacerbation.
Lack of Granular Data Linking Pollen Types to Specific Allergic Reactions:
Another significant gap was the absence of datasets that detailed how specific types of pollen (e.g., ragweed, grass, birch) correlated with particular allergic reactions. This type of data is crucial for accurately modeling the influence of different pollen types on asthma severity, yet it was not readily available in the datasets we initially considered.
To address these challenges, we leveraged Academic Research and Published Data: We utilized estimates and models from academic research to simulate realistic pollen exposure levels, particularly focusing on the severity of allergic responses to different pollen types. This included insights from studies published in allergy and immunology journals, which helped guide our assumptions for generating synthetic data. Our sources are:
https://www.meduniwien.ac.at/web/en/ueber-uns/news/2022/news-im-august-2022/ragweed-allergie-herkunftsort-und-umwelt-beeinflussen-aggressivitaet-der-pollen/
https://www.meduniwien.ac.at/web/en/about-us/news/detailsite/2018/news-jaenner-2018/first-vaccine-in-the-world-developed-against-grass-pollen-allergy/medicine-science/
https://www.thermofisher.com/allergy/us/en/allergen-fact-sheets/mugwort.html 10-15% in EU
https://www.meduniwien.ac.at/web/en/about-us/news/detailsite/apple-allergy-symptoms-may-be-significantly-reduced-in-future/ 4.73%
https://www.thermofisher.com/phadia/wo/en/resources/allergen-encyclopedia/t2.html 6-10%%
http://www.globalasthmareport.org/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5806141/


##Pollen Exposure Index (PEI) and Pollen Weighting:
* The impact of various pollen types on allergic conditions, such as allergic rhinitis and asthma, has been well documented in the Allergic Rhinitis and its Impact on Asthma (ARIA) guidelines published by the World Health Organization (WHO). These guidelines provide a framework for understanding how different pollen types affect individuals with allergies and asthma. The weights assigned to different pollen types in the PEI calculation reflect their relative impact based on these guidelines.
##Air Quality Impact Index (AQII):
* The calculation of the AQII is based on the air quality guidelines provided by the World Health Organization (WHO) and the United States Environmental Protection Agency (EPA). These organizations have set thresholds for various pollutants, such as PM10, PM2.5, nitrogen dioxide (NO2), sulfur dioxide (SO2), carbon monoxide (CO), and ozone. The scores in the code are scaled according to these thresholds to reflect the health impact of exposure to these pollutants. The WHO's "Global Air Quality Guidelines" (2021) were particularly instrumental in defining these thresholds.
##Ozone-Pollen Interaction Index (OPII):
* The interaction between ozone levels and pollen allergenicity has been highlighted in research studies such as Bousquet et al., where it is noted that ozone can increase the allergenicity of pollen, exacerbating allergic symptoms. This insight was used to adjust the OPII based on ozone concentrations.
##Particulate Matter Health Index (PMHI):
* The health risks associated with particulate matter, particularly PM2.5 and PM10, are well-documented in numerous studies, including those published by the WHO and the EPA. PM2.5, being smaller, penetrates deeper into the lungs and is associated with more severe health effects, which is why it is weighted more heavily in the PMHI calculation.
##Composite Environmental Risk Index (CERI):
* The CERI is a composite index that aggregates the total environmental risk from various factors, including pollen exposure, air quality, and particulate matter. This approach is in line with multi-factorial health risk assessment frameworks used by public health organizations. The normalization of the total score to a 0-1 scale ensures that the risk is presented in an intuitive format for users.
##Health Conditions and Personalized Recommendations:
* The code incorporates personalized recommendations based on the user's health conditions, such as asthma, hay fever, eczema, and coronary artery disease (CAD). The increased risk for these conditions in polluted or high-pollen environments is supported by extensive literature, including the Global Initiative for Asthma (GINA)guidelines and studies on the "atopic march" in allergic diseases. The code's logic reflects the heightened sensitivity of individuals with these conditions to environmental factors, guiding them to take appropriate precautions based on the computed risk score.
* Specifically, the recommendations for individuals with CAD are influenced by guidelines from the American Heart Association (AHA), which emphasize the importance of avoiding poor air quality and high physical exertion in environments with elevated pollutant levels.

# 5. Synthetic Data Generation Using CTGAN

With the integrated dataset in place, we moved to the synthetic data generation phase. For this, we chose the Conditional Tabular GAN (CTGAN) model, which is particularly well-suited for handling the complexities of tabular data. The reasons for selecting CTGAN include:

##Introduction to Synthetic Data Generation in tabular datasets
One of the most advanced and effective methods for synthetic data generation in tabular datasets is the use of Generative Adversarial Networks (GANs), particularly the Conditional Tabular GAN (CTGAN). This chapter outlines the reasons for choosing CTGAN for synthetic data generation in our project, discusses its methodological strengths, and explains why it is a superior approach compared to traditional methods.
##Overview of CTGAN
CTGAN (Conditional Tabular GAN) is a variant of GANs specifically designed to handle the unique challenges posed by tabular data. Traditional GANs have been highly successful in generating synthetic data for images and text but struggle when applied directly to tabular data due to the discrete and often imbalanced nature of features in such datasets.
CTGAN addresses these challenges by incorporating a conditional mechanism that allows it to model both continuous and discrete variables effectively. It was developed to overcome the shortcomings of earlier models, such as inability to accurately capture the dependencies between variables in tabular data and poor performance on imbalanced datasets.
##Why CTGAN is an Ideal Choice
Handling Mixed Data Types: Tabular data often includes a mix of continuous and categorical variables, which can be difficult to model simultaneously. CTGAN uses a technique called mode-specific normalization to transform continuous variables and applies a Gumbel-softmax trick to handle categorical variables. This allows CTGAN to maintain the relationships between these different data types, generating realistic synthetic data that accurately reflects the underlying patterns in the original dataset. Capturing Complex Dependencies: In health-related datasets, such as the ones used in this project, variables can exhibit complex interdependencies (e.g., age, BMI, and gender influencing health outcomes). CTGAN’s conditional mechanism allows it to learn and preserve these dependencies, ensuring that the synthetic data maintains the integrity of the original data’s relationships. This is particularly important for tasks such as predictive modeling, where the accuracy of the model relies heavily on the quality of the input data. Effective Management of Imbalanced Data: Many real-world datasets suffer from imbalanced classes, where certain outcomes or categories are underrepresented. Traditional synthetic data generation methods may amplify these imbalances, leading to biased models. CTGAN, however, is designed to handle imbalances more effectively by focusing on the minority classes during training, ensuring that the synthetic data is more balanced and representative of the underlying population. Preservation of Privacy: One of the primary motivations for using synthetic data is to protect the privacy of individuals in the dataset. CTGAN generates new data points that are statistically similar to the original data but do not directly replicate any individual’s records. This makes CTGAN an excellent choice for projects that require data sharing or analysis without compromising privacy. Flexibility and Adaptability: CTGAN is highly flexible and can be adapted to different types of tabular data across various domains. Its architecture allows for the inclusion of additional features or modifications to the generation process, making it a versatile tool in the synthetic data generation toolkit. State-of-the-Art Performance: Numerous studies and experiments have demonstrated that CTGAN outperforms other synthetic data generation methods in terms of both the realism of the generated data and the preservation of statistical properties. This has made CTGAN a preferred choice in academic and industrial applications where the accuracy and utility of synthetic data are paramount.
##Application in the Current Project
In this project, CTGAN was chosen to generate synthetic data based on the integrated health datasets described earlier. The ability of CTGAN to handle mixed data types and maintain complex dependencies was critical in ensuring that the generated data was both realistic and useful for downstream predictive modeling tasks. By using CTGAN, we were able to overcome the limitations of incomplete datasets and enhance our dataset with additional synthetic records that closely mimic the statistical properties of the original data.
The synthetic data generated by CTGAN will serve as a valuable resource for training machine learning models, enabling robust predictions while safeguarding the privacy of the individuals represented in the original datasets.

# 6. Data Cleaning and Validation

After generating the synthetic dataset, we performed an extensive data cleaning process. This involved:

Removing Noise and Outliers: Identifying and eliminating anomalous data points that could skew the analysis.
Ensuring Consistency: Verifying that the synthetic data maintained the statistical properties and distributions of the original data.
Validation: Comparing the synthetic data with the original dataset to assess its accuracy and fidelity. This step was crucial to ensure that the synthetic data could be reliably used for predictive modeling and analysis.

# 7. Application and Future Work

The synthetic dataset generated through this process is a valuable resource for training machine learning models, conducting epidemiological studies, and developing predictive tools. In future work, this dataset could be expanded by integrating additional health conditions or environmental factors, such as pollen exposure, to provide an even more comprehensive tool for health analysis.
Although not perfect this dataset present an interesting structure that if replicated with more complex and complete dataset could perform way better.

#8. Conclusion

The creation of a synthetic medical dataset using publicly available data and machine learning techniques is a powerful approach to circumventing the challenges of data accessibility. By carefully selecting and integrating datasets, augmenting variables with predictive models, and employing advanced synthetic data generation techniques like CTGAN, we have constructed a dataset that can serve as a robust foundation for further research in health condition analysis.

In [2]:
#to run this notebook you will have to have the following libraries: ctgan, pandas, numpy, table_evaluator, sklearn
!pip install ctgan
!pip install table_evaluator



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

# Define file paths as variables
asthma_dataset_path = '/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/asthma_disease_data.csv'
cad_dataset_path = '/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/CAD.csv'

# Load the datasets into separate DataFrames
asthmaOriginal_df = pd.read_csv(asthma_dataset_path)
cadOriginal_df = pd.read_csv(cad_dataset_path)

# Work on copies of the datasets to avoid modifying the originals
asthmadf = asthmaOriginal_df.copy()
caddf = cadOriginal_df.copy()

# Optional: If loading datasets from GitHub
# Replace the URLs below with the actual GitHub raw file links

#asthma_github_url = 'https://raw.githubusercontent.com/username/repository/branch/path_to_asthma_data.csv'
#cad_github_url = 'https://raw.githubusercontent.com/username/repository/branch/path_to_cad_data.csv'

# Load the datasets from GitHub
#asthma_df_github = pd.read_csv(asthma_github_url)
#cad_df_github = pd.read_csv(cad_github_url)

# Work on copies of the datasets loaded from GitHub
#asthma_df_github_copy = asthma_df_github.copy()
#cad_df_github_copy = cad_df_github.copy()


In [None]:
#Caddf
from sklearn.preprocessing import LabelEncoder

caddf = caddf.dropna()

# Correct the BMI column by converting to numeric values
caddf['BMI'] = caddf['BMI'].round(1)
# Correct the 'Sex' column
caddf['Sex'] = caddf['Sex'].replace({'Fmale': 'Female'})



# Display the data types and first few rows after the initial correction
caddf.dtypes, caddf.head()

# Encoding categorical variables
label_encoders = {}
for column in ['Sex', 'Obesity', 'Airway disease', 'Thyroid Disease', 'Cath']:
    le = LabelEncoder()
    caddf[column] = le.fit_transform(caddf[column])
    label_encoders[column] = le
# Rename the columns in caddf
caddf.rename(columns={'Age': 'age', 'Sex': 'gender', 'BMI': 'bmi', 'Cath': 'Cad'}, inplace=True)

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Select relevant features and target variable
features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
target = 'Cad'

# Extract the features and target variable from the dataset
X = caddf[features]
y = caddf[target]

# Check for missing values and handle them if necessary
if X.isnull().sum().any():
    imputer = SimpleImputer(strategy='mean')
    X = pd.DataFrame(imputer.fit_transform(X), columns=features)


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")


Accuracy: 0.71
Precision: 0.48
Recall: 0.48
ROC-AUC Score: 0.69


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the model
rf_model = RandomForestClassifier(random_state=42)

# Initialize Grid Search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit Grid Search
grid_search.fit(X_train, y_train)

# Best parameters from Grid Search
best_params = grid_search.best_params_

# Train the Random Forest model with the best parameters
best_rf_model = RandomForestClassifier(**best_params, random_state=42)
best_rf_model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = best_rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, best_rf_model.predict_proba(X_test)[:, 1])

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")


Fitting 5 folds for each of 108 candidates, totalling 540 fits


KeyboardInterrupt: 

In [None]:
#after some tests it appeared that the data weren't enough so we decided to apply ctgan to create synthetic data and see if there are improvements on the model assesments



In [None]:
#starting generation of synthetic data to have a sufficient number of subjects to work with other datasets
from ctgan import CTGAN
features = [
    'age',
    'Weight',
    'Length',
    'gender',
    'bmi',
    'DM',
    'HTN',
    'Current Smoker',
    'EX-Smoker',
    'FH',
    'Obesity',
    'CRF',
    'CVA',
    'Airway disease',
    'Thyroid Disease',
    'CHF',
    'DLP',
    'BP',
    'PR',
    'Edema',
    'Weak Peripheral Pulse',
    'Lung rales',
    'Systolic Murmur',
    'Diastolic Murmur',
    'Typical Chest Pain',
    'Dyspnea',
    'Function Class',
    'Atypical',
    'Nonanginal',
    'Exertional CP',
    'LowTH Ang',
    'Q Wave',
    'St Elevation',
    'St Depression',
    'Tinversion',
    'LVH',
    'Poor R Progression',
    'FBS',
    'CR',
    'TG',
    'LDL',
    'HDL',
    'BUN',
    'ESR',
    'HB',
    'K',
    'Na',
    'WBC',
    'Lymph',
    'Neut',
    'PLT',
    'EF-TTE',
    'Region RWMA',
    'VHD',
    'Cad'
]

# Drop rows with missing BMI values
startingdf = caddf.dropna(subset=['bmi'])

# Initialize CTGAN
ctgan = CTGAN(verbose=True)

# Fit CTGAN to the data
ctgan.fit(startingdf, features, epochs=5000)  # Adjust the number of epochs as needed MAYBE RUN MORE LIKE 50000 epochs

# Generate 200 synthetic samples
samples = ctgan.sample(10000)

# Post-process the synthetic data to ensure 'gender' is correctly encoded
samples['gender'] = samples['gender'].clip(0, 1).round().astype(int)

# Optionally, print or inspect the first few rows of the synthetic data
print(samples.head())


  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Gen. (-0.77) | Discrim. (-0.51): 100%|██████████| 5000/5000 [08:37<00:00,  9.67it/s]


   age  Weight  Length  gender   bmi  DM  HTN  Current Smoker  EX-Smoker  FH  \
0   57      66     170       1  26.3   0    0               0          0   0   
1   72      78     175       1  24.3   1    1               1          0   0   
2   56      63     162       1  25.1   1    0               0          0   0   
3   63      84     175       1  27.4   0    1               1          0   0   
4   49      81     155       0  28.4   0    0               0          0   1   

   ...    K   Na    WBC  Lymph  Neut  PLT EF-TTE  Region RWMA       VHD  Cad  
0  ...  3.4  146   6900     37    60  174     55            0      mild    1  
1  ...  4.7  139  10000     29    65  192     50            0      mild    1  
2  ...  4.6  142   8500     34    54  290     50            0  Moderate    0  
3  ...  4.8  153   9400     15    67  170     40            3      mild    0  
4  ...  4.3  138   6900     36    40  236     55            2         N    1  

[5 rows x 55 columns]


In [None]:
samples.to_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/CADSynthetic.csv', index=False)

In [None]:
samples.dtypes, samples.head(), len(samples)


(age                        int64
 Weight                     int64
 Length                     int64
 gender                     int64
 bmi                      float64
 DM                         int64
 HTN                        int64
 Current Smoker             int64
 EX-Smoker                  int64
 FH                         int64
 Obesity                    int64
 CRF                       object
 CVA                       object
 Airway disease             int64
 Thyroid Disease            int64
 CHF                       object
 DLP                       object
 BP                         int64
 PR                         int64
 Edema                      int64
 Weak Peripheral Pulse     object
 Lung rales                object
 Systolic Murmur           object
 Diastolic Murmur          object
 Typical Chest Pain         int64
 Dyspnea                   object
 Function Class             int64
 Atypical                  object
 Nonanginal                object
 Exertional CP

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

samples = pd.read_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/CADSynthetic.csv')
# Select relevant features and target variable
features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
target = 'Cad'

# Extract the features and target variable from the dataset
X = samples[features]
y = samples[target]

# Check for missing values and handle them if necessary
if X.isnull().sum().any():
    imputer = SimpleImputer(strategy='mean')
    X = pd.DataFrame(imputer.fit_transform(X), columns=features)


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")


Accuracy: 0.70
Precision: 0.46
Recall: 0.35
ROC-AUC Score: 0.65


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the model
rf_model = RandomForestClassifier(random_state=42)

# Initialize Grid Search
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

# Fit Grid Search
grid_search.fit(X_train, y_train)

# Best parameters from Grid Search
best_params = grid_search.best_params_

# Train the Random Forest model with the best parameters
best_rf_model = RandomForestClassifier(**best_params, random_state=42)
best_rf_model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = best_rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, best_rf_model.predict_proba(X_test)[:, 1])

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")


Fitting 5 folds for each of 108 candidates, totalling 540 fits


KeyboardInterrupt: 

In [None]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import joblib  # For saving the model

# Select relevant features and target variable
features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
target = 'Cad'

# Extract the features and target variable from the dataset
X = samples[features]
y = samples[target]

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=features)

# Encode categorical variables
for col in X.select_dtypes(include=['object']).columns:
    encoder = LabelEncoder()
    X[col] = encoder.fit_transform(X[col].astype(str))

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model name for identification
model_name = "LightGBM_CAD_v1"

# Initialize and train the LightGBM model
model = lgb.LGBMClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print evaluation metrics
print(f"Model: {model_name}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Save the model
model_filename = f"{model_name}_model.pkl"
joblib.dump(model, model_filename)

print(f"Model saved as {model_filename}")


[LightGBM] [Info] Number of positive: 1937, number of negative: 5063
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000900 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 181
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.276714 -> initscore=-0.960819
[LightGBM] [Info] Start training from score -0.960819
Model: LightGBM_CAD_v1
Accuracy: 0.74
Precision: 0.58
Recall: 0.26
ROC-AUC Score: 0.71
Model saved as LightGBM_CAD_v1_model.pkl


In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import joblib  # For saving the model

# Select relevant features and target variable
features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
target = 'Cad'

# Extract the features and target variable from the dataset
X = samples[features]
y = samples[target]

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=features)

# Encode categorical variables
for col in X.select_dtypes(include=['object']).columns:
    encoder = LabelEncoder()
    X[col] = encoder.fit_transform(X[col].astype(str))

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model name for identification
model_name = "XGBoost_CAD_v1"

# Initialize and train the XGBoost model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print evaluation metrics
print(f"Model: {model_name}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Save the model
model_filename = f"{model_name}_model.pkl"
joblib.dump(model, model_filename)

print(f"Model saved as {model_filename}")


Model: XGBoost_CAD_v1
Accuracy: 0.73
Precision: 0.52
Recall: 0.32
ROC-AUC Score: 0.70
Model saved as XGBoost_CAD_v1_model.pkl


Parameters: { "use_label_encoder" } are not used.



In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import joblib  # For saving the model

# Load the dataset
samples = pd.read_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/CADSynthetic.csv')

# Define relevant features and target variable
features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
target = 'Cad'

# Extract the features and target variable from the dataset
X = samples[features]
y = samples[target]

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=features)

# Encode categorical variables
for col in X.select_dtypes(include=['object']).columns:
    encoder = LabelEncoder()
    X[col] = encoder.fit_transform(X[col].astype(str))

# Create interaction and polynomial features
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X)

# Get the names of the new features
poly_features = poly.get_feature_names_out(features)

# Convert to DataFrame for better readability
X_poly_df = pd.DataFrame(X_poly, columns=poly_features)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_poly_df, y, test_size=0.3, random_state=42)

# Define model name for identification
model_name = "RandomForest_Poly_CAD_v1"

# Initialize and train the Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print evaluation metrics
print(f"Model: {model_name}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Save the model
model_filename = f"{model_name}_model.pkl"
joblib.dump(model, model_filename)

print(f"Model saved as {model_filename}")

Model: RandomForest_Poly_CAD_v1
Accuracy: 0.70
Precision: 0.46
Recall: 0.36
ROC-AUC Score: 0.66
Model saved as RandomForest_Poly_CAD_v1_model.pkl


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import joblib  # For saving the model

# Select relevant features and target variable
features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
target = 'Cad'

# Extract the features and target variable from the dataset
X = samples[features]
y = samples[target]

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=features)

# Encode categorical variables
for col in X.select_dtypes(include=['object']).columns:
    encoder = LabelEncoder()
    X[col] = encoder.fit_transform(X[col].astype(str))

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model name for identification
model_name = "RandomForest_Balanced_CAD_v1"

# Initialize and train the Random Forest model with class_weight='balanced'
model = RandomForestClassifier(random_state=42, class_weight='balanced')
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print evaluation metrics
print(f"Model: {model_name}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Save the model
model_filename = f"{model_name}_model.pkl"
joblib.dump(model, model_filename)

print(f"Model saved as {model_filename}")


Model: RandomForest_Balanced_CAD_v1
Accuracy: 0.69
Precision: 0.43
Recall: 0.39
ROC-AUC Score: 0.65
Model saved as RandomForest_Balanced_CAD_v1_model.pkl


In [None]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import pandas as pd
import joblib  # For saving the model

# Select relevant features and target variable
features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
target = 'Cad'

# Extract the features and target variable from the dataset
X = samples[features]
y = samples[target]

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=features)

# Encode categorical variables
for col in X.select_dtypes(include=['object']).columns:
    encoder = LabelEncoder()
    X[col] = encoder.fit_transform(X[col].astype(str))

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define model name for identification
model_name = "LightGBM_CAD_v2"

# Initialize and train the LightGBM model
model = lgb.LGBMClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print evaluation metrics
print(f"Model: {model_name}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Save the model
model_filename = f"{model_name}_model.pkl"
joblib.dump(model, model_filename)

print(f"Model saved as {model_filename}")


[LightGBM] [Info] Number of positive: 1937, number of negative: 5063
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001081 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 181
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 5
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.276714 -> initscore=-0.960819
[LightGBM] [Info] Start training from score -0.960819
Model: LightGBM_CAD_v2
Accuracy: 0.74
Precision: 0.58
Recall: 0.26
ROC-AUC Score: 0.71
Model saved as LightGBM_CAD_v2_model.pkl


In [None]:
caddf.dtypes

Unnamed: 0,0
age,int64
Weight,int64
Length,int64
gender,int64
bmi,float64
DM,int64
HTN,int64
Current Smoker,int64
EX-Smoker,int64
FH,int64


In [None]:
#rename columns
asthmadf.rename(columns={'Age': 'age', 'Gender': 'gender', 'BMI': 'bmi', 'Smoking': 'Current Smoker'}, inplace=True)

# Reduce to 1 decimal the bmi
asthmadf['bmi'] = asthmadf['bmi'].round(1)


# Create the Obesity column
if 'bmi' in asthmadf.columns:
    asthmadf['Obesity'] = asthmadf['bmi'].apply(lambda x: 0 if x > 25 else 1 ) #0 yes 1 no
else:
    print("Error: 'BMI' column is missing in the StartingDataset.")

In [None]:
asthmadf.dtypes

Unnamed: 0,0
PatientID,int64
age,int64
gender,int64
Ethnicity,int64
EducationLevel,int64
bmi,float64
Current Smoker,int64
PhysicalActivity,float64
DietQuality,float64
SleepQuality,float64


In [None]:
import pandas as pd
import joblib
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Load the trained models
rf_model = joblib.load('RandomForest_Balanced_CAD_v1_model.pkl')
lgb_model = joblib.load('LightGBM_CAD_v1_model.pkl')
xgb_model = joblib.load('XGBoost_CAD_v1_model.pkl')

# List of models and their names
models = {
    "RandomForest": rf_model,
    "LightGBM": lgb_model,
    "XGBoost": xgb_model
}

best_model = None
best_model_name = ""
best_roc_auc = 0

# Initialize variables for best model selection
best_model = None
best_model_name = ""
best_roc_auc = 0

# Use X_test and y_test to select the best model
for name, model in models.items():
    y_test_pred_proba = model.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test, y_test_pred_proba)

    if roc_auc > best_roc_auc:
        best_roc_auc = roc_auc
        best_model = model
        best_model_name = name

print(f"Best model selected: {best_model_name} with ROC-AUC: {best_roc_auc:.2f}")

# Now proceed to apply the best model to the asthmadf dataset

print(f"Best model selected: {best_model_name} with ROC-AUC: {best_roc_auc:.2f}")

# Assuming the best model is selected or if you already know which one is the best:
best_model = rf_model  # replace this with the selected best model

# Apply the best model to the asthmadf dataset
asthma_features = ['age', 'bmi', 'Current Smoker', 'gender', 'Obesity']
asthmadf_X = asthmadf[asthma_features]

# Handle missing values (use the same strategy used during training)
imputer = SimpleImputer(strategy='mean')
asthmadf_X = pd.DataFrame(imputer.fit_transform(asthmadf_X), columns=asthma_features)


# Make predictions
asthmadf['Cad_Probability'] = best_model.predict_proba(asthmadf_X)[:, 1]
asthmadf['Cad'] = (asthmadf['Cad_Probability'] > 0.5).astype(int)



# Save the adjusted predictions
asthmadf.to_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/asthma_disease_CADPredictions.csv', index=False)

print("Predictions and adjustments completed and saved.")


Best model selected: LightGBM with ROC-AUC: 0.71
Best model selected: LightGBM with ROC-AUC: 0.71
Predictions and adjustments completed and saved.


In [None]:
asthmaCaddf = pd.read_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/asthma_disease_CADPredictions.csv')

In [None]:
cad_distribution = asthmaCaddf['Cad'].value_counts(normalize=True)
print(cad_distribution)

Cad
0    0.861204
1    0.138796
Name: proportion, dtype: float64


In [None]:
#alder pollen, birch pollen, grass pollen, mugworth pollen, olive pollen, ragweed pollen

#ragweed pollen = distribution source https://www.meduniwien.ac.at/web/en/ueber-uns/news/2022/news-im-august-2022/ragweed-allergie-herkunftsort-und-umwelt-beeinflussen-aggressivitaet-der-pollen/
#33 million Europeans suffer from ragweed allergy in 2022 / 447.03 millon = 7.38%

#grass pollen: distribution source https://www.meduniwien.ac.at/web/en/about-us/news/detailsite/2018/news-jaenner-2018/first-vaccine-in-the-world-developed-against-grass-pollen-allergy/medicine-science/
#400 million people world-wide in 2018 /7.594 million = 5.27%

#mugwort pollen distribution source https://www.thermofisher.com/allergy/us/en/allergen-fact-sheets/mugwort.html 10-15% in EU

#birch pollen distribution source https://www.meduniwien.ac.at/web/en/about-us/news/detailsite/apple-allergy-symptoms-may-be-significantly-reduced-in-future/ 4.73%

#alder pollen distributioin source https://www.thermofisher.com/phadia/wo/en/resources/allergen-encyclopedia/t2.html 6-10%%

#olive pollen allergy there are not sensible data on the distribution but it given the normal distribution oof the other allergies should be around 5-8%%

#asthma allergy in Europe is 6.71% of the population sources European Lung Foundation (ELF): https://europeanlung.org/en/ Global Asthma Report: http://www.globalasthmareport.org/
#studies have shown how asthma, eczema and pollen allergies are correlated thus having one or more of those sympthoms means that an individual will be more prone to have one or more of those conditions source https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5806141/

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/asthma_disease_CADPredictions.csv')

# Define the distributions for each allergy
prob_ragweed_pollen = 0.0738
prob_grass_pollen = 0.0527
prob_mugwort_pollen = 0.0527
prob_birch_pollen = 0.125
prob_alder_pollen = 0.08
prob_olive_pollen = 0.07
prob_asthma = 0.0671

# Increased probabilities based on conditions
prob_asthma_with_history = 0.60
prob_pollen_with_eczema_or_asthma = 0.40

# Initialize columns for the new allergies
df['Ragweed_Pollen_Allergy'] = 0
df['Grass_Pollen_Allergy'] = 0
df['Mugwort_Pollen_Allergy'] = 0
df['Birch_Pollen_Allergy'] = 0
df['Alder_Pollen_Allergy'] = 0
df['Olive_Pollen_Allergy'] = 0
df['Asthma_Allergy'] = 0
df['Eczema'] = 0

# Assign allergies based on the probabilities
for index, row in df.iterrows():
    # Determine if the person has eczema
    if np.random.rand() < prob_pollen_with_eczema_or_asthma:
        df.at[index, 'Eczema'] = 1

    # Determine if the person has asthma, with higher probability if there's a family history
    if row['FamilyHistoryAsthma'] == 1:
        if np.random.rand() < prob_asthma_with_history:
            df.at[index, 'Asthma_Allergy'] = 1
    else:
        if np.random.rand() < prob_asthma:
            df.at[index, 'Asthma_Allergy'] = 1

    # Assign pollen allergies, with higher probability if the person has eczema or asthma
    if df.at[index, 'Eczema'] == 1 or df.at[index, 'Asthma_Allergy'] == 1:
        if np.random.rand() < prob_pollen_with_eczema_or_asthma:
            if np.random.rand() < prob_ragweed_pollen:
                df.at[index, 'Ragweed_Pollen_Allergy'] = 1
            if np.random.rand() < prob_grass_pollen:
                df.at[index, 'Grass_Pollen_Allergy'] = 1
            if np.random.rand() < prob_mugwort_pollen:
                df.at[index, 'Mugwort_Pollen_Allergy'] = 1
            if np.random.rand() < prob_birch_pollen:
                df.at[index, 'Birch_Pollen_Allergy'] = 1
            if np.random.rand() < prob_alder_pollen:
                df.at[index, 'Alder_Pollen_Allergy'] = 1
            if np.random.rand() < prob_olive_pollen:
                df.at[index, 'Olive_Pollen_Allergy'] = 1
    else:
        if np.random.rand() < prob_ragweed_pollen:
            df.at[index, 'Ragweed_Pollen_Allergy'] = 1
        if np.random.rand() < prob_grass_pollen:
            df.at[index, 'Grass_Pollen_Allergy'] = 1
        if np.random.rand() < prob_mugwort_pollen:
            df.at[index, 'Mugwort_Pollen_Allergy'] = 1
        if np.random.rand() < prob_birch_pollen:
            df.at[index, 'Birch_Pollen_Allergy'] = 1
        if np.random.rand() < prob_alder_pollen:
            df.at[index, 'Alder_Pollen_Allergy'] = 1
        if np.random.rand() < prob_olive_pollen:
            df.at[index, 'Olive_Pollen_Allergy'] = 1

    # Ensure that if a person has HayFever (value 1), they have at least one pollen allergy
    if row['HayFever'] == 1:
        if not (df.at[index, 'Ragweed_Pollen_Allergy'] == 1 or
                df.at[index, 'Grass_Pollen_Allergy'] == 1 or
                df.at[index, 'Mugwort_Pollen_Allergy'] == 1 or
                df.at[index, 'Birch_Pollen_Allergy'] == 1 or
                df.at[index, 'Alder_Pollen_Allergy'] == 1 or
                df.at[index, 'Olive_Pollen_Allergy'] == 1):
            # Randomly assign one pollen allergy if none are assigned yet
            pollen_allergies = ['Ragweed_Pollen_Allergy', 'Grass_Pollen_Allergy', 'Mugwort_Pollen_Allergy',
                                'Birch_Pollen_Allergy', 'Alder_Pollen_Allergy', 'Olive_Pollen_Allergy']
            chosen_allergy = np.random.choice(pollen_allergies)
            df.at[index, chosen_allergy] = 1

# Save the updated dataframe to a new CSV file
df.to_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/SmallDataset.csv', index=False)

# Check the distribution of allergies in the dataframe
distribution = {
    'Ragweed_Pollen_Allergy': df['Ragweed_Pollen_Allergy'].mean(),
    'Grass_Pollen_Allergy': df['Grass_Pollen_Allergy'].mean(),
    'Mugwort_Pollen_Allergy': df['Mugwort_Pollen_Allergy'].mean(),
    'Birch_Pollen_Allergy': df['Birch_Pollen_Allergy'].mean(),
    'Alder_Pollen_Allergy': df['Alder_Pollen_Allergy'].mean(),
    'Olive_Pollen_Allergy': df['Olive_Pollen_Allergy'].mean(),
    'Asthma_Allergy': df['Asthma_Allergy'].mean(),
    'Eczema': df['Eczema'].mean()
}

distribution


{'Ragweed_Pollen_Allergy': 0.07566889632107024,
 'Grass_Pollen_Allergy': 0.07065217391304347,
 'Mugwort_Pollen_Allergy': 0.07566889632107024,
 'Birch_Pollen_Allergy': 0.1149665551839465,
 'Alder_Pollen_Allergy': 0.07566889632107024,
 'Olive_Pollen_Allergy': 0.0798494983277592,
 'Asthma_Allergy': 0.23118729096989968,
 'Eczema': 0.395066889632107}

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/SmallDataset.csv')

# Function to generate random date of birth based on age
def generate_dob(age):
    year_of_birth = datetime.now().year - age
    # Generate random month
    month = np.random.randint(1, 13)
    # Generate random day based on the month and whether it's a leap year
    if month == 2:  # February
        if year_of_birth % 4 == 0 and (year_of_birth % 100 != 0 or year_of_birth % 400 == 0):
            day = np.random.randint(1, 30)  # Leap year
        else:
            day = np.random.randint(1, 29)
    elif month in [4, 6, 9, 11]:
        day = np.random.randint(1, 31)  # April, June, September, November have 30 days
    else:
        day = np.random.randint(1, 32)  # Other months have 31 days
    dob = datetime(year_of_birth, month, day).strftime('%Y-%m-%d')
    return dob

# Replace 'age' with 'date_of_birth'
df['Date_of_Birth'] = df['age'].apply(generate_dob)
df.drop(columns=['age'], inplace=True)

# Drop specified columns
columns_to_drop = ['Ethnicity', 'GastroesophagealReflux', 'LungFunctionFEV1', 'LungFunctionFVC',
                   'Diagnosis', 'DoctorInCharge']
df.drop(columns=columns_to_drop, inplace=True)

# Round specified columns to one decimal point
columns_to_round = ['PhysicalActivity', 'DietQuality', 'SleepQuality', 'PollutionExposure',
                    'PollenExposure', 'DustExposure']
df[columns_to_round] = df[columns_to_round].round(1)

# Save the modified dataframe to a new CSV file
df.to_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/SmallDatasetV1.csv', index=False)

# Display the first few rows of the modified dataframe
df.head()


Unnamed: 0,PatientID,gender,EducationLevel,bmi,Current Smoker,PhysicalActivity,DietQuality,SleepQuality,PollutionExposure,PollenExposure,...,Cad_Probability,Cad,Ragweed_Pollen_Allergy,Grass_Pollen_Allergy,Mugwort_Pollen_Allergy,Birch_Pollen_Allergy,Alder_Pollen_Allergy,Olive_Pollen_Allergy,Asthma_Allergy,Date_of_Birth
0,5034,0,0,15.8,0,0.9,5.5,8.7,7.4,2.9,...,0.43,0,0,0,0,0,0,0,0,1961-12-19
1,5035,1,2,22.8,0,5.9,6.3,5.2,2.0,7.5,...,0.04,0,0,0,0,0,0,0,1,1998-03-30
2,5036,0,1,18.4,0,6.7,9.2,6.8,1.5,1.4,...,0.16,0,0,0,0,1,0,0,1,1967-02-03
3,5037,1,1,38.5,0,1.4,5.8,4.3,0.6,7.6,...,0.12,0,1,0,0,0,0,0,0,1984-08-26
4,5038,0,3,19.3,0,4.6,3.1,9.6,1.0,3.0,...,0.13,0,0,0,1,0,0,0,0,1963-04-20


In [None]:
import pandas as pd
from ctgan import CTGAN

# Load the dataset
asthmaCaddf = pd.read_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/SmallDatasetV1.csv')

def extract_features(file_path):
    df = pd.read_csv(file_path)
    features = df.columns.tolist()
    return features

# Example usage
file_path = '/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/SmallDatasetV1.csv'
features = extract_features(file_path)
print(features)



# Initialize CTGAN
ctgan = CTGAN(verbose=True)

# Fit CTGAN to the data
ctgan.fit(asthmaCaddf, features, epochs=5000)  # Adjust the number of epochs as needed

# Generate 10000 synthetic samples
samples = ctgan.sample(10000)

# Post-process the synthetic data to ensure 'gender' is correctly encoded
samples['gender'] = samples['gender'].clip(0, 1).round().astype(int)

# Optionally, print or inspect the first few rows of the synthetic data
samples.to_csv('/content/drive/MyDrive/Big Data Technologies/datasets/RawDatasets/FinalDatasetSynthetic.csv', index=False)


['PatientID', 'gender', 'EducationLevel', 'bmi', 'Current Smoker', 'PhysicalActivity', 'DietQuality', 'SleepQuality', 'PollutionExposure', 'PollenExposure', 'DustExposure', 'PetAllergy', 'FamilyHistoryAsthma', 'HistoryOfAllergies', 'Eczema', 'HayFever', 'Wheezing', 'ShortnessOfBreath', 'ChestTightness', 'Coughing', 'NighttimeSymptoms', 'ExerciseInduced', 'Obesity', 'Cad_Probability', 'Cad', 'Ragweed_Pollen_Allergy', 'Grass_Pollen_Allergy', 'Mugwort_Pollen_Allergy', 'Birch_Pollen_Allergy', 'Alder_Pollen_Allergy', 'Olive_Pollen_Allergy', 'Asthma_Allergy', 'Date_of_Birth']


  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Gen. (-1.16) | Discrim. (-0.16):  92%|█████████▏| 4595/5000 [1:20:08<06:55,  1.03s/it]