# **Data Preprocessing Binary Dataset**

Different machine learning methods require the data to be formatted in different ways.

Decision Trees work best with Categorical data, Neural Networks work best with Data Normalised to (0 and 1)

This file takes the binary dataset for lung cancer and creates several modified versions of the dataset to be used for model training and experiementing 

The choice of modifications is based on the analysis, the needs of different models and for experimental reasons

## Necessary Imports

In [13]:
import sys
assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import numpy as np
import os
#import tarfile
#import urllib
import pandas as pd
#import urllib.request

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

## Loading in original dataset

In [14]:
BINARY_PATH = os.path.join("..", "datasets", "1_binary", "untouched")

def load_binary(binary_path=BINARY_PATH):
    csv_path = os.path.join(binary_path, "1_binary.csv")
    return pd.read_csv(csv_path)

binary_unbalanced = load_binary() 

binary_unbalanced.head() #Display first five rows of the frame

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


## Oversampling to balance target class

In [15]:
binary_unbalanced['LUNG_CANCER'].value_counts()

LUNG_CANCER
YES    270
NO      39
Name: count, dtype: int64

In [16]:
# Oversample 'NO' class by randomly duplicating instances until it matches the 'YES' count of 270
no_class = binary_unbalanced[binary_unbalanced['LUNG_CANCER'] == 'NO']
no_class_oversampled = no_class.sample(n=270, replace=True, random_state=42)

# Combine the oversampled 'NO' instances with the original 'YES' instances
binary_balanced = pd.concat([binary_unbalanced[binary_unbalanced['LUNG_CANCER'] == 'YES'], no_class_oversampled])

# Shuffle the dataset to mix 'YES' and 'NO' instances
binary_balanced = binary_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

binary_balanced['LUNG_CANCER'].value_counts()

LUNG_CANCER
YES    270
NO     270
Name: count, dtype: int64

The simplest method for oversampling was used - randomly duplicating instances. This is not ideal, however, the target class having equal value distributions could improve performance of certain models.

For all modified datasets both an oversampled (balanced) and unbalanced (original) version will be saved. The versions will be used that are best suited for the concrete machine learning models.

Note: in this case using oversamnpled dataset may lead to overly optimistic accuracy results, as the imbalance is a ratio of 1 to 7, meaning that during model evaluation and testing, it will encounter the exact same instances as dring training. Model may not generalise well if oversampled dataset is used.

## Modifying dataset and saving as a file

#### 1 - original dataset with string values converted to numbers: ["NO","YES"] to [1,2] and ["M","F"] to ["1","2"]

This is just the original dataset with string values converted to numberical values to match te rest of the data.

It could be used as a comparison point with the normalized dataset by certain models

In [23]:
# Create a copy of binary_unbalanced
binary_og_unbalanced = binary_unbalanced.copy()
# Create a copy of binary_balanced
binary_og_balanced = binary_balanced.copy()


# Map the string values to numbers for 'LUNG_CANCER' and 'Gender'
binary_og_unbalanced['LUNG_CANCER'] = binary_og_unbalanced['LUNG_CANCER'].map({'NO': 1, 'YES': 2})
binary_og_unbalanced['GENDER'] = binary_og_unbalanced['GENDER'].map({'M': 1, 'F': 2})

binary_og_balanced['LUNG_CANCER'] = binary_og_balanced['LUNG_CANCER'].map({'NO': 1, 'YES': 2})
binary_og_balanced['GENDER'] = binary_og_balanced['GENDER'].map({'M': 1, 'F': 2})


# Verify the transformations
binary_og_unbalanced.head()
#binary_og_unbalanced['LUNG_CANCER'].value_counts()
#binary_og_balanced.head()
#binary_og_balanced['LUNG_CANCER'].value_counts()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,1,69,1,2,2,1,1,2,1,2,2,2,2,2,2,2
1,1,74,2,1,1,1,2,2,2,1,1,1,2,2,2,2
2,2,59,1,1,1,2,1,2,1,2,1,2,2,1,2,1
3,1,63,2,2,2,1,1,1,1,1,2,1,1,2,2,1
4,2,63,1,2,1,1,1,1,1,2,1,2,2,1,1,1


Save these versions of the dataset into a csv file in a folder called processed

In [24]:
# Specify the folder path
SAVE_PATH = os.path.join("..", "datasets", "1_binary", "processed")

# Specify file names (to be saved)
og_unbalanced_file_path = os.path.join(SAVE_PATH, "1_og_ub.csv")
og_balanced_file_path = os.path.join(SAVE_PATH, "1_og_b.csv")

# Check if the file already exists
if not os.path.exists(og_unbalanced_file_path):
    # Save the DataFrame if it doesn't exist
    binary_og_unbalanced.to_csv(og_unbalanced_file_path, index=False)
    print(f"File saved at {og_unbalanced_file_path}")
else:
    print(f"File already exists at {og_unbalanced_file_path}")

# Repeat for balanced dataset 
if not os.path.exists(og_balanced_file_path):
    binary_og_balanced.to_csv(og_balanced_file_path, index=False)
    print(f"File saved at {og_balanced_file_path}")
else:
    print(f"File already exists at {og_balanced_file_path}")

File saved at ../datasets/1_binary/processed/1_og_ub.csv
File saved at ../datasets/1_binary/processed/1_og_b.csv


#### 2 - Dataset normalized to values between 0 and 1. 0 = "NO", 1 = "YES", except for 'AGE'

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Function to normalize all features using MinMaxScaler
def normalize_df(df):
    scaler = MinMaxScaler()
    df_string = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
    return df_string

# Apply normalization to both unbalanced and balanced datasets
binary_normalized_unbalanced = normalize_df(binary_og_unbalanced)
binary_normalized_balanced = normalize_df(binary_og_balanced)

# Verify the normalization
binary_normalized_unbalanced.head()
#binary_normalized_balanced.head()

#binary_normalized_unbalanced['LUNG_CANCER'].value_counts()
#binary_normalized_balanced['LUNG_CANCER'].value_counts()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,0.0,0.727273,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,0.0,0.80303,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
2,1.0,0.575758,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
3,0.0,0.636364,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
4,1.0,0.636364,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0


Save these versions of the dataset into a csv file in a folder called processed

In [31]:
# Specify file names (to be saved)
normalized_unbalanced_file_path = os.path.join(SAVE_PATH, "1_nrml_ub.csv")
normalized_balanced_file_path = os.path.join(SAVE_PATH, "1_nrml_b.csv")

# Check if the file already exists
if not os.path.exists(normalized_unbalanced_file_path):
    # Save the DataFrame if it doesn't exist
    binary_normalized_unbalanced.to_csv(normalized_unbalanced_file_path, index=False)
    print(f"File saved at {normalized_unbalanced_file_path}")
else:
    print(f"File already exists at {normalized_unbalanced_file_path}")

# Repeat for balanced dataset 
if not os.path.exists(normalized_balanced_file_path):
    binary_normalized_balanced.to_csv(normalized_balanced_file_path, index=False)
    print(f"File saved at {normalized_balanced_file_path}")
else:
    print(f"File already exists at {normalized_balanced_file_path}")

File saved at ../datasets/1_binary/processed/1_nrml_ub.csv
File saved at ../datasets/1_binary/processed/1_nrml_b.csv


#### 3 - Original dataset with features converted to string values - [1,2] to ["NO","YES"] and 'AGE' to ['YOUNG ADULT','ADULT','OLDER ADULT','ELDERLY']

Some models like decision trees work are intended to work with categorical string values

In [40]:
# Function to map [1, 2] to ["No", "Yes"] for all applicable features and equal width binning for 'AGE'
def map_numerical_to_string(df):
    df_mapped = df.copy()
    
    # Define the mapping
    mapping = {1: "NO", 2: "YES"}
    
    # Apply the mapping to all columns with values [1, 2]
    for column in df_mapped.columns:
        if df_mapped[column].isin([1, 2]).all():  # Check if the column contains only 1 and 2
            df_mapped[column] = df_mapped[column].map(mapping)

    # Perform equal-width binning on 'AGE'
    age_bins = [df_mapped['AGE'].min(), 35, 50, 65, df_mapped['AGE'].max()]  
    age_labels = ['YOUNG ADULT', 'ADULT', 'OLDER ADULT', 'ELDERLY']
    df_mapped['AGE'] = pd.cut(df_mapped['AGE'], bins=age_bins, labels=age_labels, include_lowest=True)
    
    
    return df_mapped

# Apply mapping to both balanced and unbalanced datasets
binary_string_unbalanced = map_numerical_to_string(binary_unbalanced)
binary_string_balanced = map_numerical_to_string(binary_balanced)

# Verify the mapping
binary_string_unbalanced.head()
#binary_string_balanced.head()

#binary_string_unbalanced['LUNG_CANCER'].value_counts()
#binary_string_balanced['LUNG_CANCER'].value_counts()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,ELDERLY,NO,YES,YES,NO,NO,YES,NO,YES,YES,YES,YES,YES,YES,YES
1,M,ELDERLY,YES,NO,NO,NO,YES,YES,YES,NO,NO,NO,YES,YES,YES,YES
2,F,OLDER ADULT,NO,NO,NO,YES,NO,YES,NO,YES,NO,YES,YES,NO,YES,NO
3,M,OLDER ADULT,YES,YES,YES,NO,NO,NO,NO,NO,YES,NO,NO,YES,YES,NO
4,F,OLDER ADULT,NO,YES,NO,NO,NO,NO,NO,YES,NO,YES,YES,NO,NO,NO


In [41]:
# Specify file names (to be saved)
string_unbalanced_file_path = os.path.join(SAVE_PATH, "1_str_ub.csv")
string_balanced_file_path = os.path.join(SAVE_PATH, "1_str_b.csv")

# Check if the file already exists
if not os.path.exists(string_unbalanced_file_path):
    # Save the DataFrame if it doesn't exist
    binary_string_unbalanced.to_csv(string_unbalanced_file_path, index=False)
    print(f"File saved at {string_unbalanced_file_path}")
else:
    print(f"File already exists at {string_unbalanced_file_path}")

# Repeat for balanced dataset 
if not os.path.exists(string_balanced_file_path):
    binary_string_balanced.to_csv(string_balanced_file_path, index=False)
    print(f"File saved at {string_balanced_file_path}")
else:
    print(f"File already exists at {string_balanced_file_path}")

File saved at ../datasets/1_binary/processed/1_str_ub.csv
File saved at ../datasets/1_binary/processed/1_str_b.csv


#### 4 - Dataset with combined features/extracted features

In the Data Analysis file I analysed the correlations between features and tearget class. The produced combinations between the top correlated features, which resulted in significantly improved correlations. 

For experiemental purposes, to see how using these extracted features would affect model performance, a dataset with these features is created and saved.

In this code, I simply repeat the process of combining features and then add then create a dataframe were I add the top 5 correlated signle features, the top 5 most correlated two fold combinations and the most correlated three fold combinations.

Perform two feature combinations:

In [55]:
# Create a list of top 5 most correlated features
base_features = ['ALLERGY', 'ALCOHOL CONSUMING', 'SWALLOWING DIFFICULTY', 'WHEEZING', 'COUGHING']

# Create a new DataFrame to store summed interaction terms
lung_binary_interactions_two = pd.DataFrame()

# Generate interaction terms by summing each pair of top features
interaction_count = 0
for i in range(len(base_features)):
    for j in range(i + 1, len(base_features)):
        # Sum of each pair of features
        feature1, feature2 = base_features[i], base_features[j]
        interaction_column_name = f"{feature1}_{feature2}"
        lung_binary_interactions_two[interaction_column_name] = binary_normalized_unbalanced[feature1] + binary_normalized_unbalanced[feature2]
        interaction_count += 1
        if interaction_count >= 10:
            break
    if interaction_count >= 10:
        break

# Normalize the interaction terms to map values of 0, 1, 2 to 0, 0.5, 1
lung_binary_interactions_two = lung_binary_interactions_two / 2

# Add the target variable 'LUNG_CANCER' as the last column
lung_binary_interactions_two['LUNG_CANCER'] = binary_normalized_unbalanced['LUNG_CANCER']


Perform three feature combinations:

In [None]:
from itertools import combinations

# Create a new DataFrame to store the three-feature interaction terms
lung_binary_interactions_three = pd.DataFrame()

# Generate interaction terms by summing each combination of three features
for combo in combinations(base_features, 3):
    # Sum of the three features in the combination
    feature1, feature2, feature3 = combo
    interaction_column_name = f"{feature1}_{feature2}_{feature3}"
    lung_binary_interactions_three[interaction_column_name] = (
        binary_normalized_unbalanced[feature1] + binary_normalized_unbalanced[feature2] + binary_normalized_unbalanced[feature3]
    )

# Normalize the interaction terms to map values of 0, 1, 2, 3 to 0, 0.33, 0.67, and 1
lung_binary_interactions_three = lung_binary_interactions_three / 3

# Add the target variable 'LUNG_CANCER' as the last column without normalization
lung_binary_interactions_three['LUNG_CANCER'] = binary_normalized_unbalanced['LUNG_CANCER']

Correlation of each three-feature combination (summed) with LUNG_CANCER:
LUNG_CANCER                                         1.000000
ALLERGY_SWALLOWING DIFFICULTY_COUGHING              0.487791
ALLERGY_ALCOHOL CONSUMING_SWALLOWING DIFFICULTY     0.465212
ALLERGY_SWALLOWING DIFFICULTY_WHEEZING              0.456343
ALCOHOL CONSUMING_SWALLOWING DIFFICULTY_COUGHING    0.454842
ALCOHOL CONSUMING_SWALLOWING DIFFICULTY_WHEEZING    0.417485
ALLERGY_ALCOHOL CONSUMING_COUGHING                  0.408984
ALLERGY_ALCOHOL CONSUMING_WHEEZING                  0.405010
SWALLOWING DIFFICULTY_WHEEZING_COUGHING             0.401074
ALLERGY_WHEEZING_COUGHING                           0.390398
ALCOHOL CONSUMING_WHEEZING_COUGHING                 0.363356
Name: LUNG_CANCER, dtype: float64


Combine top 5 correlated features from normalized, two feature combinations, three feature combinations dataframes into a new dataframe

In [58]:

# Select the top 5 most correlated columns from the two-fold combination
top_two_fold_features = [
    'ALLERGY_SWALLOWING DIFFICULTY', 
    'SWALLOWING DIFFICULTY_COUGHING', 
    'ALCOHOL CONSUMING_SWALLOWING DIFFICULTY', 
    'ALLERGY_WHEEZING', 
    'ALLERGY_ALCOHOL CONSUMING'
]

# Select the top 5 most correlated columns from the three-fold combination
top_three_fold_features = [
    'ALLERGY_SWALLOWING DIFFICULTY_COUGHING', 
    'ALLERGY_ALCOHOL CONSUMING_SWALLOWING DIFFICULTY', 
    'ALLERGY_SWALLOWING DIFFICULTY_WHEEZING', 
    'ALCOHOL CONSUMING_SWALLOWING DIFFICULTY_WHEEZING', 
    'ALLERGY_ALCOHOL CONSUMING_COUGHING'
]

# Combine all selected columns into a new DataFrame along with the target column
binary_extracted_features_unbalanced = binary_normalized_unbalanced[base_features].copy()
binary_extracted_features_unbalanced = pd.concat([binary_extracted_features_unbalanced, lung_binary_interactions_two[top_two_fold_features]], axis=1)
binary_extracted_features_unbalanced = pd.concat([binary_extracted_features_unbalanced, lung_binary_interactions_three[top_three_fold_features]], axis=1)
binary_extracted_features_unbalanced['LUNG_CANCER'] = binary_normalized_unbalanced['LUNG_CANCER']

# Display the first few rows of the new DataFrame to verify
binary_extracted_features_unbalanced.head()

Unnamed: 0,ALLERGY,ALCOHOL CONSUMING,SWALLOWING DIFFICULTY,WHEEZING,COUGHING,ALLERGY_SWALLOWING DIFFICULTY,SWALLOWING DIFFICULTY_COUGHING,ALCOHOL CONSUMING_SWALLOWING DIFFICULTY,ALLERGY_WHEEZING,ALLERGY_ALCOHOL CONSUMING,ALLERGY_SWALLOWING DIFFICULTY_COUGHING,ALLERGY_ALCOHOL CONSUMING_SWALLOWING DIFFICULTY,ALLERGY_SWALLOWING DIFFICULTY_WHEEZING,ALCOHOL CONSUMING_SWALLOWING DIFFICULTY_WHEEZING,ALLERGY_ALCOHOL CONSUMING_COUGHING,LUNG_CANCER
0,0.0,1.0,1.0,1.0,1.0,0.5,1.0,1.0,0.5,0.5,0.666667,0.666667,0.666667,1.0,0.666667,1.0
1,1.0,0.0,1.0,0.0,0.0,1.0,0.5,0.5,0.5,0.5,0.666667,0.666667,0.666667,0.333333,0.333333,1.0
2,0.0,0.0,0.0,1.0,1.0,0.0,0.5,0.0,0.5,0.0,0.333333,0.0,0.333333,0.333333,0.333333,0.0
3,0.0,1.0,1.0,0.0,0.0,0.5,0.5,1.0,0.0,0.5,0.333333,0.666667,0.333333,0.666667,0.333333,0.0
4,0.0,0.0,0.0,1.0,1.0,0.0,0.5,0.0,0.5,0.0,0.333333,0.0,0.333333,0.333333,0.333333,0.0


Verify correlations:

In [59]:
# Calculate the correlation matrix with 'LUNG_CANCER' as the target
correlation_matrix = binary_extracted_features_unbalanced.corr()
correlation_with_target = correlation_matrix['LUNG_CANCER'].sort_values(ascending=False)

# Display the correlations with 'LUNG_CANCER'
print("Correlation of each feature with LUNG_CANCER:")
print(correlation_with_target)


Correlation of each feature with LUNG_CANCER:
LUNG_CANCER                                         1.000000
ALLERGY_SWALLOWING DIFFICULTY_COUGHING              0.487791
ALLERGY_ALCOHOL CONSUMING_SWALLOWING DIFFICULTY     0.465212
ALLERGY_SWALLOWING DIFFICULTY_WHEEZING              0.456343
ALLERGY_SWALLOWING DIFFICULTY                       0.428705
ALCOHOL CONSUMING_SWALLOWING DIFFICULTY_WHEEZING    0.417485
ALLERGY_ALCOHOL CONSUMING_COUGHING                  0.408984
SWALLOWING DIFFICULTY_COUGHING                      0.391638
ALCOHOL CONSUMING_SWALLOWING DIFFICULTY             0.389447
ALLERGY_WHEEZING                                    0.376618
ALLERGY_ALCOHOL CONSUMING                           0.375856
ALLERGY                                             0.327766
ALCOHOL CONSUMING                                   0.288533
SWALLOWING DIFFICULTY                               0.259730
WHEEZING                                            0.249300
COUGHING                               

Correlations values look as they should.

Creating a balanced version of this modified dataframe:

In [None]:
# Separate the 'NO' class (0) instances
no_class = binary_extracted_features_unbalanced[binary_extracted_features_unbalanced['LUNG_CANCER'] == 0]

# Perform oversampling to match the number of 'YES' (1) instances
yes_count = int(binary_extracted_features_unbalanced['LUNG_CANCER'].sum())  # Convert to integer count of 'YES' instances
no_class_oversampled = no_class.sample(n=yes_count, replace=True, random_state=42)

# Combine the oversampled 'NO' instances with the original 'YES' instances
binary_extracted_features_balanced = pd.concat([
    binary_extracted_features_unbalanced[binary_extracted_features_unbalanced['LUNG_CANCER'] == 1],
    no_class_oversampled
])

# Shuffle the dataset
binary_extracted_features_balanced = binary_extracted_features_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Display the balanced dataset to verify
binary_extracted_features_balanced.head()

# Verify value counts
#binary_extracted_features_unbalanced['LUNG_CANCER'].value_counts()
#binary_extracted_features_balanced['LUNG_CANCER'].value_counts()

LUNG_CANCER
1.0    270
0.0     39
Name: count, dtype: int64