# Data Exploration and Exploratory Data Analysis (EDA)
## Deep Learning-Based Lung Cancer Risk Prediction

**Research Title:** Deep Learning-Based Lung Cancer Risk Prediction: A Comparative Study of Artificial Neural Network Performance with Clinical Feature Analysis

**Notebook Purpose:** This notebook performs comprehensive exploratory data analysis on the lung cancer dataset, including:
- Data loading and basic statistics
- Missing value analysis
- Feature distribution analysis
- Correlation analysis
- Statistical tests for feature significance

---


In [None]:
# Import required libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from data_preprocessing import LungCancerDataPreprocessor
from feature_analysis import FeatureAnalyzer, FeatureVisualizer

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")


In [None]:
# Initialize preprocessor and load data
DATA_PATH = '../data/survey lung cancer.csv'
preprocessor = LungCancerDataPreprocessor(DATA_PATH)
df = preprocessor.load_data()

# Display basic information
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
df.head()


In [None]:
# Comprehensive data exploration
exploration_results = preprocessor.explore_data()


In [None]:
# Visualize target distribution
target_counts = df['LUNG_CANCER'].value_counts()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

target_counts.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
axes[0].set_title('Lung Cancer Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Lung Cancer Status')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

axes[1].pie(target_counts, labels=target_counts.index, autopct='%1.1f%%',
           colors=['#2ecc71', '#e74c3c'], startangle=90)
axes[1].set_title('Lung Cancer Distribution')

plt.tight_layout()
plt.savefig('../figures/target_distribution.png', dpi=300, bbox_inches='tight')
plt.show()


In [None]:
# Feature analysis
X, y = preprocessor.get_full_dataset()
feature_names = preprocessor.get_feature_names()

analyzer = FeatureAnalyzer(X, y, feature_names)
rf_importance = analyzer.calculate_feature_importance_rf()
comprehensive_importance = analyzer.get_comprehensive_importance()

print("\nTop 10 Most Important Features:")
print(comprehensive_importance[['Feature', 'Average_Importance']].head(10))

# Visualize
FeatureVisualizer.plot_feature_importance(
    rf_importance,
    title="Feature Importance Analysis",
    save_path='../figures/feature_importance_eda.png'
)


In [None]:
# Correlation analysis
df_corr = df.copy()
df_corr['GENDER'] = df_corr['GENDER'].map({'M': 1, 'F': 0})
df_corr['LUNG_CANCER'] = df_corr['LUNG_CANCER'].map({'YES': 1, 'NO': 0})

FeatureVisualizer.plot_correlation_matrix(df_corr, save_path='../figures/correlation_matrix.png')


## Summary

This exploratory analysis has revealed:
1. Complete dataset with no missing values
2. Strong predictive features identified
3. Significant correlations between clinical symptoms and cancer diagnosis
4. Dataset ready for deep learning model development

Next step: Proceed to model training and comparison in notebook 02.
