# Tumor Detection Project

**Assignment 7 — Complete project notebook**

This notebook follows the project instructions provided in the PDF and performs data loading, cleaning, EDA, preprocessing, model training (Random Forest), and evaluation.


## 1. Import libraries

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Display settings
pd.set_option('display.max_columns', 200)
sns.set(style='whitegrid')

## 2. Load dataset

Make sure `Tumor_Detection.csv` is in the same folder as this notebook. The next cell loads the CSV and shows the first five rows.

In [None]:
# Load dataset
df = pd.read_csv("Tumor_Detection.csv")
df.head()

## 3. Data overview

In [None]:
# Shape, info, describe
print('Shape:', df.shape)
print('\nInfo:')
display(df.info())
print('\nSummary statistics:')
display(df.describe())

## 4. Data cleaning

Drop irrelevant columns (like `id` and any unnamed columns) and check for missing values.

In [None]:
# Drop 'id' if present and unnamed columns
df = df.drop(columns=['id'], errors='ignore')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Show columns and first rows
print('Columns:', df.columns.tolist())
df.head()

In [None]:
# Check for missing values
missing = df.isnull().sum()
missing[missing > 0] if missing.sum() > 0 else print('No missing values detected')

## 5. Exploratory Data Analysis (EDA)

Visualize distribution of diagnosis and feature correlations.

In [None]:
# Countplot of diagnosis
plt.figure(figsize=(6,4))
sns.countplot(x='diagnosis', data=df)
plt.title('Tumor Diagnosis Distribution (B = benign, M = malignant)')
plt.show()

In [None]:
# Correlation heatmap (may be large)
plt.figure(figsize=(18,14))
corr = df.corr()
sns.heatmap(corr, cmap='coolwarm', annot=False)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Show top absolute correlations with the target (if target exists)
if 'diagnosis' in df.columns:
    cor_target = corr['diagnosis'].abs().sort_values(ascending=False)
    display(cor_target.head(15))

In [None]:
# Histogram of mean features (if present)
mean_cols = df.filter(regex='mean').columns.tolist()
if mean_cols:
    df[mean_cols].hist(figsize=(14,10))
    plt.suptitle('Distribution of Mean Features')
    plt.show()
else:
    print('No mean-feature columns detected by regex')

## 6. Preprocessing

Encode labels, split data, and scale features.

In [None]:
# Encode diagnosis (M/B) to 1/0
if df['diagnosis'].dtype == object or df['diagnosis'].dtype.name == 'category':
    le = LabelEncoder()
    df['diagnosis'] = le.fit_transform(df['diagnosis'])
    print('Label encoding applied:', dict(zip(le.classes_, le.transform(le.classes_))))
else:
    print('Diagnosis column already numeric')

In [None]:
# Prepare X and y
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

In [None]:
# Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 7. Model training — Random Forest Classifier

In [None]:
# Train Random Forest Classifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

## 8. Evaluation

In [None]:
# Predict and evaluate
y_pred = rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc:.4f}\n')
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

## 9. Feature importance (top features)

In [None]:
# Show feature importances (if feature names are available)
try:
    importances = rf.feature_importances_
    feat_names = df.drop('diagnosis', axis=1).columns
    feat_imp = pd.Series(importances, index=feat_names).sort_values(ascending=False)
    display(feat_imp.head(15))
    plt.figure(figsize=(8,6))
    feat_imp.head(15).plot(kind='bar')
    plt.title('Top 15 Feature Importances')
    plt.ylabel('Importance')
    plt.show()
except Exception as e:
    print('Could not compute feature importances:', str(e))

## 10. Conclusion

- Dataset cleaned and preprocessed.
- EDA performed (diagnosis distribution, correlations).
- Random Forest classifier trained and evaluated.

**Notes & next steps:**
- If you want to improve performance, try cross-validation, hyperparameter tuning (GridSearchCV/RandomizedSearchCV), or other classifiers (SVM, XGBoost).
- Save the trained model and scaler for deployment if needed.