# 📘 RFE Classification Notebook

This notebook demonstrates **Recursive Feature Elimination (RFE)** for feature selection  
and compares the performance of different **classification** models:

- **Logistic Regression**  
- **Support Vector Machine (SVM – Linear & RBF kernels)**  
- **K-Nearest Neighbors (KNN)**  
- **Naive Bayes**  
- **Decision Tree Classifier**  
- **Random Forest Classifier**

---

### 🔄 Workflow
1. **Load & preprocess** the dataset (train/test split, scaling).  
2. Apply **RFE** with multiple base estimators (Logistic, SVM, Random Forest, Decision Tree) to select the top `n` features.  
3. **Train** all classifiers on each RFE-reduced feature subset.  
4. **Evaluate** using **Accuracy**, **Confusion Matrix**, and **Classification Report**.  
5. **Summarize** results in a comparison **DataFrame** for quick benchmarking.

---

### 💡 Key Insight
- **RFE** highlights the most informative features, reducing noise and improving focus.  
- Ensemble methods like **Random Forest** often yield higher accuracy than simpler linear models, though results depend on the dataset.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 
import time
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pickle
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier   
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

### RFE Feature Selection

Applies Recursive Feature Elimination (RFE) with Logistic Regression, SVM, Random Forest, and Decision Tree models to select the top n features from the dataset and returns the reduced feature sets.

In [24]:
def rfeFeature(indep_X, dep_Y, n):
    rfelist=[]
    
    log_model = LogisticRegression(solver='lbfgs')
    RF = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
    # NB = GaussianNB()
    DT = DecisionTreeClassifier(criterion='gini', max_features='sqrt', splitter='best', random_state=0)
    svc_model = SVC(kernel='linear', random_state=0)
    # knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
    
    rfemodellist=[log_model, svc_model, RF, DT] 
    for i in rfemodellist:
        print(i)
        log_rfe = RFE(estimator=i, n_features_to_select=n)   # ✅ fix required here
        log_fit = log_rfe.fit(indep_X, dep_Y)
        log_rfe_feature = log_fit.transform(indep_X)
        rfelist.append(log_rfe_feature)
    return rfelist

### Data Split & Scaling  
Splits the dataset into training (75%) and testing (25%), then applies StandardScaler to normalize features.  
Returns scaled X_train, X_test along with y_train and y_test.  

In [2]:
def split_scalar(indep_X,dep_Y):
        X_train, X_test, y_train, y_test = train_test_split(indep_X, dep_Y, test_size = 0.25, random_state = 0)
        #X_train, X_test, y_train, y_test = train_test_split(indep_X,dep_Y, test_size = 0.25, random_state = 0)
        
        #Feature Scaling
        #from sklearn.preprocessing import StandardScaler
        sc = StandardScaler()
        X_train = sc.fit_transform(X_train)
        X_test = sc.transform(X_test)
        
        return X_train, X_test, y_train, y_test

### Confusion Matrix Prediction

A helper function that evaluates a classifier by predicting on X_test, comparing with y_test, and returning the accuracy, classification report, and confusion matrix.

In [4]:
def cm_prediction(classifier,X_test):
     y_pred = classifier.predict(X_test)
        
        # Making the Confusion Matrix
     from sklearn.metrics import confusion_matrix
     cm = confusion_matrix(y_test, y_pred)
        
     from sklearn.metrics import accuracy_score 
     from sklearn.metrics import classification_report 
        #from sklearn.metrics import confusion_matrix
        #cm = confusion_matrix(y_test, y_pred)
        
     Accuracy=accuracy_score(y_test, y_pred )
        
     report=classification_report(y_test, y_pred)
     return  classifier,Accuracy,report,X_test,y_test,cm

### Logistic Regression

A helper function that trains a Logistic Regression model on X_train and y_train, then evaluates it on X_test using cm_prediction to return accuracy, classification report, and confusion matrix.

In [5]:
def logistic(X_train,y_train,X_test):       
        # Fitting K-NN to the Training set
        from sklearn.linear_model import LogisticRegression
        classifier = LogisticRegression(random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm    

### SVM (Linear Kernel)

A helper function that trains an SVM classifier with a linear kernel on X_train and y_train, then evaluates it on X_test using cm_prediction to return accuracy, classification report, and confusion matrix.

In [6]:
def svm_linear(X_train,y_train,X_test):
                
        from sklearn.svm import SVC
        classifier = SVC(kernel = 'linear', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm
    

### SVM (Non-Linear RBF Kernel)

A helper function that trains an SVM classifier with an RBF kernel on X_train and y_train, then evaluates it on X_test using cm_prediction to return accuracy, classification report, and confusion matrix.

In [7]:
def svm_NL(X_train,y_train,X_test):
                
        from sklearn.svm import SVC
        classifier = SVC(kernel = 'rbf', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm

### Naive Bayes

A helper function that trains a Gaussian Naive Bayes classifier on X_train and y_train, then evaluates it on X_test using cm_prediction to return accuracy, classification report, and confusion matrix. 

In [8]:
   
def Navie(X_train,y_train,X_test):       
        # Fitting K-NN to the Training set
        from sklearn.naive_bayes import GaussianNB
        classifier = GaussianNB()
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm         
    

### K-Nearest Neighbors (KNN)

A helper function that trains a KNN classifier with k=5 using the Minkowski distance metric on X_train and y_train, then evaluates it on X_test using cm_prediction to return accuracy, classification report, and confusion matrix. 

In [10]:
    
def knn(X_train,y_train,X_test):
           
        # Fitting K-NN to the Training set
        from sklearn.neighbors import KNeighborsClassifier
        classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm

### Decision Tree

A helper function that trains a Decision Tree classifier using the entropy criterion on X_train and y_train, then evaluates it on X_test using cm_prediction to return accuracy, classification report, and confusion matrix.

In [11]:
def Decision(X_train,y_train,X_test):
        
        # Fitting K-NN to the Training set
        from sklearn.tree import DecisionTreeClassifier
        classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm      

### Random Forest

A helper function that trains a Random Forest classifier with 10 estimators using the entropy criterion on X_train and y_train, then evaluates it on X_test using cm_prediction to return accuracy, classification report, and confusion matrix.

In [14]:
def random(X_train,y_train,X_test):
        
        # Fitting K-NN to the Training set
        from sklearn.ensemble import RandomForestClassifier
        classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
        classifier.fit(X_train, y_train)
        classifier,Accuracy,report,X_test,y_test,cm=cm_prediction(classifier,X_test)
        return  classifier,Accuracy,report,X_test,y_test,cm

### Recursive Feature Elimination (RFE) Classification Results

A helper function that aggregates accuracy scores from different classifiers (Logistic, SVM Linear, SVM Non-Linear, KNN, Naive Bayes, Decision Tree, and Random Forest) into a DataFrame for comparison across models.

In [15]:
def rfe_classification(acclog,accsvml,accsvmnl,accknn,accnav,accdes,accrf): 
    
    rfedataframe=pd.DataFrame(index=['Logistic','SVC','Random','DecisionTree'],columns=['Logistic','SVMl','SVMnl',
                                                                                        'KNN','Navie','Decision','Random'])

    for number,idex in enumerate(rfedataframe.index):
        
        rfedataframe['Logistic'][idex]=acclog[number]       
        rfedataframe['SVMl'][idex]=accsvml[number]
        rfedataframe['SVMnl'][idex]=accsvmnl[number]
        rfedataframe['KNN'][idex]=accknn[number]
        rfedataframe['Navie'][idex]=accnav[number]
        rfedataframe['Decision'][idex]=accdes[number]
        rfedataframe['Random'][idex]=accrf[number]
    return rfedataframe

### Reading and Preprocessing the Dataset

- Load data from `prep.csv`  
- Create another variable `df2`  
- Convert categorical columns into dummy variables (0/1) using `pd.get_dummies()`  
- Use `drop_first=True` to avoid duplicate category columns

In [18]:
dataset1=pd.read_csv("prep.csv",index_col=None)
df2=dataset1
df2 = pd.get_dummies(df2, drop_first=True)

### Splitting Features and Target

- `indep_X` → all independent features (X values)  
- `dep_Y` → the dependent/target variable (y value)  
- We drop the column `classification_yes` from features,  
  and keep it separately as the target column.

In [19]:
indep_X = df2.drop('classification_yes', axis=1)  
dep_Y   = df2['classification_yes']            

### RFE Feature Selection & Accuracy Lists

The rfeFeature function selects the top 3 features from indep_X with respect to dep_Y, while the empty lists (acclog, accsvml, accsvmnl, accknn, accnav, accdes, accrf) are initialized to store accuracy scores for different classifiers.

In [25]:
rfelist = rfeFeature(indep_X, dep_Y, 3)

acclog=[]
accsvml=[]
accsvmnl=[]
accknn=[]
accnav=[]
accdes=[]
accrf=[]

LogisticRegression()


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to sca

SVC(kernel='linear', random_state=0)
RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0)
DecisionTreeClassifier(max_features='sqrt', random_state=0)


### Model Evaluation over RFE Subsets

Iterates over each RFE-reduced feature set, splits/scales the data, trains seven classifiers (Logistic, SVM-linear/RBF, KNN, Naive Bayes, Decision Tree, Random Forest), and appends their accuracies to the respective lists. 

In [26]:
for i in rfelist:   
    X_train, X_test, y_train, y_test=split_scalar(i,dep_Y)   
    
        
    classifier,Accuracy,report,X_test,y_test,cm=logistic(X_train,y_train,X_test)
    acclog.append(Accuracy)
    
    classifier,Accuracy,report,X_test,y_test,cm=svm_linear(X_train,y_train,X_test)  
    accsvml.append(Accuracy)
    
    classifier,Accuracy,report,X_test,y_test,cm=svm_NL(X_train,y_train,X_test)  
    accsvmnl.append(Accuracy)
    
    classifier,Accuracy,report,X_test,y_test,cm=knn(X_train,y_train,X_test)  
    accknn.append(Accuracy)
    
    classifier,Accuracy,report,X_test,y_test,cm=Navie(X_train,y_train,X_test)  
    accnav.append(Accuracy)
    
    classifier,Accuracy,report,X_test,y_test,cm=Decision(X_train,y_train,X_test)  
    accdes.append(Accuracy)
    
    classifier,Accuracy,report,X_test,y_test,cm=random(X_train,y_train,X_test)  
    accrf.append(Accuracy)
    

### RFE Classification Result

Calls rfe_classification with the collected accuracy lists from different classifiers and returns a DataFrame summarizing their performance for comparison.

In [27]:
result=rfe_classification(acclog,accsvml,accsvmnl,accknn,accnav,accdes,accrf)

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  rfedataframe['Logistic'][idex]=acclog[number]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFr

In [28]:
result

Unnamed: 0,Logistic,SVMl,SVMnl,KNN,Navie,Decision,Random
Logistic,0.94,0.94,0.94,0.94,0.94,0.94,0.94
SVC,0.87,0.87,0.87,0.87,0.87,0.87,0.87
Random,0.91,0.92,0.93,0.93,0.86,0.91,0.94
DecisionTree,0.93,0.93,0.94,0.95,0.74,0.95,0.97


### 🔎 Feature Selection & Model Performance  

**Feature Selection**  
We used **Recursive Feature Elimination (RFE)** to pick the **top 3 features**.  
RFE means:  
- Train a model,  
- Check which features are most important,  
- Remove the least important ones step by step,  
- Finally keep only the best features.  

This way, the models can focus only on the useful data and ignore the noise.  

---

**Model Performance (Accuracy Scores)**  

| RFE Base Model       | Accuracy Range |
|----------------------|----------------|
| Logistic Regression  | ~0.94 (same for all classifiers) |
| SVC (Linear)         | ~0.87 (lowest overall) |
| Random Forest        | ~0.86 – 0.94 |
| Decision Tree        | ~0.74 – **0.97** |

---

✅ **Observations**  
- Logistic Regression features → always gave around **0.94** accuracy.  
- SVC (linear) features → weakest, stuck at **0.87**.  
- Random Forest features → strong, went up to **0.94**.  
- Decision Tree features → 🏆 best, reached **0.97** with Random Forest.  
- Naive Bayes sometimes dropped low (like **0.74**) on Decision Tree features.  

---

🎯 **Conclusion**  
- RFE helped us reduce the dataset to just 3 good features.  
- The **best combo** was Decision Tree (for feature selection) + Random Forest (for classification).  
- Simple models like Logistic and SVC worked okay, but not as powerful as trees.  
