# 📘 RFE Regression Notebook

This notebook demonstrates **Recursive Feature Elimination (RFE)** for feature selection  
and compares the performance of different regression models:

- **Linear Regression**
- **Support Vector Regression (SVR – Linear & Non-Linear kernels)**
- **Decision Tree Regressor**
- **Random Forest Regressor**

### Workflow:
1. Load and preprocess the dataset (scaling, splitting).
2. Apply **RFE** to select the top `n` most important features.
3. Train multiple regression models on the reduced feature sets.
4. Evaluate model performance using **R² scores**.
5. Summarize results in a comparison table.

### Key Insight:
- **RFE** helps reduce noise and improve model focus.  
- In most cases, **Random Forest** and **Decision Tree** perform better than linear models.  


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 
import time
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pickle
import matplotlib.pyplot as plt

### Data Split & Scaling  
Splits the dataset into training (75%) and testing (25%), then applies StandardScaler to normalize features.  
Returns scaled X_train, X_test along with y_train and y_test.  

In [2]:
def split_scalar(indep_X,dep_Y):
        X_train, X_test, y_train, y_test = train_test_split(indep_X, dep_Y, test_size = 0.25, random_state = 0)
        #X_train, X_test, y_train, y_test = train_test_split(indep_X,dep_Y, test_size = 0.25, random_state = 0)
        
        #Feature Scaling
        #from sklearn.preprocessing import StandardScaler
        sc = StandardScaler()
        X_train = sc.fit_transform(X_train)
        X_test = sc.transform(X_test)    
        return X_train, X_test, y_train, y_test

### R² Prediction  
Uses the trained regressor to predict on X_test, compares with y_test,  
and returns the R² score as a measure of model accuracy.  

In [3]:
def r2_prediction(regressor,X_test,y_test):
     y_pred = regressor.predict(X_test)
     from sklearn.metrics import r2_score
     r2=r2_score(y_test,y_pred)
     return r2

### Linear Regression Model  
Fits a Linear Regression model on training data,  
predicts on test data, and returns the R² score.  

In [4]:
def Linear(X_train,y_train,X_test):       
        # Fitting K-NN to the Training set
        from sklearn.linear_model import LinearRegression
        regressor = LinearRegression()
        regressor.fit(X_train, y_train)
        r2=r2_prediction(regressor,X_test,y_test)
        return  r2   

### SVM (Linear Kernel)  
Fits an SVR model with a linear kernel on training data,  
predicts on test data, and returns the R² score.  

In [5]:
def svm_linear(X_train,y_train,X_test):
                
        from sklearn.svm import SVR
        regressor = SVR(kernel = 'linear')
        regressor.fit(X_train, y_train)
        r2=r2_prediction(regressor,X_test,y_test)
        return  r2  
    

### SVM (Non-Linear Kernel)  
Fits an SVR model with an RBF kernel on training data,  
predicts on test data, and returns the R² score.  

In [6]:
def svm_NL(X_train,y_train,X_test):
                
        from sklearn.svm import SVR
        regressor = SVR(kernel = 'rbf')
        regressor.fit(X_train, y_train)
        r2=r2_prediction(regressor,X_test,y_test)
        return  r2  

### Decision Tree Regressor  
Fits a Decision Tree Regressor on training data,  
predicts on test data, and returns the R² score.  

In [7]:
def Decision(X_train,y_train,X_test):
        
        # Fitting K-NN to the Training setC
        from sklearn.tree import DecisionTreeRegressor
        regressor = DecisionTreeRegressor(random_state = 0)
        regressor.fit(X_train, y_train)
        r2=r2_prediction(regressor,X_test,y_test)
        return  r2  
     

### Random Forest Regressor  
Fits a Random Forest Regressor with 10 trees on training data,  
predicts on test data, and returns the R² score.  

In [8]:
def random(X_train,y_train,X_test):       
        # Fitting K-NN to the Training set
        from sklearn.ensemble import RandomForestRegressor
        regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
        regressor.fit(X_train, y_train)
        r2=r2_prediction(regressor,X_test,y_test)
        return  r2 

### RFE Feature Selection  
Runs Recursive Feature Elimination (RFE) with multiple regressors  
(Linear, SVM, Decision Tree, Random Forest) to select the top `n` features.  
Returns a list of transformed feature sets (`rfelist`) for further model training.  

In [29]:
def rfeFeature(indep_X, dep_Y, n):
    rfelist=[]
    
    from sklearn.linear_model import LinearRegression
    lin = LinearRegression()
    
    from sklearn.svm import SVR
    SVRl = SVR(kernel = 'linear')
    
    from sklearn.svm import SVR
    #SVRnl = SVR(kernel = 'rbf')
    
    from sklearn.tree import DecisionTreeRegressor
    dec = DecisionTreeRegressor(random_state = 0)
    
    from sklearn.ensemble import RandomForestRegressor
    rf = RandomForestRegressor(n_estimators = 10, random_state = 0)
    
    rfemodellist=[lin, SVRl, dec, rf] 
    for i in rfemodellist:
        print(i)
        log_rfe = RFE(estimator=i, n_features_to_select=n)
        log_fit = log_rfe.fit(indep_X, dep_Y)
        log_rfe_feature = log_fit.transform(indep_X)
        rfelist.append(log_rfe_feature)
    return rfelist

### RFE Regression Results  
Creates a DataFrame to compare R² scores of all models  
(Linear, SVM, Decision Tree, Random Forest) across RFE-selected features.  
Returns the result table for analysis.  

In [10]:
def rfe_regression(acclog,accsvml,accdes,accrf): 
    
    rfedataframe=pd.DataFrame(index=['Linear','SVC','Random','DecisionTree'],columns=['Linear','SVMl',
                                                                                        'Decision','Random'])

    for number,idex in enumerate(rfedataframe.index):
        
        rfedataframe['Linear'][idex]=acclog[number]       
        rfedataframe['SVMl'][idex]=accsvml[number]
        rfedataframe['Decision'][idex]=accdes[number]
        rfedataframe['Random'][idex]=accrf[number]
    return rfedataframe

### Reading and Preprocessing the Dataset

- Load data from `prep.csv`  
- Create another variable `df2`  
- Convert categorical columns into dummy variables (0/1) using `pd.get_dummies()`  
- Use `drop_first=True` to avoid duplicate category columns

In [21]:
dataset1=pd.read_csv("prep.csv",index_col=None)
df2=dataset1
df2 = pd.get_dummies(df2, drop_first=True)

### Splitting Features and Target

- `indep_X` → all independent features (X values)  
- `dep_Y` → the dependent/target variable (y value)  
- We drop the column `classification_yes` from features,  
  and keep it separately as the target column.

In [25]:
indep_X = df2.drop('classification_yes', axis=1)  
dep_Y   = df2['classification_yes']            

### Running RFE (Recursive Feature Elimination) and storing accuracies

- `rfelist = rfeFeature(indep_X, dep_Y, 3)`  
  → Run RFE to select the top 3 features from `indep_X` against target `dep_Y`.

- Create empty lists to store model accuracies:  
  - `acclin`   → accuracy of Logistic Regression  
  - `accsvml`  → accuracy of SVM (linear kernel)  
  - `accsvmnl` → accuracy of SVM (non-linear kernel)  
  - `accdes`   → accuracy of Decision Tree  
  - `accrf`    → accuracy of Random Forest

In [31]:
rfelist=rfeFeature(indep_X,dep_Y,3)       

acclin=[]
accsvml=[]
accsvmnl=[]
accdes=[]
accrf=[]

LinearRegression()
SVR(kernel='linear')
DecisionTreeRegressor(random_state=0)
RandomForestRegressor(n_estimators=10, random_state=0)


### Model Training & Evaluation  
Trained Linear, SVM, Decision Tree, and Random Forest models on RFE-selected features.  
Collected R² scores for each model and summarized results in `result`.  

In [32]:
for i in rfelist:   
    X_train, X_test, y_train, y_test=split_scalar(i,dep_Y)  
    r2_lin=Linear(X_train,y_train,X_test)
    acclin.append(r2_lin)
    
    r2_sl=svm_linear(X_train,y_train,X_test)    
    accsvml.append(r2_sl)
    
    r2_NL=svm_NL(X_train,y_train,X_test)
    accsvmnl.append(r2_NL)
    
    r2_d=Decision(X_train,y_train,X_test)
    accdes.append(r2_d)
    
    r2_r=random(X_train,y_train,X_test)
    accrf.append(r2_r)
    
    
result=rfe_regression(acclin,accsvml,accdes,accrf)

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  rfedataframe['Linear'][idex]=acclog[number]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFram

# Print the final comparison of model accuracies after RFE-based feature selection

In [33]:
result

Unnamed: 0,Linear,SVMl,Decision,Random
Linear,0.441961,0.262153,0.441961,0.441816
SVC,0.441961,0.262153,0.441961,0.441816
Random,0.664893,0.609652,0.965961,0.916304
DecisionTree,0.676174,0.670691,0.933504,0.887256


## 🔎 Feature Selection & Model Performance

### Feature Selection
- We applied **Recursive Feature Elimination (RFE)** to select the **top 3 features**.
- RFE works by:
  1. Fitting a model,
  2. Ranking features by importance,
  3. Eliminating the least important features step by step,
  4. Keeping only the best `n_features_to_select`.
- This reduces noise and helps models focus on the most relevant predictors.

---

### Model Performance (R² Scores)

| Model           | Avg R² Range |
|-----------------|--------------|
| Linear Regression | ~0.26 – 0.44 |
| SVM (linear)      | ~0.26 – 0.44 |
| Decision Tree     | ~0.67 – 0.93 |
| Random Forest     | **~0.66 – 0.96** |

---

### ✅ Observations
- **Linear Regression / SVM** → weak performance, not capturing complexity well.  
- **Decision Tree** → strong results, good at handling non-linear relationships.  
- **Random Forest** → 🏆 best performer overall, highest R², most reliable across runs.

---

### 🎯 Conclusion
- **RFE** successfully reduced the dataset to the **3 most important features**.  
- **Random Forest** performed the best, followed closely by **Decision Tree**.  
- Simple models like **Linear Regression** and **SVM** were less effective.

### RFE with Top 4 Features  
Selects the 4 most important features using Recursive Feature Elimination (RFE).  

In [39]:
# Run RFE for top 4 features
rfelist_4 = rfeFeature(indep_X, dep_Y, 4)

LinearRegression()
SVR(kernel='linear')
DecisionTreeRegressor(random_state=0)
RandomForestRegressor(n_estimators=10, random_state=0)


### RFE with Top 5 Features  
Selects the 5 most important features using Recursive Feature Elimination (RFE).  

In [40]:
# Run RFE for top 5 features
rfelist_5 = rfeFeature(indep_X, dep_Y, 5)

LinearRegression()
SVR(kernel='linear')
DecisionTreeRegressor(random_state=0)
RandomForestRegressor(n_estimators=10, random_state=0)
LinearRegression()
SVR(kernel='linear')
DecisionTreeRegressor(random_state=0)
RandomForestRegressor(n_estimators=10, random_state=0)


In [41]:
# Initialize lists for 4 features
acclin_4, accsvml_4, accsvmnl_4, accdes_4, accrf_4 = [], [], [], [], []

for i in rfelist_4:
    X_train, X_test, y_train, y_test = split_scalar(i, dep_Y)

    acclin_4.append(Linear(X_train, y_train, X_test))
    accsvml_4.append(svm_linear(X_train, y_train, X_test))
    accsvmnl_4.append(svm_NL(X_train, y_train, X_test))
    accdes_4.append(Decision(X_train, y_train, X_test))
    accrf_4.append(random(X_train, y_train, X_test))

result_4 = rfe_regression(acclin_4, accsvml_4, accdes_4, accrf_4)

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  rfedataframe['Linear'][idex]=acclog[number]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFram

In [42]:
# Initialize lists for 5 features
acclin_5, accsvml_5, accsvmnl_5, accdes_5, accrf_5 = [], [], [], [], []

In [46]:
# Initialize lists for 5 features
acclin_5, accsvml_5, accsvmnl_5, accdes_5, accrf_5 = [], [], [], [], []

for i in rfelist_5:
    X_train, X_test, y_train, y_test = split_scalar(i, dep_Y)

    acclin_5.append(Linear(X_train, y_train, X_test))
    accsvml_5.append(svm_linear(X_train, y_train, X_test))
    accsvmnl_5.append(svm_NL(X_train, y_train, X_test))
    accdes_5.append(Decision(X_train, y_train, X_test))
    accrf_5.append(random(X_train, y_train, X_test))

result_5 = rfe_regression(acclin_5, accsvml_5, accdes_5, accrf_5)

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  rfedataframe['Linear'][idex]=acclog[number]
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFram

### Print RFE Results  
Displays the model performance results separately for top 4 features and top 5 features.  

In [43]:
print("Results with 4 features:")
print(result_4)

Results with 4 features:
                Linear      SVMl  Decision    Random
Linear         0.60401  0.457046  0.776711  0.776492
SVC            0.60401  0.457046  0.776711  0.776492
Random        0.671727  0.628963  0.835247    0.8403
DecisionTree  0.681563  0.614992   0.96711  0.923559


In [50]:
print("\nResults with 5 features:")
print(result_5)


Results with 5 features:
                Linear      SVMl  Decision    Random
Linear        0.620124  0.457136   0.77924  0.780135
SVC           0.604508  0.456871  0.776474  0.776745
Random        0.674403  0.628206  0.696181  0.815538
DecisionTree  0.686361  0.643365  0.836806  0.845303


### Comparison of RFE Results  
Combines model performance results for top 4 and 5 features into a single side-by-side table.  

In [49]:
# Combine results for top 3, 4, and 5 features
comparison = pd.concat(
    {"Top 3 Features": result, "Top 4 Features": result_4, "Top 5 Features": result_5},
    axis=1
)

comparison

Unnamed: 0_level_0,Top 3 Features,Top 3 Features,Top 3 Features,Top 3 Features,Top 4 Features,Top 4 Features,Top 4 Features,Top 4 Features,Top 5 Features,Top 5 Features,Top 5 Features,Top 5 Features
Unnamed: 0_level_1,Linear,SVMl,Decision,Random,Linear,SVMl,Decision,Random,Linear,SVMl,Decision,Random
Linear,0.441961,0.262153,0.441961,0.441816,0.60401,0.457046,0.776711,0.776492,0.620124,0.457136,0.77924,0.780135
SVC,0.441961,0.262153,0.441961,0.441816,0.60401,0.457046,0.776711,0.776492,0.604508,0.456871,0.776474,0.776745
Random,0.664893,0.609652,0.965961,0.916304,0.671727,0.628963,0.835247,0.8403,0.674403,0.628206,0.696181,0.815538
DecisionTree,0.676174,0.670691,0.933504,0.887256,0.681563,0.614992,0.96711,0.923559,0.686361,0.643365,0.836806,0.845303


### 📊 RFE Feature Comparison (Top 3 vs Top 4 vs Top 5 Features)

- **Top 3 Features**  
  - Linear & SVM models performed weakly (R² ~0.26–0.44).  
  - Decision Tree and Random Forest gave strong results (R² ~0.87–0.96).  

- **Top 4 Features**  
  - Noticeable improvement in Linear and SVM models (R² up to ~0.60).  
  - Decision Tree achieved its best score (~0.97).  
  - Random Forest remained consistently strong (~0.84–0.92).  

- **Top 5 Features**  
  - Linear and SVM stayed around similar levels (~0.60).  
  - Decision Tree dropped slightly compared to 4 features (~0.83–0.84).  
  - Random Forest also decreased a bit (~0.81).  

### ✅ Conclusion
- **Best Performance** → **Decision Tree with Top 4 Features** (R² ≈ 0.97).  
- **Most Consistent** → **Random Forest**, which gave solid performance across all feature sets.  
- **Linear & SVM** → Improved when moving from 3 → 4 features, but overall still weaker compared to tree-based models.  