## ml projec pipeline
1. import required lib
2. load the dataset
3. data preprocessing
4. EDA
5. Features Engineering
6. Train and Splitting
7. Train Model
8. Predict Model
9. Model Evaluation Metrics
10. Saved Model

# ðŸŒ² Random Forest (Machine Learning)

## What is Random Forest?
**Random Forest** is an **ensemble learning algorithm** that builds **multiple Decision Trees** and combines their predictions to produce a more **accurate and stable** result.

Instead of relying on one decision tree, Random Forest uses the **collective decision of many trees**.

---

## Why Random Forest?
Decision Trees often suffer from:
- Overfitting
- High variance

Random Forest reduces these problems by:
- Training trees on **different samples of data**
- Using **random subsets of features**
- Combining predictions through voting or averaging

---

## Core Idea
> **Many decision trees + randomness = better performance**

---

## How Random Forest Works

### 1. Bootstrapping (Row Sampling)
- Random samples are drawn from the original dataset **with replacement**
- Each decision tree is trained on a **different dataset**

This technique is called **Bagging (Bootstrap Aggregating)**.

---

### 2. Feature Randomness (Column Sampling)
- At each split, only a **random subset of features** is considered
- This ensures trees are **less correlated**

---

### 3. Train Multiple Decision Trees
- Each tree is trained independently
- Trees are usually grown deep to reduce bias

---

### 4. Combine Predictions
- **Classification** â†’ Majority voting
- **Regression** â†’ Average of predictions

---

## Prediction Method

### Classification
If tree predictions are:
- Tree 1 â†’ Yes  
- Tree 2 â†’ No  
- Tree 3 â†’ Yes  

Final output = **Yes**

---

### Regression
If tree predictions are:
- 40, 45, 50  

Final output = **(40 + 45 + 50) / 3 = 45**

---

## Important Hyperparameters
| Parameter | Description |
|--------|------------|
| `n_estimators` | Number of trees |
| `max_depth` | Maximum depth of trees |
| `max_features` | Features considered at each split |
| `min_samples_split` | Minimum samples to split |
| `bootstrap` | Enables bootstrapping |

---

## Advantages of Random Forest
- Reduces overfitting
- High accuracy
- Handles large datasets well
- Works with non-linear data
- Provides feature importance

---

## Disadvantages of Random Forest
- Less interpretable than a single decision tree
- Slower training time
- Requires more memory

---

## Random Forest vs Decision Tree

| Feature | Decision Tree | Random Forest |
|------|---------------|---------------|
| Overfitting | High | Low |
| Accuracy | Medium | High |
| Stability | Low | High |
| Interpretability | High | Medium |
| Training Time | Fast | Slower |

---

## Applications of Random Forest
- Fraud detection
- Credit scoring
- Medical diagnosis
- Customer churn prediction
- Recommendation systems

---

## Conclusion
Random Forest is a powerful algorithm that improves Decision Trees by using **ensemble learning and randomness**.  
It provides better accuracy, stability, and generalization compared to a single decision tree.


In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('RF datasets.csv')

In [3]:
print(df.head())


   age    salary  experience_years education_level department city_tier  \
0   56  136748.0                33             PhD         IT    Tier-3   
1   46   25287.0                28     High School    Finance    Tier-2   
2   32  146593.0                 3             PhD         HR    Tier-1   
3   60   54387.0                16     High School         IT    Tier-3   
4   25   28512.0                34        Bachelor    Finance    Tier-2   

   work_hours_per_week  performance_score  promotion_last_5years  \
0                   47           3.800000                      0   
1                   40           2.400000                      1   
2                   45           2.970316                      0   
3                   47           2.600000                      0   
4                   64           1.900000                      0   

   target_left_company  
0                    1  
1                    1  
2                    0  
3                    0  
4              

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   age                    1000 non-null   int64  
 1   salary                 1000 non-null   float64
 2   experience_years       1000 non-null   int64  
 3   education_level        1000 non-null   object 
 4   department             1000 non-null   object 
 5   city_tier              1000 non-null   object 
 6   work_hours_per_week    1000 non-null   int64  
 7   performance_score      1000 non-null   float64
 8   promotion_last_5years  1000 non-null   int64  
 9   target_left_company    1000 non-null   int64  
dtypes: float64(2), int64(5), object(3)
memory usage: 78.2+ KB
None


In [5]:
print(df.describe())

               age         salary  experience_years  work_hours_per_week  \
count  1000.000000    1000.000000        1000.00000          1000.000000   
mean     40.986000   85084.980000          19.34500            49.554000   
std      13.497852   37146.053131          11.45492            11.255833   
min      18.000000   20060.000000           0.00000            30.000000   
25%      29.000000   53704.250000          10.00000            40.000000   
50%      42.000000   84772.000000          19.00000            49.000000   
75%      52.000000  116535.500000          29.00000            59.000000   
max      64.000000  149972.000000          39.00000            69.000000   

       performance_score  promotion_last_5years  target_left_company  
count        1000.000000            1000.000000          1000.000000  
mean            2.970316               0.511000             0.497000  
std             1.128795               0.500129             0.500241  
min             1.000000       

In [6]:
df.isna().sum()

age                      0
salary                   0
experience_years         0
education_level          0
department               0
city_tier                0
work_hours_per_week      0
performance_score        0
promotion_last_5years    0
target_left_company      0
dtype: int64

In [7]:
df['department'].value_counts()

department
IT            225
Finance       201
Operations    199
HR            190
Marketing     185
Name: count, dtype: int64

In [8]:
df['education_level'].value_counts()

education_level
High School    299
PhD            244
Master         232
Bachelor       225
Name: count, dtype: int64

In [9]:
df['city_tier'].value_counts()

city_tier
Tier-3    349
Tier-2    336
Tier-1    315
Name: count, dtype: int64

In [11]:
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
ohe = OneHotEncoder(sparse_output=False, drop='first')
cate_cols = ['department','education_level','city_tier']
oh_df = pd.DataFrame(ohe.fit_transform(df[cate_cols]), columns=ohe.get_feature_names_out(cate_cols))
df = pd.concat([df.drop(columns=cate_cols), oh_df], axis=1)



In [12]:
print(df.head())

   age    salary  experience_years  work_hours_per_week  performance_score  \
0   56  136748.0                33                   47           3.800000   
1   46   25287.0                28                   40           2.400000   
2   32  146593.0                 3                   45           2.970316   
3   60   54387.0                16                   47           2.600000   
4   25   28512.0                34                   64           1.900000   

   promotion_last_5years  target_left_company  department_HR  department_IT  \
0                      0                    1            0.0            1.0   
1                      1                    1            0.0            0.0   
2                      0                    0            1.0            0.0   
3                      0                    0            0.0            1.0   
4                      0                    1            0.0            0.0   

   department_Marketing  department_Operations  educatio

In [None]:
from sklearn.model_selection import train_test_split
x=df.drop(columns=['target_left_company'])
y=df['target_left_company']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=9137)


In [17]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100,random_state=9137)
rf.fit(x_train,y_train)
y_pred=rf.predict(x_test)
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
print("Accuracy:",accuracy_score(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Classification Report:\n",classification_report(y_test,y_pred))

Accuracy: 0.495
Confusion Matrix:
 [[49 50]
 [51 50]]
Classification Report:
               precision    recall  f1-score   support

           0       0.49      0.49      0.49        99
           1       0.50      0.50      0.50       101

    accuracy                           0.49       200
   macro avg       0.49      0.49      0.49       200
weighted avg       0.50      0.49      0.50       200



In [19]:
comp_df=pd.DataFrame({'Actaul':y_test,'Predicted':y_pred})
print(comp_df.head(10))

     Actaul  Predicted
269       1          1
686       1          0
473       0          0
811       1          1
344       1          0
24        1          0
982       1          1
336       0          1
584       0          1
619       0          0
