## RANDOM FOREST
Random Forest is a popular and powerful ensemble machine learning algorithm used for classification, regression, and other tasks. It operates by building a multitude of decision trees and combining their outputs to make more accurate and stable predictions.



### ADVANTAGES 
Handles high-dimensional data well.

Resistant to overfitting.

Feature importance can be measured.

Works well with both categorical and numerical data.



In [1]:
import pandas as pd 
import numpy as np 

In [2]:
df = pd.read_csv("C:\\Users\\HP\\OneDrive\\Desktop\\DATASET\\ai_dev_productivity - ai_dev_productivity.csv")

In [3]:
df.head(2)

Unnamed: 0,hours_coding,coffee_intake_mg,distractions,sleep_hours,commits,bugs_reported,ai_usage_hours,cognitive_load,task_success
0,5.99,600,1,5.8,2,1,0.71,5.4,1
1,4.72,568,2,6.9,5,3,1.75,4.7,1


In [4]:
df.shape

(500, 9)

In [5]:
df['task_success'].value_counts()

task_success
1    303
0    197
Name: count, dtype: int64

In [6]:
df.isnull().sum()

hours_coding        0
coffee_intake_mg    0
distractions        0
sleep_hours         0
commits             0
bugs_reported       0
ai_usage_hours      0
cognitive_load      0
task_success        0
dtype: int64

In [7]:
x = df.drop(columns = ['task_success'] )
y = df['task_success']

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size = 0.2 , random_state = 42)

In [10]:
from sklearn.ensemble import RandomForestClassifier

In [11]:
rf = RandomForestClassifier()

In [12]:
rf.fit(x_train,y_train)

In [13]:
y_pred = rf.predict(x_test)

In [14]:
from sklearn.metrics import accuracy_score

In [15]:
accuracy_score(y_test , y_pred)

0.99

In [None]:
🌳 RANDOM FOREST CLASSIFIER vs REGRESSOR

Feature	            RandomForestClassifier	                        RandomForestRegressor

Task	            Classification (predict categories)	            Regression (predict continuous values)

Output	            Class labels (e.g., "spam" or "not spam")	    Continuous values (e.g., price = 523.4)

Aggregation Method	Majority vote from all trees	                Average of predictions from all trees

Loss Function Used	Gini impurity / Entropy	                        Mean Squared Error (MSE), by default

Example Use Case	Email spam detection, disease classification	House price prediction, temperature forecasting

## RANDOM FOREST REGRESSOR

In [16]:
df = pd.read_csv("C:\\Users\\HP\\OneDrive\\Desktop\\DATASET\\insurance - insurance.csv")

In [17]:
df.head(2)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523


In [18]:
df = pd.get_dummies(df,columns = ['sex' , 'smoker' , 'region'])

In [19]:
df.head(2)

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,True,False,False,True,False,False,False,True
1,18,33.77,1,1725.5523,False,True,True,False,False,False,True,False


In [20]:
x = df.drop(columns = ['charges'])
y = df['charges']

In [21]:
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size = 0.2 , random_state = 42)

In [22]:
from sklearn.ensemble import RandomForestRegressor 

In [23]:
rf = RandomForestRegressor()

In [24]:
rf.fit(x_train,y_train)

In [25]:
y_pred = rf.predict(x_test)

In [26]:
from sklearn.metrics import r2_score

In [27]:
r2_score(y_test ,y_pred )

0.8664762184400074

## another data with classifier 

In [62]:
df = pd.read_csv("C:\\Users\\HP\\OneDrive\\Desktop\\DATASET\\Attrition - Attrition.csv")

In [63]:
df.head(2)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7


In [64]:
df.dtypes

Age                          int64
Attrition                   object
BusinessTravel              object
DailyRate                    int64
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeCount                int64
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
HourlyRate                   int64
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
MonthlyRate                  int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears   

In [69]:
df_obj = ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']

In [70]:
df = pd.get_dummies(df ,columns = df_obj)

In [71]:
df.head(2)

Unnamed: 0,Age,Attrition,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
0,41,Yes,1102,1,2,1,1,2,94,3,...,False,False,True,False,False,False,True,True,False,True
1,49,No,279,8,1,1,2,3,61,2,...,False,True,False,False,False,True,False,True,True,False


In [73]:
from sklearn.preprocessing import LabelEncoder

In [74]:
lb = LabelEncoder()

In [75]:
df['Attrition'] = lb.fit_transform(df['Attrition'])


In [76]:
x = df.drop(columns = ['Attrition'])
y = df['Attrition']

In [77]:
df.columns

Index(['Age', 'Attrition', 'DailyRate', 'DistanceFromHome', 'Education',
       'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction',
       'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StandardHours', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany',
       'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager',
       'BusinessTravel_Non-Travel', 'BusinessTravel_Travel_Frequently',
       'BusinessTravel_Travel_Rarely', 'Department_Human Resources',
       'Department_Research & Development', 'Department_Sales',
       'EducationField_Human Resources', 'EducationField_Life Sciences',
       'EducationField_Marketing', 'EducationField_Medical',
       'EducationField_Other', 'EducationField_Technical Degree',
       'Gender_Female', 'Gender

In [78]:
from sklearn.model_selection import train_test_split

In [79]:
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size = 0.2 , random_state = 42)

In [83]:
from sklearn.ensemble import RandomForestClassifier

In [90]:
rf = RandomForestClassifier()

In [91]:
rf.fit(x_train,y_train)

In [92]:
y_pred = rf.predict(x_test)

In [93]:
accuracy_score(y_test , y_pred)

0.8775510204081632