#Random Forest

**Aim:**
To understand, implement, and analyze the Random Forest algorithm to predict/classify data accurately and efficiently while leveraging the power of ensemble learning.

**Title:**
"Enhancing Predictive Accuracy: Exploring Random Forest Algorithm for Data Classification"

**Dataset Source:**
The California Housing dataset is obtained from the StatLib repository and includes housing-related data collected from the 1990 California census. It comprises features like median income, housing median age, average rooms, average bedrooms, population, households, latitude, and longitude, among others, aiming to predict median house values for California districts

**Theory (Explanation of Algorithm):**
Random Forest is an ensemble learning method used for classification and regression tasks. It constructs multiple decision trees during training and outputs the mode/mean prediction of the individual trees for classification/regression.

**Decision Trees:**

 Random Forest is built upon the foundation of decision trees. Each decision tree is constructed by randomly selecting a subset of features from the dataset. The tree is formed by splitting nodes based on the best feature that optimally separates the data.

**Bootstrap Aggregating (Bagging):**

 Random Forest employs bagging, where multiple decision trees are created by training on different subsets of the data, randomly sampled with replacement (bootstrap samples). This prevents overfitting by ensuring diversity among the trees.

Random Feature Selection: At each node of the decision tree, a random subset of features is considered for splitting. This randomness adds robustness to the model and prevents it from relying too heavily on any single feature.

Voting/Averaging: For classification, the mode of the predictions of individual trees is taken as the final prediction. In regression tasks, the average of predictions from all trees is calculated.

**Conclusion:**

Random Forest algorithm has proven to be a powerful and widely used machine learning technique due to its ability to handle complex datasets, reduce overfitting, and provide robust predictions. By leveraging the strengths of ensemble learning and decision trees, Random Forest often outperforms single decision trees and other traditional machine learning algorithms in terms of accuracy and stability.

In [1]:
# Random Forest
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [7]:
# Load the california_housing dataset
d = fetch_california_housing()

In [9]:
X, y = d.data, d.target
y = y.astype(int)

In [10]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Standardize the features (optional)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [12]:
#Fitting Decision Tree classifier to the training set
from sklearn.ensemble import RandomForestClassifier
classifier= RandomForestClassifier(n_estimators= 10, criterion="entropy")
classifier.fit(X_train, y_train)

In [13]:
#Predicting the test set result
y_pred= classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("RandomForestClassifier Accuracy:",accuracy)

RandomForestClassifier Accuracy: 0.6891957364341085


In [14]:
#Creating the Confusion matrix
from sklearn.metrics import confusion_matrix
cm= confusion_matrix(y_test, y_pred)
print(cm)

[[ 575  151    3    0    1    0]
 [ 143 1355  172   11    2    1]
 [  12  261  602   72    5    4]
 [   1   49  143  187   16   19]
 [   1   12   37   60   22   27]
 [   2    9   15   42   12  104]]
