### **Random Forests**

They are related to decision trees. <br>
A random forest is constructed with multiple decision trees and combines them to obtain a more robust and accurate prediction. <br>
Each tree is trained with a different sample from the dataset and selects a random subset of features to split at each node.



                       Dataset                   
 
    |  Random Sample 1    |  Random Sample 2      |
    | (Tree 1 Training)   | (Tree 2 Training)   |
    +------------------------+--------------------------+
             |                           |
         Tree 1                        Tree 2
             |                           |
    +----------+-----------+   +-----------+-----------+
    | Random Selection  |      | Random Selection     |
    | of Features for   |      | of Features for      |
    | Splitting         |      |  splitting           |
    +----------+-----------+   +-----------+-----------+
               |                           |
     +---------+---------+       +---------+---------+
     | Prediction of    |       | Prediction of      |
     | Tree 1           |       | Tree 2             |
     +---------+---------+       +---------+---------+
               |                           |
        +------+---------------------------+------+
        |          Averaged Voting                |
        |         of All Trees                    |
        +---------------------------------------+
                                 |
                          Final Prediction




`USE CASES`

Fraud detection: <br>
Variables: transaction details, user behavior patterns, etc. <br>
Objective: Identify fraudulent transactions. <br>
Use: Security in electronic payment systems. <br>


In [None]:
!pip3 install numpy pandas matplotlib scikit-learn imbalanced-learn

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.preprocessing import LabelEncoder

print("All libraries imported successfully.")

In [None]:

# Generate synthetic data for market segmentation
np.random.seed(42)
n = 100

# Factors to consider for market segmentation
# - age: between 18 and 70 years
# - income: between 20,000 and 120,000
# - has purchased before: yes or no
# - will purchase: yes or no

# Here we have a clear purchase pattern
# incomes = np.random.randint(20000, 100000, n)
# age = np.random.randint(18, 70, n)
# has_purchased_before = np.random.randint(0, 2, n)
# will_purchase = []
# for i in range(n):
#     if incomes[i] > 70000:
#         if age[i] >= 40:
#             will_purchase.append(1)  # Yes
#         else:
#             will_purchase.append(0)  # No
#     else:
#         if has_purchased_before[i] == 1:
#             will_purchase.append(1)  # Yes
#         else:
#             will_purchase.append(0)  # No

# data = {
#     'Age': age,
#     'Income': incomes,
#     'HasPurchasedBefore': has_purchased_before,
#     'WillPurchase': will_purchase
# }

# Here we have a dataset with completely random values, the accuracy should decrease significantly
incomes = np.random.randint(20000, 120000, n)
age = np.random.randint(18, 70, n)
has_purchased_before = np.random.choice([0, 1], n)
will_purchase = np.random.choice(['Yes', 'No'], n)

# Create DataFrame
data = {
    'Age': age,
    'Income': incomes,
    'HasPurchasedBefore': has_purchased_before,
    'WillPurchase': will_purchase
}

# Convert dataset to a pandas DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('market_segmentation.csv', index=False)

In [None]:
# Load data
data = pd.read_csv('market_segmentation.csv')

# Show first few rows
print(data.head())

# Descriptive statistics
# count: non-null values
# mean: average (sum of values in each column divided by number of rows)
# std: standard deviation
# min: minimum value per column
# 25%: 25% of houses are below these values
# 50%: 50% of houses are below these values
# 75%: 75% of houses are below these values
# max: maximum value per column
print(data.describe())

In [None]:
# Purchase distribution
plt.hist(data['WillPurchase'])
plt.xlabel('Will Buy')
plt.ylabel('Frequency')
plt.title('Purchase Distribution')
plt.show()

# Encode categorical variables
label_encoder = LabelEncoder()
data['HasPurchasedBefore'] = label_encoder.fit_transform(data['HasPurchasedBefore'])
data['WillPurchase'] = label_encoder.fit_transform(data['WillPurchase'])

# Here we can see a plot, where on the x-axis are the options ('Yes' and 'No')
# On the y-axis we see the frequency of each category in the data.
# Frequency indicates how many times each category appears in a dataset.

In [None]:
# Features to consider for model training
X = df[['Age', 'Income', 'HasPurchasedBefore']]
y = df['WillPurchase']

# Split into training and test sets
# X variables used for prediction
# y variable we want to predict
# test_size=0.2: indicates using 20% of data for test set and 80% for training set
# Training set is used to train the model, teaching the model the relationship between patterns in the data (more is better)
# Test set is used to compare predictions and see how accurate they are
# random_state: used to control how random data is split. If two people run the same function with the same random_state value,
#   they will get exactly the same data split (test and training).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [None]:
# Predict on the test set
y_pred_rf = model.predict(X_test)

# Evaluate the model
print("Random Forest - Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Random Forest - Classification Report:\n", classification_report(y_test, y_pred_rf))
print("Random Forest - Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

In [None]:
# Feature importance
importances = model.feature_importances_
feature_names = X.columns
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

# Visualize feature importance
feature_importances.plot(kind='bar')
plt.title('Feature Importance in Random Forest Model')
plt.show()