What is a Random Forest:

A Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to make a more accurate and stable prediction.

How it works:
It creates many decision trees using random subsets of your data (bootstrapping).

At each split in each tree, it chooses from a random subset of features.

For classification, it uses majority voting (i.e., the class most trees agree on).

For regression, it uses the average of the predictions.

When to Use Random Forest:
Use Random Forest when:

You want a robust model that avoids overfitting.

You have noisy data or high dimensionality (many features).

You want feature importance insights.

You want a balance of accuracy and interpretability.


`The parameter n_estimators=100 specifies the number of decision trees that the Random Forest will build to make its final prediction.`

""" Explanation in Simple Terms:
A Random Forest is made up of many individual Decision Trees.

The more trees you include (i.e., the higher the n_estimators), the more stable, robust, and accurate your model tends to be — up to a point.

Each tree is trained on a slightly different random subset of the data, and then during prediction, the forest:

Votes (for classification)

Averages (for regression) """


""" So Why 100?
100 is a default value in many cases and often provides a good balance between performance and training time.

You can increase this number for better accuracy (especially with larger datasets), but:

It will take more time to train.

It will use more memory. """

rf_model = RandomForestClassifier(n_estimators=10) fewer trees: faster, less accurate
rf_model = RandomForestClassifier(n_estimators=100) standard setting
rf_model = RandomForestClassifier(n_estimators=500) more stable but slower

Rule of Thumb:
Small datasets: n_estimators=50-100 is usually enough.

Large datasets: You can use n_estimators=200-500+, depending on your system and needs.

Advantages of Random Forest:
Handles missing data and categorical features well (with some prep)

Reduces overfitting from individual trees

Good for both classification and regression

Provides feature importance

Disadvantages of Random Forest: 
Slower than single decision trees

Less interpretable than one tree

Large models can take more memory





Focus on Relevant Features (Feature Selection)
We selected these features because they are the most directly related to whether someone survived or not (survived is the target). Including irrelevant or less informative features can:

Introduce noise

Make the model overfit

Complicate training without improving performance


| Abbreviation | Full Name      | Meaning                                           |
| ------------ | -------------- | ------------------------------------------------- |
| **TN**       | True Negative  | Model correctly predicted "Not Survived"          |
| **FP**       | False Positive | Model predicted "Survived" but was "Not Survived" |
| **FN**       | False Negative | Model predicted "Not Survived" but was "Survived" |
| **TP**       | True Positive  | Model correctly predicted "Survived"              |



 What Is Feature Importance?
In a Random Forest model, feature_importances_ tells you which features are most useful in making predictions.

It measures how much each feature reduces uncertainty (e.g., Gini impurity or entropy) in decision trees — on average across all trees in the forest.

Higher value → Feature is more important for prediction.

Lower value → Feature contributes little.


Step-by-Step Explanation
importances = rf_model.feature_importances_
This extracts a list of numerical importance scores for each feature used in training the RandomForestClassifier.


Why Use Feature Importance?
To understand which features matter most.

To simplify your model by removing weak features.

To help interpret your machine learning model.


| Feature    | Importance |
| ---------- | ---------- |
| `sex`      | 0.40       |
| `fare`     | 0.30       |
| `age`      | 0.15       |
| `pclass`   | 0.10       |
| `embarked` | 0.05       |

You can say:

The model relies most on sex and fare to predict survival.

You might even consider removing embarked if it's contributing little.

Label Encoding convert categorical columns (text data) into numerical format so they can be used in a machine learning model like Random Forest, which only works with numerical input.

What this loop does:
For each column in the list label_cols, it does:

data[col].astype(str): Ensures that all values are strings. This avoids errors when there are missing values (NaN) or mixed types.

le.fit_transform(...): Converts each unique category into an integer (e.g., 'Yes' → 1, 'No' → 0, 'N' → 0, 'S' → 1, etc.).

data[col] = ...: Replaces the original string values in the column with their numeric codes.

Why This Is Important
ML algorithms need numeric input: Random Forest, Logistic Regression, etc., don’t understand text.

Label Encoding is fast and simple: It works well when the categorical values have no inherent order.

When to Use LabelEncoder (and When Not To)
Good Use Case:
Binary categories like 'Yes'/'No', 'Male'/'Female'

High cardinality categorical features when used with tree-based models (e.g., Random Forest)

Be Careful When:
Using ordinal encoding where the order matters ('Low', 'Medium', 'High')

Using linear models (e.g., Logistic Regression) → LabelEncoding may introduce fake order → use OneHotEncoding instead.
