# Random Forest Tutorial - Student Template

## Introduction

Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. It's one of the most popular and powerful machine learning algorithms, known for its accuracy and robustness.

### Key Concepts:

1. **Ensemble Method**: Combines predictions from multiple models (decision trees)
2. **Bagging (Bootstrap Aggregating)**: Each tree is trained on a random subset of the data
3. **Feature Randomness**: Each tree considers only a random subset of features at each split
4. **Voting**: For classification, the final prediction is the mode of individual tree predictions
5. **Averaging**: For regression, the final prediction is the average of individual tree predictions

### Advantages of Random Forest:
- Reduces overfitting compared to individual decision trees
- Handles missing values well
- Provides feature importance rankings
- Robust to outliers
- Works well with both numerical and categorical data

### Disadvantages of Random Forest:
- Less interpretable than single decision trees
- Can be computationally expensive with large datasets
- May overfit on noisy datasets with many trees

In this tutorial, we'll explore how to implement Random Forest using scikit-learn with a real-world dataset.

In [None]:
# Import necessary libraries
# TODO: Import numpy, pandas, matplotlib, seaborn, and sklearn modules
# Hint: You'll need datasets, train_test_split, RandomForestClassifier, RandomForestRegressor,
#       accuracy_score, classification_report, confusion_matrix, mean_squared_error, r2_score,
#       LabelEncoder, and other necessary components

# Your code here:

## Dataset: Adult (Census Income)

We'll use the Adult dataset (also known as the Census Income dataset), which contains demographic information about individuals and whether they make more than $50K per year. This dataset has 48,842 samples and 15 features, making it an excellent example of a "big dataset" for machine learning.

### Features:
1. **age**: Continuous
2. **workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
3. **fnlwgt**: Continuous (final weight)
4. **education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
5. **education-num**: Continuous
6. **marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
7. **occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
8. **relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
9. **race**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
10. **sex**: Female, Male
11. **capital-gain**: Continuous
12. **capital-loss**: Continuous
13. **hours-per-week**: Continuous
14. **native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands

### Target:
**income**: >50K, <=50K

Let's load the dataset and explore its structure.

In [None]:
# Load the Adult (Census Income) dataset
# TODO: Use fetch_openml to load the 'adult' dataset
# TODO: Create a DataFrame from the dataset
# Hint: Use adult.frame.copy() to create a DataFrame

# Your code here:

In [None]:
# Display basic statistics of the dataset
# TODO: Use the info() method to show dataset information

# Your code here:

In [None]:
# Check for missing values
# TODO: Check for missing values in the dataset
# Hint: Use isnull().sum() and filter for values > 0

# Your code here:

## Data Preprocessing

Before training our Random Forest model, we need to preprocess the data. This includes handling missing values, encoding categorical variables, and preparing the data for machine learning.

In [None]:
# Handle missing values
# TODO: Replace '?' with NaN
# TODO: Drop rows with missing values
# TODO: Print the new dataset shape
# TODO: Check the distribution of the target variable

# Your code here:

In [None]:
# Encode categorical variables
# TODO: Define categorical columns
# TODO: Create label encoders for each categorical column
# TODO: Encode the target variable
# TODO: Print success message and show first few rows

# Your code here:

## Data Visualization

Let's visualize some key aspects of our data to better understand the relationships.

In [None]:
# Plot the distribution of the target variable
# TODO: Create a bar plot of the target variable distribution
# Hint: Use value_counts().plot(kind='bar')

# Your code here:

In [None]:
# Plot the relationship between age and income
# TODO: Create a boxplot of age by income level
# Hint: Use boxplot() with column and by parameters

# Your code here:

In [None]:
# Plot the relationship between education and income
# TODO: Create a crosstab of education and income
# TODO: Create a stacked bar plot
# Hint: Use pd.crosstab() with normalize='index' and plot(kind='bar', stacked=True)

# Your code here:

## Preparing Data for Random Forest

Now we'll prepare our data for training the Random Forest model by splitting it into training and testing sets.

In [None]:
# Separate features (X) and target variable (y)
# TODO: Create X (features) by dropping the target column
# TODO: Create y (target) with just the target column
# TODO: Split the data using train_test_split with test_size=0.2, random_state=42, and stratify=y
# TODO: Print dataset information

# Your code here:

## Training a Random Forest Classifier

Now we'll train a Random Forest Classifier to predict income levels. We'll use scikit-learn's `RandomForestClassifier` class.

In [None]:
# Create a Random Forest Classifier model
# TODO: Create a RandomForestClassifier with n_estimators=100, max_depth=10, random_state=42, and n_jobs=-1
# TODO: Fit the model on the training data

# Your code here:

## Making Predictions

Let's use our trained model to make predictions on the test set and evaluate its performance.

In [None]:
# Make predictions on the test set
# TODO: Use the predict method to make predictions on X_test
# TODO: Get prediction probabilities for the positive class
# TODO: Calculate accuracy using accuracy_score
# TODO: Print performance metrics and classification report

# Your code here:

## Confusion Matrix

Let's visualize the confusion matrix to better understand the model's performance.

In [None]:
# Create confusion matrix
# TODO: Create confusion matrix using confusion_matrix
# TODO: Plot the confusion matrix using seaborn heatmap
# TODO: Calculate precision, recall, and F1-score
# TODO: Print the metrics

# Your code here:

## Feature Importance

One of the advantages of Random Forest is that it provides insight into which features are most important for making predictions.

In [None]:
# Get feature importances from the Random Forest
# TODO: Extract feature importances from the trained model
# TODO: Create a DataFrame with features and their importances
# TODO: Sort by importance (descending)
# TODO: Print the top 10 feature importances
# TODO: Create a bar plot of top 10 feature importances

# Your code here:

## Hyperparameter Tuning

Random Forest has several hyperparameters that can be tuned to improve performance. Let's experiment with different values for `n_estimators` and `max_depth`.

In [None]:
# Test different n_estimators values
# TODO: Define ranges for n_estimators and max_depth
# TODO: Loop through combinations and train models
# TODO: Calculate and store accuracies
# TODO: Find the best combination
# TODO: Create and display a heatmap of results

# Your code here:

## Comparison with Single Decision Tree

Let's compare our Random Forest model with a single Decision Tree to see the improvement.

In [None]:
# Train a single Decision Tree
# TODO: Import DecisionTreeClassifier
# TODO: Create and train a DecisionTreeClassifier
# TODO: Make predictions with both models
# TODO: Calculate accuracies
# TODO: Compare the results and print the improvement

# Your code here:

## Conclusion

In this tutorial, we've explored how to use Random Forest for classification tasks:

### Key Takeaways:

1. **Random Forest is powerful**: It combines multiple decision trees to make more accurate predictions

2. **Data preprocessing is crucial**: We handled missing values and encoded categorical variables

3. **Hyperparameter tuning matters**: We experimented with different combinations of `n_estimators` and `max_depth`

4. **Model evaluation is important**: We used multiple metrics (accuracy, precision, recall, F1-score) to assess our model

5. **Feature importance provides insights**: Random Forest helps identify which features are most predictive

6. **Ensemble methods outperform single models**: Our Random Forest significantly outperformed a single decision tree

### Next Steps:

1. Try other hyperparameters like `min_samples_split`, `min_samples_leaf`, and `max_features`
2. Experiment with different preprocessing techniques (e.g., one-hot encoding instead of label encoding)
3. Apply Random Forest to regression problems
4. Explore other ensemble methods like Gradient Boosting or XGBoost

Random Forest is a versatile and powerful algorithm that often performs well out-of-the-box. Understanding it is essential for any machine learning practitioner!