# Machine Learning Project: Theoretical Formalism

This notebook provides a theoretical explanation and mathematical background for the methods used in this project. The focus is on the techniques applied in regression and classification tasks, explaining their assumptions, parameter impacts, and performance evaluation metrics.

---

## 1. Exploratory Data Analysis (EDA)

### Purpose of EDA
Exploratory Data Analysis is the process of summarizing the main characteristics of the dataset, often using visual methods. The goals include:
- Identifying data distributions.
- Detecting missing values and outliers.
- Understanding feature correlations.

### Key Techniques Used
1. **Descriptive Statistics**: Mean, median, standard deviation, and quartiles.
2. **Correlation Analysis**: Pearson correlation coefficient was used to measure linear relationships between features.
   $$
   r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}
   $$
3. **Visualization Tools**: Histograms, heatmaps, and boxplots provided insights into feature distributions and relationships.

---

## 2. Regression Models

### 2.1 Linear Regression

**Objective**: Predict a continuous target variable (`MedHouseVal`) using a linear combination of features.

**Mathematical Model**:
$$
\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n
$$
where:
- $\hat{y}$ is the predicted value.
- $x_i$ are the input features.
- $\beta_i$ are the model coefficients.

**Key Assumptions**:
1. Linearity: The relationship between features and the target is linear.
2. Independence: Observations are independent.
3. Homoscedasticity: Residuals (errors) have constant variance.
4. Normality: Residuals are normally distributed.

**Evaluation Metrics**:
- **Mean Squared Error (MSE)**:
  $$
  MSE = \frac{1}{n}\sum_{i=1}^n (\hat{y}_i - y_i)^2
  $$
- **Mean Absolute Error (MAE)**:
  $$
  MAE = \frac{1}{n}\sum_{i=1}^n |\hat{y}_i - y_i|
  $$
- **R² Score**:
  $$
  R^2 = 1 - \frac{\sum (\hat{y}_i - y_i)^2}{\sum (y_i - \bar{y})^2}
  $$

---

### 2.2 Random Forest Regression

**Objective**: Build an ensemble of decision trees to improve prediction accuracy.

**Key Concepts**:
1. **Bootstrap Aggregation (Bagging)**: Subsamples are drawn with replacement to train individual trees.
2. **Decision Tree Splitting**: Trees are split to minimize variance:
   $$
   \text{Variance Reduction} = \text{Var(Parent)} - \left(\text{Weight}_\text{left} \cdot \text{Var(Left)} + \text{Weight}_\text{right} \cdot \text{Var(Right)}\right)
   $$

**Advantages**:
- Handles non-linear relationships.
- Reduces overfitting by averaging multiple trees.

**Feature Importance**: Measures how much each feature contributes to reducing error. Calculated as the total reduction in impurity (e.g., variance) caused by splits involving that feature.

---

## 3. Classification Models

### 3.1 K-Nearest Neighbors (KNN)

**Objective**: Classify an instance based on the majority class of its $k$-nearest neighbors.

**Mathematical Model**:
1. **Distance Metric**: Typically uses Euclidean distance:
   $$
   d(i, j) = \sqrt{\sum_{k=1}^n (x_{ik} - x_{jk})^2}
   $$
2. **Classification Rule**: Assign the class most common among the $k$-nearest neighbors.

**Hyperparameter**:
- $k$: Number of neighbors. A small $k$ leads to high variance, while a large $k$ leads to high bias.

**Evaluation Metrics**:
- **Accuracy**:
  $$
  \text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
  $$
- **Confusion Matrix**: Visualizes predictions across classes.

---

### 3.2 Decision Tree Classifier

**Objective**: Build a tree to classify instances by splitting features based on information gain or impurity reduction.

**Key Splitting Metrics**:
1. **Gini Impurity**:
   $$
   G = 1 - \sum_{i=1}^C p_i^2
   $$
   where $p_i$ is the proportion of class $i$.
2. **Entropy**:
   $$
   H = -\sum_{i=1}^C p_i \log_2(p_i)
   $$

**Advantages**:
- Easy to interpret.
- Captures non-linear relationships.

---

## 4. Feature Importance in Random Forest

Feature importance quantifies the contribution of each feature to the model’s predictive power. In Random Forest, this is computed as:
$$
\text{Importance}(f) = \sum_{t \in T} \frac{\text{Reduction in Impurity from Splits on } f}{\text{Total Reduction in Impurity}}
$$
This helps identify key predictors in the dataset.

---

## 5. Conclusion

This notebook outlines the theoretical foundations of:
1. **Exploratory Data Analysis (EDA)**: Understanding data characteristics.
2. **Regression**: Modeling continuous outcomes with Linear Regression and Random Forests.
3. **Classification**: Modeling discrete outcomes with KNN and Decision Trees.
4. **Evaluation Metrics**: Quantitative measures of model performance.

For future work, consider:
- Testing additional models (e.g., Support Vector Machines, Gradient Boosting).
- Applying advanced feature engineering techniques.
- Incorporating cross-validation for robust performance evaluation.
