# Local Outlier Factor performance metrics

2. Local Outlier Factor (LOF) for Outlier Detection
LOF is used to identify outliers in a dataset based on density, comparing the local density of a point with its neighbors. For outlier detection, you would typically have a binary classification problem (outlier or not) for each data point.

Here, the typical evaluation metrics for classification problems, based on whether the outliers are correctly detected, include:

Precision: The proportion of correctly identified outliers (true positives) among all predicted outliers (true positives + false positives).

Precision
=
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑃
Precision= 
TP+FP
TP
​
 
Recall (Sensitivity): The proportion of actual outliers (true positives + false negatives) that were correctly identified.

Recall
=
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑁
Recall= 
TP+FN
TP
​
 
F1 Score: The harmonic mean of precision and recall, balancing the trade-off between them.

F1 Score
=
2
×
Precision
×
Recall
Precision
+
Recall
F1 Score=2× 
Precision+Recall
Precision×Recall
​
 
Accuracy: The proportion of all data points (both inliers and outliers) that were correctly classified.

Accuracy
=
𝑇
𝑃
+
𝑇
𝑁
𝑇
𝑃
+
𝑇
𝑁
+
𝐹
𝑃
+
𝐹
𝑁
Accuracy= 
TP+TN+FP+FN
TP+TN
​
 
False Positive Rate (FPR): The proportion of inliers that were incorrectly classified as outliers.

FPR
=
𝐹
𝑃
𝐹
𝑃
+
𝑇
𝑁
FPR= 
FP+TN
FP
​
 
False Negative Rate (FNR): The proportion of outliers that were incorrectly classified as inliers.

FNR
=
𝐹
𝑁
𝐹
𝑁
+
𝑇
𝑃
FNR= 
FN+TP
FN
​


Using LOF with Linear Regression for Outlier Detection
In a typical use case, you might use LOF for outlier detection and then apply Linear Regression to model the relationships between your variables (potentially excluding outliers if necessary). After detecting outliers, you can filter them out and train your model on the remaining inliers. In such cases, you'd use the above classification metrics to evaluate LOF's performance and regression metrics (like MAE, MSE, R²) for evaluating the regression model.

Example Workflow:
Detect outliers using LOF:

Use LOF to label data points as outliers or inliers.
Evaluate the LOF performance using classification metrics (Precision, Recall, F1 Score, Accuracy).
Train Linear Regression Model:

Use the remaining inliers (after filtering out outliers) to train a Linear Regression model.
Evaluate the regression model using MAE, MSE, RMSE, and R².

Example Output:
If you run the code with sample data, you will get a confusion matrix plot where:

True Positive (TP) is the top right corner.
True Negative (TN) is the bottom left corner.
False Positive (FP) is the top left corner.
False Negative (FN) is the bottom right corner.

Example Output:
If you run the code with sample data, you will get a confusion matrix plot where:

True Positive (TP) is the top right corner.
True Negative (TN) is the bottom left corner.
False Positive (FP) is the top left corner.
False Negative (FN) is the bottom right corner.

# Linear Regression performance metrics:


Linear regression is typically used for predicting continuous values, so the relevant metrics focus on how well your model predicts the target variable. Common metrics for regression tasks are:

Mean Absolute Error (MAE): Measures the average of the absolute errors between predicted and actual values.

MAE
=
1
𝑛
∑
𝑖
=
1
𝑛
∣
𝑦
𝑖
−
𝑦
𝑖
^
∣
MAE= 
n
1
​
  
i=1
∑
n
​
 ∣y 
i
​
 − 
y 
i
​
 
^
​
 ∣
Mean Squared Error (MSE): Measures the average of the squared errors. It penalizes larger errors more than MAE.

MSE
=
1
𝑛
∑
𝑖
=
1
𝑛
(
𝑦
𝑖
−
𝑦
𝑖
^
)
2
MSE= 
n
1
​
  
i=1
∑
n
​
 (y 
i
​
 − 
y 
i
​
 
^
​
 ) 
2
 
Root Mean Squared Error (RMSE): The square root of the MSE, which gives you the error in the same units as the target variable.

RMSE
=
MSE
RMSE= 
MSE
​
 
R-squared (R²): A measure of how well the regression model explains the variance in the target variable. Values close to 1 indicate a good fit.

𝑅
2
=
1
−
∑
𝑖
=
1
𝑛
(
𝑦
𝑖
−
𝑦
𝑖
^
)
2
∑
𝑖
=
1
𝑛
(
𝑦
𝑖
−
𝑦
ˉ
)
2
R 
2
 =1− 
∑ 
i=1
n
​
 (y 
i
​
 − 
y
ˉ
​
 ) 
2
 
∑ 
i=1
n
​
 (y 
i
​
 − 
y 
i
​
 
^
​
 ) 
2
 
​
 
Where:

𝑦
𝑖
y 
i
​
  is the actual value
𝑦
𝑖
^
y 
i
​
 
^
​
  is the predicted value
𝑦
ˉ
y
ˉ
​
  is the mean of the actual values
𝑛
n is the number of samples

# linear regression performance interpretation

The metrics you're seeing are typical performance indicators for a linear regression model. Let's break down each of them:

### 1. **Mean Absolute Error (MAE)**: `2.51`

- **What it is**: MAE measures the average magnitude of the errors between predicted and actual values. It tells you how far off your predictions are, on average, in the same units as the target variable (in this case, `Median` values).
  
- **Interpretation**: On average, your model's predictions are off by **2.51 units** from the true values. A lower MAE indicates a better fit to the data.

### 2. **Mean Squared Error (MSE)**: `8.14`

- **What it is**: MSE is the average of the squared differences between predicted and actual values. It penalizes larger errors more than smaller ones because the differences are squared. It's sensitive to outliers.
  
- **Interpretation**: The model's average error, when squared, is **8.14**. Since MSE emphasizes larger errors more, this can sometimes make it less interpretable directly than MAE, but it’s useful for understanding the variance of the errors.

### 3. **R² (R-squared)**: `-0.358`

- **What it is**: R² represents the proportion of variance in the dependent variable that is predictable from the independent variable(s). It ranges from 0 to 1, where:
  - **R² = 1** means perfect predictions (model explains 100% of the variance).
  - **R² = 0** means the model doesn't explain any variance, and it performs no better than just predicting the mean of the target.
  - **R² < 0** indicates that the model is worse than predicting the mean, meaning your model is not fitting the data well and might be overfitting or underfitting.

- **Interpretation**: An R² of `-0.358` means your model is **worse than simply predicting the mean value of `Median`**. It doesn't capture the relationship well, and it suggests that the linear regression model doesn't fit the data effectively. Negative values of R² generally indicate that your model is not suitable for the data or that there's a mismatch between the data and the model assumptions (e.g., linearity).

### Overall Interpretation:

- **Model Performance**: Based on these results, the model is performing poorly. The MAE and MSE are relatively high, and the negative R² indicates that the model is underperforming when compared to just predicting the average value of `Median`. 
- **Potential Issues**: There might be issues with the data (e.g., non-linear relationships, outliers) or with the features being used. A few possible next steps:
  - **Check for linearity**: Linear regression assumes a linear relationship. If the data is non-linear, a different model (e.g., polynomial regression or decision trees) might be more suitable.
  - **Feature Engineering**: You might need additional features or better-preprocessed data.
  - **Model Choice**: Consider testing other regression models (e.g., decision tree regressor, random forest regressor) to see if they can better capture the underlying patterns in the data.

### Next Steps:
1. **Visualize the Data**: Plot the data to check for trends or patterns that linear regression might not capture.
2. **Try Polynomial Regression**: If the data has a non-linear trend, try fitting a polynomial regression model instead.
3. **Feature Selection/Engineering**: Investigate whether additional features might improve the model's performance.