## Supervised Machine Learning Model

This project aims to develop and evaluate machine learning models to assess performance and conduct a comparative analysis of the results. We have selected Logistic Regression and Random Forest as the primary models for this study, allowing us to draw informed conclusions based on their performance.

#### Logistic Regression

Why Logistic Regression?

Simplicity and Interpretability: Logistic Regression is a straightforward and interpretable model. It provides probabilities that can be easily understood and translated into decision-making. The coefficients of the model indicate the influence of each feature on the outcome, making it easier to interpret and understand the relationship between the variables.

Linear Decision Boundary: Logistic Regression assumes a linear relationship between the features and the outcome. This can be beneficial when the true relationship is approximately linear, allowing for efficient and effective modeling. It provides a clear decision boundary that separates classes based on a linear combination of the input features.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [3]:
data = pd.read_csv('../data/data_tfidf.csv')

In [None]:
# Transforming categorical variables into dummies
X = pd.get_dummies(df.drop(columns=['resultado']), drop_first=True)
y = df['resultado']

In [None]:
# Transforming the target variable into numerical values
y = df['resultado'].map({'DEFERIDO': 1, 'INDEFERIDO': 0, 'NÃO DEFINIDO': 2})

### Splitting the dataset, training and making previsions 

In [None]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# Training the Logistic Regression model for multiclass classification
model = LogisticRegression(multi_class='ovr')
model.fit(X_train, y_train)

# Making predictions
y_pred_logistic = model.predict(X_test)

### Evaluating the model

In [None]:
# Evaluating
accuracy = accuracy_score(y_test, y_pred_logistic)
print(f'Acuracy: {accuracy:.2f}')
conf_matrix = confusion_matrix(y_test, y_pred_logistic)
print('Confusion Matrix:')
print(conf_matrix)
class_report = classification_report(y_test, y_pred_logistic)
print('Classification Report:')
print(class_report)

#### Random Forest

Why Random Forest?

Non-Linearity: Unlike Logistic Regression, which assumes a linear relationship between the variables and the outcome, Random Forest can capture more complex interactions.

Immune to Overfitting: Since it is an ensemble of multiple trees, it tends to be more resistant to overfitting.

Interpretability: You can obtain the feature importance, which helps to understand which features have the most impact on the outcome.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Assuming that the categorical variables have already been transformed
X = pd.get_dummies(df.drop(columns=['resultado']), drop_first=True)
y = df['resultado'].map({'DEFERIDO': 1, 'INDEFERIDO': 0, 'NÃO DEFINIDO': 2})

In [None]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# Creating and training the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Making predictions
y_pred_rf = rf_model.predict(X_test)

In [None]:
# Evaluating
accuracy = accuracy_score(y_test, y_pred_rf)
print(f'Acuracy: {accuracy:.2f}')
conf_matrix = confusion_matrix(y_test, y_pred_rf)
print('Confusion Matrix:')
print(conf_matrix)
class_report = classification_report(y_test, y_pred_rf)
print('Classification Report:')
print(class_report)

### Data analysis and comparisons

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Calculating the predicted probabilities
y_prob_lr = model.predict_proba(X_test)
y_prob_rf = rf_model.predict_proba(X_test)

# ROC Curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr[:, 1], pos_label=1)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf[:, 1], pos_label=1)

# AUC (Area Under the Curve)
roc_auc_lr = auc(fpr_lr, tpr_lr)
roc_auc_rf = auc(fpr_rf, tpr_rf)

# Plotting the ROC Curves
plt.figure(figsize=(10, 8))
plt.plot(fpr_lr, tpr_lr, color='blue',
         label=f'Logistic Regression (AUC = {roc_auc_lr:.2f})')
plt.plot(fpr_rf, tpr_rf, color='green',
         label=f'Random Forest (AUC = {roc_auc_rf:.2f})')
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Supondo que você já tenha os valores reais e as previsões dos modelos
# y_true = valores reais
# y_pred_logistic = previsões do modelo de Regressão Logística
# y_pred_rf = previsões do modelo de Random Forest


def plot_confusion_matrix(y_true, y_pred, model_name):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
                xticklabels=['INDEFERIDO', 'DEFERIDO', 'NÃO DEFINIDO'],
                yticklabels=['INDEFERIDO', 'DEFERIDO', 'NÃO DEFINIDO'])
    # plt.title(f'Matriz de Confusão - {model_name}', fontsize=14)
    plt.ylabel('Valores Reais', fontsize=12)
    plt.xlabel('Valores Previstos', fontsize=12)
    plt.show()


# Exemplo de uso com Regressão Logística
plot_confusion_matrix(y_test, y_pred_logistic, "Regressão Logística")

# Exemplo de uso com Random Forest
plot_confusion_matrix(y_test, y_pred_rf, "Random Forest")