<a href="https://colab.research.google.com/github/lcbjrrr/quantai/blob/main/IA_R_Clas_RegLog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Topic:** R Lang

**Title:** Logistic Regression

**Author:** Luiz Barboza

**Date:** 20/dec/24

**Lang:** R

**Site:** https://quant-research.group/

**Email:** contato@quant-research.group

In [None]:
#install.packages("glm",repos = "http://cran.us.r-project.org")
#library("glm2")

Let's define some metrics R functions:

* **`Score`:** Calculates the accuracy of a model's predictions (`prevs`) compared to the expected values (`y_exp`), handling different prediction types (linear, logistic, knn).

* **`Accuracy`:** Computes the accuracy of predicted values (`prevs`) against expected values (`y_exp`).

* **`ConfusionMatrix`:** Creates a confusion matrix (table) showing the counts of true positives, true negatives, false positives, and false negatives, comparing predicted (`prevs`) and expected (`y_exp`) values.

* **`PrecisionRecall`:** Calculates and returns the precision and recall scores based on a confusion matrix derived from predicted (`prevs`) and expected (`y_exp`) values.


In [None]:

Score <- function(mod,Xs,y_exp,t=""){
   if(t=="response"){
     prevs<-predict(mod,Xs,type=t)>0.5
   }else if(t=="knn"){
     df<-data.frame(y_exp,Xs)
     prevs<-mod(df,df,cl=df$y,k=3)
   }else{
     prevs<-predict(mod,Xs)
   }
   score<-sum(prevs==y_exp)/length(prevs)
   return(score)
}

Accuracy <- function(y_exp,prevs){
   accuracy<-sum(prevs==y_exp)/length(prevs)
   return(accuracy)
}

ConfusionMatrix <- function(y_exp,prevs){
   cm<-table(y_exp,prevs)
   return(cm)
}

PrecisionRecall <- function(y_exp,prevs){
   cm<-table(y_exp,prevs)
   tp<-cm[2,2]
   tn<-cm[1,1]
   fn<-cm[2,1]
   fp<-cm[1,2]
   precision <- tp/(tp+fp)
   recall <- tp/(tp+fn)
   return(c(precision,recall))
}


This dataset shows gender (G), height (H), and weight (W) for individuals.

In [2]:
train <- read.csv('https://raw.githubusercontent.com/lcbjrrr/data/main/gender%20-%20tr.csv')
print(train)

  G   H  W
1 0 178 72
2 0 179 81
3 1 163 55
4 1 168 58
5 0 181 98
6 1 170 60
7 0 184 78
8 1 171 59


Let's build a logistic regression model (rlog). It predicts the binary variable "G" using all other variables in the train data frame, specifying a binomial family for logistic regression.

In [None]:
rlog <- glm( G ~ . , data = train, family='binomial')
rlog

Let's select and return columns 2 and 3 from the test data frame.


In [None]:
test <- read.csv('https://raw.githubusercontent.com/lcbjrrr/data/main/gender%20-%20ts.csv')
test[,2:3]

“incomplete final line found by readTableHeader on 'https://raw.githubusercontent.com/lcbjrrr/data/main/gender%20-%20ts.csv'”


H,W
<int>,<int>
175,75
165,65


Let's predict the probability of the "G" variable being true (1) using the logistic regression model rlog on columns 2 and 3 of the test data frame. Then, it converts these probabilities into binary predictions (TRUE or FALSE) by classifying any probability greater than 0.5 as TRUE, and finally converts the result to a factor variable.

In [None]:
pred_test<- factor(predict(rlog,test[,2:3],type = "response")>0.5,levels=c(F,T))
pred_test

Let's calculate the accuracy score of the logistic regression model rlog when predicting the first column of the train data frame (likely the "G" variable) using columns 2 and 3 as predictors.

In [None]:
print("ACC Train: ")
Score(rlog,train[,2:3],train[,1],"response")

[1] "ACC Train: "


Let's calculate the accuracy of the predictions pred_test compared to the actual values y_ref. It determines the proportion of correct predictions.

In [None]:
print("ACC Test: ")
y_ref <- factor(test[,1]>0.5,levels=c(F,T))
Accuracy(y_ref,pred_test)

[1] "ACC Test: "


Let's generate a confusion matrix, which is a table that shows the performance of a classification model. It compares the predicted values (pred_test) to the actual values (y_ref). The matrix helps visualize how well the model is classifying different categories.

In [None]:
print("CM Test: ")
ConfusionMatrix(y_ref,pred_test)

[1] "CM Test: "


       prevs
y_exp   FALSE TRUE
  FALSE     1    0
  TRUE      0    1

Let's  calculate the precision and recall scores based on the predicted values (pred_test) and the actual values (y_ref). Precision measures the proportion of correctly predicted positive cases out of all predicted positive cases, while recall measures the proportion of correctly predicted positive cases out of all actual positive cases.

In [None]:
print("Precision / Recall - Test: ")
PrecisionRecall(y_ref,pred_test)

[1] "Precision / Recall - Test: "
