## Carga de datos

In [1]:
train <- read.csv("advertising_train.csv" )
str(train)

'data.frame':	700 obs. of  11 variables:
 $ X                       : int  714 503 358 624 985 718 919 470 966 516 ...
 $ Daily.Time.Spent.on.Site: num  49.4 66.2 49.8 88 66.5 ...
 $ Age                     : int  53 26 39 35 31 29 32 25 50 37 ...
 $ Area.Income             : num  45465 63580 45800 48919 58038 ...
 $ Daily.Internet.Usage    : num  128 229 112 149 256 ...
 $ Ad.Topic.Line           : chr  "Ameliorated well-modulated complexity" "Business-focused maximized complexity" "Reduced multimedia project" "Secured secondary superstructure" ...
 $ City                    : chr  "Jacquelineshire" "North Anaport" "Hannaport" "Port Brianfort" ...
 $ Male                    : int  1 0 0 1 1 0 1 1 0 1 ...
 $ Country                 : chr  "Congo" "Mexico" "Samoa" "France" ...
 $ Timestamp               : chr  "2016-07-07 18:07:19" "2016-05-02 00:01:56" "2016-02-09 07:21:25" "2016-03-24 05:38:01" ...
 $ Clicked.on.Ad           : int  1 0 1 0 0 0 0 0 1 1 ...


In [2]:
train$Male_f <- factor(train$Male, levels=c("0","1"))
summary(train$Male_f)

In [3]:
test <- read.csv("advertising_test.csv" )
str(test)

'data.frame':	300 obs. of  11 variables:
 $ X                       : int  3 5 8 17 20 22 25 29 33 35 ...
 $ Daily.Time.Spent.on.Site: num  69.5 68.4 66 55.4 74.6 ...
 $ Age                     : int  26 35 48 37 40 35 41 34 57 57 ...
 $ Area.Income             : num  59786 73890 24593 23937 23822 ...
 $ Daily.Internet.Usage    : num  236 226 132 129 136 ...
 $ Ad.Topic.Line           : chr  "Organic bottom-line service-desk" "Robust logistical utilization" "Reactive local challenge" "Customizable multi-tasking website" ...
 $ City                    : chr  "Davidton" "South Manuel" "Port Jefferybury" "West Dylanberg" ...
 $ Male                    : int  0 0 1 0 1 1 0 0 1 1 ...
 $ Country                 : chr  "San Marino" "Iceland" "Australia" "Palestinian Territory" ...
 $ Timestamp               : chr  "2016-03-13 20:35:42" "2016-06-03 03:36:18" "2016-03-07 01:40:15" "2016-01-30 19:20:41" ...
 $ Clicked.on.Ad           : int  0 0 1 1 1 0 1 1 1 1 ...


In [4]:
test$Male_f <- factor(test$Male, levels=c("0","1"))
summary(test$Male_f)

## Modelo

In [5]:
reg <- glm(Clicked.on.Ad ~ Daily.Time.Spent.on.Site + Age + Area.Income 
    + Daily.Internet.Usage + Male_f, data=train, family="binomial")
summary(reg)


Call:
glm(formula = Clicked.on.Ad ~ Daily.Time.Spent.on.Site + Age + 
    Area.Income + Daily.Internet.Usage + Male_f, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7583  -0.1264   0.0013   0.0132   3.2345  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               2.875e+01  3.601e+00   7.985 1.41e-15 ***
Daily.Time.Spent.on.Site -2.192e-01  2.845e-02  -7.704 1.32e-14 ***
Age                       1.746e-01  3.088e-02   5.655 1.56e-08 ***
Area.Income              -1.264e-04  2.304e-05  -5.485 4.14e-08 ***
Daily.Internet.Usage     -6.500e-02  8.456e-03  -7.687 1.50e-14 ***
Male_f1                  -2.581e-01  5.089e-01  -0.507    0.612    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 970.35  on 699  degrees of freedom
Residual deviance: 117.65  on 694  degrees of freedom
A

## Preguntas

### 1. ¿A mayor edad, mayor probabilidad de clicar el enlace publicitario?

R: **Verdadero**, el signo del coeficiente para Age de 1.746e-01 indica que a mayor edad es mayor la  probabilidad de clicar el enlace publicitario.

### 2. ¿A mayor tiempo de navegación en el sitio (Daily.Time.Spent.on.Site), mayor probabilidad de clicar el enlace publicitario?

R: **Falso**, el coeficiente para el Daily.Time.Spent.on.Site es de -2.192e-01, donde el signo negativo indica que a mayor tiempo de navegación en el sitio menor probabilidad de clicar en el enlace publicitario.

### 3. ¿Considerando una significancia del 10%, todos los predictores son significativos?

R: **Falso**, ya que el nivel de significancia calculado para Male_f1 es de 0.612

### 4. La razón de momios entre hombres y mujeres es:

\begin{matrix}
    \hline
    \text{Sexo} & \text{male_f1} \\
    \hline
    \text{Mujer} & 0\\    
    \text{Hombre} & 1 \\    
    \hline
\end{matrix}

* $o_1 = (Daily.Time.Spent.on.Site, Age, Area.Income, Daily.Internet.Usage, 0)$
* $o_2 = (Daily.Time.Spent.on.Site, Age, Area.Income, Daily.Internet.Usage, 1)$

$\frac{m_{male}}{m_{female}} = e^{\text{male_f1}} = e^{-2.581e-01} = 0.77$

In [6]:
exp(-2.581e-01)

### 5, 6, 7 y 8. Exactitud, precisión, sensibilidad y especificidad del modelo.

In [7]:
# Predicción (la regresión logística modela una probabilidad).
p <- predict(reg, test, type="response")
p[1:5]

In [8]:
# Binarización.
p_class <- ifelse(p > 0.5, "S", "N")

In [9]:
# Se compara lo predicho contra los valores reales.
tab <- table(p_class, test$Clicked.on.Ad)
tab

       
p_class   0   1
      N 147   7
      S   6 140

In [10]:
# Exactitud
exa <- (tab[1,1] + tab[2,2])/(tab[1, 1] + tab[1,2] + tab[2,1] + tab[2,2])

# Precisión
pre <- tab[2, 2] / (tab[2, 2] + tab[2, 1])

# Exhaustividad y sensibilidad
sen <- tab[2, 2] / (tab[2, 2] + tab[1, 2])

# Especificidad
esp <- tab[1, 1] / (tab[1, 1] + tab[2, 1])

print(paste0("Exactitud:        ", round(exa, 4), " = ", round(exa*100, 1), "%"))
print(paste0("Precisión:        ", round(pre, 4), " = ", round(pre*100, 1), "%"))
print(paste0("Sensibilidad:     ", round(sen, 4), " = ", round(sen*100, 1), "%"))
print(paste0("Especificidad:    ", round(esp, 4), " = ", round(esp*100, 1), "%"))

[1] "Exactitud:        0.9567 = 95.7%"
[1] "Precisión:        0.9589 = 95.9%"
[1] "Sensibilidad:     0.9524 = 95.2%"
[1] "Especificidad:    0.9608 = 96.1%"


In [12]:
install.packages("caret")
library(caret)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘future’, ‘globals’, ‘future.apply’, ‘progressr’, ‘numDeriv’, ‘SQUAREM’, ‘lava’, ‘prodlim’, ‘proxy’, ‘iterators’, ‘gower’, ‘ipred’, ‘timeDate’, ‘e1071’, ‘foreach’, ‘ModelMetrics’, ‘plyr’, ‘pROC’, ‘recipes’, ‘reshape2’


Loading required package: ggplot2

Loading required package: lattice

“running command 'timedatectl' had status 1”


In [13]:
# Corroboración de las metricas obtenidas manualmente utilizando las automatizadas.
p_class <- ifelse(p > 0.5, 1, 0)
confusionMatrix(as.factor(p_class), as.factor(test$Clicked.on.Ad), positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 147   7
         1   6 140
                                         
               Accuracy : 0.9567         
                 95% CI : (0.927, 0.9767)
    No Information Rate : 0.51           
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.9133         
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.9524         
            Specificity : 0.9608         
         Pos Pred Value : 0.9589         
         Neg Pred Value : 0.9545         
             Prevalence : 0.4900         
         Detection Rate : 0.4667         
   Detection Prevalence : 0.4867         
      Balanced Accuracy : 0.9566         
                                         
       'Positive' Class : 1              
                                         