<a href="https://colab.research.google.com/github/prat8897/R_companion/blob/main/R_companion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
R.version

               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          0.3                         
year           2020                        
month          10                          
day            10                          
svn rev        79318                       
language       R                           
version.string R version 4.0.3 (2020-10-10)
nickname       Bunny-Wunnies Freak Out     

###Preface

This document contains hands-on examples using R code to illustrate the methods discussed in the Bayesian networks course. Thre is usually one chapter here per lecture, but not all lectures have code examples.

This course does not cover the R programming language itself, but we try to be gentle and explain a few things as we go along. There are many excellent resources online for learning R.

###Naive and Optimal Bayesian Prediction

This vignette contains R code for performing prediction with Naive Bayes and logistic regression classifiers for an example from the lecture.

Learning objectives:
* Become acquainted with the R platform
* Be able to perform simple classification tasks on discrete data
* Understand the difference between logistic regression and Naive Bayes

Please read the following vignette and make sure that you can run the code on your own computer. You will first need to have R or RStudio installed.

##Installing the R package for this course

We first need to install the Bayesian Networks R package from github. If you have not done this yet, uncomment the first two lines in the code block below (by removing the hash mark) and then run them. This should first install the “devtools” package, and then install the “bayesianNetworks” package from github.

During the course, this package may change and be upgraded, so it may be a good idea to install it regularly.

In [2]:
install.packages(c("remotes","pROC","naivebayes"))
remotes::install_github("jtextor/bayesianNetworks")

library( bayesianNetworks )

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘plyr’


Downloading GitHub repo jtextor/bayesianNetworks@HEAD



[32m✔[39m  [90mchecking for file ‘/tmp/RtmpJ1QJn3/remotes646efb2749/jtextor-bayesianNetworks-bd3d8e0/DESCRIPTION’[39m[36m[39m
[90m─[39m[90m  [39m[90mpreparing ‘bayesianNetworks’:[39m[36m[39m
[32m✔[39m  [90mchecking DESCRIPTION meta-information[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for LF line-endings in source and make files and shell scripts[39m[36m[39m
[90m─[39m[90m  [39m[90mchecking for empty or unneeded directories[39m[36m[39m
[90m─[39m[90m  [39m[90mbuilding ‘bayesianNetworks_0.0.0.9000.tar.gz’[39m[36m[39m
   


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Note: if installing the “bayesianNetworks” package above does not work for you, perhaps because your version of R is too old to install the “remotes” package that this also requires, then instead you can also use the line below to directly include the functions defined in that package.

```
# source("https://raw.githubusercontent.com/jtextor/bayesianNetworks/master/R/functions.R")
```

##Generating some artificial data to work with

Now we simulate some data from the “Sprinkler” example in the lecture notes to train a classifier on. Separately we will also generate a test set to evaluate our classifier. Here I also seed the random number generator so that you hopefully get exactly the same output as I do.

In [3]:
set.seed(123)
training.data <- simulate_sprinkler( 500 )
test.data <- simulate_sprinkler( 1000 )

Let us see how the training data looks like. The below line generates a table similar to the ones we saw in the lecture by counting the amount of datapoints for each category.

In [4]:
print( ftable(training.data, row.vars=1:4) )

WetGrass Rain  Sprinkler Cloudy     
FALSE    FALSE FALSE     FALSE   102
                         TRUE     43
               TRUE      FALSE    13
                         TRUE      0
         TRUE  FALSE     FALSE     3
                         TRUE     14
               TRUE      FALSE     1
                         TRUE      0
TRUE     FALSE FALSE     FALSE     1
                         TRUE      0
               TRUE      FALSE    87
                         TRUE      7
         TRUE  FALSE     FALSE    16
                         TRUE    170
               TRUE      FALSE    23
                         TRUE     20


##Using logistic regression to predict whether the grass is wet

Prediction in R is always done in two steps. First, we fit a prediction model to a training dataset. For example, the below line fits a logistic regression model. Here `glm` stands for a generalized linear model, and we are using the binomial linkage function.

In [5]:
m <- glm( WetGrass ~ Cloudy + Rain + Sprinkler, data=training.data, family="binomial" )

We can look at this model to get some information about the fit, such as the values of the coefficients.

In [6]:
summary(m)


Call:
glm(formula = WetGrass ~ Cloudy + Rain + Sprinkler, family = "binomial", 
    data = training.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.0067  -0.1380   0.4066   0.4066   3.0523  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -4.6488     0.7675  -6.057 1.39e-09 ***
CloudyTRUE      0.9242     0.5614   1.646   0.0997 .  
RainTRUE        6.1758     0.7759   7.960 1.73e-15 ***
SprinklerTRUE   6.4995     0.8129   7.995 1.29e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 648.68  on 499  degrees of freedom
Residual deviance: 223.23  on 496  degrees of freedom
AIC: 231.23

Number of Fisher Scoring iterations: 7
