# Sample R Notebook - dashDB Machine Learning - Linear Regression

Before running the notebook, insert credentials cell here. To do so click "Find and Add Data" at top right of the screen, then select "Connection" and select "Insert to code" for the dashDB system of your choice. Make sure you have a dashDB connection set up in your project beforehand.
<div> <img width = 370 height =286 src="https://ibm.box.com/shared/static/yc0airtlenm9ezywk3pigr453gkz3u1w.png"> </div>

In [None]:
# The code was removed by DSX for sharing.

Next the ibmdbR push down library for dashDB is loaded. It translates R data frame operations into SQLs and machine learning routines executed inside dashDB.

In [None]:
# Load the ibmdbR package and make a connection
library(ibmdbR)
library(ggplot2)
library(scales)
con <- idaConnect(paste("DASHDB", credentials_1["dsn"], sep=";"),'','')
idaInit(con)

### Creating proxy data frames
Creating  ida (in-database analytics) data frames for SHOWCASE_SYSUSAGE, SHOWCASE_SYSTEMS and SAMPLES.SHOWCASE_SYSTYPES sample tables. Data remains in dashDB.
Then print a small sample of the data in that table.

In [None]:
sysusage<-ida.data.frame('SAMPLES.SHOWCASE_SYSUSAGE')
systems<-ida.data.frame('SAMPLES.SHOWCASE_SYSTEMS')
systypes<-ida.data.frame('SAMPLES.SHOWCASE_SYSTYPES')

head(sysusage)
head(systems)
head(systypes)

The data in these tables holds time series of measurents of computer systems resource usage in a compute center. It can be used to train a regression model of memory usage based on number of users on the system.

### Data preparation: pushed down merging all three data frame inside database
Then print a sample of the merged data frame.

In [None]:
# Join the three tables on their TYPEID and SID columns.
mergedSys<-idaMerge(systems, systypes, by='TYPEID')
mergedUsage<-idaMerge(sysusage, mergedSys, by='SID')

head(mergedUsage)

A distribution histogram for different amounts of memory used:

In [None]:
# Obtain a random sample of 1000 data points for visualization
dfSample <- idaSample(mergedUsage[,c("MEMUSED", "USERS")], 1000)

In [None]:
d2 <- ggplot(dfSample) + geom_histogram(aes(x=MEMUSED, y=..count../sum(..count..)), binwidth=1000, colour="black", fill="white") + scale_y_continuous(labels=percent_format()) + labs(title="Memory Used") + labs(x="Memory Used",y="Frequency")
ggsave(filename = "img2.jpg", plot = d2, height=2, width=3, scale=2, dpi=120)
d2

A distribution histogram for different amounts of active users:

In [None]:
# Plot a histogram that shows relative frequency of various numbers of users.
d3 <- ggplot(dfSample) + geom_histogram(aes(x=USERS, y=..count../sum(..count..)), binwidth=7, colour="black", fill="white") + scale_y_continuous(labels=percent_format()) + labs(title="Active Users") + labs(x="Number of Users",y="Frequency")
ggsave(filename = "img3.jpg", plot = d3, height=2, width=3, scale=2, dpi=120)
d3

### Train a linear prediction model for MEMUSED based on USERS

In [None]:
lm1 <- idaLm(MEMUSED~USERS, mergedUsage)

lm1

### Visualize the model

A scatter plot of number of users vs. memory usage, overlaid with the calculated linear relationship. IIn the linear model the first coefficient is the slope of the line in MB/user and the second coefficient is the Y intercept.

In [None]:
d1 <- ggplot(dfSample, aes(x=USERS, y=MEMUSED)) + geom_point(shape=1) + labs(title="Memory used") + labs(x="Number of Users",y="Memory Used (MB)") + stat_function(fun=function(x){x*lm1$coefficients[1]+lm1$coefficients[2]}, aes(colour="blue")) + scale_colour_manual("Legend", values=c("blue"), labels=c("idaLM"))
ggsave(filename = "img1.jpg", plot = d1, height=2, width=3, scale=2, dpi=120)
d1

### Persist the linear model
By storing the coefficients inside a R object in the dashDB database.

In [None]:
# Create a pointer to the private R object storage table of the current user. 
myModels <- ida.list(type="private")

# List all objects in the private R object storage table of the current user.
writeLines("Private R object storage table:")
names(myModels)
writeLines("")

In [None]:
myModels['model1'] <- lm1$coefficients
myModels <- ida.list(type="private")
names(myModels)

### Clean up

In [None]:
myModels['model1'] <- NULL;
idaDropView(mergedSys@table)
idaDropView(mergedUsage@table)

idaClose(con)