Skip to content
Newer
Older
100644 149 lines (103 sloc) 7.71 KB
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
1 ## Community Tutorial 01: Using RHadoop to predict visitors amount
2
1152608 @vivganes Forgot to add what to download. Adding "the Hortonworks Sandbox".
vivganes authored
3 **This tutorial is from the Community part of tutorial for [Hortonworks Sandbox](http://hortonworks.com/products/sandbox) - a single-node Hadoop cluster running in a virtual machine. [Download](http://hortonworks.com/products/sandbox) the Hortonworks Sandbox to run this and other tutorials in the series.**
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
4
5 ### Summary
6
86ddbfa @vivganes Fixed some typos and grammatical errors
vivganes authored
7 This tutorial describes how to use RHadoop on Hortonworks Data Platform and how to facilitate using R on Hadoop to create a powerful analytics platform.
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
8
9 ### Clickstream Data
10
86ddbfa @vivganes Fixed some typos and grammatical errors
vivganes authored
11 Clickstream data is an information trail a user leaves behind while visiting a website. It is typically captured in semi-structured website log files.
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
12
13 Clickstream data has been described in already exisiting tutorial [10 - Visualizing Website Clickstream Data](../Sandbox/T10_Visualizing_Website_Clickstream_Data.md). In this tutorial the same dataset will be used. So, it must be uploaded into `omniturelogs` table.
14
15
16 ### R Language
17
18 [![](./images/tutorial-01/Rlogo.png?raw=true)](./images/tutorial-01/Rlogo.png?raw=true)
6d2f8eb @vivganes corrected typos and grammar
vivganes authored
19 R is a language for Stats, Math and Data Science created by statisticians for statisticians. It contains 5000+ implemented algorithms and impressive 2M+ users with domain knowledge worldwide. However, it has one big disadvantage - all data is placed into memory and processed in one thread.
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
20
21 ### Using R on Hadoop
22
9d2c913 @vivganes corrected typos and grammar
vivganes authored
23 Hadoop was developed in Java and Java is the main programming language for Hadoop. Although Java is main language, you can still use any other language to write MapReduce(MR): for example, Python, R or Ruby. It is called "Streaming API". Not all features available in Java will be available in R, because streaming works through "unix streams". Unfortunately, Streaming API is not easily used and that's why RHadoop has been created. It still uses streaming, but has the following advantages:
86ddbfa @vivganes Fixed some typos and grammatical errors
vivganes authored
24 * no need to manage key change in Reducer
25 * no need to control functions output manually
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
26 * simple MapReduce API for R
27 * enables access to files on HDFS
28 * R code can be run on local env/Hadoop without changes
29
30 RHadoop is set of packages for R language, it contains the next packages currently (you install and load this package the same as you would for any other R package):
31 * rmr provides MapReduce interface; mapper and reducer can be described in R code and then called from R
32 * rhdfs provides access to HDFS; using simple R functions, you can copy data between R memory, the local file system, and HDFS
33 * rhbase required if you are going to access HBase
e7b6000 @-kostya- add data preparation step
-kostya- authored
34 * plyrmr common data manipulation operations
35
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
36
5031940 @vivganes Fixed typos
vivganes authored
37 ### Installation
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
38
39 To enable RHadoop on existing Hadoop cluster the following steps must be applied:
40 1. install R on each node in Cluster
41 2. on each node install RHadoop packages with dependencies
5031940 @vivganes Fixed typos
vivganes authored
42 3. set up env variables; run R from console and check that these variables are accessible
0230ffd @vivganes Added syntax highlighting to the sample code listings.
vivganes authored
43
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
44
79152d2 @-kostya- prediction part
-kostya- authored
45 Environment variables required for RHadoop is 'HADOOP_CMD' and 'HADOOP_STREAMING', details are described in [RHadoop Wiki](https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr). To facilitate development, RStudio server is recommended to be installed. It provides the same GUI for development as standalone RStudio. RStudio WebUI accessible just after instalation at '<host>:8787', use login and password of any non-system user on this host.
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
46
47 ### Overview
48
5031940 @vivganes Fixed typos
vivganes authored
49 We are going to predict number of visitors in the next period for each country/state using RHadoop. We will do it with linear regression
163c7c3 @-kostya- RHadoop tutorial initial commit
-kostya- authored
50
51 ## Step 1. Create table with required data
52
53 In the “Loading Data into the Hortonworks Sandbox” tutorial, we loaded website data files into Hortonworks. **Omniture logs*** – website log files containing information such as URL, timestamp, IP address, geocoded IP address, and user ID (SWID). First of all, we will create table with required data for us.
54
55 [![](./images/tutorial-01/Omniture-hive.png?raw=true)](./images/tutorial-01/Omniture-hive.png?raw=true)
56
e7b6000 @-kostya- add data preparation step
-kostya- authored
57 ## Step 2. Prepare Omniture dataset for further regression
58
6898810 @-kostya- typo corrections
-kostya- authored
59 In omniture dataset we have information from 2012-03-01 till 2012-03-15 (Hive query `select country, ts, count(*) from omniture2 group by country, ts`), for many countries there are gaps, we are going to put 0 into these gaps and remove datasets with too small amount of elements, because of it's not enought for regression. The result of this query is following:
e7b6000 @-kostya- add data preparation step
-kostya- authored
60
61 [![](./images/tutorial-01/Omniture-hive-res.png?raw=true)](./images/tutorial-01/Omniture-hive-res.png?raw=true)
62
63 We need to save this result for the next step, just by clicking 'Download as CSV'. Save result to HDFS to the folder '/user/hue/hdp/in':
64
65 [![](./images/tutorial-01/Omniture-hdfs-in.png?raw=true)](./images/tutorial-01/Omniture-hdfs-in.png?raw=true)
66
67
79152d2 @-kostya- prediction part
-kostya- authored
68 ## Step 3. Predict visitors number for the further period
69
5031940 @vivganes Fixed typos
vivganes authored
70 Please, don't guess all calculation here as academic research. This "prediction" has only one purpose to show the power of RHadoop. So, let's open RStudio and write first MapReduce with RHadoop. RStudio on local environment can be used as well as web UI (available at '<host>:8787' under your non-system user). In the initial data set, number of clicks for each day (with possible gaps) is present from Mart 3 till Mart 15. The number of click for the Mart 16 is forecasted in the next program
79152d2 @-kostya- prediction part
-kostya- authored
71
72 [![](./images/tutorial-01/Omniture-hdfs-RSTUDIO.png?raw=true)](./images/tutorial-01/Omniture-hdfs-RSTUDIO.png?raw=true)
73
74 The whole listing is following:
ce9b524 @mwacc format code
mwacc authored
75
0230ffd @vivganes Added syntax highlighting to the sample code listings.
vivganes authored
76 ```{r}
2123de2 @mwacc format code
mwacc authored
77 library(rmr2)
78
79 # utility function - insert new row into exist data frame
80 insertRow <- function(target.dataframe, new.day) {
81 new.row <- c(new.day, 0)
82 target.dataframe <- rbind(target.dataframe,new.row)
83 target.dataframe <- target.dataframe[order(c(1:(nrow(target.dataframe)-1),new.day-0.5)),]
84 row.names(target.dataframe) <- 1:nrow(target.dataframe)
85 return(target.dataframe)
86 }
87
88 mapper = function(null, line) {
89 # skip header
90 if( "ts" != line[[2]] )
91 keyval(line[[1]], paste(line[[1]],line[[2]], line[[3]], sep=","))
92 }
93
94 reducer = function(key, val.list) {
95 # not possible to build good enought regression for small datasets
96 if( length(val.list) < 10 ) return;
79152d2 @-kostya- prediction part
-kostya- authored
97
2123de2 @mwacc format code
mwacc authored
98 list <- list()
99 # extract country
100 country <- unlist(strsplit(val.list[[1]], ","))[[1]]
101 # extract time interval and click number
102 for(line in val.list) {
103 l <- unlist(strsplit(line, split=","))
104 x <- list(as.POSIXlt(as.Date(l[[2]]))$mday, l[[3]])
105 list[[length(list)+1]] <- x
106 }
107 # convert to numeric values
108 list <- lapply(list, as.numeric)
109 # create frames
110 frame <- do.call(rbind, list)
111 colnames(frame) <- c("day","clicksCount")
79152d2 @-kostya- prediction part
-kostya- authored
112
2123de2 @mwacc format code
mwacc authored
113 # set 0 count of clicks for missed days in input dataset
114 i = 1
115 # we must have 15 days in dataset
116 while(i < 16) {
117 if(i <= nrow(frame))
118 curDay <- frame[i, "day"]
79152d2 @-kostya- prediction part
-kostya- authored
119
2123de2 @mwacc format code
mwacc authored
120 # next Day in existing frame is not suspected
121 if( curDay != i ) {
122 frame <- insertRow(frame, i)
123 }
124 i <- i+1
125 }
79152d2 @-kostya- prediction part
-kostya- authored
126
2123de2 @mwacc format code
mwacc authored
127 # build lineral model for prediction
128 model <- lm(clicksCount ~ day, data=as.data.frame(frame))
129 # predict for the next day
130 p <- predict(model, data.frame(day=16))
79152d2 @-kostya- prediction part
-kostya- authored
131
2123de2 @mwacc format code
mwacc authored
132 keyval(country, p)
133 }
79152d2 @-kostya- prediction part
-kostya- authored
134
2123de2 @mwacc format code
mwacc authored
135 # call MapReduce job
136 mapreduce(input="/user/hue/hdp/in",
79152d2 @-kostya- prediction part
-kostya- authored
137 input.format=make.input.format("csv", sep = ","),
138 output="/user/hue/hdp/out",
139 output.format="csv",
140 map=mapper,
141 reduce=reducer
2123de2 @mwacc format code
mwacc authored
142 )
0230ffd @vivganes Added syntax highlighting to the sample code listings.
vivganes authored
143 ```
79152d2 @-kostya- prediction part
-kostya- authored
144
145
0230ffd @vivganes Added syntax highlighting to the sample code listings.
vivganes authored
146 As soon as MapReduce job finishes, the result will be available at expected directory as several CSV formated files. Directory structure is regular for MapReduce jobs:
79152d2 @-kostya- prediction part
-kostya- authored
147
148 [![](./images/tutorial-01/Omniture-hdfs-result.png?raw=true)](./images/tutorial-01/Omniture-hdfs-result.png?raw=true)
Something went wrong with that request. Please try again.