-
Notifications
You must be signed in to change notification settings - Fork 129
/
tutorial.txt
300 lines (219 loc) · 10.7 KB
/
tutorial.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
This tutorial is for social recommendation base on collaborative filtering. Processing
is done with a pipeline of MR jobs. Some of them are optional. All commands are in the
shell script brec.sh. Please make necessary changes for path, environment etc in that file
before proceeding
Dependency Jars
===============
Please refer to resource/jar_dpendency.txt
Shell script
============
Please modify the following variables in brec.sh to suit your environment
JAR_NAME
CHOMBO_JAR_NAME
HDFS_BASE_DIR
PROP_FILE
HDFS_META_BASE_DIR
Creating HDFS directories
=========================
Please create various HDFS directories as you need them manually as below
hadoop fs -mkdir .....
You should create the directory defined by HDFS_BASE_DIR and various sub directories
under it
Rating Data
===========
There are 3 ways to generate rating data.
1. Explicit
In reality you will hardly ever do recommendations based on explicit rating. This option
is provided for development purpose only
Follow step 1 and then 3
2. Implicit
This will be the typical way to create input rating
Follow steps 2.1 through 2.4 and 3
3. Blended Rating
This is a more advanced approach. It combines implicit rating with explicit
and rating data from CRM systems
Follow steps 2.1 through 2.9 and 3
Map reduce workflow
===================
There are some core MR jobs, that are manadatory. They constitute CF processing.
The final output of this is obtained after executing step 7.
The output from step 7 can be used by one or more optional MR jobs for various post
processing depernding on the need. The number of such optional MR jobs and the order
in which they are executed depends on the requirement.
1. Explicit Rating Data Generation (optional)
==============================================
You could generate rating data directly, by following the steps here. If not
you could generate implicit rating data as described in the next section. The format
of rating data generated is as follows. Each line has rating by all users for a
given item
item1,user1:3,user2:4,..
item2,user2:5,user4:2,...
You can use ratings.rb as follows to generate ratings data and save it ia file
It requires util.rb to be in the ../lib directory. You can get util.rb
from the visitante project at the following location
https://github.com/pranab/visitante/tree/master/script/ruby
1.1 generate explicit rating data
./brec.sh genExplicitRating <item_count> <user_count> <user_per_items_multipler> <rating_file>
In the output, average number of users rating for an item will be
item_count * user_per_items_multipler / user_count. So choose the last argument
as per your need. User count should be an order of magnitude higher than item count.
The value of user_per_items_multipler should be 5 or more. A reasoable value is 10
1.2 Copy rating data to HDFS
./brec.sh expExplicitRating <rating_file>
2. Implicit Rating Generator and Blending Rating (optional)
===========================================================
This MR task is optional. You want to run this MR if you want to generate rating data
from user engaement click stream data. If you have generated rating data directly
from script then skip this.
2.1 Export user engaement schema to HDFS. You can use engaementEvent.json as an
example schema file
./brec.sh expSchema <schema_file>
2.2 Generate user engagement data as follows
./brec.sh genHistEvent <item_count> <user_count> <average_event_count_per_user> <output_file_name>
item_count = number of items e.g. 1000
user_count = number of users e.g. 100
average_event_count_per_user = number of events per customer (a reasonable
number is around 10)
The data generated has the following fields
user ID,session ID,item ID,event type,time of event
2.3 Copy the input data file to HDFS input directory. This is the script to run MR
./brec.sh expEvent
2.4 Run MR
./brec.sh genRating <event_data_file>
Following additional steps should be performed if you want to aggregate explicit
and implicit rating and optional additional rating data from CRM systems
2.5 Create explicit rating data (if you want to use it)
Uses implicit rating data file and generates explicit rating for some of those
users that have converted i.e., has maximum implicit rating
./brec.sh createExplicitRating <implicit_rating_file> <percentage_rated> <expl_rating_file>
The argument percentage_rated is percentage of users who have converted
(e.g., purchased) and also explicit ly rated. Recommended value is 50
2.6 Export explicit rating data to HDFS (if you want to use it)
./brec.sh putExplicitRating <expl_rating_file> [clean]
Use the clean option if you want all existing files from the HDFS directory removed
before exporting
2.7 Create customer service rating data (if you want to use it)
Uses implicit rating data file and generates customer service rating for some of those
users that have converted i.e., has maximum implicit rating
./brec.sh createExplicitRating <implicit_rating_file> <percentage_rated> <cust_svc_rating_file>
The argument percentage_rated is percentage of users who have converted
(e.g., purchased) and contacted customer service. Recommended value is 30
2.8 Export customer service rating data to HDFS (if you want to use it)
./brec.sh putExplicitRating <cust_svc_rating_file>
2.9 Generate blended or aggregated rating by running MR job
./brec.sh blendRating
3. Rating data formatter (optional)
===================================
If you are using implicit rating or blended rating, it generated rating data in
an exploded format as userID, itemID, rating. However, Rating Correlation MR below
expects data in a compact format as itemID, userID1:rating1, userId2:rating2
3.1 Run MR
./brech.sh compactRating <rating_input>
Depending on nthe option chosen for rating data generation, rating_input should be rate
or erat. It points to HDFS directory containing the rating data
4. Rating Statistics (optional)
===============================
If the parameter input.rating.stdDev.weighted.average is set to true for UtilityAggregator,
then rating std dev calculation is necessary. In our example, we are not using it.
4.1 Run MR
./brec.sh ratingStat <rating_dat
5. Rating correlation
=====================
Correlation can be calculated in various ways. We will be using cosine similarity.
5.1 Run MR
./brec.sh correlation
6. Rating Predictor
===================
The next step is to predict rating based on items already rated by user and the
correlation calculated in the first MR
6.1 The rating file should be renamed so that it has the same prefix
as defined by the config param rating.file.prefix (prRating here). It should
be repeated if there are multiple reducer output files
./brec.sh renameRatingFile part-r-00000 prRating0.txt
6.2 The rating stat file should be renamed so that it has the same prefix
as defined by the config param rating.stat.file.prefix. It should
be repeated if there are multiple reducer output files
./brec.sh renameRatingStat
6.3 Run MR as follows
./brec.sh ratingPred [withStat]
The last argument is necessary if rating stats data is used
7. Aggregate Rating Predictor
=============================
This predicts the final rating by aggregating contribution from all items rated
by the user
7.1 Run MR
./brec.sh ratingAggr
8. Business Goal Injection (optional)
=====================================
This is an optional MR, that combines scores of various business goals with
recommendation score using relative weighting to come up with the final score.
In our example, we are not using it.
8.1 Copy business score data
./brec.sh storeBizData <local_biz_data_file_name> <hdfs_biz_data_file_name>
hdfs_biz_data_file_name should have the the prefix as defined by the config param
biz.goal.file.prefix
8.2 Run MR
./brec.sh injectBizGoal
9. Order by User ID (optional)
==============================
It orders the final result by userID, so that you get all recommendation for
a given user together
9.1 Run MR. Unsroted data dir name (not full path) needs to be specified, because
unsorted data location depends on post processing done with predicted rating
(e.g., business goal injection)
./brec.sh sortByUser <unsorted_data_hdfs_dir>
10. Individual Novelty (optional)
=================================
Novelty can blended in with predicted rating as follows
10.1 Caculate user item engaement distribution
./brec.sh genEngageDistr
10.2 Generate item novelty score
./brec.sh genItemNovelty
10.3 Rename predicted rating file to have prefix as defined in config param
first.type.prefix. The command should be repeatedly executed if there are multiple
reducer output files
./brec.sh renamePredRatingFile part-r-00000 prRatings0.txt
10.4 Join predicted rating and novelty
./brec.sh joinRatingNovelty
10.5 Weighted average of predicted rating and novelty
./brec.sh injectItemNovelty
11. Item popularity global (optional)
=====================================
It can be used to solve cold start problem. Popularity is calculated by taking
weighted average of various rating stats
11.1 Run MR
./brec.sh itemPopularity
12. Postive feedback driven rank reordering (optional)
======================================================
The actual implicit rating based on user engagement data is used together with
predicted rating to generated modified ratings
12.1 Rename rating data file
./brec.sh renameRating part-r-00000 <rating_file_name>
rating_file_name should the prefix defined by the config param actual.rating.file.prefix
12.2 Modify rating
./brec.sh posFeedbackReorder
13. Attribute diffusion based diversification
=============================================
It makes the recommendation result more diverse by manitaining a minimum rank
distance between items with same attribute values
13.1 Generate item attribute data
./brec.sh genItemAttrData <event_data_file>
It generates data with with itemID and two attributes (category and brand)
13.2 Store item attribute data in HDFS
./brec.sh storeItemAttrData <local_file_name> <hdfs_file_name>
local_file_name = file name from step 13.1
hdfs_file_name = file name in hdfs with prefix as set in item.metadta.file.prefix
13.3 Run user, item attribute aggregation MR
./brec.sh userItemAttrAggr
13.4 Rename user, item attribute data file
./brec.sh renameUserItemAttrData <mr_generated_file_name> <new_file_name>
mr_generated_file_name = MR generated file name from step 13.3
new_file_name = new file name with prefix as set through item.metadta.file.prefix
13.5 Run attribute diffusion based diversifier
./brec.sh diversifyWithAttr
Configuration
=================
It's in reco.properties for all the MR jobs. Feel free to make changes as needed
For number of reducer there is global config param num.reducer. For each MR job there is
job specific config param with name as xxx.num.reducer. If this job specific param is defined
it overrides the global param.