resource/tutorial.txt

This tutorial is for social recommendation base on collaborative filtering. Processing 
is done with a  pipeline of MR jobs. Some of them are optional. All commands are in the 
shell script brec.sh. Please make necessary changes for path, environment etc in that file 
before  proceeding


Dependency Jars
===============
Please refer to resource/jar_dpendency.txt

Shell script
============
Please modify the following variables in brec.sh to suit your environment
JAR_NAME
CHOMBO_JAR_NAME
HDFS_BASE_DIR
PROP_FILE
HDFS_META_BASE_DIR


Creating HDFS directories
=========================
Please create various HDFS directories  as you need them manually as below

hadoop fs -mkdir .....

You should create the directory defined by HDFS_BASE_DIR and various sub directories
under it

Rating Data
===========
There are 3 ways to generate rating data.

1. Explicit 
In reality you will hardly ever do recommendations based on explicit rating. This option
is provided for development purpose only
Follow step 1 and then 3

2. Implicit
This will be the typical way to create input rating
Follow steps 2.1 through 2.4 and 3

3. Blended Rating
This is a more advanced approach. It combines implicit rating with explicit 
and rating data from CRM systems
Follow steps 2.1 through 2.9 and 3


Map reduce workflow
===================
There are some core MR jobs, that are manadatory. They constitute CF processing.
The final output of this is obtained after executing step 7.

The output from step 7 can be used by one or more optional MR jobs for various post
processing depernding on the need. The number of such optional MR jobs and the order
in which they are executed depends on the requirement.


1. Explicit  Rating Data Generation (optional)
==============================================
You could generate rating data directly, by following the steps here. If not 
you could generate implicit rating data as described in the next section. The format 
of rating data  generated is as follows. Each line has rating by all  users for a 
given item

item1,user1:3,user2:4,..
item2,user2:5,user4:2,...

You can use ratings.rb as follows to generate ratings data and save it ia file
It requires util.rb to be in the ../lib directory. You can get util.rb
from the visitante project at the following location
https://github.com/pranab/visitante/tree/master/script/ruby

1.1 generate explicit rating data
./brec.sh genExplicitRating <item_count> <user_count> <user_per_items_multipler> <rating_file>

In the output, average number of users rating for an item will be 
item_count *  user_per_items_multipler / user_count. So choose the last argument 
as per your need. User count should be an order of magnitude higher than item count. 
The value of user_per_items_multipler should be 5 or more. A reasoable value is 10

1.2 Copy rating data to HDFS
./brec.sh expExplicitRating <rating_file>

2. Implicit Rating Generator and Blending Rating (optional)
===========================================================
This MR task is optional. You want to run this MR if you want to generate rating data
from user engaement click stream data. If you have generated rating data directly 
from script then skip this.

2.1 Export user engaement schema to HDFS. You can use engaementEvent.json as an 
example schema file
./brec.sh expSchema <schema_file>

2.2 Generate user engagement data as follows
./brec.sh genHistEvent <item_count> <user_count> <average_event_count_per_user> <output_file_name>
item_count = number of items e.g. 1000
user_count = number of users e.g. 100
average_event_count_per_user = number of events per customer (a reasonable 
number is around 10)

The data generated has the following fields
user ID,session ID,item ID,event type,time of event

2.3 Copy the input data file to  HDFS input directory. This is the script to run MR
./brec.sh expEvent 

2.4 Run MR
./brec.sh genRating  <event_data_file>

Following additional steps should be performed if you want to aggregate explicit
and implicit rating and optional additional rating data from CRM systems

2.5 Create explicit rating data (if you want to use it)
Uses implicit rating data file and generates explicit rating for some of those
users that have converted i.e., has maximum implicit rating
./brec.sh createExplicitRating <implicit_rating_file> <percentage_rated> <expl_rating_file>

The argument percentage_rated is percentage of users who have converted 
(e.g., purchased) and also explicit ly rated.  Recommended value is 50

2.6 Export explicit rating data to HDFS (if you want to use it)
./brec.sh putExplicitRating <expl_rating_file> [clean]
Use the clean option if you want all existing files from the HDFS directory removed
before exporting 

2.7 Create customer service rating data (if you want to use it)
Uses implicit rating data file and generates customer service rating for some of those
users that have converted i.e., has maximum implicit rating
./brec.sh createExplicitRating <implicit_rating_file> <percentage_rated> <cust_svc_rating_file>

The argument percentage_rated is percentage of users who have converted 
(e.g., purchased) and contacted customer service. Recommended value is 30

2.8 Export customer service rating data to HDFS (if you want to use it)
./brec.sh putExplicitRating <cust_svc_rating_file>

2.9 Generate blended or aggregated rating by running MR job
./brec.sh blendRating

3. Rating data formatter (optional)
===================================
If you are using implicit rating or blended rating, it generated rating data in 
an exploded format as  userID, itemID, rating. However, Rating Correlation MR below 
expects data in a compact  format as  itemID, userID1:rating1, userId2:rating2

3.1 Run MR
./brech.sh compactRating <rating_input>
Depending on nthe option chosen for rating data generation, rating_input should be rate 
or erat. It points to HDFS directory containing the rating data


4. Rating Statistics (optional)
===============================
If the parameter input.rating.stdDev.weighted.average is set to true for UtilityAggregator,
then rating std dev calculation is necessary. In our example, we are not using it.

4.1 Run MR
./brec.sh ratingStat <rating_dat


5. Rating correlation
=====================
Correlation can be calculated in various ways. We will be using cosine similarity.

5.1 Run MR
./brec.sh correlation

6. Rating Predictor
===================
The next step is to predict rating based on items already rated by user and the 
correlation calculated in the first MR

6.1 The rating file should be renamed so that it has the same prefix   
as defined by  the config param rating.file.prefix (prRating here). It should 
be repeated if there are multiple reducer output files
./brec.sh renameRatingFile part-r-00000 prRating0.txt 

6.2 The rating stat file should be renamed so that it has the same prefix   
as defined by  the config param rating.stat.file.prefix. It should 
be repeated if there are multiple reducer output files
./brec.sh renameRatingStat

6.3 Run MR as follows
./brec.sh ratingPred [withStat]
The last argument is necessary if rating stats data is used

7. Aggregate Rating Predictor
=============================
This predicts the final rating by aggregating contribution from all items rated 
by the user

7.1 Run MR
./brec.sh ratingAggr

8. Business Goal Injection (optional)
=====================================
This is an optional MR, that combines scores of various business goals with 
recommendation score using relative weighting to come up with the final score. 
In our example, we are not using it.

8.1 Copy business score data
./brec.sh storeBizData <local_biz_data_file_name> <hdfs_biz_data_file_name>

hdfs_biz_data_file_name should have the the prefix as defined by the config param 
biz.goal.file.prefix

8.2 Run MR
./brec.sh injectBizGoal


9. Order by User ID (optional)
==============================
It orders the final result by userID, so that you get all recommendation for
a given user together

9.1 Run MR. Unsroted data dir name (not full path) needs to be specified, because
unsorted data location depends on post processing  done with predicted rating 
(e.g., business goal injection) 
./brec.sh  sortByUser <unsorted_data_hdfs_dir>


10. Individual Novelty (optional)
=================================
Novelty can blended in with predicted rating as follows

10.1 Caculate user item engaement distribution
./brec.sh genEngageDistr

10.2 Generate item novelty score
./brec.sh genItemNovelty

10.3 Rename predicted rating file to have prefix as defined in config param 
first.type.prefix. The command should be repeatedly executed  if there are multiple 
reducer output files
./brec.sh renamePredRatingFile  part-r-00000 prRatings0.txt

10.4 Join predicted rating and novelty
./brec.sh joinRatingNovelty

10.5 Weighted average of predicted rating and novelty
./brec.sh injectItemNovelty

11. Item popularity global (optional)
=====================================
It can be used to solve cold start problem. Popularity is calculated by taking 
weighted  average of various rating stats

11.1 Run MR
./brec.sh itemPopularity

12. Postive feedback driven rank reordering (optional)
======================================================
The actual implicit rating based on user engagement data is used together with
predicted rating to generated modified ratings

12.1 Rename rating data file
./brec.sh renameRating part-r-00000 <rating_file_name>

rating_file_name should the prefix defined by the config param actual.rating.file.prefix

12.2 Modify rating
./brec.sh posFeedbackReorder

13. Attribute diffusion based diversification
=============================================
It makes the recommendation result more diverse by manitaining a minimum rank
distance between items with same attribute values

13.1 Generate item attribute data
./brec.sh genItemAttrData <event_data_file>
It generates data with with itemID and two attributes (category and brand)

13.2 Store item attribute data in HDFS
./brec.sh storeItemAttrData <local_file_name> <hdfs_file_name>
local_file_name = file name from step 13.1
hdfs_file_name = file name in hdfs with prefix as set in item.metadta.file.prefix

13.3 Run user, item attribute aggregation MR
./brec.sh userItemAttrAggr

13.4 Rename user, item attribute  data file
./brec.sh renameUserItemAttrData <mr_generated_file_name> <new_file_name>

mr_generated_file_name = MR generated file name from step 13.3
new_file_name = new file name with prefix as set through item.metadta.file.prefix

13.5 Run attribute diffusion based diversifier
./brec.sh diversifyWithAttr


Configuration
=================
It's in reco.properties for all the MR jobs. Feel free to make changes as needed

For number of reducer there is global config param num.reducer. For each MR job there is 
job specific config param with name as xxx.num.reducer. If this job specific param  is defined 
it overrides the global param.