This repository contains source code in java for finding Mutual Friends and analyzing Yelp DataSet using Hadoop MapReduce
Write a MapReduce program in Hadoop that implements a simple “Mutual/Common friend list of two friends". The key idea is that if two people are friend then they have a lot of mutual/common friends. This question will give any two Users as input, output the list of the user id of their mutual friends. For example,Alice’s friends are Bob, Sam, Sara, Nancy Bob’s friends are Alice, Sam, Clara, Nancy Sara’s friends are Alice, Sam, Clara, NancyAs Alice and Bob are friend and so, their mutual friend list is [Sam, Nancy] As Sara and Bob are not friend and so, their mutual friend list is empty
The input contains the adjacency list and has multiple lines in the following format: Here, is a unique integer ID corresponding to a unique user and is a comma-separated list of unique IDs ( ID) corresponding to the friends of the user. Note that the friendships are mutual (i.e., edges are undirected): if A is friend with B then B is also friend with A. The data provided is consistent with that rule as there is an explicit entry for each side of each edge. So when you make the pair, always consider (A, B) or (B, A) for user A and B but not both.
Output: The output should contain one line per user in the following format: <User_A>, <User_B><Mutual/Common Friend List> where <User_A> & <User_B> are unique IDs corresponding to a user A and B (A and B are friend). < Mutual/Common Friend List > is a comma-separated list of unique IDs corresponding to mutual friend list of User A and B.
Please find the above output for the following pairs. (0,4), (20, 22939), (1, 29826), (6222, 19272), (28041, 28056)File : MutualFriend.jar
Steps to run jar file:
- From the terminal go inside MutualFriend directory
- Delete output directory - "/user/krupali/output1" if it already exists
- Type the following command: hadoop jar MutualFriend.jar MutualFriend /user/krupali/input/soc-LiveJournal1Adj.txt /user/krupali/output1
- Output is at : /user/krupali/output1/part-r-00000
- Now, to get output for specific friends pair say 0,4 type the following command: hdfs dfs -cat /user/krupali/output1/part-r-00000 | grep "0,4<press Ctrl+v tab>" You will get output as : 0,4 8,14,15,18,27,72,80,74,77
- Similarly to get output for:
20,22939 type the command: hdfs dfs -cat /user/krupali/output1/part-r-00000 | grep "20,22939<press Ctrl+v tab>" .
Output : 20,22939 1,5
1,29826 type the command: hdfs dfs -cat /user/krupali/output1/part-r-00000 | grep "1,29826<press Ctrl+v tab>".
Output : 1,29826
6222,19272 type the command: hdfs dfs -cat /user/krupali/output1/part-r-00000 | grep "6222,19272<press Ctrl+v tab>" .
Output : 6222,19272 19263,19280,19281,19282
28041,28056 type the command: hdfs dfs -cat /user/krupali/output1/part-r-00000 | grep "28041,28056<press Ctrl+v tab>" .
Output : 28041,28056 6245,28054,28061
Note: Since the output from reducer is seperated by TAB press Ctrl+v and tab to get tab key value when using grep command. Not following this note might not give you output.
Please answer this question by using dataset from Question 1. Find friend pairs whose common friend number are within the top-10 in all the pairs. Please output them in decreasing order. Output Format: <User_A>, <User_B><Mutual/Common Friend Number>
File: TopTenFriends.jar
Steps to run jar file:
- From the terminal go inside MutualFriend directory
- Delete output directory - "/user/krupali/output2_1" and "/user/krupali/output2_2" if it already exists
- Type the following command: hadoop jar TopTenFriends.jar TopTenFriends /user/krupali/input/soc-LiveJournal1Adj.txt /user/krupali/output2_1 /user/krupali/output2_2
- Output is at : /user/krupali/output2_2/part-r-00000 ========================================================================
In this question, you will apply Hadoop map-reduce to derive some statistics from Yelp Dataset.
------------------------------------- Data set Info -------------------------------------------
The dataset files are as follows and columns are separated using ‘::’ business.csv.
review.csv.
user.csv.
Dataset Description.
The dataset comprises of three csv files, namely user.csv, business.csv and review.csv. Business.csv file contain basic information about local businesses. Business.csv file contains the following columns "business_id"::"full_address"::"categories" 'business_id': (a unique identifier for the business) 'full_address': (localized address), 'categories': [(localized category names)] review.csv file contains the star rating given by a user to a business. Use user_id to associate this review with others by the same user. Use business_id to associate this review with others of the same business. review.csv file contains the following columns "review_id"::"user_id"::"business_id"::"stars" 'review_id': (a unique identifier for the review) 'user_id': (the identifier of the reviewed business), 'business_id': (the identifier of the authoring user), 'stars': (star rating, integer 1-5), the rating given by the user to a business user.csv file contains aggregate information about a single user across all of Yelp user.csv file contains the following columns "user_id"::"name"::"url" user_id': (unique user identifier), 'name': (first name, last initial, like 'Matt J.'), this column has been made anonymous to preserve privacy 'url': url of the user on yelp NB: :: is Column separator in the files.
List the business_id, full address and categories of the Top 10 businesses using the average ratings. This will require you to use review.csv and business.csv files. Please use reduce side join and job chaining technique to answer this problem.Sample output:
businessid full address categories avg rating
xdf12344444444, CA 91711 List['Local Services', 'Carpet Cleaning'] 5.0
File: TopTenBusinessRatings.jar
Steps to run jar file:
- From the terminal go inside YelpDataSetAnalysis folder
- Delete output directory - "/user/krupali/output3_1" and "/user/krupali/output3_2" if it already exists
- Type the following command: hadoop jar TopTenBusinessRatings.jar TopTenBusinessRatings /user/krupali/input/review.csv /user/krupali/output3_1 /user/krupali/input/business.csv /user/krupali/output3_2
- Output is at : /user/krupali/output3_2/part-r-00000
Use Yelp Dataset List the 'user id' and 'rating' of users that reviewed businesses located in “Palo Alto” Required files are 'business' and 'review'.
Please use In Memory Join technique to answer this problem. Hint: Please load all data in business.csv file into the distributed cache.
Sample output
User id Rating
0WaCdhr3aXb0G0niwTMGTg 4.0
File: Question4.jar
Steps to run jar file:
- From the terminal go inside YelpDataSetAnalysis folder
- Delete output directory - "/user/krupali/output4" if it already exists
- Type the following command: hadoop jar Question4.jar Question4 /user/krupali/input/business.csv /user/krupali/input/review.csv /user/krupali/output4
- Output is at : /user/krupali/output4/part-r-00000