Skip to content

Data is fetched from StackExchange, transformed using Pig, queried and stored in Hive. Additionally, the TF-IDF of the top 10 users is calculated using Hive.

Notifications You must be signed in to change notification settings

nrohit78/PigHive_StackExhangeData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Task Details:

  1. Acquire the top 200,000 posts by viewcount from stack exchange (https://data.stackexchange.com/stackoverflow/queries)
  2. Using Pig or MapReduce , extract, transform and load the data as applicable
  3. Using Hive and/or MapReduce , get: I. The top 10 posts by score II. The top 10 users by post score III. The number of distinct users, who used the word “Hadoop” in one of their posts
  4. Using Mapreduce /Pig/Hive calculate the per user TF IDF (just submit the top 10 terms for each of the top 10 users from Query 3.II)

About

Data is fetched from StackExchange, transformed using Pig, queried and stored in Hive. Additionally, the TF-IDF of the top 10 users is calculated using Hive.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published