big-data-intro

Using Hadoop and Spark for some basic exercises

Task 1:

Preprocess data. Process the provided user query logs (search_data.sample). Strip the clickUrls in the query log using Hadoop to leave only a specific part (the url before the first ‘/’) of the clickUrls.

Example input: zhidao.baidu.com/question/48881311 Example output: zhidao.baidu.com

Output from the MapReduce operation:

Task 2:

Rank the tokens that appear most often in the queried url. Tokenlize the clickUrls in the query log, then rank them according to the number of times they appear. The output should be the top ten tokens and the number of times they appear.

Output:

Task 3:

Rank the time period (by minute) with the most queries. Count the number of query at each minute, then rank them from more to less. The output should be the top ten time period (by minute) with most queries and the number of queries during that time period.

Output:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
task1		task1
task2		task2
task3		task3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

big-data-intro

Task 1:

Task 2:

Task 3:

About

Releases

Packages

Languages

pranayperiwal/big-data-intro

Folders and files

Latest commit

History

Repository files navigation

big-data-intro

Task 1:

Task 2:

Task 3:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages