Using Hadoop and Spark for some basic exercises
Preprocess data. Process the provided user query logs (search_data.sample). Strip the clickUrls in the query log using Hadoop to leave only a specific part (the url before the first ‘/’) of the clickUrls.
Example input: zhidao.baidu.com/question/48881311 Example output: zhidao.baidu.com
Output from the MapReduce operation:
Rank the tokens that appear most often in the queried url. Tokenlize the clickUrls in the query log, then rank them according to the number of times they appear. The output should be the top ten tokens and the number of times they appear.
Output:
Rank the time period (by minute) with the most queries. Count the number of query at each minute, then rank them from more to less. The output should be the top ten time period (by minute) with most queries and the number of queries during that time period.
Output: