Skip to content

We have implemented PageRank algorithm in serial, parallel shared-memory and distributed (map-reduce) way, and we have made comparisons between the different implementation paradigms.

License

Notifications You must be signed in to change notification settings

naisila/PageRank_Implementations

Repository files navigation

PageRank_Implementations

We have implemented PageRank algorithm in serial, parallel shared-memory and distributed (map-reduce) way, and we have made comparisons between the different implementation paradigms. This was our term project for CS425: Algorithms for Web-Scale Data, a CS technical elective course offered at Bilkent University, Fall 2018.

Brief Info

We used OpenMP application programming interface that supports multi-platform shared memory multiprocessing programming in C++, and Apache Spark unified analytics engine for the distributed implementation (PySpark Map-Reduce). After implementing the algorithms, we tested them with three different datasets taken from Stanford Large Network Dataset collection, namely the NotreDame web graph with 1.5 Million egdes, the Google web graph with 5.1 Million edges, and the Patent Citation Network with 16.5 Million edges. Then we compared the amount of time taken to complete the execution in each of the implementations as well as the PageRank results.

Frameworks

Programming Languages

  • Python for Distributed Implementation
  • C++ for Serial and Parallel Implementation

Findings

As expected, we observed that the parallel implementation takes shorter amount of time than the serial one. Also, we saw that the distributed implementation using Map-Reduce is actually really efficient compared to Parallel implementation. Furthermore, data preprocessing is done faster in Spark. Nevertheless, there is a trade-off of this result: Spark setup requires its own time. Also, the distributed implementation is not feasible with the calculation of PageRank leakage in case the graph has dead-ends. However, considering much larger graphs than the datasets we used, the Spark setup time will eventually become unsignificant, so it is better to use that implementation and not parallel one. Also, with more sophisticated methods of calculating the PageRank leakage due to dead ends, this problem can also be overcome in the distributed implementation.

Borrowed Ideas

https://github.com/papadopoulosc/pagerank/blob/master/1st_parallel/pagerank_openmp.c https://github.com/apache/spark/blob/master/examples/src/main/python/pagerank.py

Further Details

Please check our report, as well as our presentation.

Credits

  1. Naisila Puka: Distributed (Map-Reduce) Implementation, Report
  2. Endi Merkuri: Serial and Parallel Implementation, Convertion of datasets
  3. Fatbardh Feta: PySpark functions and OpenMP API research

About

We have implemented PageRank algorithm in serial, parallel shared-memory and distributed (map-reduce) way, and we have made comparisons between the different implementation paradigms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published