PageRank_Implementations

We have implemented PageRank algorithm in serial, parallel shared-memory and distributed (map-reduce) way, and we have made comparisons between the different implementation paradigms. This was our term project for CS425: Algorithms for Web-Scale Data, a CS technical elective course offered at Bilkent University, Fall 2018.

Brief Info

We used OpenMP application programming interface that supports multi-platform shared memory multiprocessing programming in C++, and Apache Spark unified analytics engine for the distributed implementation (PySpark Map-Reduce). After implementing the algorithms, we tested them with three different datasets taken from Stanford Large Network Dataset collection, namely the NotreDame web graph with 1.5 Million egdes, the Google web graph with 5.1 Million edges, and the Patent Citation Network with 16.5 Million edges. Then we compared the amount of time taken to complete the execution in each of the implementations as well as the PageRank results.

Frameworks

Spark: PySpark for Distributed Implementation
OpenMP for Parallel Implementation

Programming Languages

Python for Distributed Implementation
C++ for Serial and Parallel Implementation

Findings

As expected, we observed that the parallel implementation takes shorter amount of time than the serial one. Also, we saw that the distributed implementation using Map-Reduce is actually really efficient compared to Parallel implementation. Furthermore, data preprocessing is done faster in Spark. Nevertheless, there is a trade-off of this result: Spark setup requires its own time. Also, the distributed implementation is not feasible with the calculation of PageRank leakage in case the graph has dead-ends. However, considering much larger graphs than the datasets we used, the Spark setup time will eventually become unsignificant, so it is better to use that implementation and not parallel one. Also, with more sophisticated methods of calculating the PageRank leakage due to dead ends, this problem can also be overcome in the distributed implementation.

Borrowed Ideas

https://github.com/papadopoulosc/pagerank/blob/master/1st_parallel/pagerank_openmp.c https://github.com/apache/spark/blob/master/examples/src/main/python/pagerank.py

Further Details

Please check our report, as well as our presentation.

Credits

Naisila Puka: Distributed (Map-Reduce) Implementation, Report
Endi Merkuri: Serial and Parallel Implementation, Convertion of datasets
Fatbardh Feta: PySpark functions and OpenMP API research

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
distributed_map_reduce_impl		distributed_map_reduce_impl
parallel_impl		parallel_impl
serial_impl		serial_impl
test_datasets		test_datasets
.gitignore		.gitignore
LICENSE		LICENSE
PR_Implementations_Presentation.pdf		PR_Implementations_Presentation.pdf
PR_Implementations_Report.pdf		PR_Implementations_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PageRank_Implementations

Brief Info

Frameworks

Programming Languages

Findings

Borrowed Ideas

Further Details

Credits

About

Releases

Packages

Languages

License

naisila/PageRank_Implementations

Folders and files

Latest commit

History

Repository files navigation

PageRank_Implementations

Brief Info

Frameworks

Programming Languages

Findings

Borrowed Ideas

Further Details

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages