Skip to content

Semantic Data Management - Distributed Graph Processing ๐Ÿ“ˆ

License

Notifications You must be signed in to change notification settings

mohammadzainabbas/SDM-Lab-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

SDM - Lab 2 @ UPC ๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป



Table of contents


Data drives the world. In this big data era, the need to analyse large volumes of data has become ever more challenging and quite complex. Several different eco-systems have been developed which try to solve some particular problem. One of the main tool in Big Data eco system is the Apache Spark

Apache Spark analysis of big data became essential easier. Spark brings a lot implementation of useful algorithms for data mining, data analysis, machine learning, algorithms on graphs. Spark takes on the challenge of implementing sophisticated algorithms with tricky optimization and ability to run your code on distributed cluster. Spark effectively solve problems like fault tolerance and provide simple API to make the parallel computation.

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.

This repository serves as a starting point for working with Spark GraphX API. As part of our SDM lab, we'd be focusing on getting a basic idea about how to work with pregel and get a hands-on experience with distributed processing of large graph.

Pregel, originally developed by Google, is essentially a message-passing interface which facilitates the processing of large-scale graphs. Apache Spark's GraphX module provides the Pregel API which allow us to write distributed graph programs / algorithms. For more details, kindly check out the original paper


Before starting, you may need to setup your machine first. Please follow the below mentioned guides to setup Spark and Maven on your machine.

We have created a setup script which will setup brew, apache-spark, maven and conda enviornment. If you are on Mac machine, you can run the following commands:

git clone https://github.com/mohammadzainabbas/SDM-Lab-2.git
cd SDM-Lab-2 && sh scripts/setup.sh

If you are on Linux, you need to install Apache Spark by yourself. You can follow this helpful guide to install apache spark. You can install maven via this guide.

We also recommend you to install conda on your machine. You can setup conda from here

After you have conda, create new enviornment via:

conda create -n spark_env python=3.8

Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)

Activate your enviornment:

conda activate spark_env

Now, you need to install pyspark:

pip install pyspark

If you are using bash:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
. ~/.bashrc

And if you are using zsh:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
. ~/.zshrc

Since, this is a typical maven project, you can run it however you'd like to run a maven project. To facilitate you, we provide you two ways to run this project.

In you are using VS Code, change the args in the Launch Main configuration in launch.json file located at .vscode directory.

See the main class for the supported arguments.

Just run the following with the supported arguments:

sh scripts/build_n_run.sh exercise1

Note: exercise1 here is the argument which you'd need to run the first exercise

Again, you can check the main class for the supported arguments.

About

Semantic Data Management - Distributed Graph Processing ๐Ÿ“ˆ

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published