In this challenge, We download 100,000 public Github repositories and perform processing on the downloaded code. For each repository, our goal is to compute the following statistics for the Python code present.
- Number of lines of code [this excludes comments, whitespaces, blank lines].
- List of external libraries/packages used.
- The Nesting factor for the repository: the Nesting factor is the average depth of a nested for loop throughout the code.
- Code duplication: What percentage of the code is duplicated per file. If the same 4 consecutive lines of code (disregarding blank lines, comments, etc. other non code items) appear in multiple places in a file, all the occurrences except the first occurence are considered to be duplicates.
- Average number of parameters per function definition in the repository.
- Average Number of variables defined per line of code in the repository.
-
To calculate the code statistics python's
ast
module is used to generate anAbstract Syntax Tree
, which is then visited to calculate 5 statistics. To calculate the code duplicate we use python dicthashmap
implementation to compare every 4 lines of code in the repo. -
To perform the processing we use
Kubernetes
,Docker
andRabbitMQ
. We have theproducer
add a message to queue for every repo that needs processing, then using kubernetes we run severalworkers
[5-10] to take messages from the queue, clone the repo, calculate the statistics then add the calculated result to another results queue in rabbitmq. -
The
results_parser
is responsible for consuming messages from the results queue and append it to the json file. -
The final results are found in
results/results.json
.
Class | Description |
---|---|
Worker | The worker class connects to rabbitmq urls queue and fetches a url and uses RepoAnalyzer to process the url then add the result to the results queue. |
Producer | The producer class is responsible for loading the urls file and adding the urls to the queue for processing. |
RepoAnalyzer | The repo analyzer class is responsible for cloning a repo, scanning for python files and doing the analysis using Analyzer class. |
ResultsParser | The results parser class is responsible for fetching the results from the results queue and saving it to the disk. |
- RabbitMQ
- Python 3
- Docker [recommended]
- Clone this repository
git clone https://github.com/melzareix/repo-analyzer.git
cd repo-analyzer
- Create
.env
file with the following variables (you can also add these variables via docker)
RABBIT_USERNAME=user
RABBIT_PASSWORD=user
RABBIT_HOST=127.0.0.1
RABBIT_PORT=5672
QUEUE_NAME=URLS_QUEUE_NAME
RESULTS_QUEUE=RESULTS_QUEUE_NAME
- If you use don't use docker, use pip to install dependencies
pip3 install -r requirements.txt
- If you are using docker then build the container
docker build -t data-challenge .
- Run producer to add urls to queue.
python3 src/producer.py
- Run worker to process urls.
python3 src/worker.py
- After the worker finishes, run results parser to parse results.
python3 src/results_parser.py
- Run producer to add urls to queue.
docker run -it --rm data-challenge /bin/sh -c "python3 src/producer.py"
- Run worker to process urls.
docker run -it --rm data-challenge
- After the worker finishes, run results parser to parse results.
docker run -it --rm --name data-container data-challenge /bin/sh -c "python3 src/results_parser.py"
- Copy results file from container to your machine.
docker cp data-container:/results/results_100000.json results.json