Skip to content

melzareix/repo-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repo Analyzer

Table Of Contents

1. Introduction

In this challenge, We download 100,000 public Github repositories and perform processing on the downloaded code. For each repository, our goal is to compute the following statistics for the Python code present.

  1. Number of lines of code [this excludes comments, whitespaces, blank lines].
  2. List of external libraries/packages used.
  3. The Nesting factor for the repository: the Nesting factor is the average depth of a nested for loop throughout the code.
  4. Code duplication: What percentage of the code is duplicated per file. If the same 4 consecutive lines of code (disregarding blank lines, comments, etc. other non code items) appear in multiple places in a file, all the occurrences except the first occurence are considered to be duplicates.
  5. Average number of parameters per function definition in the repository.
  6. Average Number of variables defined per line of code in the repository.

2. Architecture

  1. To calculate the code statistics python's ast module is used to generate an Abstract Syntax Tree, which is then visited to calculate 5 statistics. To calculate the code duplicate we use python dict hashmap implementation to compare every 4 lines of code in the repo.

  2. To perform the processing we use Kubernetes, Docker and RabbitMQ. We have the producer add a message to queue for every repo that needs processing, then using kubernetes we run several workers [5-10] to take messages from the queue, clone the repo, calculate the statistics then add the calculated result to another results queue in rabbitmq.

  3. The results_parser is responsible for consuming messages from the results queue and append it to the json file.

  4. The final results are found in results/results.json.

2.1 Classes

Class Description
Worker The worker class connects to rabbitmq urls queue and fetches a url and uses RepoAnalyzer to process the url then add the result to the results queue.
Producer The producer class is responsible for loading the urls file and adding the urls to the queue for processing.
RepoAnalyzer The repo analyzer class is responsible for cloning a repo, scanning for python files and doing the analysis using Analyzer class.
ResultsParser The results parser class is responsible for fetching the results from the results queue and saving it to the disk.

3. Installation

3.1 Dependencies

  • RabbitMQ
  • Python 3
  • Docker [recommended]

3.2 Steps

  1. Clone this repository
git clone https://github.com/melzareix/repo-analyzer.git
cd repo-analyzer
  1. Create .env file with the following variables (you can also add these variables via docker)
RABBIT_USERNAME=user
RABBIT_PASSWORD=user
RABBIT_HOST=127.0.0.1
RABBIT_PORT=5672
QUEUE_NAME=URLS_QUEUE_NAME
RESULTS_QUEUE=RESULTS_QUEUE_NAME
  1. If you use don't use docker, use pip to install dependencies
pip3 install -r requirements.txt
  1. If you are using docker then build the container
docker build -t data-challenge .

4. Usage

1. Non-Docker

  1. Run producer to add urls to queue.
python3 src/producer.py
  1. Run worker to process urls.
python3 src/worker.py
  1. After the worker finishes, run results parser to parse results.
python3 src/results_parser.py

2. Docker

  1. Run producer to add urls to queue.
docker run -it --rm data-challenge /bin/sh -c "python3 src/producer.py"
  1. Run worker to process urls.
docker run -it --rm data-challenge
  1. After the worker finishes, run results parser to parse results.
docker run -it --rm --name data-container data-challenge /bin/sh -c "python3 src/results_parser.py"
  1. Copy results file from container to your machine.
docker cp data-container:/results/results_100000.json results.json

About

Github Repo Analyzer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published