Skip to content

nash5657/Data-Engineering-II-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Github analytic system using a streaming framework

Project Description

This project undertaken for the Data Engineering II course at Uppsala University aims to develop an efficient analytic system for GitHub using Apache Pulsar and Kubernetes. The system overcomes the limitations of GitHub's search capabilities by streaming data from GitHub's API and analyzing it in real-time. By leveraging Apache Pulsar's streaming framework and the scalability of Kubernetes, developers gain access to valuable insights previously unavailable.

The analytic system focuses on answering key questions such as identifying popular programming languages, detecting vulnerabilities in popular projects, monitoring active projects, and tracking community trends. Robust data streaming mechanisms using Apache Pulsar and seamless deployment and scaling with Kubernetes are key areas of emphasis. The final result will be a user-friendly interface or API that allows developers to effortlessly access and leverage the wealth of information within the GitHub data stream.

This project provides a practical learning experience for participants, combining cutting-edge technologies, real-world data analysis challenges, and the opportunity to contribute to the GitHub developer community. It enhances participants' data engineering skills and deepens their understanding of streaming frameworks and container orchestration platforms.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 69.6%
  • Python 28.1%
  • Shell 1.6%
  • Dockerfile 0.7%