This project undertaken for the Data Engineering II course at Uppsala University aims to develop an efficient analytic system for GitHub using Apache Pulsar and Kubernetes. The system overcomes the limitations of GitHub's search capabilities by streaming data from GitHub's API and analyzing it in real-time. By leveraging Apache Pulsar's streaming framework and the scalability of Kubernetes, developers gain access to valuable insights previously unavailable.
The analytic system focuses on answering key questions such as identifying popular programming languages, detecting vulnerabilities in popular projects, monitoring active projects, and tracking community trends. Robust data streaming mechanisms using Apache Pulsar and seamless deployment and scaling with Kubernetes are key areas of emphasis. The final result will be a user-friendly interface or API that allows developers to effortlessly access and leverage the wealth of information within the GitHub data stream.
This project provides a practical learning experience for participants, combining cutting-edge technologies, real-world data analysis challenges, and the opportunity to contribute to the GitHub developer community. It enhances participants' data engineering skills and deepens their understanding of streaming frameworks and container orchestration platforms.