This project was elaborated to acquire practical skills and to explore various techniques and tools specifically designed to handle big data as part of our academic curriculum as software engineers
- About
- Architecture
- Implementation
- Demo
- Ways to improve
This project was undertaken as a vital component of our software engineering curriculum, focusing on hands-on experience with big data tools and techniques. Through batch processing, we analyzed Steam reviews from Kaggle, applying sentiment analysis to rate games thanks to Hadoop HDFS. Simultaneously, we used Spark and Kafka to develop a real-time data collection system to monitor ongoing gaming conversations on Reddit, ensuring our recommendations stayed fresh. Processed data was efficiently managed in MongoDB. We also created a user-friendly interface using React and Express, allowing easy exploration of gaming insights.
MongoDB is known as NOSQL database and can be used effectively in big data scenarios.
File system to distribute data accross multiple machines in the cluster.
It is a real-time data processing module in Apache Spark that supports streaming processing.
Kafka is a distributed event store and stream-processing platform.
React is a free and open-source front-end JavaScript library for building user interfaces. Express is a back end web application framework for building RESTful APIs with Node.js.
- One way to improve the project is to add a layer for filtring the posts coming from reddit according to their relevance and/or to group them by talking points.
- Another way to improve is to add an executable pipeline to automate running and deploying the project.