This repository presents a comprehensive implementation of the MapReduce programming methodology, designed for processing massive data collections efficiently. This project encompasses two primary applications: Word Count and Inverted Index.
The implementation comprises several key components:
-
Server: A Python 3 TCP server that receives client requests and performs MapReduce operations based on the provided input file. The server is capable of handling multiple client connections concurrently, and it creates and invokes a
main
object located in the Master implementation. The server returns the results of themain
execution to the client. -
Master: The master node coordinates the MapReduce tasks. Upon receiving a request from a client, the master initiates the corresponding map and reduce functions on the input data. Depending on the map function type (e.g., "word_count" or "inverted_index"), the master creates instances of the appropriate classes to handle the operations.
-
Key-Value Server: This serves as a database for intermediate files and output files, allowing mappers and reducers to read from and write to these files. It provides functionalities such as retrieving file data and saving to files.
-
MapReduce: This section includes the implementations of the
WordCount
andInvertedIndex
classes:- WordCount Class: The mapper task reads input files based on file splits, emitting key-value pairs. These pairs are then passed to the combiner, which organizes the mapper output into key-value pairs that are suitable for the reducer. After combining, this data is stored in the key-value database. The reduce function aggregates the computed results from each worker, producing the final output by summing the values.
- InvertedIndex Class: The input file is divided into chunks equal to the number of mappers. Each mapper processes a chunk and outputs are combined to create a
MergedData
. The reducer processes unique keys and combines their data. Once all reducers are executed, the results are combined and stored as an output file.
-
Utils: Contains helper functions such as
split_input
,merge_files
, andjoin_files
. -
Helper: Contains the map and reduce functions corresponding to the
WordCount
andInvertedIndex
objects. -
Config: Contains host and port information that can be configured.
-
Log Files: Log files are created to document the process information and any errors.
The project follows this workflow:
- Configure the
config.ini
file according to requirements, setting port numbers, hosts, and other parameters. - Execute
start_server.sh
to launch the key-value server and main server. - Run
test.sh
to initiate test files:client_wordCount
,client_invertedIndex
,clients_multiple
. - Clients connect to the Master server, and requests are routed to the appropriate server components based on attributes like
num_mappers
,num_reducers
,input_location
,output_location
,map_function
, andreduce_function
. - The Master connects to the Key-Value server and creates class instances according to the client's request.
- Depending on the application, the Master dispatches mapper tasks asynchronously to run in parallel. The Master connects to mappers to assign map tasks to each mapper.
- Mappers read allocated data from the key-value store and write key-value pairs to it.
- After processing all pairs, mappers write intermediate outputs to the key-value store, using hashing to allocate keys to reducer inputs.
- After all mappers finish, reducers are initiated to complete the remaining tasks.
- Each reducer processes its input and combines values for each key.
- Upon completion, reducers write their output to the output file at the key-value server.
- Protocol Buffers: Implemented for efficient communication of intermediate data between components.
- Fault Tolerance: In the event of failures, clients can resend requests. If map or reduce functions fail, they are restarted. Reducer failures lead to reprocessing the stored intermediate data.
Original Map-Reduce | Distributed Map Reduce |
---|---|
Employs job scheduling with tasks assigned to machines | No job scheduling implemented |
Implements fault tolerance through heartbeat detection | Implements fault tolerance by detecting lack of worker output |
Uses task backups to improve execution time | Does not create backups of worker tasks |
- Supports only Word Count and Inverted Index applications.
- Concurrent client connections are constrained by server hardware capabilities.
- Implement load balancing for better execution times.
- Integrate a distributed data store like HDFS for enhanced scalability and performance.