Distributed Map Reduce

This repository presents a comprehensive implementation of the MapReduce programming methodology, designed for processing massive data collections efficiently. This project encompasses two primary applications: Word Count and Inverted Index.

Architecture

The implementation comprises several key components:

Server: A Python 3 TCP server that receives client requests and performs MapReduce operations based on the provided input file. The server is capable of handling multiple client connections concurrently, and it creates and invokes a main object located in the Master implementation. The server returns the results of the main execution to the client.
Master: The master node coordinates the MapReduce tasks. Upon receiving a request from a client, the master initiates the corresponding map and reduce functions on the input data. Depending on the map function type (e.g., "word_count" or "inverted_index"), the master creates instances of the appropriate classes to handle the operations.
Key-Value Server: This serves as a database for intermediate files and output files, allowing mappers and reducers to read from and write to these files. It provides functionalities such as retrieving file data and saving to files.
MapReduce: This section includes the implementations of the WordCount and InvertedIndex classes:
- WordCount Class: The mapper task reads input files based on file splits, emitting key-value pairs. These pairs are then passed to the combiner, which organizes the mapper output into key-value pairs that are suitable for the reducer. After combining, this data is stored in the key-value database. The reduce function aggregates the computed results from each worker, producing the final output by summing the values.
- InvertedIndex Class: The input file is divided into chunks equal to the number of mappers. Each mapper processes a chunk and outputs are combined to create a MergedData. The reducer processes unique keys and combines their data. Once all reducers are executed, the results are combined and stored as an output file.
Utils: Contains helper functions such as split_input, merge_files, and join_files.
Helper: Contains the map and reduce functions corresponding to the WordCount and InvertedIndex objects.
Config: Contains host and port information that can be configured.
Log Files: Log files are created to document the process information and any errors.

Workflow

The project follows this workflow:

Configure the config.ini file according to requirements, setting port numbers, hosts, and other parameters.
Execute start_server.sh to launch the key-value server and main server.
Run test.sh to initiate test files: client_wordCount, client_invertedIndex, clients_multiple.
Clients connect to the Master server, and requests are routed to the appropriate server components based on attributes like num_mappers, num_reducers, input_location, output_location, map_function, and reduce_function.
The Master connects to the Key-Value server and creates class instances according to the client's request.
Depending on the application, the Master dispatches mapper tasks asynchronously to run in parallel. The Master connects to mappers to assign map tasks to each mapper.
Mappers read allocated data from the key-value store and write key-value pairs to it.
After processing all pairs, mappers write intermediate outputs to the key-value store, using hashing to allocate keys to reducer inputs.
After all mappers finish, reducers are initiated to complete the remaining tasks.
Each reducer processes its input and combines values for each key.
Upon completion, reducers write their output to the output file at the key-value server.

Bonus Components

Protocol Buffers: Implemented for efficient communication of intermediate data between components.
Fault Tolerance: In the event of failures, clients can resend requests. If map or reduce functions fail, they are restarted. Reducer failures lead to reprocessing the stored intermediate data.

Comparisons with Original Map-Reduce

Original Map-Reduce	Distributed Map Reduce
Employs job scheduling with tasks assigned to machines	No job scheduling implemented
Implements fault tolerance through heartbeat detection	Implements fault tolerance by detecting lack of worker output
Uses task backups to improve execution time	Does not create backups of worker tasks

Limitations

Supports only Word Count and Inverted Index applications.
Concurrent client connections are constrained by server hardware capabilities.

Future Scope

Implement load balancing for better execution times.
Integrate a distributed data store like HDFS for enhanced scalability and performance.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data		Data
MapReduce.py		MapReduce.py
README.md		README.md
Report.pdf		Report.pdf
client_invertedIndex.py		client_invertedIndex.py
client_wordCount.py		client_wordCount.py
clients_multiple.py		clients_multiple.py
config.py		config.py
helper.py		helper.py
master.py		master.py
newServer.py		newServer.py
server.png		server.png
start_server.sh		start_server.sh
test.sh		test.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Map Reduce

Architecture

Workflow

Bonus Components

Comparisons with Original Map-Reduce

Limitations

Future Scope

About

Releases

Packages

Languages

prashastikarlekar/Distributed-Map-Reduce

Folders and files

Latest commit

History

Repository files navigation

Distributed Map Reduce

Architecture

Workflow

Bonus Components

Comparisons with Original Map-Reduce

Limitations

Future Scope

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages