<a href="https://colab.research.google.com/github/pejmanrasti/Big_Data/blob/main/01_Example_MapReduce.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3>Preparing the Environment</h3>

<p>
Before we can explore Hadoop and MapReduce inside Google Colab, we first need to set up the foundations that will allow the Hadoop ecosystem to run smoothly in this cloud environment.
Since Hadoop relies on Java, we begin by making sure the correct Java version is installed.
Once the Java runtime is ready, we bring in a fresh copy of Hadoop from the official Apache repository and unpack it directly into the Colab workspace.
This step essentially prepares our local “mini cluster,” giving us all the tools we will use throughout the rest of this notebook as we move toward executing real MapReduce jobs on actual data.
</p>

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
!tar -xzf hadoop-3.3.6.tar.gz

<h3>Configuring the Hadoop Environment</h3>

<p>
With Hadoop downloaded, the next step is to make Colab aware of where everything lives.
In this part of the setup, we define a few essential environment variables that allow the rest of the notebook to interact with Hadoop just as it would on a regular cluster.
We tell the system where Java is installed, point it to the location of our Hadoop folder, and finally extend the system path so that Hadoop commands can be executed naturally from any cell.
After this configuration, our Colab runtime behaves like a lightweight Hadoop environment, ready for the tasks that follow.
</p>

In [2]:
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["HADOOP_HOME"] = "/content/hadoop-3.3.6"
os.environ["PATH"] += f":{os.environ['HADOOP_HOME']}/bin:{os.environ['HADOOP_HOME']}/sbin"

<h3>Setting Up Hadoop’s Core Configuration</h3>

<p>
Now that the environment is prepared, we customize Hadoop’s core configuration to match the way we intend to use it inside Colab.
Since we are not running a distributed cluster here, we configure Hadoop to operate in local mode by telling it to use the local filesystem instead of HDFS.
This small adjustment ensures that Hadoop works directly with the files stored in our Colab workspace, allowing us to run MapReduce jobs smoothly without needing a full cluster.
It is a lightweight setup, but perfectly suited for demonstrations, teaching, and hands-on experimentation.
</p>

In [3]:
%%bash
cat > /content/hadoop-3.3.6/etc/hadoop/core-site.xml << EOF
<configuration>
 <property>
   <name>fs.defaultFS</name>
   <value>file:///</value>
 </property>
</configuration>
EOF

<h3>Creating the Mapper</h3>

<p>
With the environment ready, we begin building our first MapReduce task by defining the mapper component.
This simple script represents the “mapping” phase of the pipeline, where raw input text is broken down into smaller pieces that can later be aggregated.
Each line sent to the mapper is scanned and split into individual words, and for every word that appears, the mapper emits a pair consisting of the word itself and the number 1.
This transforms unstructured text into a stream of key-value pairs, setting the stage for Hadoop to group and process them in the next phase.
</p>

In [5]:
%%writefile mapper.py
#!/usr/bin/env python3
import sys

for line in sys.stdin:
    for word in line.strip().split():
        print(f"{word}\t1")

Writing mapper.py


<h3>Creating the Reducer</h3>

<p>
We now complete the MapReduce workflow by introducing the reducer.
This component receives the intermediate key-value pairs produced by the mapper, but now they arrive grouped by word.
The reducer’s job is to walk through this sorted stream and accumulate the counts for each unique word.
Whenever it detects that the word has changed, it outputs the final total for the previous one and moves on to the next.
By the end of this phase, all the small “1” values emitted by the mapper are combined into a complete tally for every word in the dataset, producing the final aggregated results of the MapReduce job.
</p>

In [6]:
%%writefile reducer.py
#!/usr/bin/env python3
import sys

cur = None
total = 0

for line in sys.stdin:
    word, count = line.split()
    count = int(count)

    if word != cur:
        if cur is not None:
            print(f"{cur}\t{total}")
        cur = word
        total = count
    else:
        total += count

if cur is not None:
    print(f"{cur}\t{total}")

Writing reducer.py


<h3>Preparing a Sample Input File</h3>

<p>
Before running our MapReduce job, we need some data for Hadoop to process.
In this step, we create a small text file that will serve as our demonstration dataset.
</p>

In [7]:
%%writefile text.txt
hello big data world
hello map reduce
hello hello world

Writing text.txt


<h3>Running the MapReduce Job</h3>

<p>
We can now launch the full MapReduce workflow.
In this step, Hadoop Streaming is used to connect our Python scripts to the Hadoop engine, allowing the mapper and reducer to operate on the input file just like native Hadoop components.
The command sends the text file into the mapper, passes the intermediate results to the reducer, and writes the final output into a dedicated directory.
This marks the first complete end-to-end execution in our mini Hadoop environment, demonstrating how custom Python code can be integrated seamlessly into the MapReduce model.
</p>

In [8]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
    -mapper mapper.py \
    -reducer reducer.py \
    -input text.txt \
    -output output \
    -file mapper.py \
    -file reducer.py

2025-11-20 14:28:31,976 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py] [] /tmp/streamjob8119443689867055725.jar tmpDir=null
2025-11-20 14:28:33,157 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-11-20 14:28:33,396 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-11-20 14:28:33,396 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-11-20 14:28:33,488 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-11-20 14:28:34,039 INFO mapred.FileInputFormat: Total input files to process : 1
2025-11-20 14:28:34,065 INFO mapreduce.JobSubmitter: number of splits:1
2025-11-20 14:28:34,594 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local466868392_0001
2025-11-20 14:28:34,595 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-11-20 14:28:35,469 INFO mapred.LocalDistributedCacheM

In [None]:
!cat output/part-00000

big	1
data	1
hello	4
map	1
reduce	1
world	2


<h3>Creating a Sample Web Server Log</h3>

<p>
To explore another MapReduce scenario, we prepare a small dataset that mimics the kind of access logs generated by a web server.
Each line in this file represents a single HTTP request, containing information such as the client’s IP address, the requested resource, the response status code, and the size of the returned data.
Although this dataset is intentionally small, it captures a variety of realistic events: successful page loads, missing pages, server errors, API calls, and repeated visits from the same user.
This structured but diverse set of log entries provides an excellent foundation for demonstrating how MapReduce can be used to analyze system behavior, detect errors, and summarize traffic patterns in large-scale real-world log processing tasks.
</p>

<h3>Understanding the HTTP Status Codes in the Log File</h3>

<p>
It is helpful to understand the meaning behind the different HTTP status codes that appear in the log.
These codes reflect how the server responded to each request and play a key role in analyzing system reliability, user behavior, and error patterns.
</p>

<ul>
  <li>
    <strong>200 — OK:</strong>  
    The request was completed successfully. These logs usually represent normal traffic such as loading pages, images, or API responses.  
    In our dataset, most entries fall into this category, showing the server's expected behavior.
  </li>

  <li>
    <strong>404 — Not Found:</strong>  
    The client tried to access a resource that does not exist on the server.  
    These logs often indicate broken links, misspelled URLs, outdated content, or malicious probing.  
    They are useful for identifying issues in navigation or security.
  </li>

  <li>
    <strong>500 — Internal Server Error:</strong>  
    Something went wrong on the server while processing the request.  
    These entries are important signals of backend failures, misconfigured services, or unexpected conditions that need attention.
  </li>
</ul>

<p>
By combining these codes with information such as IP addresses, endpoints, and timestamps, we can extract valuable patterns and insights.
This makes the dataset perfectly suited for exercises in log analysis, error detection, and summarization using MapReduce.
</p>

In [None]:
%%writefile access_log.txt
192.168.1.1 - user1 [10/Nov/2023:14:30:00 +0000] "GET /index.html HTTP/1.1" 200 1024
192.168.1.2 - user2 [10/Nov/2023:14:31:05 +0000] "GET /nonexistent.php HTTP/1.1" 404 0
192.168.1.3 - user3 [10/Nov/2023:14:32:10 +0000] "POST /api/data HTTP/1.1" 200 512
192.168.1.4 - user4 [10/Nov/2023:14:33:15 +0000] "GET /admin HTTP/1.1" 500 200
192.168.1.1 - user1 [10/Nov/2023:14:34:20 +0000] "GET /images/logo.png HTTP/1.1" 200 5000
192.168.1.5 - user5 [10/Nov/2023:14:35:25 +0000] "GET /index.html HTTP/1.1" 200 1024
192.168.1.6 - user6 [10/Nov/2023:14:36:30 +0000] "GET /about.html HTTP/1.1" 200 700
192.168.1.7 - user7 [10/Nov/2023:14:37:35 +0000] "GET /contact.php HTTP/1.1" 404 0
192.168.1.8 - user8 [10/Nov/2023:14:38:40 +0000] "GET /products/category1 HTTP/1.1" 200 3000
192.168.1.9 - user9 [10/Nov/2023:14:39:45 +0000] "PUT /api/update HTTP/1.1" 500 150

Writing access_log.txt


<h3>Building the Mapper for Status Code Analysis</h3>

<p>
To analyze the behavior of our web server logs, we begin by defining a mapper that focuses on extracting the status code from each request.
Every line of the log contains several pieces of information, but the status code is especially meaningful because it tells us whether the request succeeded, failed, or triggered an error.
The mapper reads each log entry, isolates the status code, and emits it together with the number 1.
This transforms the log file into a stream of simple countable units, allowing Hadoop to later group and total the occurrences of each unique status code.
This step sets the foundation for summarizing how often different types of responses—such as successful requests or server errors—appear in the dataset.
</p>

In [None]:
%%writefile mapper_status_code.py
#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    if line:
        parts = line.split()
        # The status code is typically the second to last element in Apache common log format
        if len(parts) >= 2:
            status_code = parts[-2]
            print(f"{status_code}\t1")

Writing mapper_status_code.py


In [None]:
!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
    -mapper mapper_status_code.py \
    -reducer reducer.py \
    -input access_log.txt \
    -output output_status_codes \
    -file mapper_status_code.py \
    -file reducer.py

2025-11-18 15:38:56,565 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper_status_code.py, reducer.py] [] /tmp/streamjob3600783376251879981.jar tmpDir=null
2025-11-18 15:38:57,951 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-11-18 15:38:58,134 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-11-18 15:38:58,135 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-11-18 15:38:58,200 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-11-18 15:38:58,795 INFO mapred.FileInputFormat: Total input files to process : 1
2025-11-18 15:38:58,846 INFO mapreduce.JobSubmitter: number of splits:1
2025-11-18 15:38:59,216 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1503246243_0001
2025-11-18 15:38:59,216 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-11-18 15:38:59,918 INFO mapred.LocalDist

In [None]:
import os

# List the contents of the output directory to confirm the output file exists
print("Contents of output_status_codes directory:")
!ls -l output_status_codes

# Display the results from the output file
print("\nResults of status code counts:")
!cat output_status_codes/part-00000

Contents of output_status_codes directory:
total 4
-rw-r--r-- 1 root root 18 Nov 18 15:39 part-00000
-rw-r--r-- 1 root root  0 Nov 18 15:39 _SUCCESS

Results of status code counts:
200	6
404	2
500	2


## Multiple Sample Web Log Files



In [None]:
%%writefile access_log_part1.txt
192.168.1.10 - userA [18/Nov/2023:10:00:00 +0000] "GET /home HTTP/1.1" 200 1500
192.168.1.11 - userB [18/Nov/2023:10:01:10 +0000] "POST /login HTTP/1.1" 200 200
192.168.1.12 - userC [18/Nov/2023:10:02:20 +0000] "GET /products HTTP/1.1" 200 3000
192.168.1.13 - userD [18/Nov/2023:10:03:30 +0000] "GET /invalid_page HTTP/1.1" 404 0
192.168.1.10 - userA [18/Nov/2023:10:04:40 +0000] "GET /images/bg.jpg HTTP/1.1" 200 10000
192.168.1.14 - userE [18/Nov/2023:10:05:50 +0000] "GET /admin HTTP/1.1" 403 120


Writing access_log_part1.txt


In [None]:
%%writefile access_log_part2.txt
192.168.1.15 - userF [18/Nov/2023:10:06:00 +0000] "GET /data.json HTTP/1.1" 200 800
192.168.1.16 - userG [18/Nov/2023:10:07:10 +0000] "POST /submit HTTP/1.1" 200 50
192.168.1.17 - userH [18/Nov/2023:10:08:20 +0000] "GET /bad_request HTTP/1.1" 400 0
192.168.1.18 - userI [18/Nov/2023:10:09:30 +0000] "GET /another_invalid_page HTTP/1.1" 404 0
192.168.1.15 - userF [18/Nov/2023:10:10:40 +0000] "GET /docs/api.html HTTP/1.1" 200 2500
192.168.1.19 - userJ [18/Nov/2023:10:11:50 +0000] "PUT /update_profile HTTP/1.1" 503 100


Writing access_log_part2.txt


In [None]:
import os

!hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \
    -mapper mapper_status_code.py \
    -reducer reducer.py \
    -input access_log_part1.txt \
    -input access_log_part2.txt \
    -output output_multi_log \
    -file mapper_status_code.py \
    -file reducer.py

2025-11-18 15:51:37,270 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper_status_code.py, reducer.py] [] /tmp/streamjob695153763025265967.jar tmpDir=null
2025-11-18 15:51:38,673 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-11-18 15:51:38,869 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-11-18 15:51:38,869 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-11-18 15:51:38,893 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-11-18 15:51:39,442 INFO mapred.FileInputFormat: Total input files to process : 2
2025-11-18 15:51:39,502 INFO mapreduce.JobSubmitter: number of splits:2
2025-11-18 15:51:39,813 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1667184421_0001
2025-11-18 15:51:39,813 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-11-18 15:51:40,247 INFO mapred.LocalDistr

In [None]:
import os

# List the contents of the output directory to confirm the output file exists
print("Contents of output_multi_log directory:")
!ls -l output_multi_log

# Display the results from the output file
print("\nResults of status code counts from multiple logs:")
!cat output_multi_log/part-00000

Contents of output_multi_log directory:
total 4
-rw-r--r-- 1 root root 30 Nov 18 15:51 part-00000
-rw-r--r-- 1 root root  0 Nov 18 15:51 _SUCCESS

Results of status code counts from multiple logs:
200	7
400	1
403	1
404	2
503	1
