
# 🧠 Apache Server Log Analysis using PySpark RDD

## Overview
In this notebook, we’ll use **PySpark RDDs** to analyze an **Apache web server log file** and identify **bad requests** (status codes `400`, `404`, and `500`).

We'll follow these steps:
1. Create a Spark session.  
2. Read the log file into an RDD.  
3. Explore a few records.  
4. Extract HTTP status codes.  
5. Filter only bad requests and count them.  
6. Return the count of each bad request type.


## Step 1: Create a Spark Session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

## Step 2: Read the Apache Log File into an RDD

In [None]:
logDataPath = 'access.log.1'
logRdd = spark.sparkContext.textFile(logDataPath)

## Step 3: Preview the First Log Entry

In [None]:
logRdd.first()

## Step 4: Inspect the Log Line Structure

In [None]:
x = logRdd.first()
x.split()

## Step 5: Extract the Status Code Field

In [None]:
x.split()[8]

In [None]:
lambda x: x.split()[8]

## Step 6: Filter Bad Requests and Count Them

In [None]:
logRdd.filter(lambda x: x.split()[8] in ['400', '404','500']).count()

## Step 7: Count the Occurrences of Each Bad Request Code

In [None]:
logRdd.filter(lambda x: x.split()[8] in ['400', '404','500']) \    .map(lambda x: (x.split()[8], 1)) \    .reduceByKey(lambda x, y: x + y) \    .collect()


## ✅ Summary
- We read Apache log data into an RDD.  
- Explored log structure to identify the status code field.  
- Filtered and counted bad requests (`400`, `404`, `500`).  
- Aggregated counts per error code using `reduceByKey()`.

This is a simple yet powerful example of **real-world log analysis using RDD transformations and actions** in PySpark.
