# Kafka Real-Time Processing Notes

## Overview
- **Task**: Deploy Kafka on AWS EC2 for ML system.
- **Kafka Features**:  
  - Distributed, fault-tolerant, high-throughput.  
  - Real-time data feeds (not a single message queue).  
  - Durable storage with low latency.  
  - Terms noted: "Persistent Storage," "Horizontal Scaling."

---

## System Components

### Zookeeper
- **Responsibilities**:  
  - Status monitoring of services and data.
  - Problem handling.
  - Sending messages to topics.
  - Partition key distribution (text-based).

### Kafka Cluster Setup
- **Brokers**:  
  - **Broker 1**: 
    - Partition 0 (Leader), Partition 1 (Follower).
  - **Broker 2**: 
    - Partitions 0 (Leader), 1 (Follower).
- **Consumer**: Listens and reads data from specified topics.

---

## Real-Time Stock Market Analysis Design

### Functional Requirements
1. **Real-time data ingestion**: 
   - Ingest prices, volumes, metadata every 1 second.
2. **Stream processing**: 
   - Data cleaning, aggregation, and transformation in real-time.
3. **Batch and real-time storage**: 
   - Store raw and processed data in a query-optimized format (e.g., S3).
4. **Real-time alerts**: 
   - Detect anomalies (volume spikes, price drops).
5. **Batch analytics**: 
   - Execute SQL queries on historical data.
6. **ML predictions**: 
   - Predict short-term stock trends.
7. **Dashboards**: 
   - Visualize trends.

### Non-Functional Requirements
- **Low latency**: 
  - Aim for < 5 seconds (end-to-end data insight).
- **Scalability**: 
  - Support 1000 stocks with 1-second updates.
- **Fault tolerance**: 
  - Systems should remain operational in case of failure.
- **Cost efficiency**: 
  - Use S3 storage and EC2 virtual machines.

---

## Key Processes
1. **Real-Time Anomaly Detection**:
   - Identify volume spikes and price drops.
2. **Batch Analysis**:
   - Conduct SQL queries on historical data.
3. **ML Predictions**:
   - Short-term stock trend forecasting.
4. **Visualization**:
   - Display trends using dashboards.

### Data Sources
- **APIs**: 
  - Yahoo Finance, Alpha Vantage API.

---

## Kafka on EC2 Deployment Steps
1. **EC2 Setup**:  
   - Launch an Amazon Linux 2 instance.  
   - Connect via SSH using a key-value pair.  
2. **Install Dependencies**:  
   ```bash
   sudo yum install java
   wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.12-3.3.1.tgz
   tar -xzf kafka_2.12-3.3.1.tgz
   cd kafka_2.12-3.3.1/
   ```
3. **Start Services**:  
   - **Start Zookeeper (Background)**:  
     ```bash
     bin/zookeeper-server-start.sh config/zookeeper.properties
     ```
   - **Start Kafka server (New Terminal)**:  
     ```bash
     export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M"
     bin/kafka-server-start.sh config/server.properties
     ```
4. **S3 Integration**:  
   - Raw data path: `s3://stock-data/raw/data=2023-10-10/hours=12/example.csv`.

---

## ETL & Analytics

### AWS Services
- **Glue Jobs**: 
  - Convert JSON to structured data.
  - Clean and process the data (e.g., handle nulls, calculate moving averages).
- **Processed Data Path**:  
  `s3://stock-data/processed/date/hour/`.

### AWS Athena: SQL Example
```sql
CREATE EXTERNAL TABLE my_table (
    id INT,
    name STRING,
    age INT
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://my-bucket/data/';
```

### AWS Lambda (Optional)
```python
def lambda_handler(event, context):
    for record in event['Records']:
        data = json.loads(record['value'])
        if data['volume'] > 1000000:
            sns.publish(TopicArn='arn:aws:sns:...', Message=f"Volume spike: {data['symbol']}")
```

---

## Modeling & Predictions

### Feature Engineering
- **Moving Average**:  
```python
df['5min_MA'] = df['close'].rolling(window=300).mean()  # Rolling average for 5 minutes
```

### Models
- **XGBoost**:
  - Pros: Fast, good for structured/tabular data.
  - Best use: Short-term predictions.

- **LSTM**: 
  - Captures temporal patterns, handles volatility well.
  - Requires larger datasets; suited for high-frequency trading.

```python
model = Sequential([
    LSTM(128, input_shape=(60, 5)),  # 60 time steps, 5 features
    Dropout(0.3),
    Dense(1)
])
model.compile(loss='mae', optimizer='adam')
```

### Deployment
- Predictions sent to dashboards/APIs.
- Integration possibilities: AWS SageMaker, AWS Lambda.

---

## Production Considerations
- **LSTM latency**: Processing times may impose constraints.
- **Kafka Integration**: Ensure seamless connection between services.
- **Zero Spend Budget**: Set alerts on expenses in AWS to prevent unexpected charges.

---

## Architecture Diagram using Mermaid

```mermaid
graph LR
    A[Stock Data Source - Yahoo Finance, Alpha Vantage] --> B[Kafka on EC2]
    B --> C[Real Time Alerting - AWS Lambda]
    B --> D[Store Raw Data in S3]
    C --> D
    D --> E[Transformation using Glue]
    E --> F[Store Processed Data in S3]
    F --> G[AWS Athena SQL Queries]
    F --> H[Model Predictions - XGBoost, LSTM]
    H --> I[Deploy Model to SageMaker]
    I --> J[Dashboard Visualization]
```

# Continuing from Kafka Real-Time Processing Notes

## Billing in AWS
### AWS Billing Overview
- **Understanding Charges and Costs**:
  - AWS operates on a pay-as-you-go model, meaning you pay for what you use, which can include services like EC2, S3, Lambda, and others.
  
### Zero Spend Budget
- **Purpose of Zero Spend Budget**:
  - This feature allows you to set a budget threshold (e.g., $0.01), which sends an alert when your spending reaches that limit.
  - It is crucial for new users to prevent unexpected charges while experimenting and learning about AWS services.

### Steps to Set Up a Zero Spend Budget:
1. **Navigate to Billings Dashboard**:
   - Go to your AWS Management Console and find the “Billing and Cost Management” section.
   
2. **Create a Budget**:
   - Click on “Budgets” and then “Create Budget.”
   - Select “Cost Budget” and choose “Use a template.”

3. **Set Budget Type**:
   - Choose “Zero Spend Budget” to avoid any charges that exceed the threshold.
   - Set budget conditions based on anticipated usage of different services.

4. **Email Notifications**:
   - Add your email address to receive alerts when your spending approaches or exceeds the defined limit.
   
5. **Review and Confirm**:
   - Review your settings to ensure accuracy and confirm to create the budget.

---

## External Tables in AWS Athena
### What is an External Table?
- **Definition**: 
  - An external table in AWS Athena allows you to query data stored in S3 without the need for loading it into a database.
  - It provides a way to run SQL queries directly from the data stored in S3, treating it as a performable dataset.

### Steps to Create an External Table:
1. **Use the AWS Athena Console**:
    - Navigate to the Athena service in the AWS Management Console.

2. **Select Database**:
   - Choose the database where you want to create the external table.

3. **Run SQL Command**:
   - Use the following SQL syntax to create your external table:
   ```sql
   CREATE EXTERNAL TABLE my_table (
       id INT,
       name STRING,
       age INT
   )
   ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY ','
   STORED AS TEXTFILE
   LOCATION 's3://my-bucket/data/';
   ```
   - Make sure to replace `'s3://my-bucket/data/'` with the actual S3 path where your data files are stored.

---

## Conclusion
### Review of Key Concepts
1. **Kafka as a Streaming Platform**:
   - Understand Kafka's role in handling high-throughput, fault-tolerant messaging for real-time updates, such as stock prices.

2. **Setting Up in AWS**:
   - Knowledge of launching EC2 instances, setting up Kafka and Zookeeper, and managing lifecycle (start/stop/terminate).

3. **Real-Time Processing and Analysis**:
   - Combine tools like AWS Glue for ETL processes, AWS Athena for querying, and integration of alerts through AWS Lambda.

4. **Cost Management**:
   - Setting up a Zero Spend Budget is essential for managing costs effectively in AWS, especially for beginners.

5. **Database Interactions**:
   - Utilize external tables effectively in Athena, allowing flexible analysis of data located in S3.

---

## Next Steps
- **Practice Deployment**: 
  - Ensure that all utility components (Kafka, Zookeeper, EC2 instances) are practiced.
  - Explore real-time data ingestion through Kafka by simulating stock price updates.

- **Engage with AWS Services**: 
  - Familiarize yourself with other relevant services such as AWS Glue for ETL, AWS Lambda for serverless processing, and AWS SageMaker for model deployment.

- **Discussion and Troubleshooting**: 
  - Be prepared to share insights or challenges faced during implementations in the next class session so solutions can be collaboratively identified.