# <font color=blue><center>Log Processing Using Lambda Architecture</center></font>
## Agenda

### Why?
Web server log analysis can offer important insights on everything from security to customer service to SEO. The information collected in web server logs can help you with:
- Network troubleshooting efforts
- Development and quality assurance
- Identifying and understanding security issues
- Customer service
- Maintaining compliance with both government and corporate policies

### Architecture
- Overview of data flow
- Tech Stack
- End result

### Environment Setup
- AWS EC2 instance and security group creation
- Docker installation and running
- Usage of docker-composer and starting all the tools
- How to access tools in local machine

### Log Preprocessing
- Walk thru Common Log Format
- Parsing Log file
- Data Cleaning
- Fix the rows with null content_size

### Extraction
- Download data from Kaggle
- Push data using NiFi
- Creating Kafka topic and publishing log data to it

### Transformation and Load
- Schema creation
- Reading data from Kafka as Streaming Dataframe
- Extraction and transformation of log data
- Continuous data load to Cassandra

### Visualization
- Multipage Dash Application development
- Realtime Dashboard
- Hourly Dashboard
- Daily Dashboard

### Code walkthrough
- Log Preprocessing
- Log Listener
- Log Visualizer

## <font color=blue>Architecture</font>
### Overview of data flow
#### Data Flow Architecture
![alt text](log_processing_using_lambda_architecture.png)
### Tech Stack
* AWS EC2
* Docker
* Jupyter Lab
* Spark Structured Streaming
* NiFi
* Kafka
* Python
* Cassandra
* Plotly
* Dash

### End result

## <font color=blue>Environment Setup</font>
### AWS EC2 instance and security group creation
- t2.xlarge instance
- 32GB of storage recommended
- Allow ports 4000 - 38888
- Connect to ec2 via ssh
 <code>ssh -i "D:\path\to\private\key.pem" user@Public_DNS</code>
 <br/>Example:<code>ssh -i "D:\Users\pyerravelly\Desktop\twitter_analysis.pem" ec2-user@ec2-54-203-235-65.us-west-2.compute.amazonaws.com</code><br/>
- Port forwarding 
 <code>ssh -i "D:\path\to\private\key.pem" user@Public_DNS</code>
 <br/>Example:<code>ssh -i "D:\Users\pyerravelly\Desktop\twitter_analysis.pem" ec2-user@ec2-34-208-254-29.us-west-2.compute.amazonaws.com -L 2081:localhost:2041 -L 4888:localhost:4888 -L 2080:localhost:2080 -L 8050:localhost:8050 -L 4141:localhost:4141</code><br/>
- Copy from local to ec2
  <code>scp -r -i "D:\Users\pyerravelly\Desktop\twitter_analysis.pem"</code>
  <br/>Example:<code>scp -r -i "D:\Users\pyerravelly\Desktop\twitter_analysis.pem" D:\Users\pyerravelly\Downloads\spark-standalone-cluster-on-docker-master\build\docker\docker-exp ec2-user@ec2-34-208-254-29.us-west-2.compute.amazonaws.com:/home/ec2-user/docker_exp
</code>

### Docker installation and running
    
### Usage of docker-composer and starting all the tools

- Commands to install Docker

<code>sudo yum update -y</code>
<code><br/>sudo yum install docker</code>
<code><br/>sudo curl -L "https://github.com/docker/compose/releases/download/1.29.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose</code>
<code><br/>sudo chmod +x /usr/local/bin/docker-compose</code>
<code><br/>sudo gpasswd -a $USER docker</code>
<code><br/>newgrp docker</code>
<br/>Start Docker: <code>sudo systemctl start docker</code>
<br/>Stop Docker: <code>sudo systemctl stop docker</code>

- How to access tools in local machine <br/>
    List Docker containers running: <code>docker ps</code><br/>
    CLI access in Docker container: <code>docker exec -i -t kafka bash</code><br/>
    NiFi at: http://localhost:2080/nifi/ <br/>
    Jupyter Lab at: http://localhost:4888/lab? <br/>
    HDFS at: http://localhost:50070/
    Dash Application at: http://localhost:8050/ (this will be available when executed log_visualizer.ipynb)

## <font color=blue>Log Preprocessing</font>

- Walk thru <a href="https://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format"> Common Log Format </a>
- Parsing Log file
- Data Cleaning
- Fix the rows with null content_size
- Formating timestamp

## <font color=blue>Extraction</font>

### Download Dataset
- <a href = https://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html> Actual NASA Logs </a>
- <a href = https://www.kaggle.com/souhagaa/nasa-access-log-dataset-1995/download> Kaggle </a>
- <code>docker exec -i -t nifi bash </br>mkdir -p nasa_logs && cp /opt/workspace/nifi/InputData/data.csv nasa_logs/data.csv</code>

### Nifi
- Processor
- Connection

### Kafka
- Topic
- Producer
- Consumer

### Topic
- Topic creation through CLI

#### Commands
<code>docker ps</code> to get kafka container name

<code>docker exec -i -t kafka bash</code> enter into kafka CLI

<code>kafka-topics.sh --create --topic nasa_logs_demo --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:2181</code> creation of topic named nasa_logs_demo

<code>kafka-topics.sh --list --bootstrap-server localhost:29092</code> list topics

<code>kafka-topics.sh --describe --topic nasa_logs_demo --zookeeper zookeeper:2181</code> describe topic

<code>kafka-topics.sh --delete --topic nasa_logs_demo --zookeeper zookeeper:2181</code> delete topic

<code>kafka-console-consumer.sh --bootstrap-server localhost:29092 --topic nasa_logs_demo --from-beginning --max-messages 30</code> consume/read data from topic

### Streaming data from file system using NiFi
- Goto http://localhost:2080/nifi/
- Nifi Setup
- Publish log data via NiFi

## <font color=blue>Transformation and Load</font>

### Cassandra and HDFS set up
- Create namespace and table in Cassandra
- CQL Commands
<code>
    docker exec -i -t cassandra bash
    cqlsh -u cassandra -p cassandra
    CREATE KEYSPACE IF NOT EXISTS LogAnalysis WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};
    CREATE TABLE IF NOT EXISTS LogAnalysis.NASALog (host text , time text , method text , url text , response text , bytes text, extension text, time_added text,PRIMARY KEY (host));
    truncate table LogAnalysis.NASALog;
</code>
- Create a folder in HDFS
<code>
    docker exec -i -t namenode bash
    hdfs dfs -mkdir -p /output1/nasa_logs/
    http://localhost:50070/
</code>

### Read Streaming Data and Cleansing
- Schema creation
- Reading data from Kafka as Streaming Dataframe
- Extraction and cleansing of Log data

### Data Loading
- Continuous data load to Cassandra
- Writing data to HDFS

## <font color=blue>Visualization</font>
- Scatter graph and Table definition with intervals using Python Plotly and Dash
- Graph and Table app call-back

## <font color=blue>Code walkthrough</font>
- Log Listener
- Log Visualizer