# **Pinterest Data Pipeline**
*Pinterest Data Pipeline project using Databricks, Spark, Airflow, Kinesis, Kafka, API Gateway*

Pinterest crunches billions of data points every day to decide how to provide more value to their users.\
In this project, we'll create a similar system using the AWS Cloud.

To start with we used a **[emulation script](script/user_post_emulation.py)** to recreate sample data of Pinterest infrastructure. The data is broken into 3 parts; **Post**, **Geolcation** and **User**.

## // 1 // Sample Data

#### Pinterest Post Data
This Data contains details of what a post would contain, such as image title, description, auther (poster_name) as well as additional details such as follower_count and tags. See sample data below.

In [None]:
{   'index': 9198,
    'unique_id': '4beee221-e7c2-4e5d-93d9-a0e98e3450a0',
    'title': 'Find the best global talent.',
    'description': 'No description available',
    'poster_name': 'Fiverr',
    'follower_count': '565k',
    'tag_list': 'Hand Tattoos,Dope Tattoos,Pretty Tattoos,Beautiful Tattoos,Body Art Tattoos,Small Tattoos,Tattoos For Guys,Tattoos For Women,Flower Tattoos',
    'is_image_or_video': 'image',
    'image_src': 'https://i.pinimg.com/originals/f9/08/67/f908679c6fd45aed6ab23b482728fa83.jpg',
    'downloaded': 1,
    'save_location': 'Local save in /data/tattoos',
    'category': 'tattoos'
    }

#### Geolaction Data
Contains details of where the post was made from and where the author is from.

In [None]:
{   'ind': 9198,
    'timestamp': datetime.datetime(2019, 4, 7, 22, 11, 2),
    'latitude': -12.1295,
    'longitude': -29.9199,
    'country': 'Afghanistan'
    }

#### User Data
Contains details of the author's actual name and details as opposed to the alias presented under "poster_name"

In [None]:
{   'ind': 9198,
    'first_name': 'Amber',
    'last_name': 'Chen',
    'age': 22,
    'date_joined': datetime.datetime(2015, 12, 30, 5, 21, 14)
    }

## // 2 // Set up EC2

In order to commence with the batch processing, we will first need to set up a Kafka cluster. We will do this on our EC2 instance on the AWS Cloud service. You will require setting up a IAM account alongside AWS Access Code and Secret Access Code. Once your accounts are configured and set-up, you will establish a connection to your EC2 VPC machine through the terminal. To do this, you will require creating a .pem key file containing the key-code for your connection.

Go to
```
EC2 > Instances > <instance-id> > Connect to instance
```
on your Amazon Cloud Service and you will find instructions on how to set up your connection.

## // 3 // Set up Kafka Cluster

Before we can run the Kafka Cluster to ingest the data. We need to adjust the client.properties to connect with out IAM authentication. This is so that the ingested data will get stored inside of our S3 storage.

You can create the client.properties file by using the ```nano``` command and insert the below code with your IAM and Access Role details.
```
ec2-user > kafka_[version-no.] > bin 
```

In [None]:
# Sets up TLS for encryption and SASL for authN.
client.security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
client.sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
client.sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="Your Access Role";

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
client.sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

## // 4 // Create Kafka Topic

Next we need to create our Kafka Topics which will tell our data where to go upon ingestion.

As there are 3 pieces of data we will repeat the below code 3 times. The ```/kafka-topics.sh``` file can be located inside the same ```ec2/kafka.../bin```. \
You can locate your ```BootstrapServerString``` inside of your AWS Cloud Server.

In [None]:
./kafka-topics.sh --bootstrap-server BootstrapServerString --command-config client.properties --create --topic <topic_name>

## // 5 // Connect the MSK Cluster to S3 Bucket

Now the credentials are inside our client.properties we need a means of executing the ingestion process via Kafka. To do this we will set up the connection between our cluster and the S3 database. We will install the CONFLUENT Connect package.

In [None]:
```
# assume admin user privileges
sudo -u ec2-user -i

# create directory where we will save our connector 
mkdir kafka-connect-s3 && cd kafka-connect-s3

# download connector from Confluent
wget https://d1i4a15mxbxib1.cloudfront.net/api/plugins/confluentinc/kafka-connect-s3/versions/10.0.3/confluentinc-kafka-connect-s3-10.0.3.zip

# copy connector to our S3 bucket
aws s3 cp ./confluentinc-kafka-connect-s3-10.0.3.zip s3://<BUCKET_NAME>/kafka-connect-s3/
```

Next we create a **CUSTOM PLUGIN** inside of the MSK Cluster.\
```Amazon MSK > Customised plugins > 0e2a0bfcc015-plugin```\
Insert the below code into the connector.

In [None]:
connector.class=io.confluent.connect.s3.S3SinkConnector

# same region as our bucket and cluster
s3.region=us-east-1
flush.size=1
schema.compatibility=NONE
tasks.max=3

# include nomeclature of topic name, given here as an example will read all data from topic names starting with msk.topic....
topics.regex=<YOUR_UUID>.*
format.class=io.confluent.connect.s3.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
value.converter.schemas.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
storage.class=io.confluent.connect.s3.storage.S3Storage
key.converter=org.apache.kafka.connect.storage.StringConverter
s3.bucket.name=<BUCKET_NAME>

## // 6 // Configure API Gateway

We will create our Kafka REST Proxy API.

1 - Go to the API Gateway Console on the MSK. API Gateway > APIs \
2 - Create a {proxy+} resource \
3 - Next setup the INTEGRATION through the EDIT INTEGRATION and select HTTP proxy \
4 - For the ENDPOINT URL, use the Public IPv4 DNS found in the MSK console for EC2 instances. (Format should be, http://```**KafkaClientEC2InstancePublicDNS**```:8082/{proxy}) \
5 - Finally you are set to DEPLOY the API Gateway
6 - Provide a stage Name and details.

We will use the **invoke URL** inside of our **[emulation script](script/user_post_emulation.py)**. 

Once the above steps are completed, we will update our Kafka-Rest folder on our EC2 machine and place the following into the Kafka.properties

In [None]:
#Sets up TLS for encryption and SASL for authN.
client.security.protocol = SASL_SSL

#Identifies the SASL mechanism to use.
client.sasl.mechanism = AWS_MSK_IAM

#Binds SASL client implementation.
client.sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="Your Access Role";

#Encapsulates constructing a SigV4 signature based on extracted credentials.
#The SASL client bound by "sasl.jaas.config" invokes this class.
client.sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

KAFKA CONSUMER

THIS IS THE ONE THAT WORKS > ESTABLISHES LIVE CONSUMER STREAM

./kafka-console-consumer.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --topic 0e2a0bfcc015.pin --from-beginning --group students --max-messages 10

./kafka-console-consumer.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --topic 0e2a0bfcc015.geo --from-beginning --group students --max-messages 10

./kafka-console-consumer.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --topic 0e2a0bfcc015.user --from-beginning --group students --max-messages 10

RUN FROM INSIDE KAFKA BIN
cd kafka_2.12-2.8.1/bin/

START KAFKA REST BEFORE SENDING VIA API

./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

FROM THE confluent-7.2.0/bin

KEYPAIR TO START

ssh -i "utility/0e2a0bfcc015-key-pair.pem" ec2-user@ec2-184-73-75-230.compute-1.amazonaws.com

cd confluent-7.2.0/bin/

## // 7 // Databricks

## // 8 // Data Transformation

## // 9 // Create DAG on MWAA Environment

## // 10 // Stream Processing: AWS Kinesis

In [None]:
from os.path import expanduser
import os

home = expanduser("~")
airflow_dir = os.path.join(home, 'airflow')
assert os.path.isdir(airflow_dir)

In [None]:
from os.path import expanduser
from pathlib import Path
home = expanduser("~")
airflow_dir = os.path.join(home, 'airflow')
Path(f"{airflow_dir}/dags").mkdir(parents=True, exist_ok=True)

In [None]:
from airflow.models import DAG
from datetime import datetime
from datetime import timedelta
from airflow.operators.bash_operator import BashOperator