Skip to content

polleyg/gcp-tweets-streaming-pipeline

 
 

Repository files navigation

Real time tweets pipeline using GCP

Forked from https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python and then dutifully hacked.

Architecture

Twitter -> App Engine Flex -> PubSub -> Dataflow -> BigQuery

Prerequisites

  • A PubSube topic
  • A BigQuery Dataset
  • A GCS bucket called tweet-pipeline

Deployment

gcloud builds submit --config=cloudbuild.yaml .

Sample Query

SELECT
  TIMESTAMP_MILLIS(timestamp) AS tweet_timestamp,
  JSON_EXTRACT(payload,
    '$.text') AS tweet_text,
  JSON_EXTRACT(payload,
    '$.user.screen_name') AS user_screen_name,
  JSON_EXTRACT(payload,
    '$.user.location') AS user_location,
  JSON_EXTRACT(payload,
    '$.user.followers_count') AS user_followers_count
FROM
  `twitter.tweets`
WHERE
  JSON_EXTRACT(payload,
    '$.text') LIKE '%BigQuery%'

About

Twitter -> App Engine Flex -> PubSub -> Dataflow -> BigQuery

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 50.9%
  • Java 44.6%
  • Dockerfile 4.5%