The following code below is a pure Python app. Its purpose is to connect to twitter's streaming API, receive the live tweets (a small sample of it), and then assemble batches of tweets into json files and drop them in a directory so that it can be consumed by PySpark's Structured Streaming.

> In the past, I write the tweets into a Socket in real time which can be consumed by Structured Stream. This approach has much lower latency, but the DataBricks environment does not seem to support Socket well.

- We use a [Tweepy](https://docs.tweepy.org/en/stable/) package for interfacing with Twitter
- Twitter has REST APIs (which you can use to make one-off requests) and a [streaming API](https://docs.tweepy.org/en/stable/stream.html), which you can maintain a live connection and keep receiving new tweets
- We choose to receive only covid related tweets in English. 
- After receiving tweets, we record the time stamp and the tweet text and discard other information. 
- We pack 50 tweets into a batch and write them into a json file in local directory.
- We also print the tweets on screen.

## Step 1. Obtain Twitter API Credentials
In order to use all of this though, we need to setup a Developer API account with Twitter and create an application to get credentials. 

- make sure you have a twitter account
- set up a Developer API account with Twitter
- create an application to get credentials at [https://apps.twitter.com/](https://apps.twitter.com/)
    + Consumer Key 
    + Consumer Secret 
    + Access Token
    + Access Token Secret

This will be entered into your `tweetread.ipynb`, so that you can hook up to twitter's streaming service and receive tweets.

## Step 2: Install `Tweepy`

In [None]:
!pip install tweepy

Collecting tweepy
  Downloading tweepy-4.12.0-py3-none-any.whl (101 kB)
[?25l[K     |███▎                            | 10 kB 9.5 MB/s eta 0:00:01[K     |██████▌                         | 20 kB 3.6 MB/s eta 0:00:01[K     |█████████▊                      | 30 kB 5.2 MB/s eta 0:00:01[K     |█████████████                   | 40 kB 4.1 MB/s eta 0:00:01[K     |████████████████▏               | 51 kB 4.4 MB/s eta 0:00:01[K     |███████████████████▍            | 61 kB 4.9 MB/s eta 0:00:01[K     |██████████████████████▊         | 71 kB 5.5 MB/s eta 0:00:01[K     |██████████████████████████      | 81 kB 6.2 MB/s eta 0:00:01[K     |█████████████████████████████▏  | 92 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████████| 101 kB 4.3 MB/s 
[?25hCollecting requests-oauthlib<2,>=1.2.0
  Downloading requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting oauthlib<4,>=3.2.0
  Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
[?25l[K     |██▏        

In [None]:
%%bash
rm -rf tweets

## Step 3 Develop the TweetRead Program

In the following, we develop an app that is connected to twitter [Streaming API](https://docs.tweepy.org/en/stable/stream.html) and writes the tweets with timestamp periodically to a local directory.

- In the on_data event handler of TweetsListener, we will 
  - load the data into a json object 
  - extrat `created_at` and `text` from the json object and save it in a dictionary `{'time':, 'text':}`
  - append the dictionary to an array `buffer`
  - at the same time, print the tweet on screen.
  - maintain a counter of dictionaries in the buffer. If the buffer size exceeds `tweets_per_file`, we output the buffer to a file in the given directory, and then reset the buffer.
  
- in the sendData(directory) function, we will
  - create a TweetListener instance, supplying it with the twitter API credential.
  - save directory to the listener's directory property.
  - call the listener's [`filter` API](https://docs.tweepy.org/en/stable/stream.html#tweepy.Stream.filter) to start listening to tweets on a particular topic (`covid`) and lanaguage (`en`)
  
- In the main logic,
  - create a directory `/databricks/driver/tweets`
  - call sendData(...)

https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview

In [None]:
import tweepy
import io
import json
import time
import os

## on databricks, use cluster /libraries /install new / PyPI type, enter tweepy to install it. For our demonstrate purpose, you may also use !pip install tweepy to install just on the driver node (this will not install on all cluster nodes)

# todo: Set up your credentials
consumer_key='ZmjFkUyHXUlLFsmN1oC10qJi4'
consumer_secret='jafW4bMAZDJ6geyUxQAQg0MnI1Mg7eOsaanx84XVx7Udc4jru4'
access_token ='1587546093281230850-km5uxRX5HrZiGvOc913NYoDvwm3Quy'
access_secret='30m5SaqQVIPw6WidsoYlsG42w1MWU6a8zGVNCBR0mrQWP'

class TweetsListener(tweepy.Stream):
  counter = 0 # for data counter.
  tweets_per_file = 10 # how many tweets per file? configure based on your needs
  buffer = []
  directory = None

  # on_data is an event that gets triggered each time there is new data (tweet) coming in.
  def on_data(self, data):
      try:
          # todo: load data into a json object and append to buffer, increase counter by 1.
          # todo: print the tweet.
          msg = json.loads(data)
          tweet = {"time":time.strftime('%Y-%m-%dT%H:%M:%SZ', time.strptime(msg['created_at'], '%a %b %d %H:%M:%S +0000 %Y')), "text":msg['text']}
          print(f"{tweet['time']} - {tweet['text']}")
          self.buffer.append(tweet)
          self.counter = self.counter+1
          
          if(self.counter >= self.tweets_per_file):
            try:
                #todo: create a file using the timestamp time.strftime("%Y%m%d-%H%M%S"), dump buffer into the file, then reset buffer
                timestr = time.strftime('%Y%m%d-%H%M%S')
                with io.open(self.directory + "/" + timestr + ".txt", "w", encoding="utf8") as f:
                  for row in self.buffer:
                    f.write(json.dumps(row))
                    f.write("\n")
                  f.close()
                  
                #todo: reset counter
                self.counter = 0
                self.buffer=[]
                
            except BaseException as e:
              print("error opening file:%s" % str(e))
          return True
      except BaseException as e:
          # if there is any error in processing the data, we print it on screen.
          print("Error on_data: %s" % str(e))
      return True
  
  # on_error gets triggered if there is some sort of error.
  def on_error(self, status):
      print(status)
      return True

def sendData(directory):
  # todo: create a tweetslistner twitter_stream, configure the directory, and start listening.
  twitter_stream = TweetsListener(consumer_key, consumer_secret, access_token, access_secret)
  twitter_stream.directory = directory
  twitter_stream.filter(track=['covid'], languages=['en'])

try:
  os.mkdir("/databricks/driver/tweets")
except Exception:
  # if the dir already exists
  pass

sendData("/databricks/driver/tweets")

Stream encountered HTTP error: 403
HTTP error response text: {"errors":[{"message":"You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve","code":453}]}

Stream encountered HTTP error: 403
HTTP error response text: {"errors":[{"message":"You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve","code":453}]}



## Note that this app runs forever; Make sure to cancel it when you're done using it.

In [None]:
%%bash
ls -l tweets

total 0
