Nowadays, data is growing and accumulating faster than ever before. Currently, around 90% of all data generated in our world was generated only in the last two years. Due to this staggering growth rate, big data platforms had to adopt radical solutions in order to maintain such huge volumes of data.

One of the main sources of data today are social networks. Allow me to demonstrate a real-life example: Dealing, analyzing and extracting insights from social network data in real-time using one of the most important big data echo solutions out there—Apache Spark, and Python.

In this step, I’ll show you how to build a simple client that will get the tweets from Twitter API using Python and passes them to the Spark Streaming instance.

Import the libraries that we’ll use as below:

In [1]:
import socket
import sys
import requests
import requests_oauthlib
import json

And add the variables that will be used in OAuth for connecting to Twitter as below:

In [2]:
CONSUMER_KEY = '69SNKxw0qbGY6PPZBcRVLZVpP'
CONSUMER_SECRET = 'sIUppqVb3mDXDdbnXGst50fL2DvmtcyjakxbJhA7D5vQpt3PNr'
ACCESS_TOKEN = '1039586617445572608-3R3OHxxrH5JY8e9IITG3w3pwkZpGFc'
ACCESS_SECRET = 'KrqdNUh5tddGq1SW1SUIeQMBelQcyLzQVOaOzAPfNZG84'
my_auth = requests_oauthlib.OAuth1(CONSUMER_KEY, CONSUMER_SECRET,ACCESS_TOKEN, ACCESS_SECRET)

Now, we will create a new function called get_tweets that will call the Twitter API URL and return the response for a stream of tweets.

In [3]:
def get_tweets():
    url = 'https://stream.twitter.com/1.1/statuses/filter.json'
    query_data = [('language', 'en'),('track','#')]
    query_url = url + '?' + '&'.join([str(t[0]) + '=' + str(t[1]) for t in query_data])
    response = requests.get(query_url, auth=my_auth, stream=True)
    print(query_url, response)
    return response

Then, create a function that takes the response from the above one and extracts the tweets’ text from the whole tweets’ JSON object. After that, it sends every tweet to Spark Streaming instance (will be discussed later) through a TCP connection.

In [4]:
def send_tweets_to_spark(http_resp, tcp_connection):
    for line in http_resp.iter_lines():
        try:
            full_tweet = json.loads(line)
            tweet_text = full_tweet['text']
            print("Tweet Text: " + tweet_text)
            print ("------------------------------------------")
            tweet_data = bytes(tweet_text + "/n", "utf-8")
            tcp_connection.send(tweet_data)
        except:
            e = sys.exc_info()[0]
            print("Error: %s" % e)

Now, we’ll make the main part which will make the app host socket connections that spark will connect with. We’ll configure the IP here to be localhost as all will run on the same machine and the port 9009. Then we’ll call the get_tweets method, which we made above, for getting the tweets from Twitter and pass its response along with the socket connection to send_tweets_to_spark for sending the tweets to Spark.

In [5]:
TCP_IP = "localhost"
TCP_PORT = 9020
conn = None
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((TCP_IP, TCP_PORT))
s.listen(1)
print("Waiting for TCP connection...")
conn, addr = s.accept()
print("Connected... Starting getting tweets.")
resp = get_tweets()
send_tweets_to_spark(resp, conn)

Waiting for TCP connection...
Connected... Starting getting tweets.
https://stream.twitter.com/1.1/statuses/filter.json?language=en&track=# <Response [406]>
Error: <class 'json.decoder.JSONDecodeError'>
