# Parquet reader for tweets

This notebook can be used to read and analyze data from Parquet-file.
Example is using twitter data stored to Parquet-format using KafkaTwitterTopic-to-Parquet notebook.

In [1]:
# Spark, Spark SQL and add operator libraries we'll need to process data
from pyspark import SparkContext

from pyspark.sql import SparkSession
from operator import add

In [2]:
# Create Spark Session 
spark = SparkSession \
    .builder \
    .appName("Parquet reader") \
    .getOrCreate()

In [3]:
# Reading data from Parquet file
df = spark.read.parquet("data/tweets.parquet")

In [4]:
# counting lines
df.count()

10548

In [5]:
# Printing schema and showing sample from beginning of dataframe
df.printSchema()

root
 |-- created_at: string (nullable = true)
 |-- tweet_id: long (nullable = true)
 |-- user_id: long (nullable = true)
 |-- user_name: string (nullable = true)
 |-- screen_name: string (nullable = true)
 |-- text: string (nullable = true)



In [None]:
df.show(5)

___
SQL queries for some simple sample operations.

In [6]:
# creating temporary table for SQL queries
df.createOrReplaceTempView('table')

In [7]:
# Counting occurences of some of our keywords
marathon = spark.sql("SELECT count(*) AS marathon FROM table WHERE text LIKE '%marathon%'")
jogging = spark.sql("SELECT count(*) AS jogging FROM table WHERE text LIKE '%jogging%'")
trailrunning = spark.sql("SELECT count(*) AS trailrunning FROM table WHERE text LIKE '%trailrunning%'")

In [8]:
marathon.show()
jogging.show()
trailrunning.show()

+--------+
|marathon|
+--------+
|     236|
+--------+

+-------+
|jogging|
+-------+
|    101|
+-------+

+------------+
|trailrunning|
+------------+
|           5|
+------------+



In [None]:
# Let's see the most active users
users = spark.sql("SELECT user_name, screen_name, count(*) AS lkm FROM table GROUP BY user_name, screen_name ORDER BY lkm DESC")
users.show()

In [None]:
# And tweets of the most active
mostactive = spark.sql("SELECT * FROM table WHERE screen_name LIKE '_pick_most_active_user_screen_name_from_list_above_'")
mostactive.select('text').show(10, truncate=False)

It seems that 'running' isn't good keyword to capture tweets about running sports. Those tweets by most active tweeter seems to be about running tap water...

---
Let's try something else. Let's count the words of all text-columns and sort them descending order.

In [9]:
# This is done by using lambda operations with Sparks RDD data structure.
# text-column is 6th column of data, so when first one is 0, we need to look at column 5.
# Making words lowercase and splitting lines by spaces. 
# Then transforming words to tuples with number one. After that add words and count ones. Sort by number on descending order
counts = df.rdd.map(lambda x: x[5]) \
            .flatMap(lambda x: x.lower().split(' ')) \
            .map(lambda x: (x, 1)) \
            .reduceByKey(add).sortBy(lambda x: -x[1])
            
counts.take(10)

[('the', 6905),
 ('rt', 6689),
 ('running', 5783),
 ('to', 3707),
 ('and', 3563),
 ('of', 3049),
 ('a', 2730),
 ('is', 2160),
 ('in', 1870),
 ('on', 1675)]