# Twitter: Developing the app

## Source

We will use the `read` method instead of the `readStream` method. This way we will be able to create a static dataframe from the Kafka data stream:

In [1]:
from pyspark.sql.functions import col, from_json, expr, to_json, window, expr, to_timestamp
from  pyspark.sql.types import StructType, StructField, LongType, StringType, ArrayType

In [2]:
raw = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "10.38.28.103:9092") \
    .option("subscribe", "twitter") \
    .option("startingOffsets", "earliest") \
    .load()

In [3]:
raw.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



For the tests let's keep only 5 rows of the raw data:

In [4]:
raw_subset = raw.limit(5)
raw_subset.cache()

DataFrame[key: binary, value: binary, topic: string, partition: int, offset: bigint, timestamp: timestamp, timestampType: int]

## Transform

We will load the different fields from the json string in the `value` column:

In [2]:
schema = StructType([
    StructField("created_at", StringType()),
    StructField("lang", StringType()),
    StructField("text", StringType()),
])

In [5]:
values = raw_subset.select(
    from_json(col("value").cast("string"), schema).alias("value"))

In [6]:
values.collect()

[Row(value=Row(created_at=u'2022-09-24T19:24:22.000Z', lang=u'en', text=u"RT @dewaleism: It's unimaginable to see that these Iranian women have more balls that those Russian men fleeing Putin's #mobilization. You\u2026")),
 Row(value=Row(created_at=u'2022-09-24T19:24:23.000Z', lang=u'en', text=u'@AndrewDevoss @TulsiGabbard Brandon says, "F*** the American people". https://t.co/TzWyOPje7A')),
 Row(value=Row(created_at=u'2022-09-24T19:24:23.000Z', lang=u'en', text=u'@Zzzaikar Putin is murdering his own citizens. These are young people, with little military experience, being sent to fight a war that serves ONLY to enrich mafia Putin and his corrupted mates #PutinWarCriminal')),
 Row(value=Row(created_at=u'2022-09-24T19:24:24.000Z', lang=u'en', text=u'Because of Fascist-supporting, Putin apologist Tankies like you.\n\nPerhaps he needs the number for Wagner so he can sign up? https://t.co/W6keQhl9RJ')),
 Row(value=Row(created_at=u'2022-09-24T19:24:26.000Z', lang=u'en', text=u'RT @DrDenaGray

As we can see each Row now has now a `value` fiels that contains itself a Row object. We want to `explode` the fields in the second Row object as independent fields:

In [7]:
exploded = values.selectExpr("value.*")

In [8]:
exploded.collect()

[Row(created_at=u'2022-09-24T19:24:22.000Z', lang=u'en', text=u"RT @dewaleism: It's unimaginable to see that these Iranian women have more balls that those Russian men fleeing Putin's #mobilization. You\u2026"),
 Row(created_at=u'2022-09-24T19:24:23.000Z', lang=u'en', text=u'@AndrewDevoss @TulsiGabbard Brandon says, "F*** the American people". https://t.co/TzWyOPje7A'),
 Row(created_at=u'2022-09-24T19:24:23.000Z', lang=u'en', text=u'@Zzzaikar Putin is murdering his own citizens. These are young people, with little military experience, being sent to fight a war that serves ONLY to enrich mafia Putin and his corrupted mates #PutinWarCriminal'),
 Row(created_at=u'2022-09-24T19:24:24.000Z', lang=u'en', text=u'Because of Fascist-supporting, Putin apologist Tankies like you.\n\nPerhaps he needs the number for Wagner so he can sign up? https://t.co/W6keQhl9RJ'),
 Row(created_at=u'2022-09-24T19:24:26.000Z', lang=u'en', text=u'RT @DrDenaGrayson: \U0001f6a8BREAKNG: Drafted men in Omsk, #Russia a

**We now have to parse the date string.**

Here we have the reference for the pattern syntax:
- [Datetime Patterns for Formatting and Parsing](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)

Let's do some tests to see if we are able to parse it correctly.

- The automatic mode seems to recognize the format correctly:

In [9]:
spark.sql('''select to_timestamp("2022-09-24T19:24:22.000Z")''').show()

+----------------------------------------+
|to_timestamp('2022-09-24T19:24:22.000Z')|
+----------------------------------------+
|                     2022-09-24 21:24:22|
+----------------------------------------+



- Looking at the doc and creating a specific string pattern does not seem to work

In [10]:
spark.sql('''select to_timestamp("2022-09-24T19:24:22.000Z", "yyyy-MM-ddTHH:mm:ss.SSSZ")''').show()

+--------------------------------------------------------------------+
|to_timestamp('2022-09-24T19:24:22.000Z', 'yyyy-MM-ddTHH:mm:ss.SSSZ')|
+--------------------------------------------------------------------+
|                                                                null|
+--------------------------------------------------------------------+



- The problem is that we have to use the `'` character to escape text (see the reference above)

In [11]:
spark.sql('''select to_timestamp("2022-09-24T19:24:22.000Z", "yyyy-MM-dd'T'HH:mm:ss")''').show()

+-------------------------------------------------------------------+
|to_timestamp('2022-09-24T19:24:22.000Z', 'yyyy-MM-dd\'T\'HH:mm:ss')|
+-------------------------------------------------------------------+
|                                                2022-09-24 19:24:22|
+-------------------------------------------------------------------+



In [12]:
tweets = exploded.withColumn(
    "created_at",
    to_timestamp(col("created_at"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
)

In [13]:
tweets.collect()

[Row(created_at=datetime.datetime(2022, 9, 24, 19, 24, 22), lang=u'en', text=u"RT @dewaleism: It's unimaginable to see that these Iranian women have more balls that those Russian men fleeing Putin's #mobilization. You\u2026"),
 Row(created_at=datetime.datetime(2022, 9, 24, 19, 24, 23), lang=u'en', text=u'@AndrewDevoss @TulsiGabbard Brandon says, "F*** the American people". https://t.co/TzWyOPje7A'),
 Row(created_at=datetime.datetime(2022, 9, 24, 19, 24, 23), lang=u'en', text=u'@Zzzaikar Putin is murdering his own citizens. These are young people, with little military experience, being sent to fight a war that serves ONLY to enrich mafia Putin and his corrupted mates #PutinWarCriminal'),
 Row(created_at=datetime.datetime(2022, 9, 24, 19, 24, 24), lang=u'en', text=u'Because of Fascist-supporting, Putin apologist Tankies like you.\n\nPerhaps he needs the number for Wagner so he can sign up? https://t.co/W6keQhl9RJ'),
 Row(created_at=datetime.datetime(2022, 9, 24, 19, 24, 26), lang=u'en', 

In [14]:
tumbling_window = tweets.groupBy(
    window(col("created_at"), "5 minutes")
).count()

In [15]:
tumbling_window.show(truncate=False)

+------------------------------------------+-----+
|window                                    |count|
+------------------------------------------+-----+
|[2022-09-24 19:20:00, 2022-09-24 19:25:00]|5    |
+------------------------------------------+-----+



## Loading the ML model

In [16]:
from pyspark.ml import PipelineModel
my_saved_model = PipelineModel.load('models/amazon_sentiment_analysis_cv_model')

In [17]:
from pyspark.sql.functions import lit

df = tweets.withColumn('reviewText', col('text')).withColumn('overall', lit(5.0))

In [18]:
predictions = my_saved_model.transform(df)

In [19]:
predictions.select('text', 'prediction').show(vertical=True, truncate=160)

-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 text       | RT @dewaleism: It's unimaginable to see that these Iranian women have more balls that those Russian men fleeing Putin's #mobilization. You…                      
 prediction | 1.0                                                                                                                                                              
-RECORD 1----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 text       | @AndrewDevoss @TulsiGabbard Brandon says, "F*** the American people". https://t.co/TzWyOPje7A                                                                    
 prediction | 1.0                                                                                                       

If we remember the 1.0 meant possitive sentiment and the 0.0 negative (look at amazon lab).

## Using TextBlob

To use TextBlob you will have to install it first using:
```
pip install --user TextBlob
```

and then you have to download corpora:
```
python -m textblob.download_corpora
```

To use in a spark job we will have to generate a zip with the dependencies:
```
pip install -t dependencies -r requirements_textblob.txt
cd dependencies
zip -r ../dependencies.zip .
```

And then to include the dependencies you will use the `--py-files dependencies.zip` option.

Additionaly we will need the `corpus` stored in `/opt/cesga/nltk_data`:
```
export NLTK_DATA="/opt/cesga/nltk_data"
```

How to use TextBlob:
- [TextBlob: Simplified Text Processing](https://textblob.readthedocs.io/en/dev/)

In [20]:
from __future__ import print_function
from textblob import TextBlob

text = """
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
"""

blob = TextBlob(text)

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)


0.06
-0.341666666667


Let's see how it works:

In [21]:
TextBlob('This is wonderful. Just wonderful.').sentences

[Sentence("This is wonderful."), Sentence("Just wonderful.")]

In [22]:
[s.polarity for s in TextBlob('This is wonderful. Just wonderful.').sentences]

[1.0, 1.0]

In [23]:
sum([s.polarity for s in TextBlob('This is wonderful. Just wonderful.').sentences])

2.0

### Using pyspark to test TextBlob interactively

First we will save our tweets dataframe in HDFS:

In [24]:
tweets.write.parquet('tweets')

To test that TextBlob works in the cluster we would have to use `pyspark` instead of `jupyter`:
- We have to pass as `py-files` our `dependencies.zip`
- We have to set the `NLTK_DATA` environmental variable

```
PYSPARK_PYTHON=$(which python) PYSPARK_DRIVER_PYTHON=$(which ipython) pyspark \
    --py-files notebook/extended/exercises/dependencies.zip \
    --conf spark.executorEnv.NLTK_DATA="/opt/cesga/nltk_data"
```

Let's verify that the TextBlob module is available in the executors (run this inside the `pyspark` session you have just launched):
```python
sc.parallelize(range(2), 2).map(lambda x: TextBlob('Wonderful')).collect()
```

To use with the Spark Structured API we have to define a UDF function first:

```python
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from textblob import TextBlob

@udf(returnType=DoubleType())
def polarity(text):
    return sum([s.polarity for s in TextBlob(text).sentences])
```

And then we can test the UDF function in pyspark:
```python
from pyspark.sql.functions import col, lit

tweets = spark.read.parquet('tweets')
tweets.show()

df = (tweets
      .withColumn('polarity', polarity(col('text')))
     )
df.select('text', 'polarity').show(vertical=True, truncate=160)
```

We can also test TextBlob in a given tweet (we will see that the polarity returned is usually 0):
```python
[s.polarity for s in TextBlob(u'RT @DrDenaGrayson: 🚨BREAKNG: Drafted men in Omsk, #Russia are fighting police to avoid being forced into military service, telling the cops…').sentences]
```

## Ready to go

Now we are ready to go and we can proceed with our app.

To submit it to the cluster you can use:

```
spark-submit --conf spark.dynamicAllocation.enabled=false --num-executors 2 Unit_8_twitter_sentiment_analysis.py
```