# Assignment 3: Construction of a predictive model using Spark Structured Streaming and textual data (Hacker News)

## 1. Construction of a data set using the provided stream

In [3]:
import threading

# Helper thread to avoid the Spark StreamingContext from blocking Jupyter
        
class StreamingThread(threading.Thread):
    def __init__(self, ssc):
        super().__init__()
        self.ssc = ssc
    def run(self):
        self.ssc.start()
        self.ssc.awaitTermination()
    def stop(self):
        print('----- Stopping... this may take a few seconds -----')
        self.ssc.stop(stopSparkContext=False, stopGraceFully=True)

In [4]:
sc

In [5]:
spark

In [6]:
socketDF = spark.readStream.format("socket").option("host", "seppe.net").option("port", 7778).load()
socketDF.printSchema()

root
 |-- value: string (nullable = true)



In [7]:
from pyspark.sql.functions import from_json, schema_of_json

In [8]:
def process_row(df, epoch_id):
    print(epoch_id)
    if df.count() == 0: return
    schema = schema_of_json(df.first().value)
    df_cols = df.selectExpr('CAST(value AS STRING)')\
        .select(from_json('value', schema)\
        .alias('temp'))\
        .select('temp.*')
    df_cols.show()
    # We can also save here using something such as:
    df.write.format("json").mode("append").save("data")

In [9]:
query = socketDF.writeStream.trigger(processingTime='5 seconds').foreachBatch(process_row).start()  

0
1
+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------+-----+
|     aid|comments|          domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|          user|votes|
+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------+-----+
|40191264|       0| theatlantic.com|    false|2024-04-28 19:47:46|The New Quarter-L...|The New Quarter-L...|The New Quarter-L...|https://www.theat...|    wallflower|    1|
|40191287|       0|tamanotchi.world|    false|2024-04-28 19:52:10|tamanotchi.world ...|tamanotchi.world ...|TamaNOTchi – Cute...|https://tamanotch...|      memalign|    1|
|40191312|       2|     autodocs.me|    false|2024-04-28 19:55:53|AutoDocs\n\nAutoD...|            AutoDocs|Show HN: Auto-upd...|https:/

10
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|     aid|comments|           domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|     user|votes|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|40191483|       0|hakaimagazine.com|    false|2024-04-28 20:22:24|The Waning Reign ...|The Waning Reign ...|The Waning Reign ...|https://hakaimaga...|PaulHoule|    1|
|40191486|       0|  alexanderell.is|    false|2024-04-28 20:22:35|Harry Chapin and ...|Harry Chapin and ...|Harry Chapin and ...|https://alexander...|    otras|    1|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------

19
+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------------+-----+
|     aid|comments|          domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|           user|votes|
+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------------+-----+
|40191728|       0|alaniswright.com|    false|2024-04-28 20:49:57|8 ways I am using...|8 ways I am using...|8 ways I am using...|https://alaniswri...|PuddleOfSausage|    1|
|40191731|       0|    esolangs.org|    false|2024-04-28 20:49:59|BytePusher - Esol...|BytePusher - Esolang|          BytePusher|https://esolangs....|           hggh|    1|
+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+---

28
+--------+--------+-----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+-------+-----+
|     aid|comments|     domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|   user|votes|
+--------+--------+-----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+-------+-----+
|40191982|       0|  slate.com|    false|2024-04-28 21:11:58|“Affordances,” a ...|Cory Doctorow on ...|"Affordances" (2019)|https://slate.com...|mooreds|    1|
|40192060|       0|youtube.com|     true|2024-04-28 21:19:38|Cars are getting ...|Cars are getting ...|Cars are getting ...|https://www.youtu...| awnird|    3|
+--------+--------+-----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+-------+-----+

29
+--------+--------+-----------+--

37
+--------+--------+-------------+---------+-------------------+--------------------+------------+--------+--------------------+------+-----+
|     aid|comments|       domain|frontpage|          posted_at|         source_text|source_title|   title|                 url|  user|votes|
+--------+--------+-------------+---------+-------------------+--------------------+------------+--------+--------------------+------+-----+
|40192310|       0|wikipedia.org|    false|2024-04-28 21:48:12|Pfaffian - Wikipe...|    Pfaffian|Pfaffian|https://en.wikipe...|gone35|    2|
+--------+--------+-------------+---------+-------------------+--------------------+------------+--------+--------------------+------+-----+

38
+--------+--------+------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------------+-----+
|     aid|comments|      domain|frontpage|          posted_at|         source_text|        source_title|               

46
+--------+--------+---------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----+
|     aid|comments|         domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|       user|votes|
+--------+--------+---------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----+
|40192566|       0|        cnn.com|    false|2024-04-28 22:29:33|edition.cnn.com\n...|     edition.cnn.com|Sexsomnia: An emb...|https://www.cnn.c...|Stratoscope|    1|
|40192579|       0|        acm.org|     true|2024-04-28 22:30:55|                \n\n|                NULL|The Essence of Co...|https://dl.acm.or...| swatson741|    4|
|40192594|       0|github.com/sass|    false|2024-04-28 22:32:42|Violation of the ...|Violation of the ...|The Sass project ...|https://github.co...|   zaert

56
+--------+--------+-------+---------+-------------------+-----------+------------+--------------------+--------------------+-------+-----+
|     aid|comments| domain|frontpage|          posted_at|source_text|source_title|               title|                 url|   user|votes|
+--------+--------+-------+---------+-------------------+-----------+------------+--------------------+--------------------+-------+-----+
|40192813|       0|cdr.fyi|    false|2024-04-28 23:07:34|       NULL|        NULL|A New API for Car...|https://www.cdr.f...|cdr_fyi|    1|
+--------+--------+-------+---------+-------------------+-----------+------------+--------------------+--------------------+-------+-----+

57
+--------+--------+---------------+---------+-------------------+-----------+------------+--------------------+--------------------+---------+-----+
|     aid|comments|         domain|frontpage|          posted_at|source_text|source_title|               title|                 url|     user|votes|


67
+--------+--------+----------+---------+-------------------+-----------+------------+--------------------+--------------------+--------+-----+
|     aid|comments|    domain|frontpage|          posted_at|source_text|source_title|               title|                 url|    user|votes|
+--------+--------+----------+---------+-------------------+-----------+------------+--------------------+--------------------+--------+-----+
|40193037|       0|deramp.com|    false|2024-04-28 23:47:44|       NULL|        NULL|Recollections of ...|https://deramp.co...|mmastrac|    1|
+--------+--------+----------+---------+-------------------+-----------+------------+--------------------+--------------------+--------+-----+

68
+--------+--------+--------------------+---------+-------------------+-----------+------------+--------------------+--------------------+--------+-----+
|     aid|comments|              domain|frontpage|          posted_at|source_text|source_title|               title|         

In [10]:
query.stop()

The historical data that was collected through the provided stream represents JSON dictionaries containing the following 11 variables:
- id
- title
- url
- domain (i.e., the website where a given article is published)
- votes
- user
- posted_at (in the format "yyyy-mm-dd hh:mm:ss)
- comments
- source_title
- source_text
- frontpage (i.e. our target)

## 2. Building the model to predict the target ("frontpage") based on the available features

Things to be taken into account:
1) Data imbalance in Y ("front_page"): "true" values are too few, "false" values are too many.

Possible solutions: oversampling, undersampling etc.

In [None]:
#Building the model using spark.ml (MLlib)

## 3. Using the trained model to make predictions as the stream comes in

I.e. show that we can connect to the data source, preprocess/featurize incoming messages, have our model predict the label, and show it, similar to spark_streaming_example_predicting.ipynb.

## Conclusion