## Workshop Goals

### - Get to know Apache Spark engine.

### - Understand Spacy NLP library capabilities.

### Apache Spark is a fast and general engine for large-scale data processing
![Spark Libs](img/spark-libs.png)

### It can access diverse data sources including HDFS, Cassandra, Hive, HBase, S3 and JDBC/ODBC
![Spark Compatabilities](img/spark-cmp.png)

![Hadoop data sharing](img/data-sharing-mapreduce.png)
![Spark data sharing](img/data-sharing-spark.png)

In [9]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as fun, types

import spacy

import pandas as pd
pd.set_option('max_colwidth', 80)

### Spark session init

In [10]:
spark = SparkSession(SparkContext.getOrCreate()) \
    .builder \
    .appName('NLP') \
    .getOrCreate()

### Load dataset

News Category Dataset:
https://www.kaggle.com/rmisra/news-category-dataset

Each json record contains following attributes:

* category: Category article belongs to

* headline: Headline of the article

* authors: Person authored the article

* link: Link to the post

* short_description: Short description of the article

* date: Date the article was published

In [12]:
news_df = spark.read.json("News_Category_Dataset_v2.json")
news_df.show()

+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|             authors|     category|      date|            headline|                link|   short_description|
+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|     Melissa Jeltsen|        CRIME|2018-05-26|There Were 2 Mass...|https://www.huffi...|She left her husb...|
|       Andy McDonald|ENTERTAINMENT|2018-05-26|Will Smith Joins ...|https://www.huffi...|Of course it has ...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Hugh Grant Marrie...|https://www.huffi...|The actor and his...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Jim Carrey Blasts...|https://www.huffi...|The actor gives D...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Julianna Margulie...|https://www.huffi...|The "Dietland" ac...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Morgan Freeman 'D...|https://www.huffi...|"It is not right ...|
|

### Examples

In [13]:
news_df.createOrReplaceTempView("news")

In [14]:
spark.sql("SELECT COUNT(*) AS count FROM news").show()

+------+
| count|
+------+
|200853|
+------+



In [15]:
news_df.count()

200853

In [16]:
spark.sql("SELECT category, count(category) AS count FROM news GROUP BY category ORDER BY count DESC").show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



In [17]:
news_df.groupby('category') \
    .count() \
    .orderBy(fun.desc('count')) \
    .show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



### Task 1. Select the longest headline

Hint: Use `length` function and `LIMIT` expression in SQL

Available functions: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [27]:
spark.sql("SELECT headline, LENGTH(news.headline) AS length FROM news ORDER BY length DESC LIMIT 1").show()

+--------------------+------+
|            headline|length|
+--------------------+------+
|Wendy Williams An...|   320|
+--------------------+------+



## Spacy NLP library
![Spacy Features](img/spacy-features.png)

## Spacy pipeline
![Spacy Features](img/spacy-pipeline.png)

### Examples

In [28]:
# import spacy 

nlp = spacy.load("en_core_web_sm")
doc = nlp("In 2018 the Debian Linux project received a donation of $300,000")

for token in doc:
    print(token.text)

In
2018
the
Debian
Linux
project
received
a
donation
of
$
300,000


In [29]:
for token in doc.noun_chunks:
    print(token.text)

the Debian Linux project
a donation


In [30]:
for token in doc:
    if token.like_num:
        print(token.text)

2018
300,000


### Task 2. Extract named entities from the string

Hint: Use `ents` attribute of the `Doc` and `label_` attribute of the `Token`

Spacy Cheat Sheet: http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06

In [31]:
for ent in doc.ents:
    print(f"{ent.text} {ent.label_}")

2018 DATE
Debian NORP
Linux ORG
300,000 MONEY


## Let's combine a power of these two instruments

### Task 3. Extract ORG, PERSON, GPE named entities in Spark

```
# Write a function that takes a news headline and generate the output like that

[
  {
    'label': 'ORG', 
    'text': 'ACME Inc.'
  },
  {
    'label': 'PERSON', 
    'text': 'John Doe'   
  },
  {
    'label': GPE,
    'text': 'London'
  }
  ...
]
```

In [32]:
news_df_sample = news_df.sample(withReplacement=False, fraction=0.002, seed=777)
news_df_sample.createOrReplaceTempView("news_sample")

In [33]:
class SpacyWrapper(object):
    """Wrapper class to load Spacy on worker nodes"""
    _spacys = {}
    disabled_pipeline_steps = ['parser', 'tagger']
    default_model = 'en_core_web_sm'

    @classmethod
    def get(cls, model=default_model, disable=disabled_pipeline_steps):
        if model not in cls._spacys:
            import spacy
            cls._spacys[model] = spacy.load(model, disable=disable)
        return cls._spacys[model]

### Neamed entity extraction function

Hint: Reuse the code from `Task 2`.

In [38]:
def ner(doc):
    labels=['ORG', 'PERSON', 'GPE']
    entities = []
    
    # Load Spacy
    nlp = SpacyWrapper.get()
    doc = nlp(doc)
    
    # ======== WRITE YOUR SOLUTION BELOW ======== 
    for ent in doc.ents:
        if ent.label_ in labels:
            entities.append({
                'label': ent.label_,
                'text': ent.text
            })
        
    return entities

In [39]:
# Schema definition
schema = types.ArrayType(
    types.StructType([
        types.StructField('label', types.StringType(), nullable=False),
        types.StructField('text', types.StringType(), nullable=False)
    ])
)

# Register user defined function (UDF) to use in SQL
spark.udf.register('ner', ner, schema)

<function __main__.ner(doc)>

### Apply UDF to extract headlines

In [40]:
ent_sample = spark.sql("SELECT short_description, ner(short_description) AS entities FROM news_sample")

In [41]:
ent_sample.toPandas()

Unnamed: 0,short_description,entities
0,"The outside groups can spend big to alter election outcomes, but don't have ...",[]
1,The Islamic State claimed responsibility.,[]
2,The letter to Congress contradicts the White House's account of the spousal ...,"[(ORG, Congress), (ORG, the White House's)]"
3,"Like self-deportation, but for basic nutritional needs!",[]
4,"Donations skyrocketed as the gun group issued frothy warnings of the ""freedo...",[]
5,"She plays ""one of the universe’s most powerful heroes.""",[]
6,The guy makes it too easy sometimes.,[]
7,Republican lawmakers have insisted that Trump let the special counsel to do ...,[]
8,"How to use bitcoins, Sam Nunberg jokes and more.","[(PERSON, Sam Nunberg)]"
9,Such a funny thing for us to try to explain at the Winter Olympics.,[]


## Save output to JSON

In [42]:
spark.sql("SELECT short_description, ner(short_description) AS entities FROM news_sample") \
    .repartition(1) \
    .write \
    .json("output")

## Make news classifier

In [99]:
import random

from spacy.util import minibatch, compounding


# Get dataset
data = spark.sql("SELECT short_description, category FROM news_sample")

dataset = data.toPandas()

categories = spark.sql("SELECT distinct category as category FROM news_sample")

categories = categories.collect()
categories = dataset.category.unique()

# Create blank model
clear_nlp = spacy.blank('en')


In [105]:
# create classifier pipe
if "textcat" not in clear_nlp.pipe_names:
        textcat = clear_nlp.create_pipe(
            "textcat",
            config={
                "exclusive_classes": True,
                "architecture": "simple_cnn",
            }
        )
        clear_nlp.add_pipe(textcat, last=True)
else:
    textcat = clear_nlp.get_pipe("textcat")

for c in categories:
    textcat.add_label(c)

other_pipes = [pipe for pipe in clear_nlp.pipe_names if pipe != "textcat"]

In [106]:
train_cats = []
for cat in dataset.category:
    cats = {}
    for c in categories:
        if cat == c:
            cats[c] = True
        else:
            cats[c] = False
    train_cats.append(cats) 

train_texts = list(dataset.short_description)
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

{'POLITICS': True, 'WORLD NEWS': False, 'ENTERTAINMENT': False, 'WOMEN': False, 'BLACK VOICES': False, 'GREEN': False, 'SPORTS': False, 'WEIRD NEWS': False, 'QUEER VOICES': False, 'TASTE': False, 'COMEDY': False, 'EDUCATION': False, 'STYLE': False, 'PARENTS': False, 'SCIENCE': False, 'HEALTHY LIVING': False, 'TRAVEL': False, 'THE WORLDPOST': False, 'MEDIA': False, 'BUSINESS': False, 'RELIGION': False, 'LATINO VOICES': False, 'TECH': False, 'GOOD NEWS': False, 'CRIME': False, 'ARTS': False, 'ARTS & CULTURE': False, 'IMPACT': False, 'FIFTY': False, 'WORLDPOST': False, 'COLLEGE': False, 'WELLNESS': False, 'PARENTING': False, 'WEDDINGS': False, 'HOME & LIVING': False, 'FOOD & DRINK': False, 'DIVORCE': False, 'ENVIRONMENT': False, 'STYLE & BEAUTY': False, 'CULTURE & ARTS': False, 'MONEY': False}
{'POLITICS': False, 'WORLD NEWS': True, 'ENTERTAINMENT': False, 'WOMEN': False, 'BLACK VOICES': False, 'GREEN': False, 'SPORTS': False, 'WEIRD NEWS': False, 'QUEER VOICES': False, 'TASTE': False, 'C

In [111]:
n_iter = 50
with clear_nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = clear_nlp.begin_training()
    batch_sizes = compounding(4.0, 32.0, 1.001)
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch in batches:
            texts, annotations = zip(*batch)
            clear_nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
        

In [110]:
test_text = 'Woman drown in the lake.'
new_doc = clear_nlp(test_text)

print(test_text)
print(new_doc.cats)
print(max(new_doc.cats, key=new_doc.cats.get))

Woman drown in the lake.
{'POLITICS': 0.037534356117248535, 'WORLD NEWS': 4.539787187241018e-05, 'ENTERTAINMENT': 0.1006203442811966, 'WOMEN': 0.00011079548130510375, 'BLACK VOICES': 0.0007788949878886342, 'GREEN': 0.00018963620823342353, 'SPORTS': 0.0007176463259384036, 'WEIRD NEWS': 4.539787187241018e-05, 'QUEER VOICES': 0.0006054925615899265, 'TASTE': 0.00022617630020249635, 'COMEDY': 4.539787187241018e-05, 'EDUCATION': 4.539787187241018e-05, 'STYLE': 4.539787187241018e-05, 'PARENTS': 5.187013448448852e-05, 'SCIENCE': 4.539787187241018e-05, 'HEALTHY LIVING': 0.0022958347108215094, 'TRAVEL': 9.292830509366468e-05, 'THE WORLDPOST': 4.539787187241018e-05, 'MEDIA': 4.539787187241018e-05, 'BUSINESS': 4.539787187241018e-05, 'RELIGION': 0.0002157364069717005, 'LATINO VOICES': 0.0002759491908363998, 'TECH': 4.539787187241018e-05, 'GOOD NEWS': 4.539787187241018e-05, 'CRIME': 5.741846689488739e-05, 'ARTS': 4.539787187241018e-05, 'ARTS & CULTURE': 4.539787187241018e-05, 'IMPACT': 4.53978718724