## Workshop Goals

### - Get to know Apache Spark engine.

### - Understand Spacy NLP library capabilities.

### Apache Spark is a fast and general engine for large-scale data processing
![Spark Libs](img/spark-libs.png)

### It can access diverse data sources including HDFS, Cassandra, Hive, HBase, S3 and JDBC/ODBC
![Spark Compatabilities](img/spark-cmp.png)

![Hadoop data sharing](img/data-sharing-mapreduce.png)
![Spark data sharing](img/data-sharing-spark.png)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as fun, types

import spacy

import pandas as pd
pd.set_option('max_colwidth', 80)

### Spark session init

In [2]:
spark = SparkSession(SparkContext.getOrCreate()) \
    .builder \
    .appName('NLP') \
    .getOrCreate()

### Load dataset

News Category Dataset:
https://www.kaggle.com/rmisra/news-category-dataset

Each json record contains following attributes:

* category: Category article belongs to

* headline: Headline of the article

* authors: Person authored the article

* link: Link to the post

* short_description: Short description of the article

* date: Date the article was published

In [3]:
news_df = spark.read.json("News_Category_Dataset_v2.json")
news_df.show()

+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|             authors|     category|      date|            headline|                link|   short_description|
+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|     Melissa Jeltsen|        CRIME|2018-05-26|There Were 2 Mass...|https://www.huffi...|She left her husb...|
|       Andy McDonald|ENTERTAINMENT|2018-05-26|Will Smith Joins ...|https://www.huffi...|Of course it has ...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Hugh Grant Marrie...|https://www.huffi...|The actor and his...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Jim Carrey Blasts...|https://www.huffi...|The actor gives D...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Julianna Margulie...|https://www.huffi...|The "Dietland" ac...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Morgan Freeman 'D...|https://www.huffi...|"It is not right ...|
|

### Examples

In [4]:
news_df.createOrReplaceTempView("news")

In [5]:
spark.sql("SELECT COUNT(*) AS count FROM news").show()

+------+
| count|
+------+
|200853|
+------+



In [6]:
news_df.count()

200853

In [7]:
spark.sql("SELECT category, count(category) AS count FROM news GROUP BY category ORDER BY count DESC").show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



In [8]:
news_df.groupby('category') \
    .count() \
    .orderBy(fun.desc('count')) \
    .show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



### Task 1. Select the longest headline

Hint: Use `length` function and `LIMIT` expression in SQL

Available functions: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [9]:
spark.sql("SELECT headline, length(headline) AS length FROM news ORDER BY length DESC LIMIT 1").show()

+--------------------+------+
|            headline|length|
+--------------------+------+
|Wendy Williams An...|   320|
+--------------------+------+



## Spacy NLP library
![Spacy Features](img/spacy-features.png)

## Spacy pipeline
![Spacy Features](img/spacy-pipeline.png)

### Examples

In [10]:
# import spacy 

nlp = spacy.load("en_core_web_sm")
doc = nlp("In 2018 the Debian Linux project received a donation of $300,000")

for token in doc:
    print(token.text)

In
2018
the
Debian
Linux
project
received
a
donation
of
$
300,000


In [11]:
for token in doc.noun_chunks:
    print(token.text)

the Debian Linux project
a donation


In [12]:
for token in doc:
    if token.like_num:
        print(token.text)

2018
300,000


### Task 2. Extract named entities from the string

Hint: Use `ents` attribute of the `Doc` and `label_` attribute of the `Token`

Spacy Cheat Sheet: http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06

In [13]:
for ent in doc.ents:
    print(ent.text, ent.label_)

2018 DATE
Debian NORP
Linux ORG
300,000 MONEY


## Let's combine a power of these two instruments

### Task 3. Extract ORG, PERSON, GPE named entities in Spark

```
# Write a function that takes a news headline and generate the output like that

[
  {
    'label': 'ORG', 
    'text': 'ACME Inc.'
  },
  {
    'label': 'PERSON', 
    'text': 'John Doe'   
  },
  {
    'label': GPE,
    'text': 'London'
  }
  ...
]
```

In [14]:
news_df_sample = news_df.sample(withReplacement=False, fraction=0.002, seed=777)
news_df_sample.createOrReplaceTempView("news_sample")

In [15]:
class SpacyWrapper(object):
    """Wrapper class to load Spacy on worker nodes"""
    _spacys = {}
    disabled_pipeline_steps = ['parser', 'tagger']
    default_model = 'en_core_web_sm'

    @classmethod
    def get(cls, model=default_model, disable=disabled_pipeline_steps):
        if model not in cls._spacys:
            import spacy
            cls._spacys[model] = spacy.load(model, disable=disable)
        return cls._spacys[model]

### Neamed entity extraction function

Hint: Reuse the code from `Task 2`.

In [16]:
def ner(doc):
    labels=['ORG', 'PERSON', 'GPE']
    entities = []
    
    # Load Spacy
    nlp = SpacyWrapper.get()
    doc = nlp(doc)
    for ent in doc.ents:
        if ent.label_ in labels:
            entities.append({ 'label': ent.label_, 'text': ent.text })
        
    return entities

In [17]:
# Schema definition
schema = types.ArrayType(
    types.StructType([
        types.StructField('label', types.StringType(), nullable=False),
        types.StructField('text', types.StringType(), nullable=False)
    ])
)

# Register user defined function (UDF) to use in SQL
spark.udf.register('ner', ner, schema)

<function __main__.ner(doc)>

### Apply UDF to extract headlines

In [18]:
ent_sample = spark.sql("SELECT short_description, ner(short_description) AS entities FROM news_sample")

In [19]:
ent_sample.toPandas()

Unnamed: 0,short_description,entities
0,"The outside groups can spend big to alter election outcomes, but don't have ...",[]
1,The Islamic State claimed responsibility.,[]
2,The letter to Congress contradicts the White House's account of the spousal ...,"[(ORG, Congress), (ORG, the White House's)]"
3,"Like self-deportation, but for basic nutritional needs!",[]
4,"Donations skyrocketed as the gun group issued frothy warnings of the ""freedo...",[]
5,"She plays ""one of the universe’s most powerful heroes.""",[]
6,The guy makes it too easy sometimes.,[]
7,Republican lawmakers have insisted that Trump let the special counsel to do ...,[]
8,"How to use bitcoins, Sam Nunberg jokes and more.","[(PERSON, Sam Nunberg)]"
9,Such a funny thing for us to try to explain at the Winter Olympics.,[]


## Save output to JSON

In [23]:
spark.sql("SELECT short_description, ner(short_description) AS entities FROM news_sample") \
    .repartition(1) \
    .write \
    .mode("overwrite") \
    .json("output")

## Text Classifier

In [24]:
nlp = spacy.blank("en")

In [25]:
if "textcat" not in nlp.pipe_names:
    textcat = nlp.create_pipe(
        "textcat",
        config={
            "exclusive_classes": True,
            "architecture": "simple_cnn",
        }
    )
    nlp.add_pipe(textcat, last=True)
else:
    textcat = nlp.get_pipe("textcat")

In [26]:
if len(textcat.labels) == 0:
    categories = spark.sql("SELECT category AS name FROM news GROUP BY category")
    for category in categories.collect():
        textcat.add_label(category['name'])
print(textcat.labels)

['SPORTS', 'MEDIA', 'BLACK VOICES', 'POLITICS', 'ARTS', 'THE WORLDPOST', 'QUEER VOICES', 'CULTURE & ARTS', 'PARENTING', 'GREEN', 'ENTERTAINMENT', 'ENVIRONMENT', 'TECH', 'BUSINESS', 'LATINO VOICES', 'COMEDY', 'STYLE & BEAUTY', 'MONEY', 'IMPACT', 'RELIGION', 'TRAVEL', 'EDUCATION', 'HEALTHY LIVING', 'CRIME', 'WEDDINGS', 'TASTE', 'HOME & LIVING', 'FIFTY', 'WEIRD NEWS', 'WORLDPOST', 'GOOD NEWS', 'DIVORCE', 'FOOD & DRINK', 'WOMEN', 'COLLEGE', 'ARTS & CULTURE', 'STYLE', 'PARENTS', 'WORLD NEWS', 'SCIENCE', 'WELLNESS']


In [27]:
from sklearn.model_selection import train_test_split
df = spark.sql("SELECT short_description, category FROM news") \
          .sample(withReplacement=False, fraction=0.001, seed=777) \
          .toPandas()
texts = df['short_description']
labels = df['category']

In [28]:
cats = []
for y in labels:
    item = {}
    for l in textcat.labels:
        item[l] = bool(l == y)
    cats.append(item)
print(cats[0])

{'SPORTS': False, 'MEDIA': False, 'BLACK VOICES': False, 'POLITICS': False, 'ARTS': False, 'THE WORLDPOST': False, 'QUEER VOICES': False, 'CULTURE & ARTS': False, 'PARENTING': False, 'GREEN': False, 'ENTERTAINMENT': False, 'ENVIRONMENT': False, 'TECH': False, 'BUSINESS': False, 'LATINO VOICES': False, 'COMEDY': False, 'STYLE & BEAUTY': False, 'MONEY': False, 'IMPACT': False, 'RELIGION': False, 'TRAVEL': False, 'EDUCATION': False, 'HEALTHY LIVING': False, 'CRIME': False, 'WEDDINGS': False, 'TASTE': False, 'HOME & LIVING': False, 'FIFTY': False, 'WEIRD NEWS': False, 'WORLDPOST': False, 'GOOD NEWS': False, 'DIVORCE': False, 'FOOD & DRINK': False, 'WOMEN': False, 'COLLEGE': False, 'ARTS & CULTURE': False, 'STYLE': False, 'PARENTS': False, 'WORLD NEWS': True, 'SCIENCE': False, 'WELLNESS': False}


In [29]:
train_texts = texts
train_cats = cats
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

In [31]:
import random
from spacy.util import minibatch, compounding

n_iter = 10
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    batch_sizes = compounding(4.0, 32.0, 1.001)
    for i in range(n_iter):
        print(f"Iteration: {i + 1}/{n_iter}")
        losses = {}
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)

Training the model...
Iteration: 1/10
Iteration: 2/10
Iteration: 3/10
Iteration: 4/10
Iteration: 5/10
Iteration: 6/10
Iteration: 7/10
Iteration: 8/10
Iteration: 9/10
Iteration: 10/10


In [49]:
test_text = "Killed his wife"
doc = nlp(test_text)
max_value = 0
top_category = None
for key, value in doc.cats.items():
    if value > max_value:
        max_value = value
        top_category = key
print(top_category)

POLITICS


## ¯\\_(ツ)_/¯ Need to use bigger subset but it takes to long