## Workshop Goals

### - Get to know Apache Spark engine.

### - Understand Spacy NLP library capabilities.

### Apache Spark is a fast and general engine for large-scale data processing
![Spark Libs](img/spark-libs.png)

### It can access diverse data sources including HDFS, Cassandra, Hive, HBase, S3 and JDBC/ODBC
![Spark Compatabilities](img/spark-cmp.png)

![Hadoop data sharing](img/data-sharing-mapreduce.png)
![Spark data sharing](img/data-sharing-spark.png)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as fun, types

import spacy

import pandas as pd
pd.set_option('max_colwidth', 80)

### Spark session init

In [2]:
spark = SparkSession(SparkContext.getOrCreate()) \
    .builder \
    .appName('NLP') \
    .getOrCreate()

### Load dataset

News Category Dataset:
https://www.kaggle.com/rmisra/news-category-dataset

Each json record contains following attributes:

* category: Category article belongs to

* headline: Headline of the article

* authors: Person authored the article

* link: Link to the post

* short_description: Short description of the article

* date: Date the article was published

In [4]:
news_df = spark.read.json("News_Category_Dataset_v2.json")
news_df.show()

+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|             authors|     category|      date|            headline|                link|   short_description|
+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|     Melissa Jeltsen|        CRIME|2018-05-26|There Were 2 Mass...|https://www.huffi...|She left her husb...|
|       Andy McDonald|ENTERTAINMENT|2018-05-26|Will Smith Joins ...|https://www.huffi...|Of course it has ...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Hugh Grant Marrie...|https://www.huffi...|The actor and his...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Jim Carrey Blasts...|https://www.huffi...|The actor gives D...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Julianna Margulie...|https://www.huffi...|The "Dietland" ac...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Morgan Freeman 'D...|https://www.huffi...|"It is not right ...|
|

In [36]:
news_df.head()

Row(authors='Melissa Jeltsen', category='CRIME', date='2018-05-26', headline='There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV', link='https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89', short_description='She left her husband. He killed their children. Just another day in America.')

### Examples

In [5]:
news_df.createOrReplaceTempView("news")

In [6]:
spark.sql("SELECT COUNT(*) AS count FROM news").show()

+------+
| count|
+------+
|200853|
+------+



In [7]:
news_df.count()

200853

In [8]:
spark.sql("SELECT category, count(category) AS count FROM news GROUP BY category ORDER BY count DESC").show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



In [9]:
news_df.groupby('category') \
    .count() \
    .orderBy(fun.desc('count')) \
    .distinct().show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



### Task 1. Select the longest headline

Hint: Use `length` function and `LIMIT` expression in SQL

Available functions: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

## Spacy NLP library
![Spacy Features](img/spacy-features.png)

## Spacy pipeline
![Spacy Features](img/spacy-pipeline.png)

### Examples

In [10]:
# import spacy 

nlp = spacy.load("en_core_web_sm")
doc = nlp("In 2018 the Debian Linux project received a donation of $300,000")

for token in doc:
    print(token.text)

In
2018
the
Debian
Linux
project
received
a
donation
of
$
300,000


In [11]:
for token in doc.noun_chunks:
    print(token.text)

the Debian Linux project
a donation


In [12]:
for token in doc:
    if token.like_num:
        print(token.text)

2018
300,000


### Task 2. Extract named entities from the string

Hint: Use `ents` attribute of the `Doc` and `label_` attribute of the `Token`

Spacy Cheat Sheet: http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06

## Let's combine a power of these two instruments

### Task 3. Extract ORG, PERSON, GPE named entities in Spark

```
# Write a function that takes a news headline and generate the output like that

[
  {
    'label': 'ORG', 
    'text': 'ACME Inc.'
  },
  {
    'label': 'PERSON', 
    'text': 'John Doe'   
  },
  {
    'label': GPE,
    'text': 'London'
  }
  ...
]
```

In [13]:
news_df_sample = news_df.sample(withReplacement=False, fraction=0.002, seed=777)
news_df_sample.createOrReplaceTempView("news_sample")

In [14]:
class SpacyWrapper(object):
    """Wrapper class to load Spacy on worker nodes"""
    _spacys = {}
    disabled_pipeline_steps = ['parser', 'tagger']
    default_model = 'en_core_web_sm'

    @classmethod
    def get(cls, model=default_model, disable=disabled_pipeline_steps):
        if model not in cls._spacys:
            import spacy
            cls._spacys[model] = spacy.load(model, disable=disable)
        return cls._spacys[model]

### Neamed entity extraction function

Hint: Reuse the code from `Task 2`.

In [15]:
def ner(doc):
    labels=['ORG', 'PERSON', 'GPE']
    entities = []
    
    # Load Spacy
    nlp = SpacyWrapper.get()
    doc = nlp(doc)
    
    # ======== WRITE YOUR SOLUTION BELOW ======== 
        
    return entities

In [16]:
# Schema definition
schema = types.ArrayType(
    types.StructType([
        types.StructField('label', types.StringType(), nullable=False),
        types.StructField('text', types.StringType(), nullable=False)
    ])
)

# Register user defined function (UDF) to use in SQL
spark.udf.register('ner', ner, schema)

<function __main__.ner(doc)>

### Apply UDF to extract headlines

In [17]:
ent_sample = spark.sql("SELECT short_description, ner(short_description) AS entities FROM news_sample")

In [18]:
ent_sample.toPandas()

Unnamed: 0,short_description,entities
0,"The outside groups can spend big to alter election outcomes, but don't have ...",[]
1,The Islamic State claimed responsibility.,[]
2,The letter to Congress contradicts the White House's account of the spousal ...,[]
3,"Like self-deportation, but for basic nutritional needs!",[]
4,"Donations skyrocketed as the gun group issued frothy warnings of the ""freedo...",[]
5,"She plays ""one of the universe’s most powerful heroes.""",[]
6,The guy makes it too easy sometimes.,[]
7,Republican lawmakers have insisted that Trump let the special counsel to do ...,[]
8,"How to use bitcoins, Sam Nunberg jokes and more.",[]
9,Such a funny thing for us to try to explain at the Winter Olympics.,[]


## Save output to JSON

In [23]:
spark.sql("SELECT short_description, ner(short_description) AS entities FROM news_sample") \
    .repartition(1) \
    .write \
    .json("output")

# Text categorization

In [24]:
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import thinc.extra.datasets

import spacy
from spacy.util import minibatch, compounding

In [129]:
output_dir = "models"
n_texts = 2000
n_iter = 20
init_tok2vec = None

In [26]:
# create blank model
nlp = spacy.blank("en")

In [27]:
# create text classifier
textcat = nlp.create_pipe("textcat", config={"exclusive_classes":True, "architecture": "simple_cnn"})

In [28]:
# add text classifier to pipeline
nlp.add_pipe(textcat, last=True)

In [62]:
# get distinct categories
categories = [ row.category for row in news_df.select('category').distinct().collect()]
print(categories)
print(len(categories))

['SPORTS', 'MEDIA', 'BLACK VOICES', 'POLITICS', 'ARTS', 'THE WORLDPOST', 'QUEER VOICES', 'CULTURE & ARTS', 'PARENTING', 'GREEN', 'ENTERTAINMENT', 'ENVIRONMENT', 'TECH', 'BUSINESS', 'LATINO VOICES', 'COMEDY', 'STYLE & BEAUTY', 'MONEY', 'IMPACT', 'RELIGION', 'TRAVEL', 'EDUCATION', 'HEALTHY LIVING', 'CRIME', 'WEDDINGS', 'TASTE', 'HOME & LIVING', 'FIFTY', 'WEIRD NEWS', 'WORLDPOST', 'GOOD NEWS', 'DIVORCE', 'FOOD & DRINK', 'WOMEN', 'COLLEGE', 'ARTS & CULTURE', 'STYLE', 'PARENTS', 'WORLD NEWS', 'SCIENCE', 'WELLNESS']
41


In [30]:
for cat in categories:
    textcat.add_label(cat)

In [114]:
news_data = [ (row.category, row.headline) for row in news_df.collect()]    

In [115]:
news_data[:10]

[('CRIME', 'There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV'),
 ('ENTERTAINMENT',
  "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song"),
 ('ENTERTAINMENT', 'Hugh Grant Marries For The First Time At Age 57'),
 ('ENTERTAINMENT',
  "Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork"),
 ('ENTERTAINMENT',
  'Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog'),
 ('ENTERTAINMENT',
  "Morgan Freeman 'Devastated' That Sexual Harassment Claims Could Undermine Legacy"),
 ('ENTERTAINMENT',
  "Donald Trump Is Lovin' New McDonald's Jingle In 'Tonight Show' Bit"),
 ('ENTERTAINMENT', 'What To Watch On Amazon Prime That’s New This Week'),
 ('ENTERTAINMENT',
  "Mike Myers Reveals He'd 'Like To' Do A Fourth Austin Powers Film"),
 ('ENTERTAINMENT', 'What To Watch On Hulu That’s New This Week')]

In [100]:
def load_data(df, limit=0, split=0.8):
    """Load data from the News dataset."""
    # Partition off part of the train data for evaluation
    categories = [ row.category for row in news_df.select('category').distinct().collect()]
    train_data  = [ (row.category, row.headline) for row in df.collect()]  
    random.shuffle(train_data)
    train_data = train_data[-limit:]
    labels, texts = zip(*train_data)
    cats = [{ cat: (lbl==cat) for cat in categories } for lbl in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 0.0  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 0.0  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if label == "NEGATIVE":
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.0
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.0
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}


In [101]:
(train_texts, train_labels), (dev_texts, dev_labels) = load_data(df=news_df)

In [103]:
dev_texts[0]

'Photo Series Exposes The Joy And Heartache Of Raising Kids With Special Needs'

In [116]:
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_labels]))

In [117]:
train_data[:10]

[('Friday Talking Points -- Rampant Republican Hypocrisy On Syria',
  {'cats': {'SPORTS': False,
    'MEDIA': False,
    'BLACK VOICES': False,
    'POLITICS': True,
    'ARTS': False,
    'THE WORLDPOST': False,
    'QUEER VOICES': False,
    'CULTURE & ARTS': False,
    'PARENTING': False,
    'GREEN': False,
    'ENTERTAINMENT': False,
    'ENVIRONMENT': False,
    'TECH': False,
    'BUSINESS': False,
    'LATINO VOICES': False,
    'COMEDY': False,
    'STYLE & BEAUTY': False,
    'MONEY': False,
    'IMPACT': False,
    'RELIGION': False,
    'TRAVEL': False,
    'EDUCATION': False,
    'HEALTHY LIVING': False,
    'CRIME': False,
    'WEDDINGS': False,
    'TASTE': False,
    'HOME & LIVING': False,
    'FIFTY': False,
    'WEIRD NEWS': False,
    'WORLDPOST': False,
    'GOOD NEWS': False,
    'DIVORCE': False,
    'FOOD & DRINK': False,
    'WOMEN': False,
    'COLLEGE': False,
    'ARTS & CULTURE': False,
    'STYLE': False,
    'PARENTS': False,
    'WORLD NEWS': False,
    

In [128]:
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    if init_tok2vec is not None:
        with init_tok2vec.open("rb") as file_:
            textcat.model.tok2vec.from_bytes(file_.read())
    print("Training the model...")
    print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))
    batch_sizes = compounding(4.0, 32.0, 1.001)
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
        with textcat.model.use_params(optimizer.averages):
            # evaluate on the dev data split off in load_data()
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_labels)
        print(
            "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(  # print a simple table
                losses["textcat"],
                scores["textcat_p"],
                scores["textcat_r"],
                scores["textcat_f"],
            )
        )

# test the trained model
test_text = "This movie sucked"
doc = nlp(test_text)
print(test_text, doc.cats)

Training the model...
LOSS 	  P  	  R  	  F  
4200.125	0.777	0.373	0.504
3082.331	0.778	0.397	0.525
2934.807	0.771	0.415	0.540
2836.402	0.766	0.426	0.548
2763.113	0.763	0.437	0.556
2703.992	0.761	0.444	0.561
2652.567	0.757	0.449	0.564
2613.604	0.755	0.452	0.565
2580.369	0.752	0.457	0.568
2546.641	0.749	0.461	0.571
2521.982	0.747	0.463	0.571
2502.292	0.745	0.466	0.574
2482.215	0.744	0.467	0.574
2460.835	0.743	0.470	0.576
2442.932	0.742	0.471	0.577
2427.028	0.740	0.473	0.577
2414.652	0.740	0.475	0.579
2400.553	0.739	0.476	0.579
2393.935	0.739	0.478	0.581
2380.902	0.739	0.478	0.580
This movie sucked {'SPORTS': 0.0029404545202851295, 'MEDIA': 7.699742855038494e-05, 'BLACK VOICES': 0.015056510455906391, 'POLITICS': 0.004427497275173664, 'ARTS': 4.539787187241018e-05, 'THE WORLDPOST': 0.0006049483199603856, 'QUEER VOICES': 0.0015267971903085709, 'CULTURE & ARTS': 4.539787187241018e-05, 'PARENTING': 0.00018652928702067584, 'GREEN': 0.0005045181605964899, 'ENTERTAINMENT': 0.9168505668640137, '

In [131]:
with nlp.use_params(optimizer.averages):
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)
    # test the saved model
    print("Loading from", output_dir)
    nlp2 = spacy.load(output_dir)
    doc2 = nlp2(test_text)
    print(test_text, doc2.cats)

Saved model to models
Loading from models
This movie sucked {'SPORTS': 0.001859086798503995, 'MEDIA': 0.00012884176976513118, 'BLACK VOICES': 0.012571942992508411, 'POLITICS': 0.0035734246484935284, 'ARTS': 4.539787187241018e-05, 'THE WORLDPOST': 0.0022347450722008944, 'QUEER VOICES': 0.0011917247902601957, 'CULTURE & ARTS': 4.539787187241018e-05, 'PARENTING': 0.0001935432228492573, 'GREEN': 0.00033383368281647563, 'ENTERTAINMENT': 0.8976044058799744, 'ENVIRONMENT': 4.539787187241018e-05, 'TECH': 4.539787187241018e-05, 'BUSINESS': 0.0006732450565323234, 'LATINO VOICES': 0.00014961596752982587, 'COMEDY': 0.002926008542999625, 'STYLE & BEAUTY': 4.539787187241018e-05, 'MONEY': 4.539787187241018e-05, 'IMPACT': 0.00011039463424822316, 'RELIGION': 4.539787187241018e-05, 'TRAVEL': 0.0004975117044523358, 'EDUCATION': 4.539787187241018e-05, 'HEALTHY LIVING': 0.0004041741194669157, 'CRIME': 0.0008256362634710968, 'WEDDINGS': 4.539787187241018e-05, 'TASTE': 0.00010820441821124405, 'HOME & LIVING'

In [134]:
sorted(doc2.cats.items(), key=lambda kv: kv[1], reverse=True)

[('ENTERTAINMENT', 0.8976044058799744),
 ('WEIRD NEWS', 0.02132837474346161),
 ('BLACK VOICES', 0.012571942992508411),
 ('POLITICS', 0.0035734246484935284),
 ('COMEDY', 0.002926008542999625),
 ('THE WORLDPOST', 0.0022347450722008944),
 ('STYLE', 0.0020494158379733562),
 ('SPORTS', 0.001859086798503995),
 ('WELLNESS', 0.0017780056223273277),
 ('PARENTS', 0.001599202398210764),
 ('FOOD & DRINK', 0.0015194063307717443),
 ('QUEER VOICES', 0.0011917247902601957),
 ('CRIME', 0.0008256362634710968),
 ('BUSINESS', 0.0006732450565323234),
 ('TRAVEL', 0.0004975117044523358),
 ('HEALTHY LIVING', 0.0004041741194669157),
 ('GREEN', 0.00033383368281647563),
 ('WOMEN', 0.00021872953220736235),
 ('WORLDPOST', 0.00019946687098126858),
 ('PARENTING', 0.0001935432228492573),
 ('LATINO VOICES', 0.00014961596752982587),
 ('MEDIA', 0.00012884176976513118),
 ('IMPACT', 0.00011039463424822316),
 ('TASTE', 0.00010820441821124405),
 ('FIFTY', 7.51630068407394e-05),
 ('SCIENCE', 7.267794717336074e-05),
 ('WORLD 

In [161]:
test_text = "Photo Series Exposes The Joy And Heartache Of Raising Kids With Special Needs"
doc = nlp(test_text)
sorted(doc.cats.items(), key=lambda kv: kv[1], reverse=True)[:5]

[('PARENTS', 0.8480067253112793),
 ('PARENTING', 0.6117821931838989),
 ('QUEER VOICES', 0.00454875361174345),
 ('WELLNESS', 0.004119208548218012),
 ('GOOD NEWS', 0.0037084331270307302)]

In [162]:
test_text = "Anti-Semitism threatens fragile Jewish life in Romania"
doc = nlp(test_text)
sorted(doc.cats.items(), key=lambda kv: kv[1], reverse=True)[:5]

[('WORLDPOST', 0.8023033738136292),
 ('POLITICS', 0.15048882365226746),
 ('WORLD NEWS', 0.02114487811923027),
 ('IMPACT', 0.011980917304754257),
 ('FOOD & DRINK', 0.010317415930330753)]

In [163]:
test_text = "Masters: Tiger Woods' first major - from mixing with stars to Monday history class"
doc = nlp(test_text)
sorted(doc.cats.items(), key=lambda kv: kv[1], reverse=True)[:5]

[('SPORTS', 0.9766820669174194),
 ('POLITICS', 0.1045171245932579),
 ('ENTERTAINMENT', 0.02232290804386139),
 ('QUEER VOICES', 0.0053286077454686165),
 ('GREEN', 0.004776482470333576)]