## Workshop Goals

### - Get to know Apache Spark engine.

### - Understand Spacy NLP library capabilities.

### Apache Spark is a fast and general engine for large-scale data processing
![Spark Libs](img/spark-libs.png)

### It can access diverse data sources including HDFS, Cassandra, Hive, HBase, S3 and JDBC/ODBC
![Spark Compatabilities](img/spark-cmp.png)

![Hadoop data sharing](img/data-sharing-mapreduce.png)
![Spark data sharing](img/data-sharing-spark.png)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as fun, types

import spacy

import pandas as pd
pd.set_option('max_colwidth', 80)

### Spark session init

In [2]:
spark = SparkSession(SparkContext.getOrCreate()) \
    .builder \
    .appName('NLP') \
    .getOrCreate()

### Load dataset

News Category Dataset:
https://www.kaggle.com/rmisra/news-category-dataset

Each json record contains following attributes:

* category: Category article belongs to

* headline: Headline of the article

* authors: Person authored the article

* link: Link to the post

* short_description: Short description of the article

* date: Date the article was published

In [3]:
news_df = spark.read.json("News_Category_Dataset_v2.json")
news_df.show()

+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|             authors|     category|      date|            headline|                link|   short_description|
+--------------------+-------------+----------+--------------------+--------------------+--------------------+
|     Melissa Jeltsen|        CRIME|2018-05-26|There Were 2 Mass...|https://www.huffi...|She left her husb...|
|       Andy McDonald|ENTERTAINMENT|2018-05-26|Will Smith Joins ...|https://www.huffi...|Of course it has ...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Hugh Grant Marrie...|https://www.huffi...|The actor and his...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Jim Carrey Blasts...|https://www.huffi...|The actor gives D...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Julianna Margulie...|https://www.huffi...|The "Dietland" ac...|
|          Ron Dicker|ENTERTAINMENT|2018-05-26|Morgan Freeman 'D...|https://www.huffi...|"It is not right ...|
|

### Examples

In [4]:
news_df.createOrReplaceTempView("news")

In [5]:
spark.sql("SELECT COUNT(*) AS count FROM news").show()

+------+
| count|
+------+
|200853|
+------+



In [6]:
news_df.count()

200853

In [7]:
spark.sql("SELECT category, count(category) AS count FROM news GROUP BY category ORDER BY count DESC").show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



In [8]:
news_df.groupby('category') \
    .count() \
    .orderBy(fun.desc('count')) \
    .show()

+--------------+-----+
|      category|count|
+--------------+-----+
|      POLITICS|32739|
|      WELLNESS|17827|
| ENTERTAINMENT|16058|
|        TRAVEL| 9887|
|STYLE & BEAUTY| 9649|
|     PARENTING| 8677|
|HEALTHY LIVING| 6694|
|  QUEER VOICES| 6314|
|  FOOD & DRINK| 6226|
|      BUSINESS| 5937|
|        COMEDY| 5175|
|        SPORTS| 4884|
|  BLACK VOICES| 4528|
| HOME & LIVING| 4195|
|       PARENTS| 3955|
| THE WORLDPOST| 3664|
|      WEDDINGS| 3651|
|         WOMEN| 3490|
|        IMPACT| 3459|
|       DIVORCE| 3426|
+--------------+-----+
only showing top 20 rows



### Task 1. Select the longest headline

Hint: Use `length` function and `LIMIT` expression in SQL

Available functions: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [9]:
spark.sql("SELECT headline, length(headline) AS max_headline FROM news ORDER BY max_headline DESC limit 1").show()

+--------------------+------------+
|            headline|max_headline|
+--------------------+------------+
|Wendy Williams An...|         320|
+--------------------+------------+



## Spacy NLP library
![Spacy Features](img/spacy-features.png)

## Spacy pipeline
![Spacy Features](img/spacy-pipeline.png)

### Examples

In [10]:
# import spacy 

nlp = spacy.load("en_core_web_sm")
doc = nlp("In 2018 the Debian Linux project received a donation of $300,000")

for token in doc:
    print(token.text)

In
2018
the
Debian
Linux
project
received
a
donation
of
$
300,000


In [11]:
for token in doc.noun_chunks:
    print(token.text)

the Debian Linux project
a donation


In [12]:
for token in doc:
    if token.like_num:
        print(token.text)

2018
300,000


### Task 2. Extract named entities from the string

Hint: Use `ents` attribute of the `Doc` and `label_` attribute of the `Token`

Spacy Cheat Sheet: http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06

In [13]:
for ent in doc.ents:
    print(ent.text)

2018
Debian
Linux
300,000


## Let's combine a power of these two instruments

### Task 3. Extract ORG, PERSON, GPE named entities in Spark

```
# Write a function that takes a news headline and generate the output like that

[
  {
    'label': 'ORG', 
    'text': 'ACME Inc.'
  },
  {
    'label': 'PERSON', 
    'text': 'John Doe'   
  },
  {
    'label': GPE,
    'text': 'London'
  }
  ...
]
```

In [14]:
news_df_sample = news_df.sample(withReplacement=False, fraction=0.002, seed=777)
news_df_sample.createOrReplaceTempView("news_sample")

In [15]:
class SpacyWrapper(object):
    """Wrapper class to load Spacy on worker nodes"""
    _spacys = {}
    disabled_pipeline_steps = ['parser', 'tagger']
    default_model = 'en_core_web_sm'

    @classmethod
    def get(cls, model=default_model, disable=disabled_pipeline_steps):
        if model not in cls._spacys:
            import spacy
            cls._spacys[model] = spacy.load(model, disable=disable)
        return cls._spacys[model]

### Neamed entity extraction function

Hint: Reuse the code from `Task 2`.

In [16]:
def ner(doc):
    labels=['ORG', 'PERSON', 'GPE']
    entities = []
    
    # Load Spacy
    nlp = SpacyWrapper.get()
    doc = nlp(doc)
    for ent in doc.ents:
        entity = {
            'label' : ent.label_,
            'text' : ent.text
        }
        entities.append(entity)
        
    # ======== WRITE YOUR SOLUTION BELOW ======== 
        
    return entities

In [17]:
# Schema definition
schema = types.ArrayType(
    types.StructType([
        types.StructField('label', types.StringType(), nullable=False),
        types.StructField('text', types.StringType(), nullable=False)
    ])
)

# Register user defined function (UDF) to use in SQL
spark.udf.register('ner', ner, schema)

<function __main__.ner(doc)>

### Apply UDF to extract headlines

In [18]:
ent_sample = spark.sql("SELECT * FROM news_sample")

In [19]:
dataset = ent_sample.toPandas()

In [20]:
dataset.head()

Unnamed: 0,authors,category,date,headline,link,short_description
0,Kevin Robillard,POLITICS,2018-05-21,Super PACs That Meddled In West Virginia’s Senate Primary Didn’t Receive A P...,https://www.huffingtonpost.com/entry/super-pacs-that-meddled-in-west-virgini...,"The outside groups can spend big to alter election outcomes, but don't have ..."
1,Sara Boboltz,WORLD NEWS,2018-05-12,"Knife Attack In Paris Leaves 1 Dead, Several Injured",https://www.huffingtonpost.com/entry/knife-attack-paris-dead-injured_us_5af7...,The Islamic State claimed responsibility.
2,Doha Madani,POLITICS,2018-04-27,White House Knew About Rob Porter Allegations A Year Ago: FBI Letter,https://www.huffingtonpost.com/entry/fbi-congress-letter-rob-porter-white-ho...,The letter to Congress contradicts the White House's account of the spousal ...
3,Arthur Delaney,POLITICS,2018-04-18,Republicans Say People Would Kick Themselves Off Food Stamps Under Their New...,https://www.huffingtonpost.com/entry/republicans-food-stamps_us_5ad76c97e4b0...,"Like self-deportation, but for basic nutritional needs!"
4,Dominique Mosbergen,POLITICS,2018-03-29,NRA Is Pulling In Big Bucks After The Parkland Mass Shooting,https://www.huffingtonpost.com/entry/nra-donations-parkland-shooting_us_5abc...,"Donations skyrocketed as the gun group issued frothy warnings of the ""freedo..."


In [21]:
dataset = dataset[['category', 'short_description']]
dataset.head()

Unnamed: 0,category,short_description
0,POLITICS,"The outside groups can spend big to alter election outcomes, but don't have ..."
1,WORLD NEWS,The Islamic State claimed responsibility.
2,POLITICS,The letter to Congress contradicts the White House's account of the spousal ...
3,POLITICS,"Like self-deportation, but for basic nutritional needs!"
4,POLITICS,"Donations skyrocketed as the gun group issued frothy warnings of the ""freedo..."


## Save output to JSON

In [23]:
spark.sql("SELECT short_description, ner(short_description) AS entities FROM news_sample") \
    .repartition(1) \
    .write \
    .json("output")

In [24]:
nlp = spacy.blank("en")  # create blank Language class
print("Created blank 'en' model")
categories = dataset.category.unique()
print(categories)
textcat = nlp.create_pipe(
            "textcat",
            config={
                "exclusive_classes": True,
                "architecture": "simple_cnn",
            }
        )
nlp.add_pipe(textcat, last=True)
for singleCat in categories:
    textcat.add_label("NEGATIVE")

Created blank 'en' model
['POLITICS' 'WORLD NEWS' 'ENTERTAINMENT' 'WOMEN' 'BLACK VOICES' 'GREEN'
 'SPORTS' 'WEIRD NEWS' 'QUEER VOICES' 'TASTE' 'COMEDY' 'EDUCATION' 'STYLE'
 'PARENTS' 'SCIENCE' 'HEALTHY LIVING' 'TRAVEL' 'THE WORLDPOST' 'MEDIA'
 'BUSINESS' 'RELIGION' 'LATINO VOICES' 'CRIME' 'TECH' 'ARTS'
 'ARTS & CULTURE' 'GOOD NEWS' 'WORLDPOST' 'IMPACT' 'FIFTY' 'COLLEGE'
 'PARENTING' 'STYLE & BEAUTY' 'WEDDINGS' 'MONEY' 'WELLNESS' 'FOOD & DRINK'
 'HOME & LIVING' 'CULTURE & ARTS' 'DIVORCE']


In [28]:
trainSet = dataset

In [29]:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]

In [30]:
from spacy.util import minibatch, compounding
import random

In [31]:
n_iter=20

In [32]:
trainSet.head()

Unnamed: 0,category,short_description
0,POLITICS,"The outside groups can spend big to alter election outcomes, but don't have ..."
1,WORLD NEWS,The Islamic State claimed responsibility.
2,POLITICS,The letter to Congress contradicts the White House's account of the spousal ...
3,POLITICS,"Like self-deportation, but for basic nutritional needs!"
4,POLITICS,"Donations skyrocketed as the gun group issued frothy warnings of the ""freedo..."


In [33]:
train_c = list(trainSet.category)
train_tex = list(trainSet.short_description)

In [34]:
train_tex[1]

'The Islamic State claimed responsibility.'

In [35]:
trainSet.category.unique()

array(['POLITICS', 'WORLD NEWS', 'ENTERTAINMENT', 'WOMEN', 'BLACK VOICES',
       'GREEN', 'SPORTS', 'WEIRD NEWS', 'QUEER VOICES', 'TASTE', 'COMEDY',
       'EDUCATION', 'STYLE', 'PARENTS', 'SCIENCE', 'HEALTHY LIVING',
       'TRAVEL', 'THE WORLDPOST', 'MEDIA', 'BUSINESS', 'RELIGION',
       'LATINO VOICES', 'CRIME', 'TECH', 'ARTS', 'ARTS & CULTURE',
       'GOOD NEWS', 'WORLDPOST', 'IMPACT', 'FIFTY', 'COLLEGE',
       'PARENTING', 'STYLE & BEAUTY', 'WEDDINGS', 'MONEY', 'WELLNESS',
       'FOOD & DRINK', 'HOME & LIVING', 'CULTURE & ARTS', 'DIVORCE'],
      dtype=object)

In [36]:
train_c[1]

'WORLD NEWS'

In [37]:
def encode_to_dict(target):
    tarDict = dict()
    for cat in trainSet.category.unique():
        tarDict[cat] = target == cat
    return tarDict

In [38]:
trainSet['target'] = trainSet['category'].apply(encode_to_dict)

In [39]:
train_c = list(trainSet.target)
train_tex = list(trainSet.short_description)

In [41]:
train_data = list(zip(train_tex, [{"cats": cats} for cats in train_c]))
print(train_data[1])

('The Islamic State claimed responsibility.', {'cats': {'POLITICS': False, 'WORLD NEWS': True, 'ENTERTAINMENT': False, 'WOMEN': False, 'BLACK VOICES': False, 'GREEN': False, 'SPORTS': False, 'WEIRD NEWS': False, 'QUEER VOICES': False, 'TASTE': False, 'COMEDY': False, 'EDUCATION': False, 'STYLE': False, 'PARENTS': False, 'SCIENCE': False, 'HEALTHY LIVING': False, 'TRAVEL': False, 'THE WORLDPOST': False, 'MEDIA': False, 'BUSINESS': False, 'RELIGION': False, 'LATINO VOICES': False, 'CRIME': False, 'TECH': False, 'ARTS': False, 'ARTS & CULTURE': False, 'GOOD NEWS': False, 'WORLDPOST': False, 'IMPACT': False, 'FIFTY': False, 'COLLEGE': False, 'PARENTING': False, 'STYLE & BEAUTY': False, 'WEDDINGS': False, 'MONEY': False, 'WELLNESS': False, 'FOOD & DRINK': False, 'HOME & LIVING': False, 'CULTURE & ARTS': False, 'DIVORCE': False}})


In [42]:
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    batch_sizes = compounding(4.0, 32.0, 1.001)
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        random.shuffle(train_data)
        batches = minibatch(train_data, size=batch_sizes)
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)

