# Lab: Spark for Text Analytics

First portion adapted from Jonsnowlabs [example notebook](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/SENTIMENT_EN.ipynb).

## 1. Setup

In [1]:
import findspark
findspark.init()

The package we need (sparknlp) is not a standard package we have in the bootstrap script. We need to install it! This package is only required for the driver machine and not needed on any of the worker machines.

In [2]:
!/mnt/miniconda/bin/pip install spark-nlp==4.2.1 --force
!/mnt/miniconda/bin/pip install sparknlp

Collecting spark-nlp==4.2.1
  Downloading spark_nlp-4.2.1-py2.py3-none-any.whl (643 kB)
[K     |████████████████████████████████| 643 kB 30.0 MB/s eta 0:00:01
[?25hInstalling collected packages: spark-nlp
  Attempting uninstall: spark-nlp
    Found existing installation: spark-nlp 4.2.2
    Uninstalling spark-nlp-4.2.2:
      Successfully uninstalled spark-nlp-4.2.2
Successfully installed spark-nlp-4.2.1
Collecting sparknlp
  Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Installing collected packages: sparknlp
Successfully installed sparknlp-1.0.0


In [3]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## 2. Start Spark Session

Note here that we are calling a specific Java package to connect to the Spark NLP package.

We are using [Kryo syrialization](https://spark.apache.org/docs/latest/tuning.html), which is a faster but less flexible method of serializing data to move between processes.

Note that this command will download and install MANY Mazen repo Java package dependencies for the Spark-NLP package. It will take a minute or two.

In [4]:
spark = SparkSession.builder \
        .appName("SparkNLP") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1") \
    .master('yarn') \
    .getOrCreate()

Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-97361bbb-25ef-4e83-9f49-e9888d90ed03;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.2.1 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.code.findbugs#annotations;3.0.1 in central
	found net.jcip#jcip-annotations;1.0 in central
	found com.google.code.findbugs#jsr305;3.0.1 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	foun

In [5]:
spark

## 3. Select the DL model and re-run cells below

In [6]:
#MODEL_NAME='sentimentdl_use_imdb'
MODEL_NAME='sentimentdl_use_twitter'

## 4. Some sample examples

In [7]:
## Generating Example Files ##

text_list = []
if MODEL_NAME=='sentimentdl_use_imdb':
  text_list = [
             """Demonicus is a movie turned into a video game! I just love the story and the things that goes on in the film.It is a B-film ofcourse but that doesn`t bother one bit because its made just right and the music was rad! Horror and sword fight freaks,buy this movie now!""",
             """Back when Alec Baldwin and Kim Basinger were a mercurial, hot-tempered, high-powered Hollywood couple they filmed this (nearly) scene-for-scene remake of the 1972 Steve McQueen-Ali MacGraw action-thriller about a fugitive twosome. It almost worked the first time because McQueen was such a vital presence on the screen--even stone silent and weary, you could sense his clock ticking, his cagey magnetism. Baldwin is not in Steve McQueen's league, but he has his charms and is probably a more versatile actor--if so, this is not a showcase for his attributes. Basinger does well and certainly looks good, but James Woods is artificially hammy in a silly mob-magnet role. A sub-plot involving another couple taken hostage by Baldwin's ex-partner was unbearable in the '72 film and plays even worse here. As for the action scenes, they're pretty old hat, which causes one to wonder: why even remake the original?""",
             """Despite a tight narrative, Johnnie To's Election feels at times like it was once a longer picture, with many characters and plot strands abandoned or ultimately unresolved. Some of these are dealt with in the truly excellent and far superior sequel, Election 2: Harmony is a Virtue, but it's still a dependably enthralling thriller about a contested Triad election that bypasses the usual shootouts and explosions (though not the violence) in favour of constantly shifting alliances that can turn in the time it takes to make a phone call. It's also a film where the most ruthless character isn't always the most threatening one, as the chilling ending makes only too clear: one can imagine a lifetime of psychological counselling being necessary for all the trauma that one inflicts on one unfortunate bystander. Simon Yam, all too often a variable actor but always at his best under To's direction, has possibly never been better in the lead, not least because Tony Leung's much more extrovert performance makes his stillness more the powerful.""",
             """This movie has successfully proved what we all already know, that professional basket-ball players suck at everything besides playing basket-ball. Especially rapping and acting. I can not even begin to describe how bad this movie truly is. First of all, is it just me, or is that the ugliest kid you have ever seen? I mean, his teeth could be used as a can-opener. Secondly, why would a genie want to pursue a career in the music industry when, even though he has magical powers, he sucks horribly at making music? Third, I have read the Bible. In no way shape or form did it say that Jesus made genies. Fourth, what was the deal with all the crappy special effects? I assure you that any acne-addled nerdy teenager with a computer could make better effects than that. Fifth, why did the ending suck so badly? And what the hell is a djin? And finally, whoever created the nightmare known as Kazaam needs to be thrown off of a plane and onto the Eiffel Tower, because this movie take the word "suck" to an entirely new level.""",
             """The fluttering of butterfly wings in the Atlantic can unleash a hurricane in the Pacific. According to this theory (somehow related to the Chaos Theory, I'm not sure exactly how), every action, no matter how small or insignificant, will start a chain reaction that can lead to big events. This small jewel of a film shows us a series of seemingly-unrelated characters, most of them in Paris, whose actions will affect each others' lives. (The six-degrees-of-separation theory can be applied as well.) Each story is a facet of the jewel that is this film. The acting is finely-tuned and nuanced (Audrey Tautou is luminous), the stories mesh plausibly, the humor is just right, and the viewer leaves the theatre nodding in agreement.""",
             """There have been very few films I have not been able to sit through. I made it through Battle Field Earth no problem. But this, This is one of the single worst films EVER to be made. I understand Whoopi Goldberg tried to get of acting in it. I do not blame her. I would feel ashamed to have this on a resume. I belive it is a rare occasion when almost every gag in a film falls flat on it's face. Well it happens here. Not to mention the SFX, look for the dino with the control cables hanging out of it rear end!!!!!! Halfway through the film I was still looking for a plot. I never found one. Save yourself the trouble of renting this and save 90 minutes of your life.""",
             """After a long hard week behind the desk making all those dam serious decisions this movie is a great way to relax. Like Wells and the original radio broadcast this movie will take you away to a land of alien humor and sci-fi paraday. 'Captain Zippo died in the great charge of the Buick. He was a brave man.' The Jack Nicholson impressions shine right through that alien face with the dark sun glasses and leather jacket. And always remember to beware of the 'doughnut of death!' Keep in mind the number one rule of this movie - suspension of disbelief - sit back and relax - and 'Prepare to die Earth Scum!' You just have to see it for yourself.""",
             """When Ritchie first burst on to movie scene his films were hailed as funny, witty, well directed and original. If one could compare the hype he had generated with his first two attempts and the almost universal loathing his last two outings have created one should consider - has Ritchie been found out? Is he really that talented? Does he really have any genuine original ideas? Or is he simply a pretentious and egotistical director who really wants to be Fincher, Tarantino and Leone all rolled into one colossal and disorganised heap? After watching Revolver one could be excused for thinking were did it all go wrong? What happened to his great sense of humour? Where did he get all these mixed and convoluted ideas from? Revolver tries to be clever, philosophical and succinct, it tries to be an intelligent psychoanalysis, it tries to be an intricate and complicated thriller. Ritchie does make a gargantuan effort to fulfil all these many objectives and invests great chunks of a script into existential musings and numerous plot twists. However, in the end all it serves is to construct a severely disjointed, unstructured and ultimately unfriendly film to the audience. Its plagiarism is so sinful and blatant that although Ritchie does at least attempt to give his own spin he should be punished for even trying to pass it off as his own work. So what the audience gets ultimately is a terrible screenplay intertwined with many pretentious oneliners and clumsy setpieces.<br /><br />Revolver is ultimately an unoriginal and bland movie that has stolen countless themes from masterpieces like Fight Club, Usual Suspects and Pulp Fiction. It aims high, but inevitably shots blanks aplenty.<br /><br />Revolver deserves to be lambasted, it is a truly poor film masquerading as a wannabe masterpiece from a wannabe auteur. However, it falls flat on its farcical face and just fails at everything it wants to be and achieve.""",
             """I always thought this would be a long and boring Talking-Heads flick full of static interior takes, dude, I was wrong. "Election" is a highly fascinating and thoroughly captivating thriller-drama, taking a deep and realistic view behind the origins of Triads-Rituals. Characters are constantly on the move, and although as a viewer you kinda always remain an outsider, it\'s still possible to feel the suspense coming from certain decisions and ambitions of the characters. Furthermore Johnnie To succeeds in creating some truly opulent images due to meticulously composed lighting and atmospheric light-shadow contrasts. Although there\'s hardly any action, the ending is still shocking in it\'s ruthless depicting of brutality. Cool movie that deserves more attention, and I came to like the minimalistic acoustic guitar score quite a bit.""",
             """This is to the Zatoichi movies as the "Star Trek" movies were to "Star Trek"--except that in this case every one of the originals was more entertaining and interesting than this big, shiny re-do, and also better made, if substance is more important than surface. Had I never seen them, I would have thought this good-looking but empty; since I had, I thought its style inappropriate and its content insufficient. The idea of reviving the character in a bigger, slicker production must have sounded good, but there was no point in it, other than the hope of making money; it\'s just a show, which mostly fails to capture the atmosphere of the character\'s world and wholly fails to take the character anywhere he hasn\'t been already (also, the actor wasn\'t at his best). I\'d been hoping to see Ichi at a late stage of life, in a story that would see him out gracefully and draw some conclusion from his experience overall; this just rehashes bits and pieces from the other movies, seasoned with more sex and sfx violence. Not the same experience at all."""
             ]
elif  MODEL_NAME=='sentimentdl_use_twitter':
  text_list = [
            """@Mbjthegreat i really dont want AT&amp;T phone service..they suck when it comes to having a signal""",
            """holy crap. I take a nap for 4 hours and Pitchfork blows up my twitter dashboard. I wish I was at Coachella.""",
            """@Susy412 he is working today  ive tried that still not working..... hmmmm!! im rubbish with computers haha!""",
            """Brand New Canon EOS 50D 15MP DSLR Camera Canon 17-85mm IS Lens ...: Web Technology Thread, Brand New Canon EOS 5.. http://u.mavrev.com/5a3t""",
            """Watching a programme about the life of Hitler, its only enhancing my geekiness of history.""",
            """GM says expects announcment on sale of Hummer soon - Reuters: WDSUGM says expects announcment on sale of Hummer .. http://bit.ly/4E1Fv""",
            """@accannis @edog1203 Great Stanford course. Thanks for making it available to the public! Really helpful and informative for starting off!""",
            """@the_real_usher LeBron is cool.  I like his personality...he has good character.""",
            """@sketchbug Lebron is a hometown hero to me, lol I love the Lakers but let's go Cavs, lol""",
            """@PDubyaD right!!! LOL we'll get there!! I have high expectations, Warren Buffet style.""",
            ]


## 5. Define Spark NLP pipleline

In [8]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ \ ]Download done! Loading the resource.
[ | ]

[Stage 0:>                                                          (0 + 1) / 1]

[ / ]

                                                                                

[ — ]

2022-10-31 19:50:10.528418: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[ | ]

2022-10-31 19:50:15.140513: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2022-10-31 19:50:15.187226: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2022-10-31 19:50:15.234789: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2022-10-31 19:50:15.313565: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.
2022-10-31 19:50:15.362942: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 60236800 exceeds 10% of free system memory.


[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[ | ]sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[ / ]Download done! Loading the resource.
[OK!]


## 6. Run the pipeline

In [9]:
empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = nlpPipeline.fit(empty_df)

df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(df)

## 7. Visualize results

Review the schema and show some rows of data.

In [10]:
result.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContains

In [11]:
result.show(5)

                                                                                

+--------------------+--------------------+--------------------+--------------------+
|                text|            document| sentence_embeddings|           sentiment|
+--------------------+--------------------+--------------------+--------------------+
|@Mbjthegreat i re...|[[document, 0, 97...|[[sentence_embedd...|[[category, 0, 97...|
|holy crap. I take...|[[document, 0, 10...|[[sentence_embedd...|[[category, 0, 10...|
|@Susy412 he is wo...|[[document, 0, 10...|[[sentence_embedd...|[[category, 0, 10...|
|Brand New Canon E...|[[document, 0, 13...|[[sentence_embedd...|[[category, 0, 13...|
|Watching a progra...|[[document, 0, 89...|[[sentence_embedd...|[[category, 0, 89...|
+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



Since this data is so much more complex, we need to pull out just the pieces we care about: the text and the sentiment result. This cell does that for us:

In [12]:
result.select('text', F.explode('sentiment.result')).show(5, truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------+--------+
|text                                                                                                                                       |col     |
+-------------------------------------------------------------------------------------------------------------------------------------------+--------+
|@Mbjthegreat i really dont want AT&amp;T phone service..they suck when it comes to having a signal                                         |negative|
|holy crap. I take a nap for 4 hours and Pitchfork blows up my twitter dashboard. I wish I was at Coachella.                                |negative|
|@Susy412 he is working today  ive tried that still not working..... hmmmm!! im rubbish with computers haha!                                |negative|
|Brand New Canon EOS 50D 15MP DSLR Camera Canon 17-85mm IS Lens ...: Web Technology Thread, Br

## Building Sentiment Model for News

We will be using data from a news summarization dataset from [Kaggle](https://www.kaggle.com/datasets/sbhatti/news-summarization). The data has been converted from a zipped csv file into a multi-part parquet file. Load the data into your environment by following the standard steps:
    
- Copy data from central bucket to personal bucket
- Read in data to Spark from personal bucket
- Review the structure, size, number of partitions, and show a few rows of data
- Create a `df_small` dataset that contains a sample of the full data using the `sample` [function](https://sparkbyexamples.com/pyspark/pyspark-sampling-example/)

In [13]:
!hadoop distcp s3://bigdatateaching/news/summarization/ s3://anly502-fall-2022-yl1353/news/summarization/

2022-10-31 19:52:09,150 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3://bigdatateaching/news/summarization], targetPath=s3://anly502-fall-2022-yl1353/news/summarization, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192, verboseLog=false, directWrite=false}, sourcePaths=[s3://bigdatateaching/news/summarization], targetPathExists=true, preserveRawXattrsfalse
2022-10-31 19:52:09,530 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-38-22.ec2.internal/172.31.38.22:8032
2022-10-31 19:52:09,874 INFO client.AHSProxy: Connecting to Application History server at ip-172-31-38-22.ec2.i

In [14]:
df = spark.read.parquet('s3://anly502-fall-2022-yl1353/news/summarization/')

                                                                                

In [15]:
df.count()

                                                                                

870521

In [16]:
len(df.columns)

5

In [17]:
df.rdd.getNumPartitions()

52

In [18]:
df.show(5)



+--------------------+--------------------+--------------------+--------------+-------------------+
|                  ID|             Content|             Summary|       Dataset|__null_dask_index__|
+--------------------+--------------------+--------------------+--------------+-------------------+
|f49ee725a0360aa68...|New York police a...|Police have inves...|CNN/Daily Mail|                  0|
|808fe317a53fbd313...|By . Ryan Lipman ...|Porn star Angela ...|CNN/Daily Mail|                  1|
|98fd67bd343e58bc4...|This was, Sergio ...|American draws in...|CNN/Daily Mail|                  2|
|e12b5bd7056287049...|An Ebola outbreak...|World Health Orga...|CNN/Daily Mail|                  3|
|b83e8bcfcd5141984...|By . Associated P...|A sinkhole opened...|CNN/Daily Mail|                  4|
+--------------------+--------------------+--------------------+--------------+-------------------+
only showing top 5 rows



                                                                                

In [19]:
df_small = df.sample(withReplacement = True, fraction = 0.3)

In [20]:
df_small.show(5)

                                                                                

+--------------------+--------------------+--------------------+--------------+-------------------+
|                  ID|             Content|             Summary|       Dataset|__null_dask_index__|
+--------------------+--------------------+--------------------+--------------+-------------------+
|550c7ea14b4ec91db...|An Australian fat...|Zia Abdul Haq is ...|CNN/Daily Mail|                  6|
|85fa186e116866297...|A community stalw...|Rahmat Ali Raja, ...|CNN/Daily Mail|                  8|
|6c9ecd04c8b2bd960...|(CNN) -- Three De...|Police arrested T...|CNN/Daily Mail|                 13|
|872dc5f31eedcb249...|Queens Park Range...|Hull and Stoke ar...|CNN/Daily Mail|                 16|
|b8e14b9ce93a52020...|Scathing: A repor...|Sir Cliff first s...|CNN/Daily Mail|                 17|
+--------------------+--------------------+--------------------+--------------+-------------------+
only showing top 5 rows



### Construct four dummy variables using pyspark and the regex function `rlike`. You will make each dummy variable using the regex statement provided below:

|dummy | regex|
|-----------|-----------|
|politics|**(?i)politics\|(?i)political\|(?i)senate\|(?i)government\|(?i)president\|(?i)prime minister\|(?i)congress**|
|sports|**(?i)sport\|(?i)ball\|(?i)coach\|(?i)goal\|(?i)baseball\|(?i)football\|(?i)basketball**|
|arts|**(?i)art\|(?i)painting\|(?i)artist\|(?i)museum\|(?i)photography\|(?i)sculpture**|
|history|**(?i)history\|(?i)historical\|(?i)ancient\|(?i)archaeology\|(?i)heritage\|(?i)fossil**|

In [27]:
from pyspark.sql import functions
from pyspark.sql.functions import *

In [31]:
df = df.withColumn("politics", when(df.Content.rlike("(?i)politics|(?i)political|(?i)senate|(?i)government|(?i)president|(?i)prime minister|(?i)congress"),True)
                   .otherwise(False))

In [32]:
df = df.withColumn("sports", when(df.Content.rlike("(?i)sport|(?i)ball|(?i)coach|(?i)goal|(?i)baseball|(?i)football|(?i)basketball"),True)
                   .otherwise(False))

In [33]:
df = df.withColumn("arts", when(df.Content.rlike("(?i)art|(?i)painting|(?i)artist|(?i)museum|(?i)photography|(?i)sculpture"),True)
                   .otherwise(False))

In [34]:
df = df.withColumn("history", when(df.Content.rlike("(?i)history|(?i)historical|(?i)ancient|(?i)archaeology|(?i)heritage|(?i)fossil"),True)
                   .otherwise(False))

In [35]:
df.show(5)



+--------------------+--------------------+--------------------+--------------+-------------------+--------+------+-----+-------+
|                  ID|             Content|             Summary|       Dataset|__null_dask_index__|politics|sports| arts|history|
+--------------------+--------------------+--------------------+--------------+-------------------+--------+------+-----+-------+
|f49ee725a0360aa68...|New York police a...|Police have inves...|CNN/Daily Mail|                  0|   false|  true|false|  false|
|808fe317a53fbd313...|By . Ryan Lipman ...|Porn star Angela ...|CNN/Daily Mail|                  1|   false| false| true|  false|
|98fd67bd343e58bc4...|This was, Sergio ...|American draws in...|CNN/Daily Mail|                  2|   false|  true| true|  false|
|e12b5bd7056287049...|An Ebola outbreak...|World Health Orga...|CNN/Daily Mail|                  3|    true| false| true|  false|
|b83e8bcfcd5141984...|By . Associated P...|A sinkhole opened...|CNN/Daily Mail|           

                                                                                

### Show counts of each dummy variable in the dataset. Save the result for the dummy count of `arts` to the variable name specified

In [36]:
politics_count = df.groupby("politics").count()

In [37]:
sports_count = df.groupby("sports").count()

In [38]:
arts_count = df.groupby("arts").count()

In [39]:
df_art_soln = arts_count.toPandas().to_dict()

                                                                                

In [40]:
history_count = df.groupby("history").count()

### Build a SparkNLP Pipeline to construct positive/negative sentiment

Adapt the previous example to make the sentiment model for this dataset. Remember, start out with a smaller dataset (`df_small`) before moving onto your full dataset!

In [44]:
documentAssembler = DocumentAssembler()\
    .setInputCol("Content")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[OK!]


In [49]:
#empty_df_s = spark.createDataFrame([['']]).toDF("Content")
pipelineModel_df = nlpPipeline.fit(df)

In [50]:
result_df = pipelineModel_df.transform(df)

In [65]:
sentiment_df = result_df.select('Content', functions.explode('sentiment.result'))
sentiment_df.show(5)



+--------------------+--------+
|             Content|     col|
+--------------------+--------+
|New York police a...|negative|
|By . Ryan Lipman ...|negative|
|This was, Sergio ...|negative|
|An Ebola outbreak...|negative|
|By . Associated P...|negative|
+--------------------+--------+
only showing top 5 rows



                                                                                

### Pull out the sentiment output into its own column in the main dataframe. Create a new dataframe that includes sentiment, news source, and your four dummy variables.

In [84]:
df_new = df.join(sentiment_df, df.Content == sentiment_df.Content,"left")

In [85]:
df_new = df_new.drop(sentiment_df.Content)\
                .withColumnRenamed('col','sentiment')\
                .withColumnRenamed('Dataset','news_source')\
                .select('Content','sentiment','news_source','politics','sports','arts','history')

In [86]:
df_new.cache()

DataFrame[Content: string, sentiment: string, news_source: string, politics: boolean, sports: boolean, arts: boolean, history: boolean]

In [None]:
df_new.show(5)



### Create a summary table of the count of articles grouped by your `politics` dummy variable, news source `news_source`, and sentiment classification `sentiment`. Save the resulting dataframe into a variable called `df_sent_baseline`, similar to the previous step for saving the Pandas dataframe.

In [82]:
table_summary = df_new.groupby('politics','news_source','sentiment').count()

In [83]:
table_summary.show(5)

+--------------------+---------+-----+
|         news_source|sentiment|count|
+--------------------+---------+-----+
|
  
 A crowdsourc...| negative|    1|
|"All Americans sh...| negative|    1|
|"I am incredibly ...| negative|    1|
|"I love Ireland,"...| positive|    1|
|"It cannot be rig...| negative|    1|
+--------------------+---------+-----+
only showing top 5 rows



In [None]:
df_sent_baseline = table_summary.toPandas().to_dict()

## **Save your analytics results to a json object - then add, commit, and push your notebook and json to GitHub!**

In [None]:
import json
json.dump({'df_arts_count' : df_art_soln,
           'df_sentiment_baseline' : df_sent_baseline,
          }, 
          fp = open('lab-soln.json','w'))

## STOP YOUR CLUSTER!!!

In [None]:
spark.stop()