**Natural Language Processing NLP using Spark and Pandas**

I created this notebook to do a short demonstration about this library called NLP. 
I found this exercise really fun and beginner friendly.

We are going to analyze some dataset from Reddit and figure out what are the most common words. 
Just to clarify, this dataset is really small and it works just for practice but you can apply the same methods to some others datasets too. 

To use this notebook you need to install 
* pyspark
* spark-nlp
* pandas

You can do it just running the following code in Jupyter Notebook:


In [1]:
!pip install pyspark
!pip install spark-nlp==2.0.1
!pip install pandas

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f2/64/a1df4440483df47381bbbf6a03119ef66515cf2e1a766d9369811575454b/pyspark-2.4.1.tar.gz (215.7MB)
[K    100% |████████████████████████████████| 215.7MB 69kB/s 
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K    100% |████████████████████████████████| 204kB 23.6MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
[?25h  Stored in directory: /tmp/.cache/pip/wheels/47/9b/57/7984bf19763749a13eece44c3174adb6ae4bc95b920375

Import `pandas` Library and set the column width to 800. 

In [2]:
import pandas as pd
pd.set_option('max_colwidth', 800)

Let's create a `SparkSession`. We're going declare a Spark package to use the NLP library and count the most common words from our dataset. 

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:1.8.2") \
        .getOrCreate()

Declare a path variable and read the csv files with the `SparkSession` created before. 

Set a *header* option as true and *csv* format 

In [4]:
path = '../input/*.csv'
df = spark.read.format('csv').option('header', 'true').load(path)
df.limit(5).toPandas()

Unnamed: 0,Author,Comment,Score,ID
0,MuffinMedic,This sounds interesting! By any chance is the bot open source? I'd be interested in running this locally and collecting some data.,,
1,Also,"have you compared this to or looked into the Perspective API at all?""",3.0,ek6kzos
2,reseph,"You may want to get in touch with https://civilservant.io/ too, just to inform them about this neat thing. AI Moderation was one of the topics discussed at the summit.",2.0,ek6lqbn
3,shaggorama,"""Define """"bad comments""""""",2.0,ek6mled
4,FreeSpeechWarrior,If this is trained on a per subreddit basis I'd be interested in using this in a report/modmail only mod on r/WatchRedditDie and r/subredditcancer,,


Our objective with this project is count the most common words, so we don't want null comments.

Let's filter all null rows from the comment column.

In [5]:
df = df.filter('comment is not null')

I'm going to create a new DataFrame using * explode * and * split * functions of `pyspark`.

The purpose of this is create a new column called word, this new column will contain all the words of our comments split with spaces.

In [6]:
from pyspark.sql.functions import split, explode, desc

dfWords = df.select(explode(split('comment', '\\s+')).alias('word')) \
                    .groupBy('word').count().orderBy(desc('word'))

dfWords.printSchema()

root
 |-- word: string (nullable = true)
 |-- count: long (nullable = false)



In [7]:
dfWords.orderBy(desc('count')).limit(5).toPandas()

Unnamed: 0,word,count
0,the,266
1,to,188
2,a,167
3,I,145
4,,139


Our new DataFrame doesn't looks so good, as you can see, we have blank rows, pronouns, etc.

Our goal is count the relevant words from posts. That's why we are going to use `NLP` library. Natural Language Processing library will classify every word from the dataset as Noun, Pronoun, Verbs, etc.

In [8]:
from com.johnsnowlabs.nlp.pretrained.pipeline.en import BasicPipeline as bp

dfAnnotated = bp.annotate(df, 'comment')
dfAnnotated.printSchema()

root
 |-- Author: string (nullable = true)
 |-- text: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 

* `text` original text from comment column.
* `pos.metadata` will contain a key,value for every words.
* `pos.result` column is an array with a bunch of tags for every word in the DataSet.

Here is the list of NLP tags https://cs.nyu.edu/grishman/jet/guide/PennPOS.html


In [9]:
dfPos = dfAnnotated.select("text", "pos.metadata", "pos.result")
dfPos.limit(5).toPandas()

Unnamed: 0,text,metadata,result
0,This sounds interesting! By any chance is the bot open source? I'd be interested in running this locally and collecting some data.,"[{'word': 'This'}, {'word': 'sounds'}, {'word': 'interesting'}, {'word': 'By'}, {'word': 'any'}, {'word': 'chance'}, {'word': 'is'}, {'word': 'the'}, {'word': 'bot'}, {'word': 'open'}, {'word': 'source'}, {'word': 'I'}, {'word': 'd'}, {'word': 'be'}, {'word': 'interested'}, {'word': 'in'}, {'word': 'running'}, {'word': 'this'}, {'word': 'locally'}, {'word': 'and'}, {'word': 'collecting'}, {'word': 'some'}, {'word': 'data'}]","[DT, VBZ, JJ, IN, DT, NN, VBZ, DT, NN, JJ, NN, PRP, SYM, VB, VBN, IN, VBG, DT, RB, CC, VBG, DT, NNS]"
1,"have you compared this to or looked into the Perspective API at all?""","[{'word': 'have'}, {'word': 'you'}, {'word': 'compared'}, {'word': 'this'}, {'word': 'to'}, {'word': 'or'}, {'word': 'looked'}, {'word': 'into'}, {'word': 'the'}, {'word': 'Perspective'}, {'word': 'API'}, {'word': 'at'}, {'word': 'all'}]","[VBP, PRP, VBD, DT, TO, CC, VBD, IN, DT, NNP, NNP, IN, DT]"
2,"You may want to get in touch with https://civilservant.io/ too, just to inform them about this neat thing. AI Moderation was one of the topics discussed at the summit.","[{'word': 'You'}, {'word': 'may'}, {'word': 'want'}, {'word': 'to'}, {'word': 'get'}, {'word': 'in'}, {'word': 'touch'}, {'word': 'with'}, {'word': 'httpscivilservantio'}, {'word': 'too'}, {'word': 'just'}, {'word': 'to'}, {'word': 'inform'}, {'word': 'them'}, {'word': 'about'}, {'word': 'this'}, {'word': 'neat'}, {'word': 'thing'}, {'word': 'AI'}, {'word': 'Moderation'}, {'word': 'was'}, {'word': 'one'}, {'word': 'of'}, {'word': 'the'}, {'word': 'topics'}, {'word': 'discussed'}, {'word': 'at'}, {'word': 'the'}, {'word': 'summit'}]","[PRP, MD, VB, TO, VB, IN, NN, IN, NN, RB, RB, TO, VB, PRP, IN, DT, JJ, NN, NNP, NNP, VBD, CD, IN, DT, NNS, VBD, IN, DT, NN]"
3,"""Define """"bad comments""""""","[{'word': 'Define'}, {'word': 'bad'}, {'word': 'comments'}]","[NNP, JJ, NNS]"
4,If this is trained on a per subreddit basis I'd be interested in using this in a report/modmail only mod on r/WatchRedditDie and r/subredditcancer,"[{'word': 'If'}, {'word': 'this'}, {'word': 'is'}, {'word': 'trained'}, {'word': 'on'}, {'word': 'a'}, {'word': 'per'}, {'word': 'subreddit'}, {'word': 'basis'}, {'word': 'I'}, {'word': 'd'}, {'word': 'be'}, {'word': 'interested'}, {'word': 'in'}, {'word': 'using'}, {'word': 'this'}, {'word': 'in'}, {'word': 'a'}, {'word': 'reportmodmail'}, {'word': 'only'}, {'word': 'mod'}, {'word': 'on'}, {'word': 'rWatchRedditDie'}, {'word': 'and'}, {'word': 'rsubredditcancer'}]","[IN, DT, VBZ, VBN, IN, DT, IN, NN, NN, PRP, SYM, VB, VBN, IN, VBG, DT, IN, DT, NN, RB, NN, IN, NN, CC, NN]"


Let's create a new DataFrame with the `pos` struct

In [10]:
dfSplitPos = dfAnnotated.select(explode("pos").alias("pos"))
dfSplitPos.limit(5).toPandas()

Unnamed: 0,pos
0,"(pos, 0, 3, DT, {'word': 'This'})"
1,"(pos, 5, 10, VBZ, {'word': 'sounds'})"
2,"(pos, 12, 22, JJ, {'word': 'interesting'})"
3,"(pos, 25, 26, IN, {'word': 'By'})"
4,"(pos, 28, 30, DT, {'word': 'any'})"


I want to count every word with the tag NNP or NNPs which means:
* NNP	Proper noun, singular 
* NNPS	Proper noun, plural


In [11]:
NNPFilter = "pos.result = 'NNP' or pos.result = 'NNPs'"
dfNNPFilter = dfSplitPos.filter(NNPFilter)
dfNNPFilter.limit(10).toPandas()

Unnamed: 0,pos
0,"(pos, 45, 55, NNP, {'word': 'Perspective'})"
1,"(pos, 57, 59, NNP, {'word': 'API'})"
2,"(pos, 107, 108, NNP, {'word': 'AI'})"
3,"(pos, 110, 119, NNP, {'word': 'Moderation'})"
4,"(pos, 1, 6, NNP, {'word': 'Define'})"
5,"(pos, 15, 29, NNP, {'word': 'CivilServantio'})"
6,"(pos, 0, 5, NNP, {'word': 'GitHub'})"
7,"(pos, 0, 7, NNP, {'word': 'RemindMe'})"
8,"(pos, 0, 2, NNP, {'word': 'Atm'})"
9,"(pos, 182, 182, NNP, {'word': 'D'})"


I'm going to use selectExpr function to create a new DataFrame with a *word* and *tag* columns

In [12]:
dfWordTag = dfNNPFilter.selectExpr("pos.metadata['word'] as word", "pos.result as tag")
dfWordTag.limit(10).toPandas()

Unnamed: 0,word,tag
0,Perspective,NNP
1,API,NNP
2,AI,NNP
3,Moderation,NNP
4,Define,NNP
5,CivilServantio,NNP
6,GitHub,NNP
7,RemindMe,NNP
8,Atm,NNP
9,D,NNP


Finally, we have our DataSet as we want and we can start counting the most common words. 

In [13]:
dfCountWords = dfWordTag.groupBy('word').count().orderBy(desc('count'))
dfCountWords.limit(20).toPandas()

Unnamed: 0,word,count
0,Reddit,15
1,PRAW,11
2,i,9
3,API,8
4,JSON,4
5,Apollo,4
6,RemindMe,3
7,HTML,2
8,GitHub,2
9,JSAPI,2


Our DataFrame doesn't say so much because the dataset is a little small, the idea is to apply this methods into another projects, this is just for practice and discover what you can do with nlp library

Please feel free to let me know your thoughts about this and what I can do better for a next exercise. 

You can reach me on Medium or Github

* https://github.com/kennycontreras
* https://medium.com/@kennycontreras