<a href="https://colab.research.google.com/github/kaganilter/SPARK-NLP/blob/main/kagan_spark_nlp1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4
! pip install --ignore-installed -q spark-nlp==2.6.3-rc1

openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)
[K     |████████████████████████████████| 215.7MB 70kB/s 
[K     |████████████████████████████████| 204kB 40.7MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 133kB 4.3MB/s 
[?25h

after installing java and pyspark, lets add spark nlp

In [2]:
import sparknlp

spark = sparknlp.start()

# params =>> gpu=False, spark23=False (start with spark 2.3)

print("Spark NLP version", sparknlp.version())

print("Apache Spark version:", spark.version)

Spark NLP version 2.6.3-rc1
Apache Spark version: 2.4.4


We can choose different languages, libraries and pipleines 

In [3]:
from sparknlp.pretrained import PretrainedPipeline


Lets try some of the pretrained pipelines 

In [4]:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]


This model have stages usch as tokenizer, spell check, lemmatizer and stemmer. Let's see these steps

In [5]:
pipeline.model.stages


[document_2ec0b742eccd,
 SENTENCE_98fb8e28cb7b,
 REGEX_TOKENIZER_1f63ed636a13,
 SPELL_e4ea67180337,
 LEMMATIZER_c62ad8f355f9,
 STEMMER_75edcc4a9cdb,
 POS_29fd848601e6]

Let's create our test doc

In [17]:
testdoc=''' this ipsum is placeholder text commonly used in the graphic, print, and publishing industries for previewing layouts and visual mockups. @ether_radio yeah :S i feel all funny cause i haven't slept enough  i woke my mum up cause i was singing she's not impressed :S you? '''

In [18]:
pipeline_local = PretrainedPipeline.from_disk('/root/cache_pretrained/explain_document_ml_en_2.4.0_2.4_1580252705962')

In [19]:
result=pipeline.annotate(testdoc)

In [20]:
result.keys()

dict_keys(['document', 'spell', 'pos', 'lemmas', 'token', 'stems', 'sentence'])

In [21]:
result['token']

['this',
 'ipsum',
 'is',
 'placeholder',
 'text',
 'commonly',
 'used',
 'in',
 'the',
 'graphic',
 ',',
 'print',
 ',',
 'and',
 'publishing',
 'industries',
 'for',
 'previewing',
 'layouts',
 'and',
 'visual',
 'mockups',
 '.',
 '@ether_radio',
 'yeah',
 ':',
 'S',
 'i',
 'feel',
 'all',
 'funny',
 'cause',
 'i',
 "haven't",
 'slept',
 'enough',
 'i',
 'woke',
 'my',
 'mum',
 'up',
 'cause',
 'i',
 'was',
 'singing',
 "she's",
 'not',
 'impressed',
 ':',
 'S',
 'you',
 '?']

In [22]:
result['lemmas']

['this',
 'ipsum',
 'be',
 'placeholder',
 'text',
 'commonly',
 'use',
 'in',
 'the',
 'graphic',
 ',',
 'print',
 ',',
 'and',
 'publish',
 'industry',
 'for',
 'preview',
 'layout',
 'and',
 'visual',
 'mockups',
 '.',
 '@ether_radio',
 'yeah',
 ':',
 'S',
 'i',
 'feel',
 'all',
 'funny',
 'cause',
 'i',
 "haven't",
 'sleep',
 'enough',
 'i',
 'wake',
 'i',
 'mum',
 'up',
 'cause',
 'i',
 'be',
 'sing',
 "she's",
 'not',
 'impressed',
 ':',
 'S',
 'you',
 '?']

we can see the Part of Speech (POS) by zipping token and POS

In [24]:
list(zip(result['token'], result['pos']))

[('this', 'DT'),
 ('ipsum', 'NN'),
 ('is', 'VBZ'),
 ('placeholder', 'NN'),
 ('text', 'NN'),
 ('commonly', 'RB'),
 ('used', 'VBD'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('graphic', 'JJ'),
 (',', ','),
 ('print', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('publishing', 'NN'),
 ('industries', 'NNS'),
 ('for', 'IN'),
 ('previewing', 'VBG'),
 ('layouts', 'NNS'),
 ('and', 'CC'),
 ('visual', 'JJ'),
 ('mockups', 'NNS'),
 ('.', '.'),
 ('@ether_radio', 'NN'),
 ('yeah', 'NN'),
 (':', ':'),
 ('S', 'NNP'),
 ('i', 'NNP'),
 ('feel', 'VBP'),
 ('all', 'DT'),
 ('funny', 'JJ'),
 ('cause', 'NN'),
 ('i', 'NNP'),
 ("haven't", 'NN'),
 ('slept', 'VBD'),
 ('enough', 'JJ'),
 ('i', 'NNP'),
 ('woke', 'VBD'),
 ('my', 'PRP$'),
 ('mum', 'JJ'),
 ('up', 'RP'),
 ('cause', 'NN'),
 ('i', 'NNP'),
 ('was', 'VBD'),
 ('singing', 'VBG'),
 ("she's", 'NN'),
 ('not', 'RB'),
 ('impressed', 'VBN'),
 (':', ':'),
 ('S', 'NNP'),
 ('you', 'PRP'),
 ('?', '.')]

We can show this as a pandas dataframe

In [26]:
import pandas as pd
df=pd.DataFrame({'token':result['token'], 'pos':result['pos']})
df.head(10)

Unnamed: 0,token,pos
0,this,DT
1,ipsum,NN
2,is,VBZ
3,placeholder,NN
4,text,NN
5,commonly,RB
6,used,VBD
7,in,IN
8,the,DT
9,graphic,JJ


Now lets clean the stop words

In [27]:
stopclean=PretrainedPipeline('clean_stop', lang='en')


clean_stop download started this may take some time.
Approx size to download 12.4 KB
[OK!]


In [28]:
result=stopclean.annotate(testdoc)
result.keys()

dict_keys(['document', 'sentence', 'token', 'cleanTokens'])

In [29]:
result['cleanTokens']

['ipsum',
 'placeholder',
 'text',
 'commonly',
 'graphic',
 ',',
 'print',
 ',',
 'publishing',
 'industries',
 'previewing',
 'layouts',
 'visual',
 'mockups',
 '.',
 '@ether_radio',
 'yeah',
 ':',
 'S',
 'feel',
 'funny',
 'cause',
 "haven't",
 'slept',
 'woke',
 'mum',
 'cause',
 'singing',
 "she's",
 'impressed',
 ':',
 'S',
 '?']

In [31]:
' '.join(result['cleanTokens'])

"ipsum placeholder text commonly graphic , print , publishing industries previewing layouts visual mockups . @ether_radio yeah : S feel funny cause haven't slept woke mum cause singing she's impressed : S ?"

apply clean slang

In [32]:
clean_slang = PretrainedPipeline('clean_slang', lang='en')

clean_slang download started this may take some time.
Approx size to download 21.8 KB
[OK!]


In [34]:
result=clean_slang.annotate(testdoc)

In [36]:
result.keys()

dict_keys(['document', 'token', 'normal'])

In [38]:
' '.join(result['normal'])

'this ipsum is placeholder text commonly used in the graphic print and publishing industries for previewing layouts and visual mockups etherradio yeah S i feel all funny cause i havent slept enough i woke my mum up cause i was singing shes not impressed S you'

spell check

In [39]:
spell_checker = PretrainedPipeline('check_spelling', lang='en')


check_spelling download started this may take some time.
Approx size to download 892.6 KB
[OK!]


In [41]:
result = spell_checker.annotate(testdoc)

In [42]:
result.keys()

dict_keys(['document', 'sentence', 'token', 'checked'])

In [43]:
' '.join(result['checked'])

"this ipsum is placeholder text commonly used in the graphic , print , and publishing industries for reviewing payouts and visual mockups . @ether_radio yeah : S i feel all funny cause i haven't slept enough i woke my mum up cause i was singing sheds not impressed : S you ?"

In [44]:
testdoc

" this ipsum is placeholder text commonly used in the graphic, print, and publishing industries for previewing layouts and visual mockups. @ether_radio yeah :S i feel all funny cause i haven't slept enough  i woke my mum up cause i was singing she's not impressed :S you? "

In [58]:
tex2=''' whatsup bro, lol you are good, call me asap'''

In [59]:

res2=clean_slang.annotate(tex2)
' '.join(res2['normal'])

'how are you friend laugh out loud you are good call me as soon as possible'

lets check the sentiment analysis of a sample text

In [60]:
sentiment = PretrainedPipeline('analyze_sentiment', lang='en')


analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]


In [73]:
result=sentiment.annotate("i can't answer this question, i hate it")

In [74]:
result.keys()

dict_keys(['checked', 'document', 'sentiment', 'token', 'sentence'])

In [75]:
result['sentiment']

['negative']

better sentiment analysis with USE 

In [76]:
sentiment_better=PretrainedPipeline('analyze_sentimentdl_use_imdb', lang='en')

analyze_sentimentdl_use_imdb download started this may take some time.
Approx size to download 935.8 MB
[OK!]


In [86]:
result = sentiment_better.annotate("Robot Roomba 675 Robot Vacuum with Wi-Fi Connectivity, Works with Alexa, Good for Pet Hair, Carpets, Hard Floors. This handy household cleaner, with its robot sensibilities, will clean your floors very well. It can be operated with Alexa, your cellphone or in and of itself: it allows the user to connect to clean from anywhere in the house. Its patented 3-Stage Cleaning System is specially engineered to loosen, lift, and suction everything from small particles to large debris from carpets and hard floors. Its patented dirt detect sensors alert to work harder on concentrated areas of dirt, such as high-traffic zones of your home. Its edge-sweeping brush is specially designed at a 27-degree angle to sweep debris away from edges and corners. Its full suite of intelligent sensors guide the robot under and around furniture to help thoroughly clean your floors while it runs for up to 90 minutes before automatically docking and recharging. Required for use: just charge its battery, press clean or schedule Roomba on the go with the iRobot HOME App or Alexa.")

In [87]:
result['sentiment']

['positive']

In [88]:
result.keys()

dict_keys(['document', 'sentence_embeddings', 'sentiment'])

In [90]:
sentiment_better.fullAnnotate("Robot Roomba 675 Robot Vacuum with Wi-Fi Connectivity, Works with Alexa, Good for Pet Hair, Carpets, Hard Floors. This handy household cleaner, with its robot sensibilities, will clean your floors very well. It can be operated with Alexa, your cellphone or in and of itself: it allows the user to connect to clean from anywhere in the house. Its patented 3-Stage Cleaning System is specially engineered to loosen, lift, and suction everything from small particles to large debris from carpets and hard floors. Its patented dirt detect sensors alert to work harder on concentrated areas of dirt, such as high-traffic zones of your home. Its edge-sweeping brush is specially designed at a 27-degree angle to sweep debris away from edges and corners. Its full suite of intelligent sensors guide the robot under and around furniture to help thoroughly clean your floors while it runs for up to 90 minutes before automatically docking and recharging. Required for use: just charge its battery, press clean or schedule Roomba on the go with the iRobot HOME App or Alexa.")[0]['sentiment']

[Annotation(category, 0, 1063, positive, {'sentence': '0', 'positive': '1.0', 'negative': '2.0834964E-14'})]