# 1 line to BioBERT Word Embeddings with NLU in Python
Including Part of Speech, Named Entity Recognition, Emotion Classification in the same line! With Bonus t-SNE plots!


## Introduction 
0.1 What is NLU?
John Snow Labs NLU library gives you 1000+ NLP models and 100+ Word Embeddings in 300+ languages and infinite possibilities to explore your data and gain insights.

Original Source -
https://medium.com/spark-nlp/1-line-to-biobert-word-embeddings-with-nlu-in-python-7224ab52e131

<br/>

https://nlu.johnsnowlabs.com/

<br/>

https://nlu.johnsnowlabs.com/docs/en/install

<br/>

In this tutorial, we will cover how to get the powerful BioBERT Embeddings with 1 line of NLU code and then how to visualize them with t-SNE. We will compare Comparing Sentiment with Sarcasm and Emotions!

## What is t-SNE?
T-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

## What is BioBERT?
Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.

## 1. Import NLU, load BioBERT, and embed a sample string in 1 line

In [1]:
!pip install nlu
!pip install pyspark==3.0.2
!pip install johnsnowlabs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlu
  Downloading nlu-4.2.0-py3-none-any.whl (639 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m639.9/639.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spark-nlp>=4.2.0
  Downloading spark_nlp-4.4.0-py2.py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.4/486.4 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Installing collected packages: spark-nlp, dataclasses, nlu
Successfully installed dataclasses-0.6 nlu-4.2.0 spark-nlp-4.4.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark==3.0.2
  Downloading pyspark-3.0.2.tar.gz (204.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/204.8 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.2-py2.py3-none-any.whl size=205186690 sha256=11948fcb4156da4f3dec39f8f3b42b38cd6e9185e5478c24fd36f32de4a1f37c
  Stored in directory: /root/.cache/pip/wheels/aa/8e/b9/ed8017fb2997a648f5868a4b728881f320e3d1bd2b0274f137
Successfully built pyspark
Installing collected packages: py4j, py

In [2]:
import nlu

nlu.load('biobert').predict('He was suprised by the diversity of NLU')

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,token,word_embedding_biobert
0,He,"[-0.30664220452308655, 0.013826360926032066, -..."
0,was,"[-0.12069427222013474, 0.1011619046330452, -0...."
0,suprised,"[-0.4372478723526001, 0.026471368968486786, 0...."
0,by,"[0.1463586688041687, 0.13498945534229279, -0.6..."
0,the,"[-0.32744330167770386, 0.013552097603678703, -..."
0,diversity,"[-0.13004884123802185, 0.14864905178546906, 0...."
0,of,"[0.23615342378616333, -0.07773970067501068, -0..."
0,NLU,"[0.08642497658729553, -0.2656814157962799, -0...."


## 2. Load a larger dataset
The following snippet will download a Reddit sarcasm dataset and load it to a Pandas Dataframe

In [None]:
import pandas as pd

# Download the dataset
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp


# Load dataset to Pandas

df = pd.read_csv('/tmp/train-balanced-sarcasm.csv')
df

--2023-04-06 08:42:18--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.58.64, 54.231.196.72, 52.216.92.189, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.58.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 255268960 (243M) [text/csv]
Saving to: ‘/tmp/train-balanced-sarcasm.csv’


2023-04-06 08:42:37 (13.5 MB/s) - ‘/tmp/train-balanced-sarcasm.csv’ saved [255268960/255268960]



Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...
...,...,...,...,...,...,...,...,...,...,...
1010821,1,I'm sure that Iran and N. Korea have the techn...,TwarkMain,reddit.com,2,2,0,2009-04,2009-04-25 00:47:52,"No one is calling this an engineered pathogen,..."
1010822,1,"whatever you do, don't vote green!",BCHarvey,climate,1,1,0,2009-05,2009-05-14 22:27:40,In a move typical of their recent do-nothing a...
1010823,1,Perhaps this is an atheist conspiracy to make ...,rebelcommander,atheism,1,1,0,2009-01,2009-01-11 00:22:57,Screw the Disabled--I've got to get to Church ...
1010824,1,The Slavs got their own country - it is called...,catsi,worldnews,1,1,0,2009-01,2009-01-23 21:12:49,I've always been unsettled by that. I hear a l...


## Now lets load in jhon snow labs 

In [None]:
from  johnsnowlabs import nlp
nlp.load('emotion').predict('Wow that easy!')

classifierdl_use_emotion download started this may take some time.
Approximate size to download 21.3 MB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,emotion,emotion_confidence,sentence,sentence_embedding_use
0,surprise,0.970199,Wow that easy!,"[0.0038029798306524754, -0.010633695870637894,..."


when using Annotator based pipelines, use nlp.start() to start up your session

In [None]:
from johnsnowlabs import nlp
nlp.start()
pipe = nlp.Pipeline(stages=
[
    nlp.DocumentAssembler().setInputCol('text').setOutputCol('doc'),
    nlp.Tokenizer().setInputCols('doc').setOutputCol('tok')
])
nlp.to_nlu_pipe(pipe).predict('That was easy')

Spark Session already created, some configs may not take.
🤓 Looks like /root/.johnsnowlabs is missing, creating it
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs


Unnamed: 0,token
0,That
0,was
0,easy


## Load & Predict 1 liner

source-
https://nlu.johnsnowlabs.com/docs/en/predict_api 

The johnsnowlabs library provides 2 simple methods with which most NLP tasks can be solved while achieving state-of-the-art results.
The load and predict method.

when building a load&predict based model you will follow these steps:

Pick a model/pipeline/component you want to create from the Namespace
Call the model = nlp.load(component) method which will return an auto-completed pipeline
Call model.predict('that was easy') on some String input
These 3 steps can be boiled down to just 1 line

In [None]:
from johnsnowlabs import nlp
nlp.load('sentiment').predict('How does this witchcraft work?')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,How does this witchcraft work?,"[-0.002302003325894475, 0.4860450029373169, 0....",neg,0.999864,"[[-0.2376900017261505, 0.5939199924468994, 0.5..."


jsl.load() defines 18 components types usable in 1-liners, some can be prefixed with .train for training models

Any of the actions for the component types can be passed as a string to nlp.load() and will return you the default model for that component type for the English language. You can further specify your model selection by placing a ‘.’ behind your component selection.
After the ‘.’ you can specify the model you want via specifying a dataset or model version.
See the Models Hub, the Components Namespace and The load function for more infos.


Component type	nlp.load() base

Named Entity Recognition(NER)	nlp.load('ner')

Part of Speech (POS)	nlp.load('pos')

Classifiers	nlp.load('classify')

Word embeddings	nlp.load('embed')

Sentence embeddings	nlp.load('embed_sentence')

Chunk embeddings	nlp.load('embed_chunk')

Labeled dependency parsers	nlp.load('dep')

Unlabeled dependency parsers	nlp.load('dep.untyped')

Legitimatizes	nlp.load('lemma')

Matchers	nlp.load('match')

Normalizers	nlp.load('norm')

Sentence detectors	nlp.load('sentence_detector')

Chunkers	nlp.load('chunk')

Spell checkers	nlp.load('spell')

Stemmers	nlp.load('stem')

Stopwords cleaners	nlp.load('stopwords')

Cleaner	nlp.load('clean')

N-Grams	nlp.load('ngram')

Tokenizers	nlp.load('tokenize')


# Annotator & PretrainedPipeline based pipelines
You can create Annotator & PretrainedPipeline based pipelines using all the classes attached to the nlp module.

nlp.PretrainedPipeline('pipe_name') gives access to Pretrained Pipelines

from johnsnowlabs import nlp

In [None]:
from johnsnowlabs import nlp
from pprint import pprint

nlp.start()
explain_document_pipeline = nlp.PretrainedPipeline("explain_document_ml")
annotations = explain_document_pipeline.annotate("We are very happy about SparkNLP")
pprint(annotations)

Spark Session already created, some configs may not take.
explain_document_ml download started this may take some time.
Approx size to download 9.2 MB
[OK!]
{'document': ['We are very happy about SparkNLP'],
 'lemmas': ['We', 'be', 'very', 'happy', 'about', 'SparkNLP'],
 'pos': ['PRP', 'VBP', 'RB', 'JJ', 'IN', 'NNP'],
 'sentence': ['We are very happy about SparkNLP'],
 'spell': ['We', 'are', 'very', 'happy', 'about', 'SparkNLP'],
 'stems': ['we', 'ar', 'veri', 'happi', 'about', 'sparknlp'],
 'token': ['We', 'are', 'very', 'happy', 'about', 'SparkNLP']}


## Custom Pipes
Alternatively you can compose Annotators into a pipeline which offers the highest degree of customization

In [None]:
from johnsnowlabs import nlp
spark = nlp.start(nlp=False)
pipe = nlp.Pipeline(stages=
[
    nlp.DocumentAssembler().setInputCol('text').setOutputCol('doc'),
    nlp.Tokenizer().setInputCols('doc').setOutputCol('tok')
])
spark_df = spark.createDataFrame([['Hello NLP World']]).toDF("text")
pipe.fit(spark_df).transform(spark_df).show()

Spark Session already created, some configs may not take.
+---------------+--------------------+--------------------+
|           text|                 doc|                 tok|
+---------------+--------------------+--------------------+
|Hello NLP World|[{document, 0, 14...|[{token, 0, 4, He...|
+---------------+--------------------+--------------------+



#Predict method Parameters
Output metadata
The NLP predict method has a boolean metadata parameter.
When it is set to True, it output the confidence and additional metadata for each prediction. Its default value is False.

In [None]:
nlp.load('lang').predict('What a wonderful day!')

detect_language_375 download started this may take some time.
Approx size to download 9.4 MB
[OK!]


Unnamed: 0,language_results,meta_language_confidence,sentence_dl
0,en,9.0,What a wonderful day!


## Output Level parameter
predict() defines 4 output levels for the generated predictions.
The output levels define how granular the predictions and outputs will be.
Depending on your goal, may need to be output level should be adjusted.

Token level: Outputs one row for every token in the input. One to many mapping.
Chunk level: Outputs one row for every chunk in the input. One to many mapping.
Sentence level: Outputs one row for every sentence the input. One to many mapping.
Relation level output: Outputs one row for every relation predicted, i.e. . One to many.
Document level output: Outputs one row for every document in the input. One to one mapping.
predict() will try to infer the most useful output level automatically if an output level is not specified.
The inferred output level will usually define the last element of the pipeline.

Take a look at the different output levels Demo which goes over all the output levels.

## Document output level example
Every row in the input data frame will be mapped to one row in the output dataframe.

In [None]:
nlp.load('sentiment').predict(['I love data science! It is so much fun! It can also be quite helpful to people.', 'I love the city New-York'], output_level='document')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


Unnamed: 0,document,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,I love data science! It is so much fun! It can...,"[-0.251600444316864, 0.40478071570396423, 0.47...",pos,3.0,"[[-0.046539001166820526, 0.6196600198745728, 0..."
1,I love the city New-York,"[0.12232939898967743, 0.28969937562942505, 0.4...",pos,1.0,"[[-0.046539001166820526, 0.6196600198745728, 0..."


## Sentence output level example
Every sentence in each row becomes a new row in the output dataframe.

In [None]:
nlp.load('sentiment').predict(['I love data science! It is so much fun! It can also be quite helpful to people.', 'I love the city New-York'], output_level='sentence')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,I love data science!,"[-0.001255804323591292, 0.5551699995994568, 0....",pos,1.0,"[[-0.046539001166820526, 0.6196600198745728, 0..."
0,It is so much fun!,"[-0.2720366418361664, 0.38465380668640137, 0.6...",pos,1.0,"[[-0.046539001166820526, 0.6196600198745728, 0..."
0,It can also be quite helpful to people.,"[-0.37705671787261963, 0.33464911580085754, 0....",pos,1.0,"[[-0.046539001166820526, 0.6196600198745728, 0..."
1,I love the city New-York,"[0.12232939898967743, 0.28969937562942505, 0.4...",pos,0.999998,"[[-0.046539001166820526, 0.6196600198745728, 0..."


## Chunk output level example
Every chunk in each input row becomes a new row in the output dataframe. This is useful for components like the Named Entity Resolver. By setting output level to chunk, you will ensure ever Named Entity becomes one row in your datset. Named Entities are chunks.

In [None]:
# 'New York' is a Chunk. A chunk is an object that consists of multiple tokens, but it's not a sentence.
nlp.load('ner').predict(['Angela Merkel and Donald Trump dont share many opinions', "Ashley wants to visit the Brandenburger Tor in Berlin"], output_level='chunk',)

onto_recognize_entities_sm download started this may take some time.
Approx size to download 160.1 MB
[OK!]


Unnamed: 0,document,entities,entities_class,entities_confidence,entities_origin_chunk,entities_origin_sentence,sentence_pragmatic,word_embedding_embeddings
0,Angela Merkel and Donald Trump dont share many...,Angela Merkel,PERSON,0.98975,0,0,[Angela Merkel and Donald Trump dont share man...,"[[-0.563759982585907, 0.26958999037742615, 0.3..."
0,Angela Merkel and Donald Trump dont share many...,Donald Trump,PERSON,0.9956,1,0,[Angela Merkel and Donald Trump dont share man...,"[[-0.563759982585907, 0.26958999037742615, 0.3..."
1,Ashley wants to visit the Brandenburger Tor in...,Ashley,PERSON,0.9958,0,0,[Ashley wants to visit the Brandenburger Tor i...,"[[0.24997000396251678, -0.12275999784469604, -..."
1,Ashley wants to visit the Brandenburger Tor in...,the Brandenburger Tor,FAC,0.38396668,1,0,[Ashley wants to visit the Brandenburger Tor i...,"[[0.24997000396251678, -0.12275999784469604, -..."
1,Ashley wants to visit the Brandenburger Tor in...,Berlin,GPE,0.9946,2,0,[Ashley wants to visit the Brandenburger Tor i...,"[[0.24997000396251678, -0.12275999784469604, -..."


## Token output level example
Every token in each input row becomes a new row in the output dataframe.

In [None]:
nlp.load('sentiment').predict(['I love data science! It is so much fun! It can also be quite helpful to people.', 'I love the city New-York'], output_level='token')

## positions parameter
By setting output_positions=True, the Dataframe generated will contain additional columns which describe the beginning and end of each feature inside of the original document. These additional _begining and _end columns let you infer the piece of the original input string that has been used to generate the output.

If output level is set to a different output level than some features output level, the resulting features will be inside of lists
If output level is set to the same output level as some feature, the generated positional features will be single integers
positional :
For token based components the positional features refer to the beginning and the end of the token inside the original document the text originates from.
For sentence based components like sentence embeddings and different sentence classifiers the output of positional will describe the beginning and the end of the sentence that was used to generate the output.

In [None]:
nlp.load('sentiment').predict('I love data science!', output_level='token', positions=True)

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document_begin,document_end,sentence_begin,sentence_embedding_converter,sentence_embedding_converter_begin,sentence_embedding_converter_end,sentence_end,sentiment,sentiment_begin,sentiment_confidence,sentiment_end,token,token_begin,token_end,word_embedding_glove,word_embedding_glove_begin,word_embedding_glove_end
0,[0],[19],[0],"[[-0.001255804323591292, 0.5551699995994568, 0...",[0],[19],[19],[pos],[0],[1.0],[19],I,0,0,"[-0.046539001166820526, 0.6196600198745728, 0....",0,0
0,[0],[19],[0],"[[-0.001255804323591292, 0.5551699995994568, 0...",[0],[19],[19],[pos],[0],[1.0],[19],love,2,5,"[0.25975000858306885, 0.5583299994468689, 0.57...",2,5
0,[0],[19],[0],"[[-0.001255804323591292, 0.5551699995994568, 0...",[0],[19],[19],[pos],[0],[1.0],[19],data,7,10,"[-0.47099000215530396, 0.6157699823379517, 0.6...",7,10
0,[0],[19],[0],"[[-0.001255804323591292, 0.5551699995994568, 0...",[0],[19],[19],[pos],[0],[1.0],[19],science,12,18,"[-0.13322000205516815, 0.48857998847961426, 0....",12,18
0,[0],[19],[0],"[[-0.001255804323591292, 0.5551699995994568, 0...",[0],[19],[19],[pos],[0],[1.0],[19],!,19,19,"[0.38471999764442444, 0.49351000785827637, 0.4...",19,19


## Row origin inference for one to many mappings
predict() will recycle the Pandas index from the input Dataframe.
The index is useful if one row is mapped to many rows during prediction.
The new rows which are generated from the input row will all have the same index as the original source row.
I.e. if one sentence row gets split into many token rows, each token row will have the same index as the sentence row.

## NaN Handling
Every NaN value is converted to a Python None variable which is reflected in the final dataframe
If a column contains only NaN or None, it will be dropped

##Memory optimization recommendations
Instead of passing your entire Pandas Dataframe to predict() you can pass only the columns which you need for later tasks.
This saves memory and computation time and can be achieved like in the following example, which assumes latitude and longitude are irrelevant for later tasks.

##Supported data types
predict() supports all of the common Python data types and formats

Pandas Dataframes
Spark Dataframes
Modin with Dask backend
Modin with Ray backend
1-D Numpy arrays of Strings
Strings
Arrays of Strings

### Single strings

In [None]:
nlp.load('sentiment').predict('This is just one string')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,This is just one string,"[-0.36997678875923157, 0.4857400059700012, 0.5...",neg,0.999081,"[[-0.570580005645752, 0.44183000922203064, 0.7..."


### Lists of strings

In [None]:
nlp.load('sentiment').predict(['This is an array', ' Of strings!'])

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,This is an array,"[-0.5481075048446655, 0.399619996547699, 0.563...",pos,0.999982,"[[-0.570580005645752, 0.44183000922203064, 0.7..."
1,Of strings!,"[0.019063333049416542, 0.3870999813079834, 0.5...",pos,0.810641,"[[-0.15289999544620514, -0.24278999865055084, ..."


### Pandas Dataframe
One column must be named text and of object/string type or the first column will be used instead if no column named ‘text’ exists note : Passing the entire dataframe with additional features to the predict() method is very memory intensive.
It is recommended to only pass the columns required for further downstream tasks to the predict() method.

In [None]:
from johnsnowlabs import nlp
import pandas as pd
data = {"text": ['This day sucks', 'I love this day', 'I dont like Sami']}
text_df = pd.DataFrame(data)
nlp.load('sentiment').predict(text_df)

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,text,word_embedding_glove
0,This day sucks,"[-0.39792001247406006, 0.5047299861907959, 0.6...",pos,0.999882,This day sucks,"[[-0.570580005645752, 0.44183000922203064, 0.7..."
1,I love this day,"[-0.18106475472450256, 0.508804976940155, 0.49...",pos,1.0,I love this day,"[[-0.046539001166820526, 0.6196600198745728, 0..."
2,I dont like Sami,"[-0.09037527441978455, 0.49884024262428284, 0....",neg,0.994728,I dont like Sami,"[[-0.046539001166820526, 0.6196600198745728, 0..."


#Pandas Series
One column must be named text and of object/string type
note : This way is the most memory efficient way

In [None]:
from johnsnowlabs import nlp
import pandas as pd
data = {"text": ['This day sucks', 'I love this day', 'I dont like Sami']}
text_df = pd.DataFrame(data)
nlp.load('sentiment').predict(text_df['text'])

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,This day sucks,"[-0.39792001247406006, 0.5047299861907959, 0.6...",pos,0.999882,"[[-0.570580005645752, 0.44183000922203064, 0.7..."
1,I love this day,"[-0.18106475472450256, 0.508804976940155, 0.49...",pos,1.0,"[[-0.046539001166820526, 0.6196600198745728, 0..."
2,I dont like Sami,"[-0.09037527441978455, 0.49884024262428284, 0....",neg,0.994728,"[[-0.046539001166820526, 0.6196600198745728, 0..."


#Spark Dataframe
One column must be named text and of string type or the first column will be used instead if no column named ‘text’ exists

In [None]:
from johnsnowlabs import nlp
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Create DataFrame").getOrCreate()

data = [
    {'text': 'This day sucks'},
    {'text': 'I love this day'},
    {'text': 'I dont like Sami'}
]

df = spark.createDataFrame(data)
nlp.load('sentiment').predict(df)

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,text,word_embedding_glove
0,This day sucks,"[-0.39792001247406006, 0.5047299861907959, 0.6...",pos,0.999882,This day sucks,"[[-0.570580005645752, 0.44183000922203064, 0.7..."
1,I love this day,"[-0.18106475472450256, 0.508804976940155, 0.49...",pos,1.0,I love this day,"[[-0.046539001166820526, 0.6196600198745728, 0..."
2,I dont like Sami,"[-0.09037527441978455, 0.49884024262428284, 0....",neg,0.994728,I dont like Sami,"[[-0.046539001166820526, 0.6196600198745728, 0..."


#Modin Dataframe
Supports Ray Dask backends
One column must be named text and of string type or the first column will be used instead if no column named ‘text’ exists

In [None]:
!pip install modin[all]

In [None]:
from johnsnowlabs import nlp
import modin.pandas as pd
data = {"text": ['This day sucks', 'I love this day', 'I dont like Sami']}
text_pdf = pd.DataFrame(data)
nlp.load('sentiment').predict(text_pdf)

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,This day sucks,"[-0.39792001247406006, 0.5047299861907959, 0.6...",pos,0.999882,"[[-0.570580005645752, 0.44183000922203064, 0.7..."
1,I love this day,"[-0.18106475472450256, 0.508804976940155, 0.49...",pos,1.0,"[[-0.046539001166820526, 0.6196600198745728, 0..."
2,I dont like Sami,"[-0.09037527441978455, 0.49884024262428284, 0....",neg,0.994728,"[[-0.046539001166820526, 0.6196600198745728, 0..."


## Training Models with the fit() function


Source - https://colab.research.google.com/drive/1f-EORjO3IpvwRAktuL4EvZPqPr2IZ_g8?usp=sharing 



You can fit load a trainable nlp pipeline via nlp.load('train.<model>')

Binary Text Classifier Training
Sentiment classification training demo
To train a Sentiment classifier model, you must pass a dataframe with a text column and a y column for the label. Uses a Deep Neural Network built in Tensorflow.
By default Universal Sentence Encoder Embeddings (USE) are used as sentence embeddings.