# Where we left off
In Part 02 we fetched question bodies, cleaned them up a bit, and saved both the raw data and clean data to a data directory. Now we're going to analyze some of the bodies using `spacy`.

# Spacy pipeline
Using [`spacy`](https://spacy.io/) out-of-the-box is pretty easy. Just load in a pretrained [model](https://spacy.io/models), e.g. `"en_core_web_sm"` (small), `"en_core_web_md"` (medium), `"en_core_web_lg"` (large), or `"en_core_web_trf"` (transformer), and you're set!

> NOTE: if you're having trouble downloading/installing a model try this [solution](https://stackoverflow.com/a/72636669/6509519)

In [35]:
import spacy


nlp = spacy.load("en_core_web_sm")

I'm going to stay at a high level when talking about `spacy` as much as possible. But there are some terms we need to cover:
- A [`Doc`](https://spacy.io/api/doc) object is what the `spacy` pipeline returns. It's like a string of text, but with annotations like part-of-speech.
- A [`Token`](https://spacy.io/api/token) object is what makes up a `Doc` object. These are typically single words or punctuation marks.
- A [`Span`](https://spacy.io/api/span) object is like a contiguous sequence of `Token` objects. When we slice a `Doc` object we get `Span` objects in return. You can think of a span as part of a sentence, like a first and last name or a phrase.


Here is an example:

In [45]:
doc = nlp("This is a string of text.")
span = doc[1:3]
token = span[0]

In [48]:
print(doc)

This is a string of text.


In [50]:
print(span)

is a


In [52]:
print(token)

is


The `Tokens` in a `Doc` have annotations added to them throughout the `nlp` [pipeline](https://spacy.io/usage/processing-pipelines). We can use these to better understand a text.

In [63]:
pos = [(t, t.pos_) for t in doc]
print(f"Parts of speech: {pos}")
lemma = [(t, t.lemma_) for t in doc]
print(f"Lemmas: {lemma}")
sent_start = [(t, t.is_sent_start) for t in doc]
print(f"Is sentence start: {sent_start}")

Parts of speech: [(This, 'PRON'), (is, 'AUX'), (a, 'DET'), (string, 'NOUN'), (of, 'ADP'), (text, 'NOUN'), (., 'PUNCT')]
Lemmas: [(This, 'this'), (is, 'be'), (a, 'a'), (string, 'string'), (of, 'of'), (text, 'text'), (., '.')]
Is sentence start: [(This, True), (is, False), (a, False), (string, False), (of, False), (text, False), (., False)]


The `nlp` pipeline can be applied to a stream of texts using [`nlp.pipe`](https://spacy.io/api/language#pipe). This can reduce the amount of time and memory consumed during processing. Let's load in our clean data and process some of it.

In [87]:
import json


with open("../data/clean.jsonl", "r") as f:
    lines = f.readlines()
    # we need to parse each line individually because not every line has the same keys
    data = [json.loads(line) for line in lines]
# get the body of each item if it's present and convert to a doc
docs = nlp.pipe(item.get("body") for item in data if "body" in item)
doc = next(docs)
print(doc)


So I understand what that means, but is there a well-known alternative that is more open-standards friendly, not proprietary?  What driver do you use and/or recommend and what are the advantages of it?



It looks like a normal chunk of text. How would we go about labeling parts as "IS_QUESTION" or "NOT_QUESTION"? We could chunk it into sentences using the [`sents`](https://spacy.io/api/doc#sents) attribute.

# Sentences

In [94]:
print(f"Number of detected sentences: {sum(1 for _ in doc.sents)}")
print(*doc.sents, sep="\n"+"-"*30+"\n")

Number of detected sentences: 6
I recently migrated an older application we have at work from Java 1.5 to 1.6.  
------------------------------


------------------------------
So I understand what that means, but is there a well-known alternative that is more open-standards friendly, not proprietary?  
------------------------------
What driver do you use and/or recommend and what are the advantages of it?

------------------------------
------------------------------
Am I wrong to think that?



This first `doc` is made up of six sentences. The `"en_core_web_sm"` `nlp` pipeline uses a `parser` ([`DependencyParser`](https://spacy.io/api/dependencyparser)) by default, which did a pretty good job at detecting sentence boundaries. We could potentially improve it further via either the rule-based `sentencizer` ([`Sentencizer`](https://spacy.io/api/sentencizer)) or statistical-based `senter` ([`SentenceRecognizer`](https://spacy.io/api/sentencerecognizer)). We can customize the rules in the `sentencizer` or tune the already existing `senter` model with our own data.

> NOTE: The `senter` component is ~10× faster than the `parser` and more accurate than the rule-based `sentencizer`.

In [107]:
# enable the `senter` model
nlp.enable_pipe("senter")
print(f"Number of detected sentences: {sum(1 for _ in doc.sents)}")
print(*doc.sents, sep="\n"+"-"*30+"\n")

Number of detected sentences: 6
I recently migrated an older application we have at work from Java 1.5 to 1.6.  
------------------------------


------------------------------
So I understand what that means, but is there a well-known alternative that is more open-standards friendly, not proprietary?  
------------------------------
What driver do you use and/or recommend and what are the advantages of it?

------------------------------
------------------------------
Am I wrong to think that?



The pipeline did a pretty good job detecting (what I think are) sentences. Let's see a few more examples:

In [108]:
doc = next(docs)
print(f"Number of detected sentences: {sum(1 for _ in doc.sents)}")
print(*doc.sents, sep="\n"+"-"*30+"\n")

Number of detected sentences: 5
I am using Ruby on Rails and I have the following Mongo collection,

I have another MYSQL table called , which also has Country and  field.
------------------------------
I want to find the MondoDB document using  and update the value of  mongoDB  field from  table.
------------------------------
I want to update the country field of each document to a different values received from another MYSQL table.


------------------------------
I know how to update the multiple document with the same value, but here the case is each document should be updated with different values of  field.  
------------------------------
I know how to perform it using looping and updating, but I want to know is there a bulk update or updateMany query available to perform this type of operation.



This one looks like it could be improved. Because some of the code bits were removed, the structure of the sentence is off. We humans can navigate around this, but the model struggles a bit. If I were doing this manually I'd say there are 6 sentences instead of 5 (the first sentence boundary should be split). Let's look at another.

In [109]:
doc = next(docs)
print(f"Number of detected sentences: {sum(1 for _ in doc.sents)}")
print(*doc.sents, sep="\n"+"-"*30+"\n")

Number of detected sentences: 6
I have a Windows Forms application that displays information in a Master-Detail DataGridView, written based on the instructions at https://learn.microsoft.com/en-us/dotnet/framework/winforms/controls/create-a-master-detail-form-using-two-datagridviews.


------------------------------
The data is displaying correctly, and selecting rows on the master DataGridView displays the expected data in the details DataGridView. 


------------------------------
What I am trying to do is pass in an integer when loading the page so that the DataGridViews will display with the right master row selected and the corresponding detail rows displayed. 


------------------------------
So far I can pass in the integer to select the correct Master row, but one still needs to click the row to display the correct details rows. 


------------------------------
Here is the constructor for the form:



In the Load() method, I populate the DGVs and Get Data for them.
-----------

Looks like another sentence could be split. Two out three docs look like they could have their sentence boundary detection improved. That's enough for me to start building a training data set and tune the `senter` to our needs.