## Linguistc Terms:
1. Adposition(ADP):
    
    An adposition is a cover term for prepositions and postpositions. It is a member of a closed set of items that:
    a. Occur before or after a complement composed of a noun phrase, noun, pronoun, or clause that functions as a noun phrase.
    b. Form a single structure with the complement to express its grammatical and semantic relation to another unit within a clause.
    
2. A determiner is a word or affix that belongs to a class of noun modifiers that expresses the reference, including quantity, of a noun.


3. Complement

    Traditionally, a complement is a constituent of a clause, such as a noun phrase or adjective phrase, that is used to predicate a description of the subject or object of the clause.

2. Nominal subject(nsubj): A nominal subject ( nsubj ) is a nominal which is the syntactic subject and the proto-agent of a clause. That is, it is in the position that passes typical grammatical test for subjecthood, and this argument is the more agentive, the do-er, or the proto-agent of the clause.

3. ccomp clausal complement:
4. prep prepositional modifier
5. pobj object of preposition
6. adverbial modifier

### Terminology

|No.|Name|Description|
|---|----|-----------|
|1|Tokenization|Segmenting text into words, punctuations mark etc|
|2|Part-of-speech(POS) Tagging|Assigning word types to tokens like verb or noun|
|3|Dependency Parsing|Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
|4|Lemmatization|Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.|
|5|Sentence Boundary Detection(SBD)|Finding and segmenting individual sentences.|
|6|Named Entity Recognition|Labelling named “real-world” objects, like persons, companies or locations.|
|7|Entity Linking (EL)|Disambiguating textual entities to unique identifiers in a knowledge base.|
|8|Similarity|Comparing words, text spans and documents and how similar they are to each other.|
|9|Text Classification|Assigning categories or labels to a whole document, or parts of a document.|
|10|Rule-based Matching|Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.|


### Stemming and lemmatization
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

1. Stemming: Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.<br>         popular algorithms: porter'stemmer, Paice/Husk stemmer
2. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
3. Stemmers use language-specific rules, but they require less knowledge than a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words.
4. Lemmatizer are a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word. Doing full morphological analysis produces at most very modest benefits for retrieval.

Information extraction is a task of automatically extracting structured information from unstructured and/or semi-structured documents. In most of the cases, this activity concerns processing human language texts by means of NLP.

Typical Information Processing task's order of operation:
1. Named Entity Recognition (NER)
2. Named Entity Linking (NEL)
3. Relation Extraction

NEL is the task to link entity mentions in text with their corresponding entities in a knowledge base

#### Statistical Models:
Some of the spacy require trained pipelines to be loaded, which enable spaCy to predict linguistic annotations.
Common Components of the trained pipelines:
1. Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
2. Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
3. Data files like lemmatization rules and lookup tables.
4. Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
5. Configuration options, like the language and processing pipeline settings and model implementations to use, to put spaCy in the correct state when you load the pipeline.

In [88]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

In [81]:
doc = nlp("Let's go to the mall sometime") #tokens on parsing
for token in doc:
    print(token.text, token.pos_, token.dep_, token.lemma_)
    
#displacy.render(doc, style="dep")



Let VERB ROOT let
's PRON nsubj 's
go VERB ccomp go
to ADP prep to
the DET det the
mall NOUN pobj mall
sometime ADV advmod sometime


In [79]:
exp = lambda x:print(x,spacy.explain(x))
exp('advmod')

advmod adverbial modifier


### Tokenization
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language.

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.

2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes


### POS And Tagging
After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language.

### Named Entity
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case

In [41]:
text = "When Raman started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."


doc2= nlp(text)
print("POS Tagging")
for token in doc2:
    print(token.text, token.pos_)
for ent in doc2.ents: # gets named entities 
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
#displacy.render(doc2, style="dep") 
displacy.render(doc2, style="ent") #entities 

POS Tagging
When ADV
Raman PROPN
started VERB
working VERB
on ADP
self NOUN
- PUNCT
driving VERB
cars NOUN
at ADP
Google PROPN
in ADP
2007 NUM
, PUNCT
few ADJ
people NOUN
outside ADP
of ADP
the DET
company NOUN
took VERB
him PRON
seriously ADV
. PUNCT
Raman 5 10 PERSON
2007 61 65 DATE


### Word Vectors and Similarity
Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec.
To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors.


Word vectors are stored in a big table in the model and when you look up `cat`, you always get the same vector from this table.

The context-sensitive tensors are `dense feature vectors` computed by the models in the pipeline while analyzing the text. You will get different vectors for cat in different texts. If you use en_core_web_sm, the token cat in I have a cat will not have the same vector as in The cat is black. Having the context-sensitive tensors available when the model doesn't include word vectors lets the similarity functions work to some degree, but the results are very different than with word vectors.

For most purposes, you probably want to use the _md or _lg model with word vectors.


In [53]:
nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat eat banana ksksdd car space")
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)
#     print('[', end='')
#     for n in token.vector:
#         print(n, end=',')
#     print(']')

dog True 7.0336733 False
cat True 6.6808186 False
eat True 6.9718246 False
banana True 6.700014 False
ksksdd False 0.0 True
car True 7.149045 False
space True 6.336554 False


In [62]:
doc11 = nlp("dogs and cats are enemies")
doc12 = nlp("cats and dogs are friends") #gives high score
print('similarity between `%s` and `%s` is %s'%(doc11.text, doc12.text,doc11.similarity(doc12)))


similarity between `dogs and cats are enemies` and `cats and dogs are friends` is 0.9462085072523121


### Expectation(s) From similarity results

1. There’s no objective definition of similarity. Whether “I like burgers” and “I like pasta” is similar depends on your application. Both talk about food preferences, which makes them very similar – but if you’re analyzing mentions of food, those sentences are pretty dissimilar, because they talk about very different foods.
2. The similarity of Doc and Span objects defaults to the average of the token vectors. This means that the vector for “fast food” is the average of the vectors for “fast” and “food”, which isn’t necessarily representative of the phrase “fast food”.
3. Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.

The similarity function by default depends more on words than on intent/context etc.

## PIPELINES
Calling nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.

The processing pipeline consists of one or more pipeline components that are called on the Doc in order. They can contain a statistical model and trained weights, or only make rule-based modifications to the Doc. They can either be 
<img src="https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg">
Various components of pipelines:

|No.|Name|Component|Creates|ABOUT|
|---|----|---------|-------|-----|
|1|tokenizer|Tokenizer|Doc|Text into Tokens|
|2|tagger   |Tagger   |Token.tag|Assigns POS tags|
|3|parser   |Dependency Parser|Token.head, Token.dep, Doc.sents, Doc.noun_chunks|Assign Dependency Labels       |
|4|ner      |Entity Recognizer|Doc.ents, Token.ent_iob, Token.ent_type          |Detect and labels named entitie|
|4.1|-------|EntityRuler      |---|Add entity spans to the Doc using token-based rules or exact phrase matches.|
|5|lemmatizer|Lemmatizer      |Token.lemma                                      |Assign base forms              |
|6|textcat   |TextCategorizer |Doc.cats                                         |Assign document labels.        |
|7|custom    |custom components|                                                |                               |
|8|morphologizer|---|Morphologizer|Predict morphological features and coarse-grained part-of-speech tags.|
|9|sentence recognizer|SentenceRecognizer|---|Predicts sentence boundaries|
|10|sentencizer|Sentencizer|---|Implement rule-based sentence boundary detection that doesn’t require the dependency parse|
|11|---|TextCategorizer|----|Predict categories or labels over the whole document.|
|12|---|Tok2Vec|---|Apply a “token-to-vector” model and set its outputs.|
|13|---|TrainablePipe|---|Class that all trainable pipeline components inherit from.|
|14|---|Transformer|---|Use a transformer model and set its outputs.|


#### Order Of Pipeline
The statistical components like the tagger or parser are typically independent and don’t share any data between each other. For example, the named entity recognizer doesn’t use any features set by the tagger and parser, and so on. This means that you can swap them, or remove single components from the pipeline without affecting the others. However, components may share a “token-to-vector” component like Tok2Vec or Transformer.

-  Custom components may also depend on annotations set by other components. For example, a custom lemmatizer may need the part-of-speech tags assigned, so it’ll only work if it’s added after the tagger.
- The parser will respect pre-defined sentence boundaries, so if a previous component in the pipeline sets them, its dependency predictions may be different.
- EntityLinker which resolves named entities to knowledge base IDs, should be preceded by the component that recognizes the entities such as EntitiyRecognizer.

#### Tokenizer 
The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in nlp.pipe_names. The reason is that there can only really be one tokenizer, and while all other pipeline components take a Doc and return it, the tokenizer takes a `string` of text and turns it into a `Doc`.

#### Matchers
Matchers help you find and extract information from Doc objects based on match patterns describing the sequences you’re looking for. A matcher operates on a Doc and gives you access to the matched tokens in context.

|no.|Name|Description|
|---|----|-----------|
|1|DependencyMatcher|Match sequences of tokens based on dependency trees using Semgrex operators.|
|2|Matcher|Matches sequence of tokens based on pattern rules, similar to regular expression|
|3|PhraseMatcher|Match sequences of tokens based on phrases.|

#### Container Objects

|no.|Name|Description|
|---|----|-----------|
|1|Corpus|Class for managing annotated corpora for training and evaluation data.|
|2|KnowledgeBase|Storage for entities and aliases of a knowledge base for entity linking.|
|3|Lookups|Container for convenient access to large lookup tables and dictionaries.|
|4|MorphAnalysis|A morphological analysis.|
|5|Morphology|Store morphological analyses and map them to and from hash values.|
|6|Scorer|Compute evalution scores|
|7|StringStore|Map strings to and from hash values.|
|8|Vectors|Container class for vector data keyed by string|
|9|Vocab|The shared vocabulary that stores strings and gives you access to Lexeme objects.|

## Architecture 
Central Data Structures:
1. Language Class (TEXT to DOC conv)<br>
   It is used to process the text and turn it into `doc`.
2. Vocab Object (String Stores, Lexical Attributes and word vectors)<br>
   Strings, word vectors and lexical attributes are stored in vocab to avoid keep their multiple copies.By centralizing strings, word vectors and lexical attributes in the Vocab, we avoid storing multiple copies of this data. This saves memory, and ensures there’s a single source of truth
3. Doc <br>
   The Doc object owns the sequence of tokens and all their annotations.
   
   <img src="https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg" length=500 width=500>
 
Doc object owns the data, and Span and Token are views that point into it. 
The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. 

The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

__orchestrate__: plan or coordinate the elements of (a situation) to produce a desired effect, especially surreptitiously.

Container Objects:

|Sno.|Name|Description|
|----|----|-----------|
|1|__Doc__|A container for accessing linguistic annotations.(Tokens and all annotations)|
|2|DocBin|Doc object for efficient serialization. Used for training data mostly|
|3|Example|Collection of training annotations, containing two `Doc`obs: reference data and predictions|
|4|__Lexeme__|An entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.|
|5|__Language__|Processing class that turns text into Doc objects. Different languages implement their own subclasses of it. The variable is typically called nlp|
|6|Span|A slice from a `Doc` object|
|7|SpanGroups|A named collection of spans belonging to a Doc|
|8|Token|A singleword, punctuation symbol, whitespace, etc.|


Referencing Objects And Documents:
Whenever possible, spaCy tries to store data in a vocabulary, the Vocab, that will be shared by multiple documents.
To save memory, spaCy also encodes all strings to hash values.
Internally, spaCy only “speaks” in hash values.

All strings are encoded, the entries in the vocabulary don’t need to include the word text themselves. Instead, they can look it up in the StringStore via its hash value. Each entry in the vocabulary, also called Lexeme, contains the context-independent information about a word.

In [102]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("when this happend that happened and other than this some of that didn't happen")
for token in doc:
    print(token.text,token.dep_, token.ent_id_,token.dep_)
    
    


when advmod  advmod
this det  det
happend ROOT  ROOT
that nsubj  nsubj
happened relcl  relcl
and cc  cc
other conj  conj
than prep  prep
this pobj  pobj
some nsubj  nsubj
of prep  prep
that pobj  pobj
did aux  aux
n't neg  neg
happen ROOT  ROOT


To make sure each value is unique, spaCy uses a hash function to calculate the hash based on the word string. This also means that the hash for “coffee” will always be the same, no matter which pipeline you’re using or how you’ve configured spaCy.
However, hashes cannot be reversed and there’s no way to resolve 3197928453018144401 back to “coffee”. All spaCy can do is look it up in the vocabulary.


### Serialization
`to_bytes` or `from_bytes`

`to_disk` or `from_disk`

modifying the pipeline, vocabulary, vectors and entities, or made updates to the component models, save your progress – for example, everything that’s in your nlp object.his process is called serialization. spaCy comes with built-in serialization methods and supports the Pickle protocol.

### Training
SpaCy’s tagger, parser, text categorizer and many other components are powered by statistical models. Every “decision” these components make is a prediction based on the model’s current weight values. The weight values are estimated based on examples the model has seen during training.
<img src="https://spacy.io/training-2bc0e13c59784440ecb60ffa82c0783d.svg">