1. What are Corpora?

What is a corpus?
A text corpus is a very large collection of text (often many billion words) produced by real users of the language and used to analyse how words, phrases and language in general are used. It is used by linguists, lexicographers, social scientists, humanities, experts in natural language processing and in many other fields. A corpus is also be used for generating various language databases used in software development such as predictive keyboards, spell check, grammar correction, text/speech understanding systems, text-to-speech modules, machine translation systems and many others.

Types of text corpora
It is not possible to easily classify a corpus into a certain category. Instead, corpora can have features or properties which can be used to group them. The same corpus can have one or more of these features.

Language
Monolingual corpus
A monolingual corpus is the most frequent type of corpus. It contains texts in one language only. The corpus is usually tagged for parts of speech and is used by a wide range of users for various tasks from highly practical ones, e.g. checking the correct usage of a word or looking up the most natural word combinations, to scientific use, e.g. identifying frequent patterns or new trends in language. Sketch Engine contains hundreds of monolingual corpora in dozens of languages.

see also What can Sketch Engine do? and Build your own corpus

Parallel corpus, multilingual corpus
A parallel corpus consists of two or more monolingual corpora. The corpora are the translations of each other.  For example, a novel and its translation or a translation memory of a CAT tool could be used to build a parallel corpus. Both languages need to be aligned, i.e. corresponding segments, usually sentences or paragraphs, need to be matched. The user can then search for all examples of a word or phrase in one language and the results will be displayed together with the corresponding sentences in the other language. The user can then observe how the search word or phrase is translated.

see also Parallel / Bilingual Concordance and Build a parallel corpus

Comparable corpus
A comparable corpus is one corpus in a set of two or more monolingual corpora, typically each in a different language, built according to the same principles. The content is therefore similar and results can be compared between the corpora even though they are not translations of each other (and therefore, there are not aligned). When users search these corpora they can use the fact, that the corpora also have the same metadata. An example of comparable corpora in Sketch Engine is CHILDES corpora or various corpora made from Wikipedia. Araneum corpora are comparable too.

see comparable corpora CHILDES corpora and corpora from Wikipedia

Time
Diachronic corpus
A diachronic corpus is a corpus containing texts from different periods and is used to study the development or change in language. Sketch Engine allows searching the corpus as a whole or only include selected time intervals into the search. In addition, there is a specialized diachronic feature called Trends, which identifies words whose usage changes the most of the selected period of time.

see also Trends – diachronic analysis

Synchronic corpus
The opposite is a synchronic corpus whose texts come from the same point of time. It is a snapshot of language in one moment. The enTenTen family of corpora are such snapshots because their content is collected within a couple of months.

Currentness
Static corpus
(also called a reference corpus (although this refers to something else in Sketch Engine) is a corpus whose development is complete. The content of the corpus does not change. Most corpora are static corpora. The benefit of a corpus that does not change is that the results of the analysis do not change which is important in many scenarios.

Monitor corpus
A monitor corpus is used to monitor the change in language. It is a corpus which is regularly (or even continuously) updated, new texts are added as they are produced. The results of the searches change because the content of the corpus gets bigger all the time.

The Timestamped corpus in Sketch Engine is an example of a monitor corpus.

More features
Learner corpora
A learner corpus is a corpus of texts produced by learners of a language. The corpus is used to study the mistakes and problems learners have when learning a foreign language. Sketch Engine allows for learner corpora to be annotated for the type of error and provides a special interface to search either for the error itself, for the error correction, for the error type or for a combination of the three options.

see also Setting up a learner corpus

Error-annotated corpus
These corpora contain texts produced by learners of a language or by translators. The  errors are annotated and can be used to study the types of errors diferent groups of learners or translators make.

see also Setting up a learner corpus

Specialized corpus
A specialized corpus contains texts limited to one or more subject areas, domains, topics etc. Such corpus is used to study how the specialized language is used. The user can create specialized subcorpora from the general corpora in Sketch Engine.

see Build a subcorpus

Multimedia corpus
A multimedia corpus contains texts which are enhanced with audio or visual materials or other type of multimedia content. For example, the spoken part of British National Corpus in Sketch Engine has links to the corresponding recordings which can be played from the Sketch Engine interface.

Other corpora can have videos where the corpus text is spoken or images which show the original manuscript or printed copy of the text.

See BNC, where the spoken part is also available in the audio format and it can be played directly in the Sketch Engine interface.

2. What are Tokens?

Tokens are the building blocks of Natural Language. Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization

3. What are Unigrams, Bigrams, Trigrams?

N-Grams are phrases cut out of a sentence with N cinsecutive words. Thus a Unigram takes a sentence and gives us all the words in that we fence. A Bigram takes a sentence and gives us sets of two consecutive words in the sentence. A Trigram gives sets of threee consecutive words in a sentence.

Let me explain with an example.

Unigram - [Let] [me] [explain] [with] [an] [example.]

Bigram [let me] [me explain] [explain with] [with an] [an example]

Trigram [let me explain] [me explain with] [explain with an] [with an example]

Hope it explains.

4. How to generate n-grams from text?

In [None]:
N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.

When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents. This post describes several different ways to generate n-grams quickly from input sentences in Python.

The Pure Python Way
In general, an input sentence is just a string of characters in Python. We can use build in functions in Python to generate n-grams quickly. Let’s take the following sentence as a sample input:

s = """
    Natural-language processing (NLP) is an area of
    computer science and artificial intelligence
    concerned with the interactions between computers
    and human (natural) languages.
"""
If we want to generate a list of bi-grams from the above sentence, the expected output would be something like below (depending on how do we want to treat the punctuations, the desired output can be different):

[
    "natural language",
    "language processing",
    "processing nlp",
    "nlp is",
    "is an",
    "an area",
    ...
]
The following function can be used to achieve this:

import re

def generate_ngrams(s, n):
    # Convert to lowercases
    s = s.lower()
    
    # Replace all none alphanumeric characters with spaces
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
    
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in s.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[token[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]
Applying the above function to the sentence, with n=5, gives the following output:

>>> generate_ngrams(s, n=5)
['natural language processing nlp is',
 'language processing nlp is an',
 'processing nlp is an area',
 'nlp is an area of',
 'is an area of computer',
 'an area of computer science',
 'area of computer science and',
 'of computer science and artificial',
 'computer science and artificial intelligence',
 'science and artificial intelligence concerned',
 'and artificial intelligence concerned with',
 'artificial intelligence concerned with the',
 'intelligence concerned with the interactions',
 'concerned with the interactions between',
 'with the interactions between computers',
 'the interactions between computers and',
 'interactions between computers and human',
 'between computers and human natural',
 'computers and human natural languages']
The above function makes use of the zip function, which creates a generator that aggregates elements from multiple lists (or iterables in genera). The blocks of codes and comments below offer some more explanation of the usage:

# Sample sentence
s = "one two three four five"

tokens = s.split(" ")
# tokens = ["one", "two", "three", "four", "five"]

sequences = [tokens[i:] for i in range(3)]
# The above will generate sequences of tokens starting
# from different elements of the list of tokens.
# The parameter in the range() function controls
# how many sequences to generate.
#
# sequences = [
#   ['one', 'two', 'three', 'four', 'five'],
#   ['two', 'three', 'four', 'five'],
#   ['three', 'four', 'five']]

bigrams = zip(*sequences)
# The zip function takes the sequences as a list of inputs
# (using the * operator, this is equivalent to
# zip(sequences[0], sequences[1], sequences[2]).
# Each tuple it returns will contain one element from
# each of the sequences.
# 
# To inspect the content of bigrams, try:
# print(list(bigrams))
# which will give the following:
#
# [
#   ('one', 'two', 'three'),
#   ('two', 'three', 'four'),
#   ('three', 'four', 'five')
# ]
#
# Note: even though the first sequence has 5 elements,
# zip will stop after returning 3 tuples, because the
# last sequence only has 3 elements. In other words,
# the zip function automatically handles the ending of
# the n-gram generation.
Using NLTK
Instead of using pure Python functions, we can also get help from some natural language processing libraries such as the Natural Language Toolkit (NLTK). In particular, nltk has the ngrams function that returns a generator of n-grams given a tokenized sentence. (See the documentaion of the function here)

import re
from nltk.util import ngrams

s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 5))
The above block of code will generate the same output as the function generate_ngrams() as shown above.

5. Explain Lemmatization

Lemmatization is a linguistic term that means grouping together words with the same root or lemma but with different inflections or derivatives of meaning so they can be analyzed as one item. The aim is to take away inflectional suffixes and prefixes to bring out the word’s dictionary form.

For example, to lemmatize the words “cats,” “cat’s,” and “cats’” means taking away the suffixes “s,” “’s,” and “s’” to bring out the root word “cat.” Lemmatization is used to train robots to speak and converse, making it important in the field of artificial intelligence (AI) known as “natural language processing (NLP)” or “natural language understanding.”

Other interesting terms…

What is Natural Language Processing?
What is Machine Learning?
Read More about “Lemmatization”
In general, lemmatization converts words into their base forms. In linguistics, lemmatization helps a reader consider a word’s intended meaning instead of its literal meaning. Because of that, lemmatization is often confused with stemming.

Differences between Lemmatization and Stemming
In stemming, a computer algorithm often cuts off the ending or beginning of the word being analyzed. The cut thus takes out prefixes and suffixes, which can lead to errors. Let’s take the words “studies” as an example. A stemming algorithm would drop the suffix “es,” thus arriving at the root word “studi,” which we all know is not right. There’s no such word.

Lemmatization, on the other hand, lets a word like “studies” undergo a morphological analysis based on a dictionary that the algorithm can consult to produce the correct root word. As such, a lemmatization-capable machine would know that “studies” is the singular verb form of the word “study” in the present tense.

Practical Applications of Lemmatization
As we said earlier, lemmatization is a crucial component of NLP. It is widely applied in text mining, which involves text analysis of data written in the natural language. This process allows computers to extract relevant information from a given set of text.

One widely known application of lemmatization is information retrieval for search engines. Lemmatization allows systems to map documents to topics, allowing search engines to display relevant results and even expanding them to include other information that readers may find useful, too.

Lemmatization is also used in sentiment analysis, which includes text preparation before examination. The concept is also applied in document clustering, where users need to extract topics and retrieve information.

Lemmatization is also useful in improving search engine optimization (SEO) results. Search engines like Google employ the technology to provide highly relevant results to users. Note that when users type in queries, a search engine automatically lemmatizes words to make sense of the search term and give relevant and comprehensive results.

Some examples of lemmatization tools currently out in the market include:

BioLemmatizer: Helps computers make sense of biomedical literature.
Lemmatization API: Automatically obtains the root of any given word.
Trinker/Textstem: Functions much like Lemmatization API.
—

Lemmatization, in a nutshell, is the process of obtaining the root of any word to make sense of a phrase, clause, sentence, or any kind of content.

6. Explain Stemming

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Stemming is a part of linguistic studies in morphology and artificial intelligence (AI) information retrieval and extraction. Stemming and AI knowledge extract meaningful information from vast sources like big data or the Internet since additional forms of a word related to a subject may need to be searched to get the best results. Stemming is also a part of queries and Internet search engines.

Recognizing, searching and retrieving more forms of words returns more results. When a form of a word is recognized it can make it possible to return search results that otherwise might have been missed. That additional information retrieved is why stemming is integral to search queries and information retrieval.

When a new word is found, it can present new research opportunities. Often, the best results can be attained by using the basic morphological form of the word: the lemma. To find the lemma, stemming is performed by an individual or an algorithm, which may be used by an AI system. Stemming uses a number of approaches to reduce a word to its base from whatever inflected form is encountered.

It can be simple to develop a stemming algorithm. Some simple algorithms will simply strip recognized prefixes and suffixes. However, these simple algorithms are prone to error. For example, an error can reduce words like laziness to lazi  instead of lazy. Such algorithms may also have difficulty with terms whose inflectional forms don't perfectly mirror the lemma such as with saw and see.

Examples of stemming algorithms include:

Lookups in tables of inflected forms of words. This approach requires all inflected forms be listed.

Suffix strippi . Algorithms recognize known suffixes on inflected words and remove them.

Lemmatization. This algorithm collects all inflected forms of a word in order to break them down to their root dictionary form or lemma. Words are broken down into a part of speech (the categories of word types) by way of the rules of grammar.

Stochastic models. This algorithm earns from tables of inflected forms of words. By understanding suffixes, and the rules by which they are applied, an algorithm can stem new words.

7. Explain Part-of-speech (POS) tagging

In [None]:
It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

Default tagging is a basic step for the part-of-speech tagging. It is performed using the DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a singular noun. DefaultTagger is most useful when it gets to work with most common part-of-speech tag. that’s why a noun tag is recommended.



Code #1 : How it works ?

# Loading Libraries
from nltk.tag import DefaultTagger
  
# Defining Tag
tagging = DefaultTagger('NN')
  
# Tagging
tagging.tag(['Hello', 'Geeks'])
Output :

[('Hello', 'NN'), ('Geeks', 'NN')]
Each tagger has a tag() method that takes a list of tokens (usually list of words produced by a word tokenizer), where each token is a single word. tag() returns a list of tagged tokens – a tuple of (word, tag).



How DefaultTagger works ?
It is a subclass of SequentialBackoffTagger and implements the choose_tag() method, having three arguments.

list of tokens
index of the current token, to choose the tag.
list of the previous tags
 
Code #2 : Tagging Sentences

# Loading Libraries
from nltk.tag import DefaultTagger
  
# Defining Tag
tagging = DefaultTagger('NN')
  
tagging.tag_sents([['welcome', 'to', '.'], ['Geeks', 'for', 'Geeks']])
Output :

[[('welcome', 'NN'), ('to', 'NN'), ('.', 'NN')],
 [('Geeks', 'NN'), ('for', 'NN'), ('Geeks', 'NN')]]
Note: Every tag in the list of tagged sentences (in the above code) is NN as we have used DefaultTagger class.

Code #3 : Illustrating how to untag.

from nltk.tag import untag
untag([('Geeks', 'NN'), ('for', 'NN'), ('Geeks', 'NN')])
Output :

['Geeks', 'for', 'Geeks']
 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course

8. Explain Chunking or shallow parsing

Chunking (Shallow Parsing): Understanding Text Syntax and Structures, Part 2
We got introduced to text syntax and structures and took a detailed look at part of speech tagging in part 1 of this tutorial series. In this tutorial, we will learn about phrasal structure and shallow parsing.

Phrasal Structures
A phrase can be a single word or a combination of words based on the syntax and position of the phrase in a clause or sentence. For example, in the following sentence

My dog likes his food.
there are three phrases. "My dog" is a noun phrase, "likes" is a verb phrase, and "his food" is also a noun phrase.

There are five major categories of phrases:

Noun phrase (NP): These are phrases where a noun acts as the head word. Noun phrases act as a subject or object to a verb or an adjective. In some cases a noun phrase can be replaced by a pronoun without changing the syntax of the sentence. Some examples of Noun phrases are "little boy", "hard rock", etc.
Verb phrase (VP): These phrases are lexical units that have a verb acting as the head word. Usually there are two forms of verb phrases. One form has the verb components as well as other entities such as nouns, adjectives, or adverbs as parts of the object. The verb here is known as a finite verb. For example in the sentence “The boy is playing football”, "playing football" is the finite verb phrase. The second form of this includes verb phrases which consist strictly of verb components only. For example, "is playing" in the same sentence is such a verb phrase.
Adjective phrase (ADJP): These are phrases with an adjective as the head word. Their main role is to describe or qualify nouns and pronouns in a sentence, and they will be either placed before or after the noun or pronoun. The sentence, "The cat is too cute" has an adjective phrase, "too cute", qualifying "cat".
Adverb phrase (ADVP): These are phrases where adverb acts as the head word in the phrase. Adverb phrases are used as modifiers for nouns, verbs, or adverbs themselves by providing further details that describe or qualify them. In the sentence "The train should be at the station pretty soon", the adverb phrase "pretty soon" describes when the train would be arriving.
Prepositional phrase (PP): These phrases usually contain a preposition as the head word and other lexical components like nouns, pronouns, and so on. It acts like an adjective or adverb describing other words or phrases. The phrase "going up the stairs" contains a prepositional phrase "up", describing the direction of the stairs.
These five major syntactic categories of phrases can be generated from words using several rules, utilizing syntax and grammars of different types.

9. Explain Noun Phrase (NP) chunking

10. Explain Named Entity Recognition

Named entity recognition (NER) — sometimes referred to as entity chunking, extraction, or identification — is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, an NER machine learning (ML) model might detect the word “super.AI” in a text and classify it as a “Company”.
NER is a form of natural language processing (NLP), a subfield of artificial intelligence. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially, such as with computer coding languages.
This post explores the basics of how NER works, along with some high-level use cases and how you can apply it in your business or project.

NER does not evaluate the truth of statements
How NER works
At the heart of any NER model is a two step process:
Detect a named entity
Categorize the entity
Beneath this lie a couple of things.
Step one involves detecting a word or string of words that form an entity. Each word represents a token: “The Great Lakes” is a string of three tokens that represents one entity. Inside-outside-beginning tagging is a common way of indicating where entities begin and end. We’ll explore this further in a future blog post.
The second step requires the creation of entity categories. Here are some common entity categories:
Person
E.g., Elvis Presley, Audrey Hepburn, David Beckham
Organization
E.g., Google, Mastercard, University of Oxford
Time
E.g., 2006, 16:34, 2am
Location
E.g., Trafalgar Square, MoMA, Machu Picchu
Work of art
E.g., Hamlet, Guernica, Exile on Main St.
These are just a few examples. You can create your own entity categories to suit your task, as well as provide granular rules for which entities belong to which categories in instances of ambiguity or task-specific ontologies.

super.AI’s interface allows you to decide your entities
To learn what is and is not a relevant entity and how to categorize them, a model requires training data. The more relevant that training data is to the task, the more accurate the model will be at completing said task. Train your model on Victorian gothic literature, and it will probably struggle to navigate Twitter.
Once you have defined your entities and your categories, you can use these to label data and create a training dataset (our named entity recognition data program can do this for you automatically). You then use this training dataset to train an algorithm to label your text predictively.
How is NER used?
NER is suited to any situation in which a high-level overview of a large quantity of text is helpful. With NER, you can, at a glance, understand the subject or theme of a body of text and quickly group texts based on their relevancy or similarity.
Some notable NER use cases include:
Human resources
Speed up the hiring process by summarizing applicants’ CVs; improve internal workflows by categorizing employee complaints and questions
Customer support
Improve response times by categorizing user requests, complaints and questions and filtering by priority keywords
Search and recommendation engines
Improve the speed and relevance of search results and recommendations by summarizing descriptive text, reviews, and discussions
Booking.com is a notable success story here
Content classification
Surface content more easily and gain insights into trends by identifying the subjects and themes of blog posts and news articles
Health care
Improve patient care standards and reduce workloads by extracting essential information from lab reports
Roche is doing this with pathology and radiology reports
Academia
Enable students and researchers to find relevant material faster by summarizing papers and archive material and highlighting key terms, topics, and themes
The EU’s digital platform for cultural heritage, Europeana, is using NER to make historical newspapers searchable

Wherever there are large quantities of text, NER can make life easier
How can I use NER?
If you think that your business or project could benefit from NER, it’s pretty easy to start out. There are a number of excellent open-source libraries that can get you going, including NLTK, SpaCy, and Stanford NER. Each has its own pros and cons, which we’ll be exploring in more detail soon.
But before you begin using one of these libraries to build a model, you will need to produce a relevant labeled dataset to train the model on. That’s where super.AI is there to help. Using our named entity recognition data program, you provide us your raw text and desired entities and categories. We’ll label the text you send and return a high quality training dataset that you can take to train and tailor your NER model.
If you’re interested in learning more or have a specialized use case, reach out to us. You can also stay tuned to our blog, where we’ll be running a series of posts covering different aspects of NLP over the coming months.
