Class: MDS271
In the broadest sense, we can say that natural processing is the synthesising and processing of human languages. NLP ranges from simple pattern making using regular expressions to deep neural nets trying to translate languages.
NLP research started in the 1950 as a central part of artificial intelligence. There have had been made many seminal works in this field, but the accuracy and success was not achieved, because of the ambiguity in languages.
A good example prevalent at that time was --> “The spirit is strong, but the flesh is weak” is incorrectly translated into “The vodka is delicious, but the meat tastes bad.”
From the late 1980 rule based methods were gradually replaced by machine learning and statistical methods which were proven to be much more successful. One of the example of statistical and ML method is as follows if a verb is followed by a noun more frequently than a verb in data, then we put higher probability on “noun” when seeing an unknown or ambiguous word after a verb.
As a result there was a resurgence in NLP technologies. In NLP algorithms, the use of linguistic rules is transformed into the use of features, or linguistic patterns for which statistics are collected and used by machine learning models.
From 2000's deep learning methods have overtaken theML and statistical based models. With growing time, the influence of lingusitic is weakening over NLP.
One of the most fundamental tools for text processing is, Regular Expression. A regular expression is a formal language for specifying text strings.
To segment sentences, we can use . ? ! etc. But, !, ? are relatively more ambiguous than a period,
Periods are relatively ambiguous because a period can be
-
sentence boundary
-
abbreviations like Dr.
-
Number like 76.87
To segment a sentence we can build a binary classifier that will
- Look at "."
- Decides whether it is EndOfSentence/NotEndOfSentence
- Classifiers → handwritten rules, regex or through machine learning
A simple decision tree to predict EOS:
A more sophisticated decision tree:
- Case of with "."
- Case of the word after "."
- Numeric Features
- Length of word with "."
- Probability(a word with"." occurs at EOS
- Probability(word after "." occurs at beginning of a sentence) eg The
Implementing Decision Tree
A decision tree is just like if-then statements, the difficult part is to choose the features.
We can think of the questions in a decision tree as features that could be exploited by an classifier
- Logistic regression
- SVM
- Neural nets
- etc
I do uh main-mainly business data processing
words like uh are known as filled pauses
words like main-mainly are known as fragments
Seuss's cat in the hat is different from other cats
Lemma: same stem, part of speech, rough word sense
eg, cat and cats → same lemma
wordform → the full inflected surface form
eg cat and cats → different wordform
Token→ an instance of that type in running text
Type→ an element of the vocabulary
eg
they lay back on the San Francisco grass and looked at the stars
→ 15 tokens(or 14)
→ 13 types (or 12)
It depends on how we define our goal
Issues in Tokenization →
Tokenization: Language issues →
Instead of white-space segmentation or single-character segmentation, we will use the data to tell us how to tokenize. Subword tokenization is a way in which the tokens can be part of words as well as whole words.
Three common subword tokenization
- Byte Pair Encoding (BPE)
- Unigram language modeling tokenization
- WordPiece
All these algorithms have two parts,
- a token learner that takes a raw training corpus and induces a vocabulary
- a token segmenter that takes a raw test sentence and tokenizes it according to that vocabulary.
Word normalization is the task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms like USA and US or uh-huh and uhhuh.
- We need to 'normalize' words
- We implicitly define equivalence classes of terms
- Alternative: asymmetric expansion
- Potentially more powerful, but less efficient
This means changing all the words to lower case. One problem can be in words like US, which is different to us.
Reduces inflections or variant forms to base form
- am, are, is -> be
- car, cars, car's, cars' -> cars Lemmatization -> the correct dictionary headword
This deals with morphemes which are the small meaningful units and are of two types, stems and affixes. Stems are the core meaning bearing part, and affixes are bits and pieces that are related to stems and often, are grammatical functions. For example in the word stems, stem is the stems, and s is the affixes.
Reduce terms to their stem. Stemming is a crude way to chop off affixes. Its language dependant. For example automatic, automation to automat.
Two ways to do stemming are
- Porters Algorithm
- Lancaster Algorithm
- Regex Method
Porters Algorithm:-
To count the number of words in a given sentence, we need to take into account how exactly we are counting the word, are we counting the unique instances or all the tokens.
Heaps Law = Herdan's Law:
i.e. the vocabulary size grows with > square root of word of token
Some common corpus:
Corpora vary along dimension like
- Language --> 7097 languages in the world
- Variety --> like African American language varieties
- Code Switching --> switching from one language to another, like "No need to worry, abhi time hai"
- Genre --> newswire, scientific articles
- Authors demographics --> writers age, gender etc
It can be found by their Edit distance. The minimum edit distance between two strings is the minimum number of editing operations that are needed to transform to one another. The editing operations are
- Insertion
- Deletion
- Substitution
Algorithm for edit distance:
Performance:
In natural language processing, the term topic means a set of words that “go together”. These are the words that come to mind when thinking about a topic. For example while thinking of the words like athlete, soccer, and stadium, the topic "Sports" comes to mind.
A topic model is one that automatically discovers topics occurring in a collection of documents
Since texts can be seen as a sequence, hence neural networks like LSTM and RNN are used.
What do we need to handle in NLP:
- Morphology
- Syntax
- Semantics/World Knowledge
- Discourse
- Pragmatics
- Multilinguality
In order to handle these, we can use Neural Networks. Neural networks are tools that can help us to handle hard things.
- Word2Vec