### *An Introduction to*
# Natural Language Processing

#### Contributors
Helene Willits,
Shaina Bagri,
Rachel Castellino

## What is NLP?
The textbook definition of Natural Language Processing says that "NLP is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications."

Let's break that down: humans speak, communicate and talk to each other in words (among other things). Sentences, paragraphs, texts, speeches and more - we all use some use of language to tell someone else our thoughts. So NLP takes advantage of the popularity of language and automates it. Machines learn how we speak, text and talk, and mimics it. It learns our patterns, our slang, our nuances as humans.

We can split up NLP into 3 different sub groups of knowledge: mathematical linguistics, computational linguistics and statistical linguistics. Mathematical linguistics focuses exclusively on the use of discrete mathematical theory for natural language - think, formulas, analyzing word and sentence vectors to find the similarity and differences of others. Computational linguistics is the modern study of linguistics using the tools of computer science. Statistical linguistics is uses statistical processing to study lingustics.

The basis of NLP takes in unstructured text and analyzes it from there. What is unstructured text? Unstructured text, also known as raw text, is the free, unmodified text you will find in essays, text messages, emails, newspapers etc. from natural human speak. It is then transformed into data that the machine can process.
   

Let's define what syntax and semantics are. Syntax is the arrangement of words and phrases to create sentences. Semantics is concerned with developing meaning from words.

This is the basic process of how an NLP project is started. We have sentence tokenization, word tokenization, text lemmatization, stemming, stop words, bag of words and tf idf.

Sentence tokenization is the division of a string of written language into component sentences. Word tokenization is the division of written language into component words. Text lemmatization and stemming is the reduction of inflectional forms and filtering down related words down to a common base. For example, go, went, going all boil down to the word \"go\", just used in different tenses. Stop words is the process of filterin gout the most common words in a language application, it depends on which software, but it can be removing the words \"and\", \"but\" and \"the\" that don't serve much meaning to a piece of text. Bag of Words describes the occurrence of each word in a document not accounting for the order or structure of words. TF-IDF, short term frequency inverse document frequency, is a statistical measure to evaluate importnace of a word. 


## Basics of Language Syntax and Semantics

There are various language semantics theories that compete in the world of Natural Language Processing.

## Basic History of NLP

## Natural Language Understanding

### Preprocessing
Preprocessing is the process of adding structure to raw text so that it can be understood and analyzed by computers. This mainly consists of defining the properties of the text so that the computer can determine the importance of different words and the relationships between them. This step also helps us to remove any noise in the data and reduce the ambiguity of our resulting model.

### How do we get Computers to Understand Words?
Computers are not very good at understanding words. However, they are built to understand numbers. In order to create complex relationships between words, NLP represents words using number vectors that make it easy to build and analyze the relationships between words.

### Document Preprocessing
There are many ways that Engineers perform preprocessing, but many NLP models begin by processing the document(s) before addressing the words individually.

#### Splitting
The raw text is split into sections so that important sections can be more easily identified.

#### Deduplication
Similar sections of the text are grouped to remove redundancy and start forming relationships.

#### Stratification
Important parts of the text are selected to be focused on when performing the upcoming analysis.

### Parsing
The next step in our language processing is to parse the data. This means separating the data into various elements such as words, punctuation, phrases, and more.

#### Tokenization
Words and sentences are identified so that they can be processed.

#### Part-of-Speech Tagging
Words are classified based on their grammatical rules, giving them tags as Nouns, Verbs, Adjectives, and more.

#### Stemming / Lemmatization
Words are connected back to their root, so that verbs like *wanting*, *wanted*, and *wants* are all mapped back to the root *want*.

#### Dependency Parsing
Relationships between words within a sentence are identified. For example, in the sentence "I love my fluffy dog," Dependency Parsing would identify that the adjective "fluffy" is meant to describe the noun "dog" and that the verb "love" refers to the noun "dog," but would know that there isn't as strong of a relationship between the words "love" and "fluffy." This is done by applying deep learning algorithms. For the sake of this notebook, we won't go into the specifics of how these algorithms work.

![python-nlp-parts-speech-tagging-and-named-entity-recognition-1.png](attachment:python-nlp-parts-speech-tagging-and-named-entity-recognition-1.png)

#### Clause Analysis
Identifies clauses within a sentence. This process requires both Supervised Machine Learning and Linguistic Rules in order to separate sentences into blocks. 

Supervised Machine Learning is a process where a neural network is given the data to be processed as an input and a set of target outputs that it is expected to produce. The neural network is made up of layers of neurons where each neuron in the first layer is connected to each input and each following layer is completely connected to the previous one. These connections are weight vectors.

![image.png](attachment:image.png)

The weight vector determines how much (and what kind) of the input's data is transalted into the neuron. This means that it can determine how important different qualities of the input are to that neuron. A neuron that is trained to recognize a certain feature will have a high value for that feature in it's weight vector so that when an input has a high value for that feature it will pass that value on to the neuron. The weight vectors are initially randomized because that the neural network will update them as it learns on its own how to identify important features of the input. 

Once the neural network has run the input, it compares its results to the target output. If it is inaccurate, it performs backpropogagtion, which is a process where the neural network works backwards through the network (from the outputs to the inputs) to update the weight vectors in the neurons according to the inaccuracies in the output. At the end of Supervised Machine Learning, the network can be used as a model to predict the expected outputs.

### Analyzing
Now we are ready to analyze our data to find relationships between topics and trends within the data. In this step, we discover more about the text as a whole. There are various methods of analysis that are commonly used:

#### Singular Value Decomposition
The frequency of each word in the document is recorded in a matrix so that the model can determine the importance of the topics mentioned.

#### Categorization
Similar pieces of text are grouped together into categories using supervised machine learning or by following a pre-made set of categorization rules.

![image.png](attachment:image.png)

#### Sentiment Analysis
The text is categorized by sentiment, often into buckets such as positive, negative, and neutral sentiment.

![image.png](attachment:image.png)

#### Word Embedding
Uses a spacial model to plot words so that words that are used in similar ways are closer together.

As you can see, Natural Language Processing gives Engineers a variety of ways to define, categorize, and relate data from our text. At this point, we have created semi-structured data. Now that we know what was expressed, we can develop a model that understands the underlying meaning.

## Extracting Information
Natural Language Understanding is the process that predicts the meaning of processed text. Engineers determine the intent of a message using the document's context and various methods of reducing ambiguity. This is a harder task than Natural Language Generation because of the unpredictable nature of the input text.

#### Noun Groups
Identifies which noun in a sentence is the subject and uses context to determine the meaning of that subject.

#### Entity Detection
Identifies nouns as names of people, groups, places, and more in order to better understand their relationships with other words.

#### Sentiment Analysis
This version of sentiment analysis identifies the sentiment of a statement and determines its relationship with the entities in the text.

#### Semantic Role Labeling
Now that the words have been individually categorized, we can develop relationships between them with Semantic Role Labeling. For each verb, the model identifies:
- the entity that performs the action
- the entity that receives the action
as well as some more roles if they apply to the verb, such as:
- beneficiaries of the action

#### Graph-Based Parsing
Graph based parsing is a traditional method of message interpretation in NLU.

Using the elements of word identification that we developed so far, important concepts in the text are identified and represented as nodes. These nodes are then structured into a directed graph with edges between every node. Engineers then use algorithmic approaches to determine the relationships between nodes that represent syntactic, semantic, and topic relationships in order to predict which relationships are important.

## Natural Language Generation

### What is Natural Language Generation (NLG)?

Natural language generation is a process that transforms structured data into human-readable English text.

### Stages of NLG Process

There are a variety of ways to break down the NLG process into different stages, but the following are more common. These stages help provide a step-by-step understanding behind the concept of natural language generation.

#### Content Determination

First, data typically contains much more information than is needed to generate the document, so it is important to establish limits for the content to determine which data is needed.

#### Data Interpretation

Then, the analyzed data needs to be interpreted and put into context. This typically happens through machine learning techniques which recognize patterns in the processed data.

#### Document Planning

After the data is interpreted, it needs to be organized in order to create a narrative structure and a document plan. This typically results in a general document structure or template.

#### Sentence Aggregation

This stage is also typically referred to as microplanning. It involves choosing expressions and words within each sentence and combining sentences together based on their relevance.

#### Grammaticalization

After the sentences have been clustered, the process needs to make sure that they follow correct grammar, spelling, and punctuation. Additionally, they need to follow syntax, morphology, and orthography rules.

#### Language Implementation

Finally, the data is input into the previously generated templates and the formatting of the document is checked to make sure it is done correctly.

### Different Approaches to NLG

The two major approaches to NLG are using templates and creating documents dynamically. The following approaches show examples of these and how the approaches have built on themselves over the years.

#### Simple Gap-Filling Approach

The simple gap-filling approach is one of the oldest approaches. It uses a template system in order to generate texts. This works best for texts that have a predefined structure and simply need a small amount of data to be filled in.

#### Scripts or Rules-Producing Text

The above approach was expanded by incorporating general-purpose programming constructs through either a scripting language or business rules. The scripting approach embeds a template within a general-purpose scripting language. An example of this is using web templating languages. The business rule approaches are similar to the scripting approach but focus on writing business rules rather than scripts. 

#### Word-Level Grammatical Functions

The scripts or rules-producing text approach was further developed by adding word-level grammatical functions to handle morphology, morphophonology, and orthography rules as well as their exceptions. This ensures the template systems are more complete making it easier for them to generate texts that are grammatically correct.

#### Dynamic Sentence Generation

This approach exemplifies the transition from templates to dynamically created documents. It builds on the previous approach by using representations of the desired linguistic structure or the intended meaning to dynamically create sentences. Additionally, the system can linguistically "optimize" sentences.

#### Dynamic Document Creation

Dynamic document creation, then, uses the generated sentences to create a document. The document generated and the process for generating it depend on the goal of the text. For example, persuasive texts would be organized differently than informative ones.

### Models for Implementing NLG

A variety of models have been used to implement NLG, and the progression through them can be seen below.

#### Markov Chain

Markov Chain was one of the first algorithms used to implement NLG. It uses the current word and considers the relationships between it and every other unique word in order to predict what the next word in the sentence will be. A common example of this is when smartphones generate suggested next words while you are typing.

<img src="https://3.bp.blogspot.com/-J3zfH_59exo/XDoZzkvKW5I/AAAAAAAAAK4/spLAxPpbY3QKexNxaEfFaJzLjxb_qwrvwCLcBGAs/s640/state.png"/>

<img src="http://2.bp.blogspot.com/-U2fyhOJ7bN8/UJsL23oh3zI/AAAAAAAADRs/wZNWvVR-Jco/s1600/text-markov.png" width="400"/>

#### Recurrent Neural Networks (RNN)

While Markov Chain only uses the immediately previous word to predict the next word, RNN use all of the previous words they encounter in order to predict the next word. This memory allows the RNN to "remember" the background and context of the text, making them more effective to generate language. They do this by iterating through a feedforward network that calculates the probability of the next word and stores the word with the highest probability in memory. However, RNNs cannot store words encountered remotely in longer sequences and thus ends up making predictions based on only the most recent word. As such, they have difficulty generating coherent long sentences.

<img src="https://www.researchgate.net/profile/Le-Lu-9/publication/313021062/figure/fig3/AS:688562659917826@1541177539464/An-example-of-a-recurrent-neural-network-language-model.ppm"/>

#### Long Short-Term Memory (LSTM)

As explained above, RNN's are problematic for longer sequences. Long short-term memory addresses this weakness. This variant of RNN use a four-layer neural network consisting of the unit, the input door, the output door, and the forgotten door. These parts aid the RNN by adjusting the information flow of the unit which allows it to remember or forget words at any time interval. For example, the forgotten gate recognizes that a period can change the context of the sentence, so the current unit state information can be ignored. As a result, the network can selectively track only relevant information. However, LSTM memory is still limited due to its high complexity and thus high computational requirements.

The below image shows one unit in LSTM. Typically there are multiple units lined up that feed into each other as in the RNN.

<img src="https://miro.medium.com/max/542/1*ULozye1lfd-dS9RSwndZdw.png"/>

#### Transformer

This model is relatively new as it was first introduced in 2017. A set of encoders processes inputs with any length, and a set of decoders to return the generated sentences. As opposed to previous models, the transformer performs a small, constant number of steps while representing the words in context without needing to compress the information into a single fixed-length representation. This self-attention mechanism allows the models to maintain low computational requirements while still being able to handle longer sentences.

<img src="https://miro.medium.com/max/3008/1*HAArsaBKNQ0Sbof5X4e70w.png" width="700"/>

<img src="https://d2l.ai/_images/transformer.svg"/>

## Limitations

While it may seem like Natural Language Processing can do just about anything, NLP models are much more restricted than many people realize. NLP models are trained to perform extremely specific tasks. Even within the scope of the tasks that they are designed to perform, NLP models can have huge amounts of error.

## Ethical Implications

Even with these innate limitations, there are limits to what Natural Language Processing should be used for. NLP systems are used widely in everyday life and have an enormous effect on people's interactions with each other, technology, and the services that are available to them. Because of this, it is critical that those who are using NLP consider the implications of the work they do on the world at large and the individuals who are affected by it. 

Some of the most critical concerns regarding NLP at the moment are related to its ability to create or propegate discrimination against certain groups of people. With careful ethical consideration, NLP can be used in ways that minimize discrimination and increase equal access for all people.

## Applications of NLP Today

## Resources

https://medium.com/sciforce/a-comprehensive-guide-to-natural-language-generation-dd63a4b6e548

https://research.aimultiple.com/nlg/

https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a

https://medium.com/@gongster/building-a-simple-artificial-neural-network-with-keras-in-2019-9eccb92527b1

https://www.ibm.com/cloud/learn/neural-networks

https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition

https://www.youtube.com/watch?v=uCZ9nAe76Ss

https://www.youtube.com/watch?v=U1yT_4xcglY&t=963s

https://www.youtube.com/watch?v=NWNKiuI8ptc

https://www.youtube.com/watch?v=aircAruvnKk&t=11s