#### Introduction to Natural Language Processing(NLP)

##### Syntax and Semantics of English Language

###### Syntax: 
Refers to the grammatical structure of a sentence. For instance Noun phrase, verb phrase, determiner, adjective, noun, verb and adverb. 

The goal for understanding the syntax of language is to be able to understand the meaning of the language.

###### Semantics:
Refers to the actual meaning of the sentence.

###### Building your own tagger?
- Corpora: 
    - universal dependencies corpus (free)
    - Penn Treebank corpus(not free)

###### Parsing Resources
1. SpaCy
    - python, high accuracy, fast (https://spacy.io/)
2. Stanford Core NLP
    - java, high accuracy, medium (https://nlp.stanford.edu/software/corenlp.shtml)
3. NLTK
    - python, low accuracy, fast (https://www.nltk.org)

##### Part of speech tags(POS Tags)
Part of speech tags - syntax of a sentence. We can infer a part of speech tags based on the context of the sentence.

###### What to do with POS tags
1. Keyword Extraction: Nouns and Noun phrases are often the most significant pieces of information.
2. Entity Extraction: These are names of people, places etc

##### Process of Keyword Extraction: 
Extract candidate keywords, and rerank their relevance to the document based on a chosen (custom) metric.

###### Where is Keyword Extraction used?
1. Customer feedback analysis

#### Language Models(LM)
Language Models provides a list of probable explanations/representations of a given sentence.
Using grammar and syntax provides just one representation of the text. But there can be more than one meaning of a sentence.

###### Understanding Language Models
A model that can predict the probability of a given word given a list of previous words:

P(NLP) = p(N) * p(L|N) * p(P|NL)

###### How do we compute these probabilites?
- Bigrams model
- Trigram
- 4-gram
- 5-gram
- Etc

##### Applications of Language Models
1. Spelling Correction: For spelling correction, probability of incorrect sentence will be much smaller than correct sentence.
2. Speech recognition: For instance, the words weather and whether.
3. Machine translation: selecting appropriate sequence while translating from one language to another can also use probability of sequence for providing more appropriate translations.
4. Predictive text: by looking at previous sequence of words, language model can predict next word, this feature was recently introduced in android phone keyboard by google.

##### Word Embeddings
Word embeddings are vector representation of a given word to capture the semantics of the word.

A vector is a list of numbers that can capture the meaning of a word.

##### Examples of word embeddings
1. WOrd2Vec
2. Glove
These models represent the knowledge of the earth in over 300 dimensions.

##### How does Word2Vec work?
Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag of Words(CBOW)

##### CBOW Model
CBOW Model: This method takes the context of each word as the input and tries to predict the word corresponding to the context.

CBOW is faster and has better representations for more frequent words.

##### Skip Gram Model
Skip Gram: We take a target word and try to predict its context.

Skip Gram works well with small amount of data and is found to represent rare words well.

##### Where can word embeddings be used?
1. Used as a thesaurus
2. Topic identification

#### Introduction to Natural Language Processing(NLP)

Natural Language Processing is an automated way to understand and analyze human languages and extract information from such data by applying machine algorithms. The data content can be text argument, audio, image or video.

Sometimes, it is also referred to as a field of computer science or AI to extract the linguistic information from the underlying data. It enables machines or computers to derive meaning from human or natural language input.

#### Why NLP
The world is now connected globally due to the advancement of technology of technology and devices. Including:
- Analyzing tons of data generated in the form of texts, audios, image or videos.
- Identifying various languages and dialects.
- Applying quantitative analysis on huge collections of data
- Handling ambiguities while interpreting data and extracting information.

With the advancement of technology and services, the world is now a global village.

One of the main goal of NLP is to understand various languages, process them and extract information from them.

#### Sentence Analysis

##### Eliminate punctuation and stopwords from the sentence

In [1]:
#import required library
import string
from nltk.corpus import stopwords

In the word of text analysis, stopwords usually have little of a meaning. E.g. I, me, myself, you, yours etc

In [2]:
#View first 10 stopwords present in the english corpus
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [3]:
#Create a test sentence to analyze
test_sentence = 'This is my first test string. Wow!! we are doing just fine'

In [4]:
#Eliminate the puntuation in form of characters and print them
no_punctuation = [char for char in test_sentence if char not in string.punctuation]
no_punctuation

['T',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'm',
 'y',
 ' ',
 'f',
 'i',
 'r',
 's',
 't',
 ' ',
 't',
 'e',
 's',
 't',
 ' ',
 's',
 't',
 'r',
 'i',
 'n',
 'g',
 ' ',
 'W',
 'o',
 'w',
 ' ',
 'w',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 'd',
 'o',
 'i',
 'n',
 'g',
 ' ',
 'j',
 'u',
 's',
 't',
 ' ',
 'f',
 'i',
 'n',
 'e']

In [5]:
#Now eliminate the punctuation and print them as a whole sentence
no_punctuation = ''.join(no_punctuation)
no_punctuation

'This is my first test string Wow we are doing just fine'

In [6]:
#Split each words present in the new sentence
no_punctuation.split()

['This',
 'is',
 'my',
 'first',
 'test',
 'string',
 'Wow',
 'we',
 'are',
 'doing',
 'just',
 'fine']

In [7]:
#Now eliminate stopwords
clean_sentence = [word for word in no_punctuation.split() if word.lower() not in stopwords.words('english')]

In [8]:
#Print the final cleaned sentence
clean_sentence

['first', 'test', 'string', 'Wow', 'fine']