In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# What is Natural Language Processing?
Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.
The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.
Most NLP techniques rely on machine learning to derive meaning from human languages.
In fact, a typical interaction between humans and machines using Natural Language Processing could go as follows:
1. A human talks to the machine
2. The machine captures the audio
3. Audio to text conversion takes place
4. Processing of the text’s data
5. Data to audio conversion takes place
6. The machine responds to the human by playing the audio file

# What is NLP used for?
Natural Language Processing is the driving force behind the following common applications:
1. Language translation applications such as Google Translate
2. Word Processors such as Microsoft Word and Grammarly that employ NLP to check grammatical accuracy of texts.
3. Interactive Voice Response (IVR) applications used in call centers to respond to certain users’ requests.
4. Personal assistant applications such as OK Google, Siri, Cortana, and Alexa.


# Why is NLP difficult?
Natural Language processing is considered a difficult problem in computer science. It’s the nature of the human language that makes NLP difficult.
The rules that dictate the passing of information using natural languages are not easy for computers to understand.
Some of these rules can be high-leveled and abstract; for example, when someone uses a sarcastic remark to pass information.
On the other hand, some of these rules can be low-levelled; for example, using the character “s” to signify the plurality of items.
Comprehensively understanding the human language requires understanding both the words and how the concepts are connected to deliver the intended message.
While humans can easily master a language, the ambiguity and imprecise characteristics of the natural languages are what make NLP difficult for machines to implement.

# How does Natural Language Processing Works?
NLP entails applying algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.
When the text has been provided, the computer will utilize algorithms to extract meaning associated with every sentence and collect the essential data from them.
Sometimes, the computer may fail to understand the meaning of a sentence well, leading to obscure results.
For example, a humorous incident occurred in the 1950s during the translation of some words between the English and the Russian languages.
Here is the biblical sentence that required translation:

“The spirit is willing, but the flesh is weak.” 

Here is the result when the sentence was translated to Russian and back to English:

“The vodka is good, but the meat is rotten.” 

# What are the techniques used in NLP?
Syntactic analysis and semantic analysis are the main techniques used to complete Natural Language Processing tasks.
Here is a description on how they can be used.
### Syntax
Syntax refers to the arrangement of words in a sentence such that they make grammatical sense.
In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules.
Computer algorithms are used to apply grammatical rules to a group of words and derive meaning from them.
Here are some syntax techniques that can be used:
* Lemmatization *: It entails reducing the various inflected forms of a word into a single form for easy analysis.
* Morphological segmentation: It involves dividing words into individual units called morphemes.
* Word segmentation: It involves dividing a large piece of continuous text into distinct units.
* Part-of-speech tagging: It involves identifying the part of speech for every word.
* Parsing: It involves undertaking grammatical analysis for the provided sentence.
* Sentence breaking: It involves placing sentence boundaries on a large piece of text.
* Stemming: It involves cutting the inflected words to their root form.


### Semantics
Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one of the difficult aspects of Natural Language Processing that has not been fully resolved yet.
It involves applying computer algorithms to understand the meaning and interpretation of words and how sentences are structured.
Here are some techniques in semantic analysis:
* Named entity recognition (NER): It involves determining the parts of a text that can be identified and categorized       into preset groups. Examples of such groups include names of people and names of places.
* Word sense disambiguation: It involves giving meaning to a word based on the context.
* Natural language generation: It involves using databases to derive semantic intentions and convert them into human       language.


https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32

# Basic Natural Language Processing with NLTK

In [2]:
import nltk
from nltk.book import*

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Counting vocabulary of words.

In [3]:
text7

<Text: Wall Street Journal>

In [4]:
len(text7)

100676

In [5]:
sents()

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [6]:
len(set(text7))

12408

In [7]:
list(set(text7))[:10]

['stockbrokers',
 'Killeen',
 'stretching',
 'anecdotal',
 'S&P',
 'workable',
 'photocopy',
 'lender',
 'Achievement',
 'community']

## Frequency of words

In [8]:
freq = FreqDist(text7)
freq

FreqDist({',': 4885, 'the': 4045, '.': 3828, 'of': 2319, 'to': 2164, 'a': 1878, 'in': 1572, 'and': 1511, '*-1': 1123, '0': 1099, ...})

In [9]:
freq[',']

4885

In [10]:
key = freq.keys()
list(key)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [11]:
freqWords = [words for words in key if len(words)>5 and freq[words]>100]
freqWords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

## Normalization and stemming

* Different forms of same 'words'

In [12]:
input1 = 'Go go Going Goings Goes'
word1 = input1.lower().split(' ')
word1

['go', 'go', 'going', 'goings', 'goes']

In [13]:
porter = nltk.PorterStemmer()
[porter.stem(i) for i in word1]

['go', 'go', 'go', 'go', 'goe']

## Lemmatization
 * Word that come out to be actually meaningful.

In [14]:
corpus = nltk.corpus.udhr.words('English-Latin1')
corpus

['Universal', 'Declaration', 'of', 'Human', 'Rights', ...]

In [15]:
# still lematization
[porter.stem(t) for t in corpus][:20]

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [16]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in corpus[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

In [17]:
## Tokenization
text = 'hey whats going on.'
text.split(' ')

['hey', 'whats', 'going', 'on.']

In [18]:
nltk.word_tokenize(text)

['hey', 'whats', 'going', 'on', '.']

In [19]:
text12 = "This is the first sentence. A gallon of milk in the Nepal costs Rs.300. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
sentences

['This is the first sentence.',
 'A gallon of milk in the Nepal costs Rs.300.',
 'Is this the third sentence?',
 'Yes, it is!']

In [20]:
len(sentences)

4

# Advanced Natural Language Processing

## POS(Parts-Of-Speech) tagging

In [21]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [22]:
text13 = nltk.word_tokenize(text)
nltk.pos_tag(text13)

[('hey', 'NN'), ('whats', 'NNS'), ('going', 'VBG'), ('on', 'IN'), ('.', '.')]

In [23]:
text14 = nltk.word_tokenize("Chilling with friends is a fantastic feeling.")
nltk.pos_tag(text14)

[('Chilling', 'VBG'),
 ('with', 'IN'),
 ('friends', 'NNS'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('fantastic', 'JJ'),
 ('feeling', 'NN'),
 ('.', '.')]

In [24]:
# Parsing sentence structure
text15 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


## POS tagging and Ambiguity.

In [25]:
text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [26]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]