# Natural language processing (NLP)
NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
![image.png](attachment:image.png)

# NLTK 

### NLTK is a popular Python framework for dealing with data of human language. It includes a set of text processing libraries for classification and semantic reasoning, as well as wrappers for industrial-strength NLP libraries and an active discussion forum. 
- The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. 
- NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

### Installation
- conda install -c anaconda nltk

## Components of NLP
Five main Component of Natural Language processing are:

## Morphological and Lexical Analysis##
 - Lexical analysis is a vocabulary that includes its words and expressions. 
 - It depicts analyzing, identifying and description of the structure of words. 
 - It includes dividing a text into paragraphs, words and the sentences. 
 - Individual words are analyzed into their components, and nonword tokens such as punctuations are separated from the words
 
 ## Semantic Analysis: 
 - Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings. 
 - This component transfers linear sequences of words into structures. 
 - It shows how the words are associated with each other.
## Pragmatic Analysis: 
- Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation. 
- It means abstracting or deriving the meaningful use of language in situations. 
- In this analysis, the main focus always on what was said in reinterpreted on what is meant.
 ## Syntax Analysis:
 - The words are commonly accepted as being the smallest units of syntax. 
 - The syntax refers to the principles and rules that govern the sentence structure of any individual languages.
  ## Discourse Integration : 
  - It means a sense of the context. 
  - The meaning of any single sentence which depends upon that sentences. It also considers the meaning of the following sentence.

### NLP and writing systems
The kind of writing system used for a language is one of the deciding factors in determining the best approach for text pre-processing. Writing systems can be


- Logographic: a Large number of individual symbols represent words. Example Japanese, Mandarin
- Syllabic: Individual symbols represent syllables
- Alphabetic: Individual symbols represent sound

   ### Challenges

     - Extracting meaning(semantics) from a text is a challenge
     - NLP is dependent on the quality of the corpus. If the domain is vast, it's difficult to understand context.
     - There is a dependence on the character set and language

- First step :conda install -c anaconda nltk
             : pip install nltk
- Second Step : import nltk
              nltk.download()

# pip install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org pip nltk

In [1]:
import nltk
#nltk.download()

## pip install --proxy http://noidasezproxy.corp.exlservice.com:8000 package 


### Tokenizing Words & Sentences

A sentence or data can be split into words using the method sent_tokenize() & word_tokenize() respectively.


In [8]:
from nltk.tokenize import sent_tokenize, word_tokenize

E_TEXT = "Hello Hello, i am Monika Arora"

print(word_tokenize(E_TEXT))
print(sent_tokenize(E_TEXT))

['Hello', 'Hello', ',', 'i', 'am', 'Monika', 'Arora']
['Hello Hello, i am Monika Arora']


In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

E_TEXT = "Positive thinking is all! , really a matter of habits. If you are; not quite a positive thinker. Change Yourself?"

print(sent_tokenize(E_TEXT))
##type(sent_tokenize(E_TEXT)) ##that ends with !,?,.
[]

['Positive thinking is all!', ', really a matter of habits.', 'If you are; not quite a positive thinker.', 'Change Yourself?']


[]

In [10]:
##  store the words and sentences and type cast it:

#from nltk.tokenize import sent_tokenize, word_tokenize
import numpy as np
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
 
phrases = sent_tokenize(data)
words = word_tokenize(data)
print(phrases)
print(words)

['All work and no play makes jack dull boy.', 'All work and no play makes jack a dull boy.']
['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a', 'dull', 'boy', '.']


In [11]:
 

new_array=np.array(words)
print(new_array)
new_array2 = sent_tokenize(data)
print(new_array2)

['All' 'work' 'and' 'no' 'play' 'makes' 'jack' 'dull' 'boy' '.' 'All'
 'work' 'and' 'no' 'play' 'makes' 'jack' 'a' 'dull' 'boy' '.']
['All work and no play makes jack dull boy.', 'All work and no play makes jack a dull boy.']


### Stopping Words
- To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.

- For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.
 
- We can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus .


In [12]:
a = "I think i that Learning  Data Science will bring a  big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains"
print(a)


I think i that Learning  Data Science will bring a  big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains


In [13]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

a = "I think i that Learning  Data Science will bring a  big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains"

stop_words1 = set(stopwords.words('english')) #downloads the file with english stop words
word_tokens = word_tokenize(a)
print(stop_words1)
print("---------------------")
print(word_tokens)

{'m', 'y', "wasn't", 'those', 'for', 're', "that'll", 'did', 'through', 'ma', 'if', 'off', 'you', 'than', 'just', 'because', "shouldn't", 'having', 'and', "you've", 'this', "mustn't", 'his', 'now', 'or', 'yours', 'when', 'didn', 'were', 'o', 'its', 'how', 'other', 'will', "mightn't", 'very', "wouldn't", 'no', 'down', 'me', 'into', "weren't", 'at', 'then', 'should', 'herself', 'a', 'what', 'can', "shan't", 'with', 'are', 'not', 'until', "aren't", "doesn't", 'wasn', 'too', 'ourselves', 'after', 'themselves', 'to', 'ain', 'there', 'here', 'i', 'yourself', 'in', 'all', 'hasn', "isn't", 'more', 'few', 'above', 'your', 'him', "you'd", 'an', 'mustn', 'theirs', 'so', 'against', 'hers', 'her', 'over', 'only', 'we', 'during', 'he', "don't", 'am', "it's", "hasn't", 'why', 'these', "couldn't", 'same', 'mightn', 'don', 'shouldn', 'most', 'weren', 'that', "you'll", 'have', 'shan', 'himself', 'isn', 'needn', 'hadn', "you're", 'been', 'wouldn', 'but', 'the', 'where', 's', 'some', 'while', 'ours', 'doe

In [None]:
[w for w in word_tokens if not w in stop_words1]

In [14]:
filtered_sentence = [w for w in word_tokens if not w in stop_words1]
print(filtered_sentence)
#print(word_tokens)
#print(word_tokens)
#print(filtered_sentence)
#print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))

['I', 'think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', '.', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']


In [15]:
b=["I","."]  #Creating your own Stop word list
stop_words1=list(stop_words1)
stop_words2 = b #downloads the file with english stop words
stop_words=stop_words1+stop_words2
word_tokens = word_tokenize(a)

filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)

#print(word_tokens)
#print(filtered_sentence)
#print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))

['think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']
