# Natural language processing (NLP)
NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
![image.png](attachment:image.png)

# NLTK 

### NLTK is a popular Python framework for dealing with data of human language. It includes a set of text processing libraries for classification and semantic reasoning, as well as wrappers for industrial-strength NLP libraries and an active discussion forum. 
- The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. 
- NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

### Installation
- conda install -c anaconda nltk

## Components of NLP
Five main Component of Natural Language processing are:

## Morphological and Lexical Analysis##
 - Lexical analysis is a vocabulary that includes its words and expressions. 
 - It depicts analyzing, identifying and description of the structure of words. 
 - It includes dividing a text into paragraphs, words and the sentences. 
 - Individual words are analyzed into their components, and nonword tokens such as punctuations are separated from the words
 
 ## Semantic Analysis: 
 - Semantic Analysis is a structure created by the syntactic analyzer which assigns meanings. 
 - This component transfers linear sequences of words into structures. 
 - It shows how the words are associated with each other.
## Pragmatic Analysis: 
- Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation. 
- It means abstracting or deriving the meaningful use of language in situations. 
- In this analysis, the main focus always on what was said in reinterpreted on what is meant.
 ## Syntax Analysis:
 - The words are commonly accepted as being the smallest units of syntax. 
 - The syntax refers to the principles and rules that govern the sentence structure of any individual languages.
  ## Discourse Integration : 
  - It means a sense of the context. 
  - The meaning of any single sentence which depends upon that sentences. It also considers the meaning of the following sentence.

### NLP and writing systems
The kind of writing system used for a language is one of the deciding factors in determining the best approach for text pre-processing. Writing systems can be


- Logographic: a Large number of individual symbols represent words. Example Japanese, Mandarin
- Syllabic: Individual symbols represent syllables
- Alphabetic: Individual symbols represent sound

   ### Challenges

     - Extracting meaning(semantics) from a text is a challenge
     - NLP is dependent on the quality of the corpus. If the domain is vast, it's difficult to understand context.
     - There is a dependence on the character set and language

- First step :conda install -c anaconda nltk
             : pip install nltk
- Second Step : import nltk
              nltk.download()

# pip install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org pip nltk

In [1]:
import nltk
#nltk.download()

## pip install --proxy http://noidasezproxy.corp.exlservice.com:8000 package 


### Tokenizing Words & Sentences

A sentence or data can be split into words using the method sent_tokenize() & word_tokenize() respectively.


In [13]:
from nltk.tokenize import sent_tokenize, word_tokenize

E_TEXT = "Hello! I am Monika Arora .# . I am taking a 'Gen AI' session"
print(type(E_TEXT))
print(E_TEXT)

<class 'str'>
Hello! I am Monika Arora .# . I am taking a 'Gen AI' session


In [14]:
print(word_tokenize(E_TEXT))

['Hello', '!', 'I', 'am', 'Monika', 'Arora', '.', '#', '.', 'I', 'am', 'taking', 'a', "'Gen", 'AI', "'", 'session']


In [15]:
print(sent_tokenize(E_TEXT))

['Hello!', 'I am Monika Arora .# .', "I am taking a 'Gen AI' session"]


In [17]:
from nltk.tokenize import sent_tokenize, word_tokenize

E_TEXT2 = "Positive thinking is all! , really a matter of habits. If you are; not quite a positive thinker. Change Yourself?"
print(E_TEXT2)

Positive thinking is all! , really a matter of habits. If you are; not quite a positive thinker. Change Yourself?


In [20]:
a=word_tokenize(E_TEXT2)

In [19]:
print(word_tokenize(E_TEXT2))

['Positive', 'thinking', 'is', 'all', '!', ',', 'really', 'a', 'matter', 'of', 'habits', '.', 'If', 'you', 'are', ';', 'not', 'quite', 'a', 'positive', 'thinker', '.', 'Change', 'Yourself', '?']


In [21]:
type(a)

list

In [22]:
print(sent_tokenize(E_TEXT2))
##type(sent_tokenize(E_TEXT)) ##that ends with !,?,.
[]

['Positive thinking is all!', ', really a matter of habits.', 'If you are; not quite a positive thinker.', 'Change Yourself?']


[]

In [24]:
##  store the words and sentences and type cast it:

#from nltk.tokenize import sent_tokenize, word_tokenize
import numpy as np
 
data = "All work and no play makes jack dull boy. All work and no play makes. Jack a dull boy."
 
phrases = sent_tokenize(data)
words = word_tokenize(data)
print(phrases)
print("----------")
print(words)

['All work and no play makes jack dull boy.', 'All work and no play makes.', 'Jack a dull boy.']
----------
['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'and', 'no', 'play', 'makes', '.', 'Jack', 'a', 'dull', 'boy', '.']


In [22]:
type(words)

list

In [25]:
print("------------")

a_1=np.array(words)
print(a_1)
print("------------")

a_2=np.array(phrases)
print(a_2)

------------
['All' 'work' 'and' 'no' 'play' 'makes' 'jack' 'dull' 'boy' '.' 'All'
 'work' 'and' 'no' 'play' 'makes' '.' 'Jack' 'a' 'dull' 'boy' '.']
------------
['All work and no play makes jack dull boy.' 'All work and no play makes.'
 'Jack a dull boy.']


In [26]:
type(a_1)

numpy.ndarray

In [27]:
type(a_2)

numpy.ndarray

In [None]:
 

new_array=np.array(words)
print(new_array)
new_array2 = sent_tokenize(data)
print(new_array2)

### Stopping Words
- To do this, we need a way to convert words to values, in numbers, or signal patterns. The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.

- For now, we'll be considering stop words as words that just contain no meaning, and we want to remove them.
 
- We can do this easily, by storing a list of words that you consider to be stop words. NLTK starts you off with a bunch of words that they consider to be stop words, you can access it via the NLTK corpus .


In [29]:
a = "I think i that Learning  Data Science will bring a  big leap in your Carrier Profile. Data Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains"
print(a)


I think i that Learning  Data Science will bring a  big leap in your Carrier Profile. Data Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains


In [32]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

a = "I think i that Learning  Data Science will bring a  big leap in your Carrier Profile. Data Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains"

stop_words1 = set(stopwords.words('english')) #downloads the file with english stop words

In [33]:
word_tokens = word_tokenize(a)
print(word_tokens)

['I', 'think', 'i', 'that', 'Learning', 'Data', 'Science', 'will', 'bring', 'a', 'big', 'leap', 'in', 'your', 'Carrier', 'Profile', '.', 'Data', 'Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'noisy', ',', 'structured', 'and', 'unstructured', 'data', ',', 'and', 'apply', 'knowledge', 'from', 'data', 'across', 'a', 'broad', 'range', 'of', 'application', 'domains']


In [34]:
print(stop_words1)

{'herself', 'myself', "that'll", "shan't", 'our', 'them', 'not', 'by', 'just', 'all', 'him', 'am', 'doing', 'again', 'any', 'isn', 'further', 'some', 'who', 'its', 'he', 'out', 'only', "don't", 'against', 'as', 'too', "hadn't", 'be', 'mightn', 'shan', 'having', 'an', 'y', "mustn't", 'of', 'where', 'doesn', 'very', 'been', 'same', "wouldn't", 'yourselves', 'it', 'yourself', 'itself', 'their', 'will', 'she', 'a', 'between', 'few', 'were', 'during', 'under', 'being', "weren't", 're', 'i', 'hasn', 'before', 'do', 'has', "you've", 'himself', 'll', 'o', 'at', 'such', 'most', 'nor', "should've", "hasn't", "couldn't", 'her', 'with', 'off', 'my', 'above', 'so', 'once', 'which', 'in', 'does', "it's", 'm', 'needn', "you'd", 'but', 'd', 'when', 'won', "she's", "haven't", 'did', 'aren', 'should', 'wouldn', 'other', 'you', 'weren', 'here', "needn't", 'now', 'until', 've', 'both', 'ma', 'are', 'hers', 'because', 'and', 'ourselves', 'your', 'have', 'or', 'through', 'while', 'own', 't', 'was', "wasn't"

In [35]:
print("---------------------")
print(word_tokens)

---------------------
['I', 'think', 'i', 'that', 'Learning', 'Data', 'Science', 'will', 'bring', 'a', 'big', 'leap', 'in', 'your', 'Carrier', 'Profile', '.', 'Data', 'Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'noisy', ',', 'structured', 'and', 'unstructured', 'data', ',', 'and', 'apply', 'knowledge', 'from', 'data', 'across', 'a', 'broad', 'range', 'of', 'application', 'domains']


In [36]:
[w for w in word_tokens if not w in stop_words1]

['I',
 'think',
 'Learning',
 'Data',
 'Science',
 'bring',
 'big',
 'leap',
 'Carrier',
 'Profile',
 '.',
 'Data',
 'Data',
 'science',
 'interdisciplinary',
 'field',
 'uses',
 'scientific',
 'methods',
 ',',
 'processes',
 ',',
 'algorithms',
 'systems',
 'extract',
 'knowledge',
 'insights',
 'noisy',
 ',',
 'structured',
 'unstructured',
 'data',
 ',',
 'apply',
 'knowledge',
 'data',
 'across',
 'broad',
 'range',
 'application',
 'domains']

In [37]:
filtered_sentence = [w for w in word_tokens if not w in stop_words1]
print(filtered_sentence)
#print(word_tokens)
#print(word_tokens)
#print(filtered_sentence)
#print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))

['I', 'think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', '.', 'Data', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']


In [39]:
b=["I","."]  #Creating your own Stop word list
stop_words1=list(stop_words1)
print(stop_words1)

['herself', 'myself', "that'll", "shan't", 'our', 'them', 'not', 'by', 'just', 'all', 'him', 'am', 'doing', 'again', 'any', 'isn', 'further', 'some', 'who', 'its', 'he', 'out', 'only', "don't", 'against', 'as', 'too', "hadn't", 'be', 'mightn', 'shan', 'having', 'an', 'y', "mustn't", 'of', 'where', 'doesn', 'very', 'been', 'same', "wouldn't", 'yourselves', 'it', 'yourself', 'itself', 'their', 'will', 'she', 'a', 'between', 'few', 'were', 'during', 'under', 'being', "weren't", 're', 'i', 'hasn', 'before', 'do', 'has', "you've", 'himself', 'll', 'o', 'at', 'such', 'most', 'nor', "should've", "hasn't", "couldn't", 'her', 'with', 'off', 'my', 'above', 'so', 'once', 'which', 'in', 'does', "it's", 'm', 'needn', "you'd", 'but', 'd', 'when', 'won', "she's", "haven't", 'did', 'aren', 'should', 'wouldn', 'other', 'you', 'weren', 'here', "needn't", 'now', 'until', 've', 'both', 'ma', 'are', 'hers', 'because', 'and', 'ourselves', 'your', 'have', 'or', 'through', 'while', 'own', 't', 'was', "wasn't"

In [40]:
stop_words2 = b #downloads the file with english stop words
print(stop_words2)

['I', '.']


In [41]:
stop_words=stop_words1+stop_words2
print(stop_words)

['herself', 'myself', "that'll", "shan't", 'our', 'them', 'not', 'by', 'just', 'all', 'him', 'am', 'doing', 'again', 'any', 'isn', 'further', 'some', 'who', 'its', 'he', 'out', 'only', "don't", 'against', 'as', 'too', "hadn't", 'be', 'mightn', 'shan', 'having', 'an', 'y', "mustn't", 'of', 'where', 'doesn', 'very', 'been', 'same', "wouldn't", 'yourselves', 'it', 'yourself', 'itself', 'their', 'will', 'she', 'a', 'between', 'few', 'were', 'during', 'under', 'being', "weren't", 're', 'i', 'hasn', 'before', 'do', 'has', "you've", 'himself', 'll', 'o', 'at', 'such', 'most', 'nor', "should've", "hasn't", "couldn't", 'her', 'with', 'off', 'my', 'above', 'so', 'once', 'which', 'in', 'does', "it's", 'm', 'needn', "you'd", 'but', 'd', 'when', 'won', "she's", "haven't", 'did', 'aren', 'should', 'wouldn', 'other', 'you', 'weren', 'here', "needn't", 'now', 'until', 've', 'both', 'ma', 'are', 'hers', 'because', 'and', 'ourselves', 'your', 'have', 'or', 'through', 'while', 'own', 't', 'was', "wasn't"

In [42]:
word_tokens = word_tokenize(a)

filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)

#print(word_tokens)
#print(filtered_sentence)
#print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))

['think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', 'Data', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']


In [50]:
# Import the necessary library
from nltk.stem import PorterStemmer

# Initialize the PorterStemmer
stemmer = PorterStemmer()

# List of example words to be stemmed
words = ["running", "jumping", "easy", "fairies","dancing"]

# Apply stemming to each word
stemmed_words = [stemmer.stem(word) for word in words]

# Print the original words and their stemmed versions
for original, stemmed in zip(words, stemmed_words):
    print(f"Original: {original}, Stemmed: {stemmed}")


Original: running, Stemmed: run
Original: jumping, Stemmed: jump
Original: easy, Stemmed: easi
Original: fairies, Stemmed: fairi
Original: dancing, Stemmed: danc


"running" is stemmed to "run", which is a valid base form of the word.


"jumps" is stemmed to "jump", which is also a valid base form.


"easily" is stemmed to "easili", which is not a valid English word but is a valid stem according to the rules of the algorithm. 

The algorithm aims to remove common suffixes, so "easili" is the stem produced for "easily".


"fairly" is stemmed to "fairli", which, like "easili", is not a valid English word but is a valid stem according to the algorithm.

In [52]:
# Import the necessary library
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer)
type(lemmatizer)

<WordNetLemmatizer>


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\monika201103\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


nltk.stem.wordnet.WordNetLemmatizer

In [53]:
# List of example words to be lemmatized
words = ["running", "jumps", "easily", "fairly", "better", "best"]

# Apply lemmatization to each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

# Print the original words and their lemmatized versions
for original, lemmatized in zip(words, lemmatized_words):
    print(f"Original: {original}, Lemmatized: {lemmatized}")

Original: running, Lemmatized: running
Original: jumps, Lemmatized: jump
Original: easily, Lemmatized: easily
Original: fairly, Lemmatized: fairly
Original: better, Lemmatized: better
Original: best, Lemmatized: best


The output shows the original words and their lemmatized versions using the WordNetLemmatizer from the NLTK library. Here's the breakdown:

- "running" remains "running" because "running" is already the base form of the word.

- "jumps" is lemmatized to "jump" because "jump" is the base form of the word "jumps".

- "easily" remains "easily" because "easily" is already in its base form.

- "fairly" remains "fairly" for the same reason as "easily".

- "better" remains "better" because "better" is the base form of the word.

- "best" remains "best" because "best" is already in its base form.

In summary, the WordNetLemmatizer reduces words to their base or root form, but it does not change words that are already in their base form.

In [38]:
from nltk.stem import PorterStemmer
words= ["writing", "ready", "writes"]
sample_ps =PorterStemmer()
for x in words:
    root_Word=sample_ps.stem(x)
    print(root_Word)


write
readi
write


In [40]:
nltk.download('wordnet')
from nltk.stem import 	WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "writes dancing study studying"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))



Lemma for writes is writes
Lemma for dancing is dancing
Lemma for study is study
Lemma for studying is studying


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\monika201103\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
