<a href="https://colab.research.google.com/github/javeria843/NLP/blob/main/NLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**What is natural language processing (NLP)?**

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans through natural language.

**NLTK**, or the Natural Language Toolkit, is a powerful library in Python used for working with human language data (natural language processing, or NLP).

In [None]:
!pip install nltk



Installing Dependencies

In [None]:
import nltk
nltk.download("all")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

**Tokenization.**

 This is the process of breaking text into words, phrases, symbols, or other meaningful elements, known as tokens.

In [None]:
from nltk.tokenize import word_tokenize

##### ENGLISH TEXT TOKENS

text = "Use logistic regression, and naïve Bayes, and word vectors to implement sentiment analysis, complete analogies & translate words."
tokens = word_tokenize(text)
# print(tokens.count("and"))
print(tokens)

##### URDU TEXT TOKENS
# urdu_text = "جذباتی تجزیے، مکمل تشبیہات اور الفاظ کا ترجمہ کرنے کے لیے لاجسٹک ریگریشن، ناواقف بیز، اور لفظ ویکٹر کا استعمال کریں۔"
# tokens = word_tokenize(urdu_text)
# print(tokens)

##### CHINESE TEXT TOKENS

# chinese_text = "使用邏輯迴歸、樸素貝葉斯和詞向量來實現情緒分析、完成類比和翻譯單字。"
# tokens = word_tokenize(chinese_text)
# print(tokens)





['Use', 'logistic', 'regression', ',', 'and', 'naïve', 'Bayes', ',', 'and', 'word', 'vectors', 'to', 'implement', 'sentiment', 'analysis', ',', 'complete', 'analogies', '&', 'translate', 'words', '.']


**Sentence Tokenization:**

This splits text into sentences. It identifies punctuation (like !, ?, .) to separate individual sentences.

In [None]:
from nltk.tokenize import sent_tokenize

text = "This is a cat, if you want it please take her. Also please take care of her."
sentences = sent_tokenize(text)
print(sentences)


['This is a cat, if you want it please take her.', 'Also please take care of her.']


**Stop words**

StopWords: are common words like "the", "is", "in" that usually carry little meaning. Removing them helps focus on the important words.

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

words = word_tokenize("This is an example showing off stop word filtration.")
filtered = [w for w in words if w.lower() not in stopwords.words('english')]
print(filtered)

['example', 'showing', 'stop', 'word', 'filtration', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
				showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
print(word_tokens)
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not
# filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
	if w not in stop_words:
		filtered_sentence.append(w)

# print(word_tokens)
print(filtered_sentence)


['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Stemming:**

Stemming cuts words to their base or root form. Here, all variations are reduced roughly to "python".

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["python", "pythoner", "pythoning", "pythonly"]
words2 = ['class', 'classes', 'classing', 'classification', 'classify']
stemmed = [ps.stem(w) for w in words2]
print(stemmed)

['class', 'class', 'class', 'classif', 'classifi']


**Lemmatization.**

This technique reduces words to their base or root form, allowing for the grouping of different forms of the same word.

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos = 'v'))

run


**POS tagging:**


 POS tagging labels each word with its grammatical role (e.g., noun, verb, adjective).

In [None]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

words = word_tokenize("He is learning NLP.")
words2 = word_tokenize("The provided Python code demonstrates stopword removal using the Natural Language Toolkit (NLTK) library. In the first step, the sample sentence, which reads “This is a sample sentence, showing off the stop words filtration,” is tokenized into words using the word_tokenize function. The code then filters out stopwords by converting each word to lowercase and checking its presence in the set of English stopwords obtained from NLTK. The resulting filtered_sentence is printed, showcasing both lowercased and original versions, providing a cleaned version of the sentence with common English stopwords removed.")
# print(pos_tag(words))
print(pos_tag(words2))


[('The', 'DT'), ('provided', 'JJ'), ('Python', 'NNP'), ('code', 'NN'), ('demonstrates', 'VBZ'), ('stopword', 'JJ'), ('removal', 'NN'), ('using', 'VBG'), ('the', 'DT'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Toolkit', 'NNP'), ('(', '('), ('NLTK', 'NNP'), (')', ')'), ('library', 'NN'), ('.', '.'), ('In', 'IN'), ('the', 'DT'), ('first', 'JJ'), ('step', 'NN'), (',', ','), ('the', 'DT'), ('sample', 'JJ'), ('sentence', 'NN'), (',', ','), ('which', 'WDT'), ('reads', 'VBZ'), ('“', 'NN'), ('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('sentence', 'NN'), (',', ','), ('showing', 'VBG'), ('off', 'RP'), ('the', 'DT'), ('stop', 'NN'), ('words', 'NNS'), ('filtration', 'NN'), (',', ','), ('”', 'NNP'), ('is', 'VBZ'), ('tokenized', 'VBN'), ('into', 'IN'), ('words', 'NNS'), ('using', 'VBG'), ('the', 'DT'), ('word_tokenize', 'NN'), ('function', 'NN'), ('.', '.'), ('The', 'DT'), ('code', 'NN'), ('then', 'RB'), ('filters', 'VBZ'), ('out', 'RP'), ('stopwords', 'NNS'), ('by', 'IN'), ('conv

**spaCy**

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes of text.

In [None]:
!pip install spacy



In [None]:
import spacy

In [None]:
import spacy
# Load the installed model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp("This is a text.")
# Token text
doc_token = [token.text for token in doc]
print(doc_token)


['This', 'is', 'a', 'text', '.']


In [None]:
doc = nlp("This is a text")
span = doc[2:4]
span.text

'a text'

Named Entities (predicted by statistical model)

**Named Entity Recognition (NER)**. NER is used to identify entities such as persons, organizations, locations, and other named items in the text.

In [None]:
doc = nlp("Larry Page founded Google")
# Text and label of named entity span
doc_entities =  [(ent.text, ent.label_) for ent in doc.ents]
print(doc_entities)

[('Larry Page', 'PERSON'), ('Google', 'ORG')]


In [None]:
doc = nlp("Pakistan is good")
# Text and label of named entity span
doc_entities =  [(ent.text, ent.label_) for ent in doc.ents]
print(doc_entities)

[('Pakistan', 'GPE')]


In [None]:
# spacy.explain("RB")
# 'adverb'
spacy.explain("GPE")
# 'Countries, cities, states'


'Countries, cities, states'

In [None]:
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

**Word vectors and similarity**

⚠️ To use word vectors, you need to install the larger models ending in md or lg , for example en_core_web_lg.

Comparing similarity

In [None]:
doc1 = nlp("I like cats")
doc2 = nlp("I like dogs")
# Compare 2 documents
doc1.similarity(doc2)
# Compare 2 tokens
doc1[2].similarity(doc2[2])
# Compare tokens and spans
doc1[0].similarity(doc2[1:3])


  doc1.similarity(doc2)
  doc1[2].similarity(doc2[2])
  doc1[0].similarity(doc2[1:3])


0.17864830791950226

In [None]:
# Vector as a numpy array
doc = nlp("I like cats")
# The L2 norm of the token's vector
doc[2].vector
doc[2].vector_norm


np.float32(6.874912)