<a href="https://colab.research.google.com/github/nurzhanmussabekov/KBTU-NLP/blob/main/NLP_as1_and_as2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Text Preprocessing with NLTK and spaCy

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Natural_language_processing"

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

text = soup.find_all("p")[0].get_text()
print("Sample Text:", text)

Sample Text: Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.



In [None]:
import nltk  # Import the nltk library

nltk.download('punkt_tab')  # Download the pretrained 'punkt_tab' model for tokenizing text

from nltk.tokenize import word_tokenize  # Import the word_tokenize function

nltk_tokens = word_tokenize(text)  # Tokenize the text

# Print each token on a new line to keep the output aligned without causing horizontal scrolling
print("NLTK Tokenization:")
for token in nltk_tokens:
    print(token)

NLTK Tokenization:
Natural
language
processing
(
NLP
)
is
a
subfield
of
computer
science
and
especially
artificial
intelligence
.
It
is
primarily
concerned
with
providing
computers
with
the
ability
to
process
data
encoded
in
natural
language
and
is
thus
closely
related
to
information
retrieval
,
knowledge
representation
and
computational
linguistics
,
a
subfield
of
linguistics
.
Typically
data
is
collected
in
text
corpora
,
using
either
rule-based
,
statistical
or
neural-based
approaches
in
machine
learning
and
deep
learning
.


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
import spacy  #import spaCy

nlp = spacy.load("en_core_web_sm")  #load the small English model

doc = nlp(text)  #process the text

spacy_tokens = [token.text for token in doc]  #get tokens from the processed text

print("spaCy Tokenization:")
for token in spacy_tokens:
    print(token)



spaCy Tokenization:
Natural
language
processing
(
NLP
)
is
a
subfield
of
computer
science
and
especially
artificial
intelligence
.
It
is
primarily
concerned
with
providing
computers
with
the
ability
to
process
data
encoded
in
natural
language
and
is
thus
closely
related
to
information
retrieval
,
knowledge
representation
and
computational
linguistics
,
a
subfield
of
linguistics
.
Typically
data
is
collected
in
text
corpora
,
using
either
rule
-
based
,
statistical
or
neural
-
based
approaches
in
machine
learning
and
deep
learning
.




In [None]:
nltk.download('wordnet') # download WordNet corpus for lemmatization
nltk.download('omw-1.4') #additional language data for WordNet
nltk.download('stopwords') #download stopwords list

from nltk.tokenize import word_tokenize # import tokenization function
from nltk.corpus import stopwords #import stopwords
from nltk.stem import WordNetLemmatizer # import WordNet lemmatizer

nltk_tokens = word_tokenize(text)  #tokenize the text

lemmatizer = WordNetLemmatizer() #initialize the lemmatizer
nltk_stopwords = set(stopwords.words('english')) #create a set of english stopwords

nltk_processed = [lemmatizer.lemmatize(token) for token in nltk_tokens if token.lower() not in nltk_stopwords] # lemmatize tokens and remove stopwords

print("NLTK Processed:")
for token in nltk_processed:
    print(token)

NLTK Processed:
Natural
language
processing
(
NLP
)
subfield
computer
science
especially
artificial
intelligence
.
primarily
concerned
providing
computer
ability
process
data
encoded
natural
language
thus
closely
related
information
retrieval
,
knowledge
representation
computational
linguistics
,
subfield
linguistics
.
Typically
data
collected
text
corpus
,
using
either
rule-based
,
statistical
neural-based
approach
machine
learning
deep
learning
.


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
doc = nlp(text)  # process the input text

spacy_processed = [token.lemma_ for token in doc if not token.is_stop] #lemmatize tokens and remove stopwords using spacy

print("spaCy Processed:")
for token in spacy_processed:
    print(token)

spaCy Processed:
natural
language
processing
(
NLP
)
subfield
computer
science
especially
artificial
intelligence
.
primarily
concern
provide
computer
ability
process
datum
encode
natural
language
closely
relate
information
retrieval
,
knowledge
representation
computational
linguistic
,
subfield
linguistic
.
typically
datum
collect
text
corpora
,
rule
-
base
,
statistical
neural
-
base
approach
machine
learning
deep
learning
.




#2. Named Entity Recognition (NER) with spaCy

In [None]:
import spacy  #import the spaCy library

nlp = spacy.load("en_core_web_sm") #Load spaCy's small English model

text = "Almaty,[a] formerly Alma-Ata,[b] is the largest city in Kazakhstan, with a population exceeding two million residents within its metropolitan area.[8] Located in the foothills of the Trans-Ili Alatau mountains in southern Kazakhstan, near the border with Kyrgyzstan, Almaty stands as a pivotal center of culture, commerce, finance and innovation. The city is nestled at an elevation of 700–900 metres (2,300–3,000 feet), with the Big Almaty and Small Almaty rivers running through it, originating from the surrounding mountains and flowing into the plains. Almaty is the second-largest city in Central Asia and the third-largest in the Eurasian Economic Union (EEU)."

doc = nlp(text) #process the text

print("Named Entities:")
for ent in doc.ents:
    print(ent.text, "->", ent.label_)



Named Entities:
Almaty,[a -> CARDINAL
Kazakhstan -> GPE
two million -> CARDINAL
the Trans-Ili Alatau -> ORG
Kazakhstan -> GPE
Kyrgyzstan -> GPE
Almaty -> GPE
700–900 metres -> QUANTITY
2,300–3,000 feet -> QUANTITY
the Big Almaty -> ORG
Small Almaty -> PERSON
Almaty -> ORG
second -> ORDINAL
Central Asia -> LOC
third -> ORDINAL
the Eurasian Economic Union -> ORG


In [None]:
from spacy import displacy  # Import displacy for visualization

displacy.render(doc, style="ent", options={"distance": 120}) #visualize the named entities in the processed document using displacy

#3. Text Vectorization using Transformers

In [None]:
from transformers import AutoTokenizer, AutoModel  #import necessary classes from Hugging Face

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") #load the tokenizer for bert-base-uncased

model = AutoModel.from_pretrained("bert-base-uncased") #load the BERT model for bert-base-uncased

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
text = "Almaty,[a] formerly Alma-Ata,[b] is the largest city in Kazakhstan, with a population exceeding two million residents within its metropolitan area.[8] Located in the foothills of the Trans-Ili Alatau mountains in southern Kazakhstan, near the border with Kyrgyzstan, Almaty stands as a pivotal center of culture, commerce, finance and innovation. The city is nestled at an elevation of 700–900 metres (2,300–3,000 feet), with the Big Almaty and Small Almaty rivers running through it, originating from the surrounding mountains and flowing into the plains. Almaty is the second-largest city in Central Asia and the third-largest in the Eurasian Economic Union (EEU)."

encoded_input = tokenizer(text, return_tensors='pt') # Tokenize and encode the sentence and return PyTorch tensors

print("Encoded Input:", encoded_input)

Encoded Input: {'input_ids': tensor([[  101, 11346,  3723,  1010,  1031,  1037,  1033,  3839, 11346,  1011,
         29533,  1010,  1031,  1038,  1033,  2003,  1996,  2922,  2103,  1999,
         11769,  1010,  2007,  1037,  2313, 17003,  2048,  2454,  3901,  2306,
          2049,  4956,  2181,  1012,  1031,  1022,  1033,  2284,  1999,  1996,
         18455,  1997,  1996,  9099,  1011,  6335,  2072, 21862,  2696,  2226,
          4020,  1999,  2670, 11769,  1010,  2379,  1996,  3675,  2007, 23209,
          1010, 11346,  3723,  4832,  2004,  1037, 20369,  2415,  1997,  3226,
          1010,  6236,  1010,  5446,  1998,  8144,  1012,  1996,  2103,  2003,
         22704,  2012,  2019,  6678,  1997,  6352,  1516,  7706,  3620,  1006,
          1016,  1010,  3998,  1516,  1017,  1010,  2199,  2519,  1007,  1010,
          2007,  1996,  2502, 11346,  3723,  1998,  2235, 11346,  3723,  5485,
          2770,  2083,  2009,  1010, 14802,  2013,  1996,  4193,  4020,  1998,
          8577,  2046, 

In [None]:
outputs = model(**encoded_input) # pass the encoded input through the model

hidden_states = outputs.last_hidden_state # The last hidden state contains the word embeddings for each token

print("Hidden States (Embeddings):", hidden_states)

Hidden States (Embeddings): tensor([[[-0.1223, -0.0352, -0.1076,  ...,  0.3376,  0.9551,  0.0992],
         [ 0.2084,  0.6483, -0.2224,  ..., -0.3481,  0.7040, -0.3805],
         [-0.1361,  0.0042,  0.7339,  ...,  0.1515,  0.4812,  1.2165],
         ...,
         [ 0.6598,  0.0037, -0.1837,  ...,  0.0428, -0.6580, -0.3002],
         [ 0.5075, -0.0304, -0.2800,  ...,  0.2003, -0.4971, -0.4049],
         [-0.5466,  0.3170, -0.1027,  ...,  0.2216,  0.1571, -0.5755]]],
       grad_fn=<NativeLayerNormBackward0>)


#4. Sentiment Analysis with Transformers

In [None]:
import pandas as pd
from transformers import pipeline  # Import Hugging Face's pipeline for sentiment analysis

url = "https://raw.githubusercontent.com/SK7here/Movie-Review-Sentiment-Analysis/refs/heads/master/IMDB-Dataset.csv"

df = pd.read_csv(url) #load the dataset directly from the URL

reviews = df["review"].head(10).tolist() #for demonstration selected the first 10 movie reviews from the dataset

sentiment_pipeline = pipeline("sentiment-analysis") #initialize the sentiment analysis pipeline with a pretrained model

print("Sentiment Analysis using Transformers on Movie Reviews:")
for review in reviews:
    result = sentiment_pipeline(review)
    print("Review:", review)
    print("Result:", result, "\n")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Sentiment Analysis using Transformers on Movie Reviews:
Review: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

url = "https://raw.githubusercontent.com/SK7here/Movie-Review-Sentiment-Analysis/refs/heads/master/IMDB-Dataset.csv"

df = pd.read_csv(url) #load the dataset directly from the URL

reviews = df["review"].head(10).tolist() #for demonstration selected the first 10 movie reviews from the dataset

sentiments = df["sentiment"].head(10).tolist()  #sentiment

# Vectorize the text data using CountVectorizer (bag-of-words model)
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(reviews)

#initialize and train the Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X, sentiments)

predictions = nb_classifier.predict(X) #predict sentiments using the trained classifier

print("Naive Bayes Sentiment Analysis:")
for review, actual, predicted in zip(reviews, sentiments, predictions):
    print("Review:", review)
    print("Actual Sentiment:", actual, "-> Predicted Sentiment:", predicted)
    print("-" * 80)

Naive Bayes Sentiment Analysis:
Review: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the m