Description:
Scrap data(text) from any two different websites then apply preprocessing techniques to it after
that get all the unique values.
The minimum preprocessing techniques required:
1. Tokenization
  - Splitting text into words, sentences, or subwords.
  - Example: "I love NLP" → ["I", "love", "NLP"]
2. Lowercasing
  - Converting all text to lowercase to ensure uniformity.
  - Example: "Machine Learning" → "machine learning"
3. Stopword Removal
  - Removing common words like "the," "is," "and" that do not add much meaning.
  - Example: "I love the new AI model" → ["love", "new", "AI", "model"]
4. Removing Special Characters, Numbers and Punctuation
  - Example: "Hello!!! NLP is awesome :) " → "Hello NLP is awesome"
  - Example: "COVID-19 cases reached 500000" → "COVID cases reached"

# Start coding

In [1]:
!pip install nltk



In [2]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [3]:
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
stop_words = set(stopwords.words('english'))
print(stop_words)

{"haven't", 'mightn', 'over', 'any', 'same', 'why', 'under', "you're", 'll', 'be', 'do', "i'm", "shouldn't", 'the', 'where', 'all', 'into', 'because', 'weren', 'who', "she'll", 'can', 'am', 'him', "i've", 'being', 'these', "you'll", 'there', 'you', 'haven', 'did', 'me', 'against', 'some', "isn't", 'further', 'just', "she'd", 'and', 'of', "they'd", "hasn't", 'have', 'his', 'its', "he'll", 'with', 'm', 'if', "i'd", "wouldn't", 've', 'what', 'hadn', "should've", 'your', "you'd", 'shouldn', 'ain', 'is', 'very', 'whom', 'won', 'which', 'themselves', 'needn', 'y', 'not', "wasn't", "doesn't", 'from', "hadn't", "we'd", 'before', 'does', 'how', 're', 'it', "they've", "couldn't", "i'll", 'about', "we'll", "won't", 'had', 'theirs', 'yours', 'most', 'between', 'aren', 'by', 'those', 'more', 'should', 'than', 'couldn', 'himself', 'then', 'through', 'wouldn', 'yourselves', 'they', 'that', "needn't", "shan't", 'having', 'shan', 'such', "weren't", 'are', 'to', 'my', 'until', 'at', 'yourself', 'ours', 

In [5]:
import re

In [6]:
sentence = "Mina, Filopater and Martina for the assignment.!!!"

In [7]:
def clean_text(text):
    # if there is no string in the sentece
    if not isinstance(text, str):
      return ""
    # Keep only letters and spaces
    cleaned_text = re.sub(r"[^a-zA-Z\s]", "", text)
    return cleaned_text

clent_sent = clean_text(sentence)
print(clent_sent)

Mina Filopater and Martina for the assignment


In [8]:
Lsent = clent_sent.lower()
print(Lsent)

mina filopater and martina for the assignment


In [9]:
sent_tok = word_tokenize(Lsent)
print(sent_tok)

['mina', 'filopater', 'and', 'martina', 'for', 'the', 'assignment']


In [10]:
sent_tok_filtered = [word for word in sent_tok if word not in stop_words]

print(f"sentence after: {sent_tok_filtered}")

sentence after: ['mina', 'filopater', 'martina', 'assignment']


## combine in class

In [13]:
class TextPreprocessing:
  def __init__(self, text):
    self.text = text
    self.stop_words = set(stopwords.words('english'))

  def _clean_text(self, text_to_clean):
    # if there is no string in the sentece
    if not isinstance(text_to_clean, str):  # Ensure the second argument is `str`
      return ""
    # Keep only letters and spaces
    cleaned_text = re.sub(r"[^a-zA-Z\s]", "", text_to_clean)
    return cleaned_text

  def fit(self):
    cleaned_text = self._clean_text(self.text)
    Lsent = cleaned_text.lower()
    sent_tok = word_tokenize(Lsent)
    sent_tok_filtered = [word for word in sent_tok if word not in self.stop_words]
    return sent_tok_filtered

In [14]:
sentence = "Mina, Filopater and Martina for the assignment.!!!"
txtprepros = TextPreprocessing(sentence)
finaltxt = txtprepros.fit()
print(finaltxt)

['mina', 'filopater', 'martina', 'assignment']


# start scrabing web pages

In [15]:
!pip install requests
!pip install beautifulsoup4
!pip install lxml



In [16]:
import requests
import lxml
from bs4 import BeautifulSoup

## page-1

In [17]:
page = requests.get("https://www.ask-aladdin.com/all-destinations/egypt/category/egyptian-pharaohs/page/tut-ankh-amun-and-his-treasures")

In [18]:
# get source page
src = page.content
print(src)

b'<!DOCTYPE html><html lang="en"><head>\n  <meta charset="utf-8">\n  <title>The Treasures of Tut Ankh Amun | TutAnkhAmun - AskAladdin</title>\n  <base href="/">\n  <link rel="icon" type="image/x-icon" href="./favicon.ico">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  <meta name="robots" content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1">\n  <link rel="alternate" href="https://www.ask-aladdin.com/all-destinations/egypt/category/egyptian-pharaohs/page/tut-ankh-amun-and-his-treasures" hreflang="x-default">\n  <link rel="alternate" href="https://www.ask-aladdin.com/all-destinations/egypt/category/egyptian-pharaohs/page/tut-ankh-amun-and-his-treasures" hreflang="en-us">\n  <link rel="canonical" href="https://www.ask-aladdin.com/all-destinations/egypt/category/egyptian-pharaohs/page/tut-ankh-amun-and-his-treasures">\n  <meta name="description" content="King Tut Ankh Amun (King Tutankhamun) was one of the kings of the 18th dy

In [19]:
# get the page source in a better format
soup = BeautifulSoup(src, 'lxml')
print(soup)

<!DOCTYPE html>
<html lang="en"><head>
<meta charset="utf-8"/>
<title>The Treasures of Tut Ankh Amun | TutAnkhAmun - AskAladdin</title>
<base href="/"/>
<link href="./favicon.ico" rel="icon" type="image/x-icon"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
<link href="https://www.ask-aladdin.com/all-destinations/egypt/category/egyptian-pharaohs/page/tut-ankh-amun-and-his-treasures" hreflang="x-default" rel="alternate"/>
<link href="https://www.ask-aladdin.com/all-destinations/egypt/category/egyptian-pharaohs/page/tut-ankh-amun-and-his-treasures" hreflang="en-us" rel="alternate"/>
<link href="https://www.ask-aladdin.com/all-destinations/egypt/category/egyptian-pharaohs/page/tut-ankh-amun-and-his-treasures" rel="canonical"/>
<meta content="King Tut Ankh Amun (King Tutankhamun) was one of the kings of the 18th dynasty of the New Kingdom of the Pharaonic 

In [20]:
AllPageContent = soup.find('div', {'class': 'inner-html'})
print(AllPageContent)

<div _ngcontent-sc102="" class="inner-html"><h2>Who was Tut Ankh Amun?</h2>
<p><img alt="Tut Ankh Amun and his Treasures" src="https://admin.ask-aladdin.com/photos/egypt/articles/tut-ankh-amun-and-his-treasures8-askaladdin.webp" style="margin-bottom:10px; margin-top:10px"/></p>
<p>King Tut Ankh Amun (King Tutankhamun) was one of the kings of the 18th dynasty of the New Kingdom of the Pharaonic period. He was the ruler of Egypt from 1334 until 1325 B.C.<br/>
Tut Ankh Amun became the king of Egypt when he was only 9 years old. The word “Tut Ankh Amun” in ancient Egyptian means “the living incarnation of Amun,” the most important god in ancient Egypt.<br/>
Tut Ankh Amun lived in a transitory period of ancient Egyptian history as he became the ruler of Egypt after Akhenaton, who tried to unify the multi-god system in Egypt into the worship of only one god, Aton, the god of the Sun.<br/>
However, when Akhenaton passed away, and Tut Ankh Amun became his successor, the multi-god system became

In [21]:
AllPageContent = soup.find_all('p')
print(AllPageContent)

[<p _ngcontent-sc40="">Cairo Travel Information</p>, <p _ngcontent-sc40="">Luxor Travel Guide</p>, <p _ngcontent-sc40="">Alexandria Travel Guide</p>, <p _ngcontent-sc40="">Aswan Travel Guide</p>, <p _ngcontent-sc40="">Hurghada Travel Guide</p>, <p _ngcontent-sc40="">Port Said travel guide</p>, <p _ngcontent-sc40="">Marsa Matrouh Travel guide</p>, <p _ngcontent-sc40="">Sharm el Sheikh Travel Guide</p>, <p _ngcontent-sc40="">Dahab Travel Guide</p>, <p _ngcontent-sc40="">Farafra Oasis</p>, <p _ngcontent-sc40="">Nuweiba Travel Guide</p>, <p _ngcontent-sc40="">Taba Travel Guide</p>, <p _ngcontent-sc40="">Contact Us</p>, <p _ngcontent-sc40="">About Us</p>, <p _ngcontent-sc40="">Request A Call back</p>, <p _ngcontent-sc40="">Ask The Experts</p>, <p _ngcontent-sc102="" class="font-weight-bold"></p>, <p><img alt="Tut Ankh Amun and his Treasures" src="https://admin.ask-aladdin.com/photos/egypt/articles/tut-ankh-amun-and-his-treasures8-askaladdin.webp" style="margin-bottom:10px; margin-top:10px"/

In [22]:
print(AllPageContent[18].text)

King Tut Ankh Amun (King Tutankhamun) was one of the kings of the 18th dynasty of the New Kingdom of the Pharaonic period. He was the ruler of Egypt from 1334 until 1325 B.C.
Tut Ankh Amun became the king of Egypt when he was only 9 years old. The word “Tut Ankh Amun” in ancient Egyptian means “the living incarnation of Amun,” the most important god in ancient Egypt.
Tut Ankh Amun lived in a transitory period of ancient Egyptian history as he became the ruler of Egypt after Akhenaton, who tried to unify the multi-god system in Egypt into the worship of only one god, Aton, the god of the Sun.
However, when Akhenaton passed away, and Tut Ankh Amun became his successor, the multi-god system became prominent in Egypt once again, represented by the ascendance of the worship of Amun once again. The tomb of Tutankhamun was discovered in 1932 by Howard Carter and managed to garner major media attention worldwide. His grave was intact and featured some of the most beautiful burial items and fur

In [23]:
# three attrs that has irrelevant information
print(AllPageContent[1]) # from 0 to 17 has irrelevant information with attrs _ngcontent-sc40
print(AllPageContent[45]) # has irrelevant infromation with attrs _ngcontent-sc43
print(AllPageContent[50]) # has irrelevant infromation with attrs _ngcontent-sc42

<p _ngcontent-sc40="">Luxor Travel Guide</p>
<p _ngcontent-sc43="" class="ng-tns-c43-0">This website uses cookies to ensure you get the best experience on our website.</p>
<p _ngcontent-sc42="" class="text-white">Powered By <a _ngcontent-sc42="" class="main-color" href="https://digitalexperts.ae/">Digital Experts</a></p>


In [27]:
txt = ""
for i in range(len(AllPageContent)):
  if '_ngcontent-sc40' not in AllPageContent[i].attrs and '_ngcontent-sc42' not in AllPageContent[i].attrs and '_ngcontent-sc43' not in AllPageContent[i].attrs:
    # print(AllPageContent[i].text)
    txt+= f"{AllPageContent[i].text} \n"
print(txt)

 
 
King Tut Ankh Amun (King Tutankhamun) was one of the kings of the 18th dynasty of the New Kingdom of the Pharaonic period. He was the ruler of Egypt from 1334 until 1325 B.C.
Tut Ankh Amun became the king of Egypt when he was only 9 years old. The word “Tut Ankh Amun” in ancient Egyptian means “the living incarnation of Amun,” the most important god in ancient Egypt.
Tut Ankh Amun lived in a transitory period of ancient Egyptian history as he became the ruler of Egypt after Akhenaton, who tried to unify the multi-god system in Egypt into the worship of only one god, Aton, the god of the Sun.
However, when Akhenaton passed away, and Tut Ankh Amun became his successor, the multi-god system became prominent in Egypt once again, represented by the ascendance of the worship of Amun once again. The tomb of Tutankhamun was discovered in 1932 by Howard Carter and managed to garner major media attention worldwide. His grave was intact and featured some of the most beautiful burial items and

In [28]:
txtprepros = TextPreprocessing(txt)
finaltxt = txtprepros.fit()
print(finaltxt)

['king', 'tut', 'ankh', 'amun', 'king', 'tutankhamun', 'one', 'kings', 'th', 'dynasty', 'new', 'kingdom', 'pharaonic', 'period', 'ruler', 'egypt', 'bc', 'tut', 'ankh', 'amun', 'became', 'king', 'egypt', 'years', 'old', 'word', 'tut', 'ankh', 'amun', 'ancient', 'egyptian', 'means', 'living', 'incarnation', 'amun', 'important', 'god', 'ancient', 'egypt', 'tut', 'ankh', 'amun', 'lived', 'transitory', 'period', 'ancient', 'egyptian', 'history', 'became', 'ruler', 'egypt', 'akhenaton', 'tried', 'unify', 'multigod', 'system', 'egypt', 'worship', 'one', 'god', 'aton', 'god', 'sun', 'however', 'akhenaton', 'passed', 'away', 'tut', 'ankh', 'amun', 'became', 'successor', 'multigod', 'system', 'became', 'prominent', 'egypt', 'represented', 'ascendance', 'worship', 'amun', 'tomb', 'tutankhamun', 'discovered', 'howard', 'carter', 'managed', 'garner', 'major', 'media', 'attention', 'worldwide', 'grave', 'intact', 'featured', 'beautiful', 'burial', 'items', 'furniture', 'ever', 'found', 'funeral', 'm