### NLP task : Take a text document
#### a) Tokenize into sentences and words

#### b) remove stop words and punctuation

#### c) Lemmatize the words

#### d) summarize the speech

### Use Beautiful Soup to scrape a page on the internet - eg. Hong Kong Wikipedia Page

### Create a summary of the page.

In [70]:
# import libraries as required
import requests
from bs4 import BeautifulSoup

# Fetch the Wikipedia page
url = "https://en.wikipedia.org/wiki/Hong_Kong"
response = requests.get(url)

# Check the response status
if response.status_code == 200:
    print("Successfully fetched the page.")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

# Parse the content
soup = BeautifulSoup(response.content, 'html.parser')

# We will try finding all relevant paragraphs and other content
content = soup.find_all(['p', 'h2', 'h3'])  # Include headings for context

# Check if content is found
if content:
    text = ' '.join([element.get_text() for element in content])
else:
    print("No content found in the expected div.")

# Print the length of the retrieved text
print(f"Length of text: {len(text)}")

# If text is empty, print part of the raw HTML for debugging
if len(text) == 0:
    print("Raw HTML snippet:")
    print(response.content[:1000])  # Print the first 1000 characters
else:
    print("Retrieved text snippet:")
    print(text[:500])  # Print the first 500 characters of the text

Successfully fetched the page.
Length of text: 59685
Retrieved text snippet:
Contents 
 Hong Kong[e] is a special administrative region of the People's Republic of China. With 7.4 million residents of various nationalities[f] in a 1,104-square-kilometre (426 sq mi) territory, Hong Kong is the fourth most densely populated region in the world.
 Hong Kong was established as a colony of the British Empire after the Qing dynasty ceded Hong Kong Island in 1841–1842 as a consequence of losing the First Opium War. The colony expanded to the Kowloon Peninsula in 1860 and was fur


In [63]:
# importing more libraries
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [65]:
#Tokenization
sentences = sent_tokenize(text)
words = word_tokenize(text)

#Remove Stop Words and Punctuation
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.isalnum() and word.lower() not in stop_words]

#Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word.lower()) for word in filtered_words]

# Print the results
print("Tokenized Sentences:")
print(sentences[:3])  # Print first 3 sentences

print("\nFiltered Words:")
print(filtered_words[:10])  # Print first 10 filtered words

print("\nLemmatized Words:")
print(lemmatized_words[:10])  # Print first 10 lemmatized words


Tokenized Sentences:
["Contents \n Hong Kong[e] is a special administrative region of the People's Republic of China.", 'With 7.4 million residents of various nationalities[f] in a 1,104-square-kilometre (426\xa0sq\xa0mi) territory, Hong Kong is the fourth most densely populated region in the world.', 'Hong Kong was established as a colony of the British Empire after the Qing dynasty ceded Hong Kong Island in 1841–1842 as a consequence of losing the First Opium War.']

Filtered Words:
['Contents', 'Hong', 'Kong', 'e', 'special', 'administrative', 'region', 'People', 'Republic', 'China']

Lemmatized Words:
['content', 'hong', 'kong', 'e', 'special', 'administrative', 'region', 'people', 'republic', 'china']


In [67]:
#Summarization
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()
summary = summarizer(parser.document, 5)  # Summarize to 5 sentences

print("\nSummary:")
for sentence in summary:
    print(sentence)


Summary:
[50] Administrative infrastructure was quickly built by early 1842, but piracy, disease, and hostile Qing policies initially prevented the government from attracting commerce.
[65] Although the territory's competitiveness in manufacturing gradually declined because of rising labour and property costs, it transitioned to a service-based economy.
[123] Hong Kong residents are not required to perform military service, and current law has no provision for local enlistment, so its defence is composed entirely of non-Hongkongers.
[288] Vehicle traffic is extremely congested in urban areas, exacerbated by limited space to expand roads and an increasing number of vehicles.
[324] Spiritual concepts such as feng shui are observed; large-scale construction projects often hire consultants to ensure proper building positioning and layout.
