topic 2:Choose a website (e.g., Wikipedia) and perform web/text mining in a Jupyter Notebook.

Extract and analyze the following sections:

Title

Introductory Paragraph

History

Main Content

References

Then apply sentiment analysis to the extracted text using NLP tools like TextBlob, VADER, or NLTK.



## Part 1: Web Scraping 
#### Web Content Mining with BeautifulSoup and Wikipedia
Concept: Web content mining extracts information from web pages.

Installation:

In [76]:
#requests: Downloads the web page content
#BeautifulSoup: Parses the HTML and extracts the required parts

#!pip install beautifulsoup4 requests



In [78]:
#show title

#bs4 is short for BeautifulSoup version 4, a Python library used to parse HTML and XML files.
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Getty_Center"
response = requests.get(url)

#parsing
soup = BeautifulSoup(response.content, 'html.parser')

# show title
title = soup.find("h1").text
print(title)

Getty Center


In [52]:
#check the html code
#print(soup)

In [16]:
#can not find mw-parser-output
#intro=soup.find("div",{"class":"mw-parser-output"}).find("p").text

In [26]:
#Tried to Show Introductory Paragraph, but failed

#soup.find("p"): Finds the first <p> tag (paragraph) on the page
#.text: Extracts the plain text content from the <p> tag (removes HTML tags)
content = soup.find("p").text
#content = soup.find('p').text
print(content)

#chatgpt: why does this code not show anything, results showed below. 
#But it's not displaying anything. 
#This happens because the first <p> tag on the page might be empty, 
#or it may contain only whitespace or be affected by comments/scripts.





In [80]:
#Show Introductory Paragraph
paragraphs = soup.find_all("p")
for p in paragraphs:
    text = p.get_text(strip=True)
    if text:
        content = text
        break

print(content)

TheGetty Center, inLos Angeles, California, United States, is a campus of theGetty Museumand other programs of theGetty Trust. The $1.3 billion center opened to the public on December 16, 1997,[2]and is well known for its architecture, gardens, and views overlooking Los Angeles. The center sits atop a hill connected to a visitors' parking garage at the bottom of the hill bya three-car, cable-pulledhovertrainpeople mover.[3]


In [30]:
# get Main Content（only get [1:4] Paragraphs）
main_content = " ".join([p.get_text() for p in soup.find_all("p")[1:4]])
print(main_content)

The Getty Center, in Los Angeles, California, United States, is a campus of the Getty Museum and other programs of the Getty Trust. The $1.3 billion center opened to the public on December 16, 1997,[2] and is well known for its architecture, gardens, and views overlooking Los Angeles. The center sits atop a hill connected to a visitors' parking garage at the bottom of the hill by a three-car, cable-pulled hovertrain people mover.[3]
 Located in the Brentwood neighborhood of Los Angeles, the center is one of two locations of the J. Paul Getty Museum and draws 1.8 million visitors annually. (The other location is the Getty Villa in the Pacific Palisades neighborhood of Los Angeles, California.) The center branch of the museum features pre-20th-century European paintings, drawings, illuminated manuscripts, sculpture, and decorative arts; and photographs from the 1830s through present day from all over the world.[4][5] In addition, the museum's collection at the center includes outdoor scu

In [68]:
#extract the key section of the page 
sections=soup.find_all("h2")
for section in sections:
    section_title = section.text
    #it starts from the position of the current section tag and 
    #searches forward through the HTML structure to find the first <p> tag that appears after it.
    section_content_tag =  section.find_next("p")
    if section_content_tag:
        section_content = section_content_tag.text
        print("section title:", section_title)
        print("section content:", section_content)
        print()

section title: Contents
section content: 


section title: Location and history
section content: Originally, the Getty Museum started in J. Paul Getty's house located in Pacific Palisades in 1954. He expanded the house with a museum wing. In the 1970s, Getty built a replica of an Italian villa on his home's land to better house his collection, which opened in 1974. After Getty's death in 1976, the entire property was turned over to the Getty Trust for museum purposes. However, the collection outgrew the site, which has since been renamed the Getty Villa, and management sought a location more accessible to Los Angeles. The purchase of the land upon which the center is located, a campus of 24 acres (9.7 ha) on a 110-acre (45 ha) site in the Santa Monica Mountains above Interstate 405, surrounded by 600 acres (240 ha) kept in a natural state, was announced in 1983. The top of the hill is 900 feet (270 m) above sea level, high enough that on a clear day it is possible to see not only the L

#### History

In [72]:
#history_section = soup.find("span",{"id":"Location and history"})  not found
#history_section = soup.find("span",{"id":"Location_and_history"})  not found
#import re
#history_section = soup.find("span", id=re.compile("Location.*history", re.I))  not found
#history_section = soup.find("span",{"id":"history"}) not found

#historyname = "Location and history"
#history_section = soup.find("span", {"id": historyname})  not found

#title = "Location and history"
#section_id = title.replace(" ", "_")
#history_section = soup.find("span", {"id": section_id})   not found

print(history_section)

#if history section exists
if history_section:
    #inside h2
    history_heading= history_section.find_parent("h2")
    
    #initialize a list to store content paragraphs
    history_content = []
    #get all sibling elements until the next <h2> tag and extract the text
    next_sibling= history_heading.find_next_sibling()
    while next_sibling and next_sibling.name != "h2":
        #make sure it's a paragraph, not a sentence.
        if next_sibling.name == "p":
            history_content.append(next_sibling.get_text())
        next_sibling=next_sibling.find_next_sibling()
        
    print("History and Location Content:")
    for paragraph in history_content:
        print(paragraph.strip())
else:
    print("history and location not found.")

None
history and location not found.


In [86]:
#check what is in span id, there is no history. 
#it doesn't have a <span id="Location and history"> structure. 

for span in soup.find_all("span"):
    if span.get("id"):
        print(span["id"])

coordinates
Getty_Research_Institute_.28GRI.29


In [96]:
#Location and history not found

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Getty_Center"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

target_title = "Location and history"

# 遍历所有 h2 -> headline span
for h2 in soup.find_all("h2"):
    span = h2.find("span", class_="mw-headline")
    if span and target_title.lower() in span.get_text().lower():
        print(f"== {target_title} ==")
        
        # 提取该小节后的内容直到下一个 h2
        content = []
        next_tag = h2.find_next_sibling()
        while next_tag and next_tag.name != "h2":
            if next_tag.name == "p":
                content.append(next_tag.get_text(strip=True))
            next_tag = next_tag.find_next_sibling()
        
        for para in content:
            print(para)
        break
else:
    print(f"Section '{target_title}' not found.")

Section 'Location and history' not found.


In [106]:
# Location and history not found
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Getty_Center"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# 直接查找 <ol class="references">
ref_list = soup.find("ol", class_="Location and history")

if ref_list:
    references = []
    for li in ref_list.find_all("li"):
        references.append(li.get_text(strip=True))

    print("== References ==")
    for i, ref in enumerate(references, 1):
        print(f"[{i}] {ref}")
else:
    print("No references found.")


No references found.


#### Reference 

In [108]:
#find reference 
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Getty_Center"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# 直接查找 <ol class="references">
ref_list = soup.find("ol", class_="references")

if ref_list:
    references = []
    for li in ref_list.find_all("li"):
        references.append(li.get_text(strip=True))

    print("== References ==")
    for i, ref in enumerate(references, 1):
        print(f"[{i}] {ref}")
else:
    print("No references found.")


== References ==
[1] ^"TOP 100 Art museum attendance (continued from page 3)". Vol. 29, no. 322 (International ed.).The Art Newspaper. April 2020. p. 15.
[2] ^ab"The Getty Center: Reflecting on 10 Years". Archived fromthe originalon July 13, 2010. RetrievedAugust 27,2020.
[3] ^abSimon, Richard (August 11, 1995)."The Art of Getting to the Getty Will Have Visitors Floating on Air".Los Angeles Times.
[4] ^"About the Museum (Getty Museum)".www.getty.edu. RetrievedMarch 16,2018.
[5] ^"Photographs | the J. Paul Getty Museum".www.getty.edu. RetrievedMarch 16,2018.
[6] ^Morgenstern, Joe. Getty opens mammoth hilltop center to public.Wall Street Journal(Eastern edition), December 16, 1997.
[7] ^Hardy, Terri. "Covering all angles – 'preview' a coveted assignment".Daily News of Los Angeles, December 10, 1997.
[8] ^Miller, Daryl H. Meier: centering on a landmark.Daily News of Los Angeles, December 20, 1987.
[9] ^abMoody, Lori. "In the home stretch – half-finished Getty Center nearing landmark statu

## Part 2: Sentiment Analysis

In [116]:
#!pip install textblob

Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m624.3/624.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: textblob
Successfully installed textblob-0.19.0


In [118]:
#Import Libraries
from textblob import TextBlob

## Analyzing Sentiment

Each `TextBlob` object has a `sentiment` property that you can use to get the sentiment of the text. This property returns a named tuple of the polarity and subjectivity.

In [121]:
blob = TextBlob(main_content)

sentiment = blob.sentiment
print("Polarity: ", sentiment.polarity)
print("Subjectivity: ", sentiment.subjectivity)

Polarity:  -0.025238095238095233
Subjectivity:  0.18444444444444444


### Polarity
Polarity is a measure from -1 to +1 where -1 indicates negative sentiment and +1 indicates positive sentiment. In your example, the polarity is approximately 0.392. This indicates that the statement is positive, but not extremely so. It's a moderately positive sentiment. The phrase "amazingly simple to use" and "great fun" contribute to this positive score.

### Subjectivity
Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The scale ranges from 0 to 1, where 0 is very objective (fact-based) and 1 is very subjective (opinion-based). In our example, the subjectivity is approximately 0.436, suggesting that the text is somewhat subjective but still closer to the objective end of the spectrum. This is likely because the statement, while positive, is presented as a user's personal experience and opinion rather than a fact.