# Practical 4


## Goals

1.  Understand text analysis techniques, including stemming, lemmatization, and morphological analysis.
2.  Practice extracting structured elements from HTML (links, images, tables, sections).
3.  Apply NLP tools to explore a real-world text corpus.


## Exercise 4.1 \[★\]

Download this webpage of Wikipedia: https://fr.wikipedia.org/wiki/Paris and save the file as an HTML. Analyze the Wikipedia page by extracting and counting words, links, images, numbers, dates, proper nouns, and structured data from tables, while differentiating between sections and paragraphs. This involves downloading the HTML, parsing it, and systematically identifying relevant content. Write a program to implement these tasks: 

1. **Download HTML**: Fetch and save the Wikipedia page as an HTML file.
2. **Load Content**: Read and parse the HTML file for analysis.
3. **Word Analysis**: Count word occurrences in the text.
4. **Extract Links**: Identify and categorize internal and external links.
5. **Image Extraction**: Locate images and gather their URLs and sizes.
6. **Number and Date Extraction**: Identify numbers, dates, and geographical coordinates.
7. **Proper Nouns**: Extract names of people and places.
8. **Table Data**: Locate and extract data from tables.
9. **Section Differentiation**: Identify sections and paragraphs in the content.

#### Analysis of Wikipedia Page: Paris

In this notebook, tasks will be performed to extract and analyze various elements from the Wikipedia page of Paris.

##### Step 1: Download the HTML Page
First, download the HTML content of the specified Wikipedia page and save it as an HTML file. We use the `requests` library to handle the HTTP request. Remember to check the response status to confirm that the page was downloaded successfully.

In [None]:
import requests

# URL of the Wikipedia page
url = "https://fr.wikipedia.org/wiki/Paris"

# Send a GET request to the URL
response = requests.get(url)

# Save the content as an HTML file
with open("paris.html", "w", encoding='utf-8') as file:
    file.write(response.text)

print("HTML page downloaded and saved as paris.html")


##### Step 2: Load the HTML Content
Load the downloaded HTML file for further analysis.
- **Comment**: Parsing the HTML is crucial for extracting data. Make sure to use a library like BeautifulSoup that can navigate the HTML structure effectively.

Familiarize yourself with the `BeautifulSoup` methods to find elements in the HTML, such as `find()` and `find_all()`.

In [None]:
from bs4 import BeautifulSoup

# Load the HTML file
with open("paris.html", "r", encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
print("HTML content loaded.")


##### Step 3: Extract and Analyze Words
Count the occurrences of each word in the page.
- **Comment**: Consider normalizing the text by converting it to lowercase to avoid counting the same word in different cases separately. We use regular expressions to effectively filter out non-word characters when splitting the text into words.

In [None]:
from collections import Counter
import re

# Extract text from the HTML content
text = soup.get_text()

# Clean and split text into words
words = re.findall(r'\w+', text.lower())
word_count = Counter(words)

# Display the 10 most common words
print(word_count.most_common(10))


##### Step 4: Extract Links
Identify all internal and external links from the page.

- **Comment**: Understanding the difference between internal and external links is important for categorization.
- **Hint**: Check the `href` attribute of the anchor (`<a>`) tags to determine the type of link.

##### Step 5: Extract Images and Their Sizes
Identify all images on the page and get their sizes.

- **Comment**: Be aware that images may not always be stored in the same format. Ensure you construct the correct URLs for them.
- **Hint**: You may need to check the attributes of the `<img>` tags to get additional information, such as the size of the images if available.

##### Step 6: Extract Numbers, Dates, and Geographical Coordinates
Identify numbers, dates, and geographical coordinates from the text.

- **Comment**: Different formats for dates and numbers can complicate extraction. Consider the various ways these can appear on the page.
- **Hint**: Use regular expressions tailored for specific patterns (e.g., date formats or geographic coordinates) to accurately identify them.

##### Step 7: Identify Proper Nouns
Extract proper nouns from the text.

- **Comment**: Proper nouns can include names of people, places, and organizations. Identifying them correctly can enhance your data analysis.
- **Hint**: Use Natural Language Processing (NLP) techniques, such as named entity recognition, to automate the identification of proper nouns.

##### Step 8: Extract Structured Data (Tables)
Identify and extract data from tables present in the HTML.

- **Comment**: Tables often contain organized data that can be useful for analysis. Make sure to capture both header and data cells.
- **Hint**: Familiarize yourself with the structure of HTML tables, including how to navigate rows (`<tr>`) and cells (`<td>` and `<th>`).

##### Step 9: Differentiate Sections and Paragraphs
Identify and separate sections and paragraphs in the content.

- **Comment**: Sections help in understanding the organization of the content. Recognizing different heading levels can aid in content navigation.
- **Hint**: Use appropriate tags (`<h1>`, `<h2>`, etc.) to differentiate between sections and ensure you capture their associated content, like paragraphs.


## Exercise 4.2 \[★★\]

Using the extracted text, build a cleaned corpus and compare word frequencies.

1.  Remove punctuation, digits, and stopwords (French). You may use nltk.corpus.stopwords or spacy.lang.fr.stop_words.
2.  Compute the top 20 most frequent words before and after stopword removal.
3.  Plot a bar chart of the top 20 words after cleaning.

Hint: Keep your preprocessing steps in a small function so you can reuse it.

    from collections import Counter
    import re

    def tokenize(text):
        # Your cleaning + tokenization here
        return tokens

    raw_tokens = tokenize(text)
    clean_tokens = [t for t in raw_tokens if t not in stopwords]

    print(Counter(raw_tokens).most_common(20))
    print(Counter(clean_tokens).most_common(20))

    # Plot the top 20 cleaned words


## Exercise 4.3 \[★★★\]
Analyze the text from the downloaded Wikipedia page by applying stemming, n-gram extraction, PoS tagging, lemmatization, morphological analysis, named entity recognition, and word embedding using Word2Vec models. Compare the results from NLTK, spaCy, and Gensim to evaluate their effectiveness in text analysis tasks.

#### Prerequisites
Make sure you have the required libraries installed. You can install them using pip if you haven't already:

In [None]:
!pip install nltk spacy gensim wordcloud seaborn

In [None]:
! python -m spacy download fr_core_news_sm  # For French language processing

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet') 

#### Step 1: Load the Wikipedia Page
Start by loading the HTML file you saved earlier and extracting the text.

In [None]:
from bs4 import BeautifulSoup

# Load the HTML file
with open("paris.html", "r", encoding='utf-8') as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()

#### Step 2: Apply Stemming Algorithms
Use the Porter and Snowball stemmers from NLTK to stem the words from the text.

In [None]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer
from collections import Counter
import re

# Tokenize and clean the text
words = re.findall(r'\w+', text.lower())

# Initialize stemmers
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")

# Apply stemming
porter_stems = [porter_stemmer.stem(word) for word in words]
snowball_stems = [snowball_stemmer.stem(word) for word in words]

# Count unique stems
porter_stem_count = Counter(porter_stems)
snowball_stem_count = Counter(snowball_stems)

# Display the most common stems and count of unique stems
print("Most common Porter stems:", porter_stem_count.most_common(10))
print("Unique Porter stems count:", len(porter_stem_count))

print("Most common Snowball stems:", snowball_stem_count.most_common(10))
print("Unique Snowball stems count:", len(snowball_stem_count))

#### Step 3: Extract N-grams
Generate and display the most common n-grams (1-grams to 5-grams) from the text.

#### Step 4: Part-of-Speech (PoS) Tagging
Use NLTK or spaCy to perform PoS tagging on the text.

#### Step 5: Lemmatization
Apply lemmatization using NLTK or spaCy.

#### Step 6: Morphological Analysis
Use spaCy to perform morphological analysis on the text.

#### Step 7: Named Entity Recognition (NER)
Use spaCy to identify named entities in the text.


#### Step 8: Frequency Distribution of Words
Visualize the frequencydistribution of words using Matplotlib.

#### Step 9: Create a Word Cloud

Generate a word cloud to visualize the most frequent words.

#### Step 10: Visualization of Named Entities

Visualize the named entities recognized in the text using Matplotlib.

#### Step 11: Visualization of Most Common Nouns

Visualize the most common nouns in the text, which can provide insights into the main subjects discussed.