<a href="https://colab.research.google.com/github/junya2025/text-retrieval-and-mining/blob/main/assignment_TRM_24_25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Analysis and Summarization System

##Deadline
**Friday January 24 by 23:59 at the latest**. Please do not submit your assignment after the deadline as late submissions will not be graded.

## Learning Objectives:

* Work with text data in Python
* Understand basic text preprocessing
* Use simple APIs for text analysis
* Collaborate on a coding project
* Create a basic command-line interface

## Project Description:
Your team of 4 will build a Python program that helps analyze and summarize documents. The program should:

** Session 1 (~ 3 hours):


* Read and preprocess text files
* Calculate basic text statistics (word count, sentence count, average word length)
* Find most common words and phrases
* Generate and show 3 word clouds


** Session 2 (~ 3 hours):


* Use the Hugging Face Transformers library (https://huggingface.co/docs/hub/en/transformers) to: Generate summaries of the news articles.

* Create a simple command-line interface to run all analyses
* Save dataframe into a CSV file

**Please note that I suggest the time that the assignment might take you. This is a mere guide and does not mean that is all the time you have. Take the time that you need**

## Use the following News Articles Dataset:

BBC News Dataset: https://www.kaggle.com/datasets/hgultekin/bbcnewsarchive
Contains ~2000 news articles in 5 categories. For your task, please use the column '**Content**' of this dataset. Use a sample of 500 news articles. Make sure your sample contains articles from all 5 categories.

## Deliverable
One self contained fully functional Notebook. Please send only the Notebook as your submission.

<div style="text-align: center;">
    <h1> Text Retrieval and Mining Group Assignment </h1>
    <h3>Date: 2025-01-7</h3>
    <strong>Group Number:</strong>  <br>
    <strong>Students Name:</strong> Maeve Fang (14186578), Junya Tamura(14194643), Kai-Wen Huang (14168766), Ketong Chen (14259915)
</div>

---

## Pakage Importing

In [5]:
# TODO: Import required libraries
# Hint: You'll need nltk, pandas, matplotlib, wordcloud, and transformers
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
# Add more imports here...
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.util import ngrams
from transformers import pipeline

# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Main Function Demonstration

In [8]:
class DocumentAnalyzer:
    def __init__(self):
        """Initialize the DocumentAnalyzer with necessary resources"""
        self.stop_words = set(stopwords.words("english"))  # We are consider English stop words
        self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn")  # Initialize the summarization pipeline

    def process_dataframe(self, df, text_column):
        """
        Process text data from a pandas DataFrame column.

        Args:
            df (pd.DataFrame): Input DataFrame
            text_column (str): Name of the column containing text data

        Returns:
            pd.DataFrame: DataFrame with added analysis columns
        """
        # # 1. Select the 500 BBC news articles from each category.
        # if len(df) > 500:
        #     df = df.sample(n=500, random_state=1)
        # Ensure the category column exists (for robust coding)
        if "category" not in df.columns:
          raise ValueError(f"Column of 'category' does not exist.")

        if text_column not in df.columns:
            raise ValueError(f" Problematic! {text_column} is not valid.")

        # Equally sample 500 articles from each category
        # Find out the number of unique categories and number of samples per group
        unique_categories = df["category"].unique()
        num_categories = len(unique_categories)
        articles_per_category = 500 // num_categories

        # 1: Sample equally and randomly from each category
        sampled_dfs = []
        for category, group in df.groupby("category"):
          if len(group) >= articles_per_category:
            sampled_group = group.sample(n=articles_per_category, random_state=1)  # Randomly sample articles_per_category articles
          else:
            raise ValueError(f"No enough articles in category '{category}' to sample {articles_per_category} (we only found {len(group)}).")
          sampled_dfs.append(sampled_group)

        # 2: Combine sampled articles back into a single DataFrame
        sampled_df = pd.concat(sampled_dfs).reset_index(drop=True)

        # 3: Apply text analysis to the inputed column
        # .apply(pd.Series) converts the output dictionary into a row of a new DataFrame
        stats = sampled_df[text_column].apply(self.basic_stats).apply(pd.Series)

        # 4. Add and populate new columns for each result.
        df['Number of words'] = stats['Number of words']
        df['Number of sentences'] = stats['Number of sentences']
        df['Average word length'] = stats['Average word length']
        df['Average sentence length'] = stats['Average sentence length']
        df['Common Words'] = df[text_column].apply(self.get_common_words)
        df['Common Phrases'] = df[text_column].apply(self.get_common_phrases)
        df['Summary'] = df[text_column].apply(self.generate_summary)

        # 4. Return the updated DataFrame.
        return df

    def basic_stats(self, text):
        """
        Function Purpose:
          To calculate basic text statistics.

        Args:
          text (str): The input article to be analyzed.

        Returns:
          A dictionary containing the following statistics:
              - Number of words (num_words)
              - Number of sentences (num_sentences)
              - Average word length (avg_word_length)
              - Average sentence length (avg_sentence_length)

        """
        if not isinstance(text, str) or text.strip() == "":
            return {
                'Number of words': 0,
                'Number of sentences': 0,
                'Average word length': 0,
                'Average sentence length': 0
                }

        # Count the number of words
        words = word_tokenize(text)  # Tokenize the text into words
        num_words = len(words)  # Count the total number of words

        # Count the number of sentence
        sentences = sent_tokenize(text)  # Tokenize the text into sentences
        num_sentences = len(sentences)  # Count the total number of sentences

        # Calculate the Average Word Length
        total_characters = sum(len(word) for word in words)  # Total character count of all words
        avg_word_length = total_characters / num_words if num_words > 0 else 0

        # Calculate the Average Sentence Length (of how many words per sentences)
        sentence_lengths = [len(word_tokenize(sentence)) for sentence in sentences]  # Number of words in each sentence
        avg_sentence_length = sum(sentence_lengths) / num_sentences if num_sentences > 0 else 0

        return {
            'Number of words': num_words,
            'Number of sentences': num_sentences,
            'Average word length': avg_word_length,
            'Average sentence length': avg_sentence_length
            }

    def get_common_words(self, text, n=10):
        """Find the n most common words in the text."""
        # 1. Tokenize the text
        words = word_tokenize(text)

        # 2. Convert to lowercase
        words = [word.lower() for word in words]

        # 3. Remove stopwords
        words = [word for word in words if word.isalpha() and word not in self.stop_words]

        # 4. Count frequencies using pandas
        word_counts = pd.Series(words).value_counts()

        # 5. Return the top `n` most common words as tuples
        return list(word_counts.head(n).items())

    def get_common_phrases(self, text, n=5, phrase_length=2):
        """
        Find the n most common phrases of specified length using pandas.Series.

        Args:
            text (str): Input text for phrase analysis.
            n (int): Number of most common phrases to return.
            phrase_length (int): Length of the phrases (in words).

        Returns:
            list: List of tuples (phrase, frequency) for the top n phrases.
        """
        # Step 1: Tokenize and preprocess the text
        words = word_tokenize(text)
        words = [word.lower() for word in words if word.isalpha() and word not in self.stop_words]

        # Step 2: Generate n-grams (phrases of specified length)
        phrases = list(ngrams(words, phrase_length))  # List of tuples

        # Step 3: Convert phrases to pandas.Series for frequency calculation
        phrase_series = pd.Series(phrases)

        # Step 4: Use value_counts to calculate frequencies and get the top n phrases
        phrase_counts = phrase_series.value_counts().head(n)

        # Step 5: Convert to a list of tuples (phrase, frequency)
        return list(phrase_counts.items())

    def generate_summary(self, text, max_length=50, min_length=30):
        """Generate a summary using the BART model"""
        # TODO: Implement text summarization
        # Remember to:
        # 1. Handle long texts by splitting into chunks
        chunks = self._split_into_chunks(text)

        # 2. Use the summarization pipeline
        summarized_chunks = [
            self.summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)[0]["summary_text"]
            for chunk in chunks
            ]
        # 3. Combine summaries
        return " ".join(summarized_chunks)

    def _split_into_chunks(self, text, max_chunk_size=1000):
        """
        Split long text into smaller chunks for summarization.

        Args:
            text (str): The input text to split.
            max_chunk_size (int): Maximum number of characters per chunk.

        Returns:
            list: List of text chunks.
        """
        # Step 1: Split text into sentences
        sentences = sent_tokenize(text)

        # Step 2: Initialize variables
        chunks = []  # To store resulting chunks
        current_chunk = []  # Temporarily holds sentences for the current chunk
        current_length = 0  # Tracks the total length of the current chunk

        # Step 3: Group sentences into chunks
        for sentence in sentences:
            if current_length + len(sentence) > max_chunk_size:
                # Add the current chunk to chunks and reset
                chunks.append(" ".join(current_chunk))
                current_chunk = []
                current_length = 0

            # Add the current sentence to the chunk
            current_chunk.append(sentence)
            current_length += len(sentence)

        # Step 4: Add the last chunk if there are leftover sentences
        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

    def save_analysis(self, filepath, analysis_results):
        """Save analysis results to a CSV file."""
        if isinstance(analysis_results, pd.DataFrame):
            analysis_results.to_csv(filepath, index=False)
        elif isinstance(analysis_results, dict):
            df = pd.DataFrame([analysis_results])
            df.to_csv(filepath, index=False)
        elif isinstance(analysis_results, list):
            df = pd.DataFrame(analysis_results, columns=["Phrase", "Frequency"])
            df.to_csv(filepath, index=False)
        else:
            raise ValueError("Analysis results must be a pandas DataFrame, dictionary, or list of tuples.")

# Brief Check -- Just for checking

In [9]:
# Sample DataFrame
data = pd.read_csv('bbc-news-data.csv', delimiter='\t')

df = pd.DataFrame(data)

# Instantiate the DocumentAnalyzer
analyzer = DocumentAnalyzer()

# Process the DataFrame and sample articles
analyzer.process_dataframe(df, text_column="content")

# Create word clouds for sampled articles
analyzer.create_wordcloud(text_column="content")


Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Your max_length is set to 50, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max_length is set to 50, but your input_length is only 18. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)
Your max_length is set to 50, but your input_length is only 25. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)
Your max_length is set to 50, but your input_length is only 41. Since this is a summarization task, where outputs shorter t

KeyboardInterrupt: 

## Implement the main function

In [None]:
def main():
    # 1. Initialize the analyzer
    analyzer = DocumentAnalyzer()

    # 2. Load the DataFrame
    df = pd.read_csv('bbc-news-data.csv', delimiter='\t')

    # 3. Process the DataFrame
    processed_df = analyzer.process_dataframe(df, text_column="content")

    # 4. Generate word clouds for 5 random summaries
    analyzer.create_wordcloud(analyzer, processed_df, num_samples=5)

    # 5. Save the updated DataFrame to a CSV file
    output_path = "analyzed_bbc_news.csv"
    processed_df.to_csv(output_path, index=False)
    print(f"Processed DataFrame saved as: {output_path}")

# Ensure the script runs the main() function only when executed directly
if __name__ == "__main__":
    main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Testing Function

In [None]:
from pathlib import Path

def run_tests():
    """Run tests for the DocumentAnalyzer class and main function."""
    print("Running tests...")

    # Step 1: Run the main function
    try:
        print("Executing main function...")
        main()
        print("Main function executed successfully!")
    except Exception as e:
        print(f"Main function failed: {e}")
        return  # Exit the tests if main fails

    # Step 2: Verify output file exists using pathlib
    output_path = Path("analyzed_bbc_news.csv")
    if output_path.exists():
        print(f"Output file '{output_path}' found.")
    else:
        print(f"Output file '{output_path}' is missing!")
        return

    # Step 3: Validate output file content
    try:
        processed_df = pd.read_csv(output_path)
        required_columns = [
            "category", "content", "Number of words",
            "Number of sentences", "Average word length",
            "Average sentence length", "Common Words", "Common Phrases"
        ]
        for col in required_columns:
            assert col in processed_df.columns, f"Missing column: {col}"
        print("Output file content is valid!")
    except Exception as e:
        print(f"Error validating output file content: {e}")
        return

    print("All tests passed successfully!")

In [None]:
from google.colab import drive
drive.mount('/content/drive')