<a href="https://colab.research.google.com/github/junya2025/text-retrieval-and-mining/blob/main/assignment_TRM_24_25_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Analysis and Summarization System

##Deadline
**Friday January 24 by 23:59 at the latest**. Please do not submit your assignment after the deadline as late submissions will not be graded.

## Learning Objectives:

* Work with text data in Python
* Understand basic text preprocessing
* Use simple APIs for text analysis
* Collaborate on a coding project
* Create a basic command-line interface

## Project Description:
Your team of 4 will build a Python program that helps analyze and summarize documents. The program should:

** Session 1 (~ 3 hours):


* Read and preprocess text files
* Calculate basic text statistics (word count, sentence count, average word length)
* Find most common words and phrases
* Generate and show 3 word clouds


** Session 2 (~ 3 hours):


* Use the Hugging Face Transformers library (https://huggingface.co/docs/hub/en/transformers) to: Generate summaries of the news articles.

* Create a simple command-line interface to run all analyses
* Save dataframe into a CSV file

**Please note that I suggest the time that the assignment might take you. This is a mere guide and does not mean that is all the time you have. Take the time that you need**

## Use the following News Articles Dataset:

BBC News Dataset: https://www.kaggle.com/datasets/hgultekin/bbcnewsarchive
Contains ~2000 news articles in 5 categories. For your task, please use the column '**Content**' of this dataset. Use a sample of 500 news articles. Make sure your sample contains articles from all 5 categories.

## Deliverable
One self contained fully functional Notebook. Please send only the Notebook as your submission.

In [None]:
# TODO: Import required libraries
# Hint: You'll need nltk, pandas, matplotlib, wordcloud, and transformers
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Add more imports here...

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\junya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\junya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\junya\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
pip install wordcloud




In [None]:
pip install transformers

Collecting transformers
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
     ---------------------------------------- 0.0/44.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/44.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/44.1 kB ? eta -:--:--
     ------------------ --------------------- 20.5/44.1 kB ? eta -:--:--
     ------------------ --------------------- 20.5/44.1 kB ? eta -:--:--
     ----------------------------------- -- 41.0/44.1 kB 245.8 kB/s eta 0:00:01
     -------------------------------------- 44.1/44.1 kB 240.0 kB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Downloading huggingface_hub-0.27.1-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.5.1-cp38-abi3-win_amd64.whl.metadata (3.9 kB)
Dow

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from collections import Counter
from nltk.util import ngrams

In [None]:
df = pd.read_csv('bbc-news-data.csv', delimiter='\t')
print(df.head())

   category filename                              title  \
0  business  001.txt  Ad sales boost Time Warner profit   
1  business  002.txt   Dollar gains on Greenspan speech   
2  business  003.txt  Yukos unit buyer faces loan claim   
3  business  004.txt  High fuel prices hit BA's profits   
4  business  005.txt  Pernod takeover talk lifts Domecq   

                                             content  
0   Quarterly profits at US media giant TimeWarne...  
1   The dollar has hit its highest level against ...  
2   The owners of embattled Russian oil giant Yuk...  
3   British Airways has blamed high fuel prices f...  
4   Shares in UK drinks and food firm Allied Dome...  


In [None]:
class DocumentAnalyzer:
    def __init__(self):
        """Initialize the DocumentAnalyzer with necessary resources"""
        # TODO: Initialize stop words and the summarization pipeline
        self.stop_words = set(stopwords.words('english'))
        self.summarizer = None  # Initialize the summarization pipeline, leave none for now

    def basic_stats(self, text):   #write this function first
        """Calculate basic text statistics

        Args:
            text (str): The input text.

        Returns:
            dict: A dictionary containing the analysis results.
        """
        # TODO: Calculate and return a dictionary containing:
        sentences = sent_tokenize(text)
        words = word_tokenize(text)
        filtered_words = [word.lower() for word in words
                         if word.isalpha() and word.lower() not in self.stop_words]
        # - Number of sentences (num_sentences)
        num_sentences = len(sentences)

        # - Number of words (num_words)
        num_words = len(words)

        # - Average word length (avg_word_length)
        avg_word_length = sum(len(word) for word in filtered_words) / len(filtered_words) if filtered_words else 0

        # - Average sentence length (avg_sentence_length)
        avg_sentence_length = sum(len(sent.split()) for sent in sentences) / num_sentences

        return {
            "num_sentences": num_sentences,
            "num_words": num_words,
            "avg_word_length": avg_word_length,
            "avg_sentence_length": avg_sentence_length,
            "filtered_words": filtered_words
        }
        pass

    def process_dataframe(self, df, text_column):
        """Process text data from a pandas DataFrame column

        Args:
            df (pd.DataFrame): Input DataFrame
            text_column (str): Name of the column containing text data

        Returns:
            pd.DataFrame: DataFrame with added analysis columns
        """
        # TODO: Implement DataFrame processing
        # 1. Read the 500 BBC news articles into the DataFrame. According to your selection criteria. separate by category!!!
        random_sample_df = df.sample(n=500, random_state=42)

        # 2. Apply text analysis functions to the 'content' column
        stats_df = df[text_column].apply(self.basic_stats).apply(pd.Series)

        # 3. Add and populate a new column for each of the following results: number of sentences (num_sentences), number of words (num_words),
        #    average word length (avg_word_length), average sentence length (avg_sentence_length), top 10 most common words (common_words),
        #    top 5 most common phrases (of lenght 2, i.e., two words),
        #    Note: you have to make these calculations for each of the 500 news articles. Only use the 'content' column.
        def get_top_10_words(words):
            word_counts = Counter(words)
            return [word for word, _ in word_counts.most_common(10)]
        df['top_10_words'] = stats_df['filtered_words'].apply(get_top_10_words)

        def get_top_phrases(words):
            phrases = [f"{words[i]} {words[i+1]}" for i in range(len(words)-1)]
            phrase_counts = Counter(phrases)
            return [phrase for phrase, _ in phrase_counts.most_common(5)]
        df['top_5_phrases'] = stats_df['filtered_words'].apply(get_top_phrases)

        # 4. Handle errors appropriately
        try:
            df['num_sentences'] = stats_df['num_sentences']
            df['num_words'] = stats_df['num_words']
            df['avg_word_length'] = stats_df['avg_word_length']
            df['avg_sentence_length'] = stats_df['avg_sentence_length']
            df['top_10_words'] = df['top_10_words']
            df['top_5_phrases'] = df['top_5_phrases']
            return df
        except Exception as e:
            print(f"Error processing DataFrame: {e}")
            return df

        pass

    def get_common_words(self, text, n=10):
        """Find the n most common words in the text"""
        # TODO: Implement word frequency analysis
        # Remember to:
        # 1. Tokenize the text
        word = word_tokenize(text)
        # 2. Convert to lowercase
        word = [word.lower() for word in words]
        # 3. Remove stopwords
        words = [word for word in words if word.isalpha() and word not in self.stop_words]
        # 4. Count frequencies
        word_counts = Counter(words)
        return word_counts.most_common(n)
        pass

    def get_common_phrases(self, text, n=5, phrase_length=2):
        """Find the n most common phrases of specified length"""
        # TODO: Implement phrase frequency analysis using nltk.util.ngrams
        words = word_tokenize(text)
        words = [word.lower() for word in words if word.isalpha() and word not in self.stop_words]
        phrases = list(ngrams(words, phrase_length))
        phrase_counts = Counter(phrases)
        return phrase_counts.most_common(n)

        pass

    def create_wordcloud(self, text)

analyzer = DocumentAnalyzer()
processed_df = analyzer.process_dataframe(df, 'content')

print(processed_df.head(15))

    category filename                              title  \
0   business  001.txt  Ad sales boost Time Warner profit   
1   business  002.txt   Dollar gains on Greenspan speech   
2   business  003.txt  Yukos unit buyer faces loan claim   
3   business  004.txt  High fuel prices hit BA's profits   
4   business  005.txt  Pernod takeover talk lifts Domecq   
5   business  006.txt   Japan narrowly escapes recession   
6   business  007.txt   Jobs growth still slow in the US   
7   business  008.txt   India calls for fair trade rules   
8   business  009.txt  Ethiopia's crop production up 24%   
9   business  010.txt  Court rejects $280bn tobacco case   
10  business  011.txt  Ask Jeeves tips online ad revival   
11  business  012.txt   Indonesians face fuel price rise   
12  business  013.txt     Peugeot deal boosts Mitsubishi   
13  business  014.txt   Telegraph newspapers axe 90 jobs   
14  business  015.txt   Air passengers win new EU rights   

                                       

In [None]:
class DocumentAnalyzer:
    def __init__(self):
        """Initialize the DocumentAnalyzer with necessary resources"""
        # TODO: Initialize stop words and the summarization pipeline
        self.stop_words = set(stopwords.words('english'))
        self.summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
        self.tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")



    def basic_stats(self, text):
        """Calculate basic text statistics"""
        # TODO: Calculate and return a dictionary containing:
        # - Number of sentences (num_sentences)
        # - Number of words (num_words)
        # - Average word length (avg_word_length)
        # - Average sentence length (avg_sentence_length)
        pass

    def process_dataframe(self, df, text_column):
        """Process text data from a pandas DataFrame column

        Args:
            df (pd.DataFrame): Input DataFrame
            text_column (str): Name of the column containing text data

        Returns:
            pd.DataFrame: DataFrame with added analysis columns
        """
        # TODO: Implement DataFrame processing
        # 1. Read the 500 BBC news articles into the DataFrame. According to your selection criteria.
        random_sample_df = df.sample(n=500, random_state=42)

        # 2. Apply text analysis functions to the 'content' column
        analyzer = DocumentAnalyzer()
        stats = df['content'].apply(analyzer.basic_stats).apply(pd.Series)

        # 3. Add and populate a new column for each of the following results: number of sentences (num_sentences), number of words (num_words),
        #    average word length (avg_word_length), average sentence length (avg_sentence_length), top 10 most common words (common_words),
        #    top 5 most common phrases (of lenght 2, i.e., two words),
        #    Note: you have to make these calculations for each of the 500 news articles. Only use the 'content' column.
        word_counts = Counter(filtered_words)
        top_10_words = [word for word, _ in word_counts.most_common(10)]

        phrases = [f"{words[i]} {words[i+1]}" for i in range(len(words)-1)]
        phrase_counts = Counter(phrases)
        top_5_phrases = [phrase for phrase, _ in phrase_counts.most_common(5)]


        # 4. Handle errors appropriately
        try:
            df['num_sentences'] = df[text_column].apply(self.analyze_text).apply(lambda x: x['num_sentences'])
            df['num_words'] = df[text_column].apply(self.analyze_text).apply(lambda x: x['num_words'])
            df['avg_word_length'] = df[text_column].apply(self.analyze_text).apply(lambda x: x['avg_word_length'])
            df['avg_sentence_length'] = df[text_column].apply(self.analyze_text).apply(lambda x: x['avg_sentence_length'])
            df['common_words'] = df[text_column].apply(self.analyze_text).apply(lambda x: x['common_words'])
            df['common_phrases'] = df[text_column].apply(self.analyze_text).apply(lambda x: x['common_phrases'])
            return df
        except Exception as e:
            print(f"Error processing DataFrame: {e}")
            return df

        pass


    def get_common_words(self, text, n=10):
        """Find the n most common words in the text"""
        # TODO: Implement word frequency analysis
        # Remember to:
        # 1. Tokenize the text
        word = word_tokenize(text)
        # 2. Convert to lowercase
        word = [word.lower() for word in words]
        # 3. Remove stopwords
        words = [word for word in words if word.isalpha() and word not in self.stop_words]
        # 4. Count frequencies
        word_counts = Counter(words)
        return word_counts.most_common(n)

        pass

    def get_common_phrases(self, text, n=5, phrase_length=2):
        """Find the n most common phrases of specified length"""
        # TODO: Implement phrase frequency analysis using nltk.util.ngrams
        words = word_tokenize(text)
        words = [word.lower() for word in words if word.isalpha() and word not in self.stop_words]
        phrases = list(ngrams(words, phrase_length))
        phrase_counts = Counter(phrases)
        return phrase_counts.most_common(n)

        pass

    def create_wordcloud(self, text):
        """Generate and save a word cloud visualization"""
        # TODO: Create and show word cloud visualizations of the most common words for 3 randomly selected news articles.
        # Use WordCloud class and matplotlib
        for article in random_sample_df['content']:
            wordcloud = WordCloud(width=800, height=400, background_color='white').generate(article)

            plt.figure(figsize=(10, 5))
            plt.imshow(wordcloud, interpolation='bilinear')
            plt.axis('off')
            plt.show()
        pass

    def generate_summary(self, text, max_length=50, min_length=30):
        """Generate a summary using the BART model"""
        # TODO: Implement text summarization
        # Remember to:
        # 1. Handle long texts by splitting into chunks
        chunk_size = 512
        chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
        # 2. Use the summarization pipeline
        summaries = []
        for chunk in chunks:
            summary = self.summarizer(chunk, max_length=max_length, min_length=min_length)[0]['summary_text']
            summaries.append(summary)
        # 3. Combine summaries if needed
        combined_summary = " ".join(summaries)
        return combined_summary

        pass

    def _split_into_chunks(self, text, max_chunk_size=1000):
        """Helper method to split text into chunks for summarization"""
        # TODO: Implement text splitting into chunks
        # This is needed because the summarizer has a maximum input length
        all_tokens = self.tokenizer.encode(text)
        chunks = []
        for i in range(0, len(all_tokens), max_chunk_size):
            chunk_tokens = all_tokens[i:i + max_chunk_size]
            chunk_text = self.tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
        return chunks

        pass

    def save_analysis(self, filepath, analysis_results):
        """Save analysis results to a CSV file"""
        # TODO: Implement saving dataframe to a CSV file
        analysis_results.to_csv(filepath, index=False)

        pass

In [None]:

'''class DocumentAnalyzer:
    def __init__(self):
        """Initialize the DocumentAnalyzer with necessary resources"""
        # TODO: Initialize stop words and the summarization pipeline
        self.stop_words = set(stopwords.words('english'))
        self.summarizer = None  # Initialize the summarization pipeline, leave blank for now

    def process_dataframe(self, df, text_column):
        """Process text data from a pandas DataFrame column

        Args:
            df (pd.DataFrame): Input DataFrame
            text_column (str): Name of the column containing text data

        Returns:
            pd.DataFrame: DataFrame with added analysis columns
        """
        # TODO: Implement DataFrame processing
        # 1. Read the 500 BBC news articles into the DataFrame. According to your selection criteria.
        random_sample_df = df.sample(n=500, random_state=42)

        # 2. Apply text analysis functions to the 'content' column
        analyzer = DocumentAnalyzer()
        stats = df['content'].apply(analyzer.basic_stats).apply(pd.Series)

        # 3. Add and populate a new column for each of the following results: number of sentences (num_sentences), number of words (num_words),
        #    average word length (avg_word_length), average sentence length (avg_sentence_length), top 10 most common words (common_words),
        #    top 5 most common phrases (of lenght 2, i.e., two words),
        #    Note: you have to make these calculations for each of the 500 news articles. Only use the 'content' column.
        # 4. Handle errors appropriately
        pass

    def basic_stats(self, text):
        """Calculate basic text statistics"""
        # TODO: Calculate and return a dictionary containing:
        # - Number of sentences (num_sentences)
        # - Number of words (num_words)
        # - Average word length (avg_word_length)
        # - Average sentence length (avg_sentence_length)
        pass

    def get_common_words(self, text, n=10):
        """Find the n most common words in the text"""
        # TODO: Implement word frequency analysis
        # Remember to:
        # 1. Tokenize the text
        # 2. Convert to lowercase
        # 3. Remove stopwords
        # 4. Count frequencies
        pass

    def get_common_phrases(self, text, n=5, phrase_length=2):
        """Find the n most common phrases of specified length"""
        # TODO: Implement phrase frequency analysis using nltk.util.ngrams
        pass

    def create_wordcloud(self, text):
        """Generate and save a word cloud visualization"""
        # TODO: Create and show word cloud visualizations of the most common words for 3 randomly selected news articles.
        # Use WordCloud class and matplotlib
        pass

    def generate_summary(self, text, max_length=50, min_length=30):
        """Generate a summary using the BART model"""
        # TODO: Implement text summarization
        # Remember to:
        # 1. Handle long texts by splitting into chunks
        # 2. Use the summarization pipeline
        # 3. Combine summaries if needed
        pass

    def _split_into_chunks(self, text, max_chunk_size=1000):
        """Helper method to split text into chunks for summarization"""
        # TODO: Implement text splitting into chunks
        # This is needed because the summarizer has a maximum input length
        pass

    def save_analysis(self, filepath, analysis_results):
        """Save analysis results to a CSV file"""
        # TODO: Implement saving dataframe to a CSV file
        pass

'''

In [None]:
def main():
    # TODO: Implement the main function that:
    # 1. Initializes the analyzer
    # 2. Populates the DataFrame with 500 BBC news articles
    # 3. Processes the DataFrame
    # 4. Generates worldclouds for 5 random summaries
    # 5. Saves the resulting updated dataframe to CSV

    # Hint: Start by loading your 500 news articles into a dataframe. Make sure to include the code to select those 500 news articles.
    filepath = #path_to_data_file - Replace with your text file
    df = pd.DataFrame(pd.read_csv(filepath, sep='	', engine='python'))
    # TODO: Complete the implementation
    pass

if __name__ == "__main__":
    main()

# Testing functions
def run_tests():
    # TODO: Implement tests for your functions
    pass