# Wikipedia Article Recommender

This notebook walks through the complete process of building and using the article recommendation system. It uses the classes defined in the `components` directory to perform the following steps:

1.  **Scrape Wikipedia Articles**: Scrape articles from Wikipedia and save them as CSV files.
2.  **Merge Articles**: Combine individual scraped article CSVs into a single corpus.
3.  **Process Corpus**: Tokenize, stem, and lemmatize the text from the merged corpus.
4.  **Process History**: Scrape and process a new list of URLs (simulating browsing history).
5.  **Build/Load Model**: Create or load a TF-IDF model from the processed corpus.
6.  **Get Recommendations**: Find and display the most relevant articles from the corpus based on the history.
7.  **Database Statistics**: Plot some interesting insights from our data.

---

### 0. Setup and Imports

Import all necessary classes from our components and define the file paths that will be used throughout the notebook.

In [1]:
import os
import asyncio
from collections import Counter
from ast import literal_eval

import nest_asyncio
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.metrics.pairwise import cosine_similarity

from components.crawler import AsyncWikiCrawler
from components.articles import ArticleMerger
from components.processor import TextProcessor
from components.similarities import SimilarityCalculator

nest_asyncio.apply()

data_dir = 'data'
corpus_file = os.path.join(data_dir, 'wiki_articles_corpus.csv')
processed_corpus_file = os.path.join(data_dir, 'processed_articles.csv')
processed_history_file = os.path.join(data_dir, 'processed_history.csv')

history_urls_to_process = [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://en.wikipedia.org/wiki/Natural_language_processing"
]


### 1. Wiki Scraper

First we can use the custom `AsyncWikiCrawler` to scrape a list of Wikipedia articles and save them as individual CSV files.
The crawler scrapes specified number of articles concurrently from given start URLs. This step is optional as it only appends the already existing corpus of scraped Wiki articles later using the `ArticleMerger` class. The result of this step is a file named `wiki_articles_<timestamp>.csv` in the `data` directory.

Set the parameters for scraping and run the crawler:
- `start_urls`: List of Wikipedia article URLs to start scraping from.
- `target_count`: Total number of articles to scrape. Treat it as maximum limit.
- `concurrency_limit`: Number of concurrent requests to make while scraping.
- `worker_target_count`: Number of articles worker should scrape from a given domain - start URL.
- `max_depth`: Maximum depth to follow links from the start URLs.

The `AsyncWikiCrawler` class is designed for efficient, high-speed scraping of Wikipedia articles. It operates asynchronously, allowing it to handle multiple network requests concurrently. Key features include:

- **Asynchronous Fetching**: Uses `aiohttp` to download many pages at once, significantly speeding up the scraping process compared to sequential requests.
- **Worker-Based Crawling**: It partitions the list of starting URLs among several "workers," each running in parallel to explore different sections of Wikipedia simultaneously.
- **Depth and Scope Control**: You can define the maximum depth for the crawler to follow links and set a target for how many articles each worker should aim to collect, preventing infinite or overly broad crawls.
- **Duplicate Prevention**: It calculates a hash of the article content to avoid saving the same article multiple times, ensuring the resulting dataset is clean.
- **Resilience**: Includes error handling and timeouts to manage network issues gracefully.

The crawler saves its findings into timestamped CSV files in the `data` directory.

In [2]:
def scrape(start_urls):
    crawler = AsyncWikiCrawler(
        start_urls=start_urls,
        output_dir='./data',
        target_count=300, 
        max_depth=3,
        concurrency_limit=5,
        worker_target_count=10
    )
    
    try:
        asyncio.run(crawler.run())
    except KeyboardInterrupt:
        print("\nCrawl interrupted by user.")
        crawler.logger.info("Saving partial results...")
        crawler._save_data()
    
START_URLS = [
    # Computer Science
    'https://en.wikipedia.org/wiki/Python_(programming_language)',
    'https://en.wikipedia.org/wiki/Web_scraping',
    'https://en.wikipedia.org/wiki/Artificial_intelligence',

    # Natural Sciences
    'https://en.wikipedia.org/wiki/Physics',
    'https://en.wikipedia.org/wiki/Chemistry',
    'https://en.wikipedia.org/wiki/Biology',

    # Medicine
    'https://en.wikipedia.org/wiki/Medicine',
    'https://en.wikipedia.org/wiki/Neuroscience',
    'https://en.wikipedia.org/wiki/Genetics',

    # History
    'https://en.wikipedia.org/wiki/History',
    'https://en.wikipedia.org/wiki/Ancient_history',
    'https://en.wikipedia.org/wiki/Renaissance',

    # Culture
    'https://en.wikipedia.org/wiki/Art',
    'https://en.wikipedia.org/wiki/Music',
    'https://en.wikipedia.org/wiki/Literature',

    # Social Sciences
    'https://en.wikipedia.org/wiki/Psychology',
    'https://en.wikipedia.org/wiki/Sociology',
    'https://en.wikipedia.org/wiki/Economics',

    # Geography
    'https://en.wikipedia.org/wiki/Geography',
    'https://en.wikipedia.org/wiki/Asia',
    'https://en.wikipedia.org/wiki/Africa',

    # Engineering
    'https://en.wikipedia.org/wiki/Engineering',
    'https://en.wikipedia.org/wiki/Agriculture',
    'https://en.wikipedia.org/wiki/Space_exploration',

    # Sports
    'https://en.wikipedia.org/wiki/Olympic_Games',
    'https://en.wikipedia.org/wiki/National_Basketball_Association',
    'https://en.wikipedia.org/wiki/Skiing',
]


# OPTIONAL: To scrape data, uncomment the line below
# scrape(START_URLS)

### 2. Merge Individual Articles into a Corpus

This step uses the `ArticleMerger` to combine all the `wiki_articles_*.csv` files from the `data` directory into a single `wiki_articles_corpus.csv`. After a successful merge, it cleans up by deleting the original smaller files.

In [3]:
# Initialize and run the article merger
merger = ArticleMerger(data_directory=data_dir)
if merger.merge_articles():
    print("Corpus merge successful.")
    merger.cleanup_source_files()
else:
    print("Corpus merge failed.")

Found 6825 existing articles in 'data\wiki_articles_corpus.csv'.
No source article files found to merge.
Corpus merge successful.
Cleaning up source files...
No source files to clean up.


The `ArticleMerger` class serves as a utility to consolidate the scattered data collected by the crawler. Its primary function is to:

- **Combine CSVs**: It scans the `data` directory for individual `wiki_articles_*.csv` files generated during scraping sessions.
- **Deduplicate Corpus**: Before merging, it reads the existing `wiki_articles_corpus.csv` (if any) and notes the content hashes of articles already present. It then intelligently appends only new, unique articles, preventing duplicates in the main corpus.
- **Cleanup**: After a successful merge, it provides an option to delete the original, smaller CSV files, keeping the data directory tidy.

### 3. Preprocess the Corpus

This step uses the `TextProcessor` to read the `wiki_articles_corpus.csv`, perform NLP tasks (tokenization, stemming, lemmatization), and saves the output to `processed_articles.csv`.

In [4]:
# OPTIONAL if no extra articles have been scraped, takes about 15 minutes as the whole corpus (around 7k articles) is processed in batches of size 1000
# processor_corpus = TextProcessor(corpus_path=corpus_file, output_path=processed_corpus_file)
# processor_corpus.process_corpus()

The `TextProcessor` class is responsible for cleaning and transforming raw text into a structured format suitable for machine learning. It performs several key NLP preprocessing steps:

- **Tokenization**: Breaks down the raw text of each article into individual words or tokens.
- **Filtering**: Removes stop words (common words like "the," "is," "a") and non-alphabetic characters, which generally do not contribute significant meaning.
- **Stemming and Lemmatization**: 
    - **Stemming** reduces words to their root form (e.g., "running" becomes "run").
    - **Lemmatization** reduces words to their base or dictionary form (e.g., "better" becomes "good"), which is often more contextually accurate than stemming.
- **Batch Processing**: To handle a large corpus without consuming excessive memory, it processes the main CSV file in smaller chunks or batches.

The output is a new CSV file (`processed_articles.csv`) containing the original metadata along with columns for tokens, stemmed words, and lemmatized words.

### 4. Scrape and Process Browsing History

Here, we simulate a user's browsing history by providing a list of URLs. The `TextProcessor` scrapes these pages, processes the text in the same way as the corpus, and saves the result to `processed_history.csv`.

In [5]:
# We need a TextProcessor instance, even if we skipped corpus processing, to process the history
# The corpus_path and output_path are not strictly needed here but are required by the constructor.
processor_history = TextProcessor(corpus_path=corpus_file, output_path=processed_corpus_file)
processor_history.process_history(urls=history_urls_to_process, output_filename='processed_history.csv')

Downloading necessary NLTK data...
NLTK data downloaded.
Processing 3 URLs from history...
NLTK data downloaded.
Processing 3 URLs from history...
History processing complete. Results saved to 'data\processed_history.csv'.
History processing complete. Results saved to 'data\processed_history.csv'.


To generate recommendations, we must first understand the user's interests. This is simulated by processing a "browsing history" of a few article URLs. The `TextProcessor` is reused for this task to ensure the history text is processed in exactly the same way as the corpus.

- **Scraping History**: It fetches the content from each URL in the `history_urls_to_process` list.
- **Consistent Processing**: It applies the same tokenization, filtering, and lemmatization pipeline used on the corpus. This consistency is crucial because the TF-IDF model will expect the history text to be in the same format as the corpus data it was trained on.
- **Saving Processed History**: The processed history is saved to its own file, `processed_history.csv`, ready to be fed into the recommendation model.

### 5. Build or Load the TF-IDF Model

This is the core of the recommendation engine. The `SimilarityCalculator` is used to build the TF-IDF model from the processed corpus. This only needs to be done once. If a model already exists in the `models` directory, it will be loaded from disk instead of being rebuilt.

In [6]:
calculator = SimilarityCalculator(
    processed_corpus_path=processed_corpus_file,
    processed_history_path=processed_history_file
)

# Fit the model on the corpus. Set force_refit=True to rebuild an existing model. (when corpus was changed)
calculator.fit_corpus(force_refit=False)

TF-IDF model already exists. Loading from disk.
Successfully loaded TF-IDF model from disk.
Successfully loaded TF-IDF model from disk.


The `SimilarityCalculator` class is the heart of the recommendation system. It builds or loads a TF-IDF model to measure the textual similarity between articles.

#### TF-IDF Vectorization

**Term Frequency-Inverse Document Frequency (TF-IDF)** is a numerical statistic that reflects how important a word is to a document in a collection or corpus. The `SimilarityCalculator` uses `TfidfVectorizer` from scikit-learn to perform this transformation:

1.  **Fit on Corpus**: The vectorizer is first "fitted" on the entire processed corpus. It analyzes the text to build a vocabulary of all unique words and calculates their Inverse Document Frequency (IDF). IDF measures how rare a word is across all documents; rare words get a higher score.
2.  **Transform to Vectors**: Once fitted, the `transform` method is used to convert each article's text into a TF-IDF vector. Each element in the vector corresponds to a word in the vocabulary, and its value is the TF-IDF score for that word in that specific article.
    - **Term Frequency (TF)**: How often a word appears in a single document.
    - **TF-IDF Score**: `TF * IDF`. This score is high for words that are frequent in one document but rare across the entire corpus, making them good indicators of the document's topic.

The fitted vectorizer and the corpus's TF-IDF vectors are saved to the `models` directory, so this computationally expensive step only needs to be run once. On subsequent runs, the pre-computed model is loaded directly from disk.

### 6. Get and Display Recommendations

Finally, we use the fitted model to calculate the similarity between the user's history and the corpus. The function returns the top K most similar articles, which we then print out.

In [7]:
# Get the top K recommendations - 5 by default
K = 5
top_recommendations = calculator.get_recommendations(top_k=K)

# Display the results
if top_recommendations:
    print(f"\n--- Top {K} Article Recommendations ---")
    for url, title, score in top_recommendations:
        print(f"Score: {score:.4f} | Title: {title} | URL: {url}")
else:
    print(f"\nCould not generate recommendations.")

Processing history to generate recommendations...

--- Top 5 Article Recommendations ---
Score: 0.3906 | Title: Applications of artificial intelligence | URL: https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence
Score: 0.3791 | Title: History of artificial intelligence | URL: https://en.wikipedia.org/wiki/History_of_artificial_intelligence
Score: 0.3304 | Title: Symbolic artificial intelligence | URL: https://en.wikipedia.org/wiki/Symbolic_artificial_intelligence
Score: 0.3217 | Title: Outline of machine learning | URL: https://en.wikipedia.org/wiki/Outline_of_machine_learning
Score: 0.3190 | Title: Ethics of artificial intelligence | URL: https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence

--- Top 5 Article Recommendations ---
Score: 0.3906 | Title: Applications of artificial intelligence | URL: https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence
Score: 0.3791 | Title: History of artificial intelligence | URL: https://en.wikipedia.org/wi

With the TF-IDF model fitted and loaded, the `SimilarityCalculator` can now generate recommendations.

#### Similarity Calculation and Scoring

1.  **Transform History**: The processed text from the user's browsing history is transformed into TF-IDF vectors using the *same* vectorizer that was fitted on the corpus. This ensures that the vectors for the history and the corpus are in the same vector space.

2.  **Cosine Similarity**: To find relevant articles, we calculate the **cosine similarity** between the history vectors and all the article vectors in the corpus. Cosine similarity measures the cosine of the angle between two vectors, providing a score from 0 (not similar) to 1 (identical). It is effective for text because it compares the orientation of the vectors, not their magnitude, making it robust to differences in document length.

3.  **Aggregate Scoring**: Since the history may contain multiple articles, a similarity matrix is generated (rows are history articles, columns are corpus articles). The scores for each corpus article are then averaged across all history articles. This creates a single, final similarity score for every article in the corpus, representing its overall relevance to the user's entire browsing history.

4.  **Ranking and Filtering**: The articles are ranked by their final similarity score in descending order. Articles that were already in the user's history are filtered out, and the top K remaining articles are returned as the final recommendations.

# 7. Database Statistics

In this section, we delve into the dataset to uncover interesting patterns and statistics. We will visualize:
- The most frequent words across the entire corpus.
- A detailed breakdown of a recommendation, showing the shared words that contribute to the similarity score.
- Examples of the most similar and dissimilar articles to provide a qualitative sense of the model's performance.
- The overall distribution of article lengths to understand the nature of our text data.

In [8]:
processed_df = pd.read_csv(processed_corpus_file)
corpus_df = pd.read_csv(corpus_file)

vectorizer = calculator.vectorizer
corpus_vectors = calculator.corpus_vectors
corpus_data = calculator.corpus_data

def safe_eval(cell):
    try:
        # ast.literal_eval
        return literal_eval(cell)
    except (ValueError, SyntaxError):
        return []

processed_df['lemmatized_list'] = processed_df['lemmatized'].apply(safe_eval)


### Corpus Size

In [9]:
num_unique_urls = corpus_data['url'].nunique()
num_unique_titles = corpus_data['title'].nunique()
vocabulary_size = len(vectorizer.vocabulary_)
num_history_articles = len(history_urls_to_process)

print(f"Distinct articles in corpus (by URL): {num_unique_urls}")
print(f"Distinct articles in corpus (by Title): {num_unique_titles}")
print(f"Vocabulary size (unique words): {vocabulary_size}")
print(f"Articles in browsing history: {num_history_articles}")

Distinct articles in corpus (by URL): 6760
Distinct articles in corpus (by Title): 6752
Vocabulary size (unique words): 185548
Articles in browsing history: 3


### Most Frequent Words

In [10]:
all_words = [word for sublist in processed_df['lemmatized_list'] for word in sublist]
word_counts = Counter(all_words)
most_common_words = word_counts.most_common(20)
freq_df = pd.DataFrame(most_common_words, columns=['Word', 'Frequency'])

fig = px.bar(freq_df, x='Frequency', y='Word', orientation='h',
             title='Top 20 Most Frequent Words in the Corpus',
             labels={'Word': 'Word', 'Frequency': 'Frequency Count'})
fig.update_layout(yaxis={'categoryorder':'total ascending'})
fig.show()

## Recommendation Breakdown

To understand *why* an article is recommended, we can break down the similarity score. The chart below visualizes the contribution of shared important words between a history article and a recommended article. It shows the top 10 words that they have in common, weighted by their TF-IDF scores in each document. Words with high TF-IDF scores in both are strong indicators of shared topics.

In [11]:
def visualize_recommendation_breakdown(history_url, recommended_url, vectorizer, corpus_data, corpus_vectors, history_data, history_vectors):
    """
    Visualizes the shared high-TF-IDF words between two articles.
    """
    try:
        history_idx = history_data[history_data['url'] == history_url].index[0]
        reco_idx = corpus_data[corpus_data['url'] == recommended_url].index[0]

        history_title = history_data.loc[history_idx, 'title']
        reco_title = corpus_data.loc[reco_idx, 'title']

        # Get the TF-IDF vectors
        history_vec = history_vectors[history_idx]
        reco_vec = corpus_vectors[reco_idx]

        # Get the feature names (words) from the vectorizer
        feature_names = np.array(vectorizer.get_feature_names_out())

        # Find common words with non-zero TF-IDF scores
        common_indices = (history_vec.toarray().flatten() > 0) & (reco_vec.toarray().flatten() > 0)
        common_words = feature_names[common_indices]
        
        if len(common_words) == 0:
            print(f"No common words found between '{history_title}' and '{reco_title}'.")
            return

        history_scores = history_vec.toarray().flatten()[common_indices]
        reco_scores = reco_vec.toarray().flatten()[common_indices]
        
        breakdown_df = pd.DataFrame({
            'Word': common_words,
            'History_TFIDF': history_scores,
            'Recommendation_TFIDF': reco_scores
        })
        
        breakdown_df['Combined_Score'] = breakdown_df['History_TFIDF'] * breakdown_df['Recommendation_TFIDF']
        top_words_df = breakdown_df.sort_values('Combined_Score', ascending=False).head(10)

        fig = go.Figure(data=[
            go.Bar(name=f'History: "{history_title}"', x=top_words_df['Word'], y=top_words_df['History_TFIDF']),
            go.Bar(name=f'Recommendation: "{reco_title}"', x=top_words_df['Word'], y=top_words_df['Recommendation_TFIDF'])
        ])
        fig.update_layout(
            barmode='group',
            title=f'TF-IDF Score Breakdown for Top Shared Words',
            xaxis_title='Word',
            yaxis_title='TF-IDF Score'
        )
        fig.show()

    except IndexError:
        print("Could not find one of the URLs in the corpus or history data. The article might not have been scraped or was filtered out.")
    except Exception as e:
        print(f"An error occurred: {e}")


if top_recommendations:
    history_data_df = calculator._load_and_preprocess_data(processed_history_file)
    history_vectors_for_viz = calculator.vectorizer.transform(history_data_df['processed_text'])

    history_article_url = history_urls_to_process[0]
    recommended_article_url = top_recommendations[0][0]
    
    visualize_recommendation_breakdown(
        history_article_url, 
        recommended_article_url, 
        vectorizer, 
        corpus_data, 
        corpus_vectors,
        history_data_df,
        history_vectors_for_viz
    )
else:
    print("No recommendations were generated, so breakdown cannot be shown.")

## Document Similarity Examples

To get a feel for how the similarity model works, let's look at pairs of articles that are either very similar or very dissimilar.

### Most Similar Documents

Below are two articles from the corpus with one of the highest cosine similarity scores. As expected, they cover nearly identical topics, which is reflected in their high score. I will find a pair of articles with high similarity, print their titles and similarity score, and then show the top shared words that contribute to this high score.

In [12]:
# To avoid a huge matrix, we'll compare a random subset of 1000 articles against the whole corpus
subset_size = 1000
if subset_size > corpus_vectors.shape[0]:
    subset_size = corpus_vectors.shape[0]

random_indices = np.random.choice(corpus_vectors.shape[0], subset_size, replace=False)
subset_vectors = corpus_vectors[random_indices]

similarity_matrix = cosine_similarity(subset_vectors, corpus_vectors)
sorted_indices = np.argsort(similarity_matrix, axis=None)[::-1]
found_pair = False
for flat_idx in sorted_indices:
    idx_2d = np.unravel_index(flat_idx, similarity_matrix.shape)
    sim_score = similarity_matrix[idx_2d]
    
    if sim_score >= 0.9999:
        continue
        
    doc1_corpus_idx = random_indices[idx_2d[0]]
    doc2_corpus_idx = idx_2d[1]
    
    doc1_title = corpus_data.loc[doc1_corpus_idx, 'title']
    doc2_title = corpus_data.loc[doc2_corpus_idx, 'title']
    url1 = corpus_data.loc[doc1_corpus_idx, 'url']
    url2 = corpus_data.loc[doc2_corpus_idx, 'url']
    
    if url1 == url2:
        continue
    if doc1_title in doc2_title or doc2_title in doc1_title:
        continue

    max_sim = sim_score
    found_pair = True
    break

if found_pair:
    print(f"--- Most Similar Documents (Score < 1.0) ---")
    print(f"Score: {max_sim:.4f}")
    print(f"Article 1: '{doc1_title}' ({url1})")
    print(f"Article 2: '{doc2_title}' ({url2})")

    print("\n--- Shared Words Breakdown ---")
    visualize_recommendation_breakdown(url1, url2, vectorizer, corpus_data, corpus_vectors, corpus_data, corpus_vectors)
else:
    print("Could not find a pair of different documents meeting all criteria.")


--- Most Similar Documents (Score < 1.0) ---
Score: 0.8954
Article 1: 'Glossary of structural engineering' (https://en.wikipedia.org/wiki/Glossary_of_structural_engineering)
Article 2: 'Glossary of mechanical engineering' (https://en.wikipedia.org/wiki/Glossary_of_mechanical_engineering)

--- Shared Words Breakdown ---


### Most Dissimilar Documents

Conversely, here are two articles with a cosine similarity score of 0.0. This means their TF-IDF vectors have no overlapping terms (or are orthogonal), indicating they are about completely unrelated topics. The breakdown visualization will likely show no common words, confirming their dissimilarity.

In [13]:
zero_sim_indices = np.where(similarity_matrix == 0)

found_dissimilar_pair = False
if len(zero_sim_indices[0]) > 0:
    for i in range(len(zero_sim_indices[0])):
        dissim_doc1_idx_in_subset = zero_sim_indices[0][i]
        dissim_doc1_corpus_idx = random_indices[dissim_doc1_idx_in_subset]
        dissim_doc2_corpus_idx = zero_sim_indices[1][i]

        url1 = corpus_data.loc[dissim_doc1_corpus_idx, 'url']
        url2 = corpus_data.loc[dissim_doc2_corpus_idx, 'url']

        if url1 != url2:
            dissim_doc1_title = corpus_data.loc[dissim_doc1_corpus_idx, 'title']
            dissim_doc2_title = corpus_data.loc[dissim_doc2_corpus_idx, 'title']
            dissim_doc1_url = url1
            dissim_doc2_url = url2
            found_dissimilar_pair = True
            break

if found_dissimilar_pair:
    print(f"--- Most Dissimilar Documents ---")
    print(f"Score: 0.0")
    print(f"Article 1: '{dissim_doc1_title}' ({dissim_doc1_url})")
    print(f"Article 2: '{dissim_doc2_title}' ({dissim_doc2_url})")

    print("\n--- Shared Words Breakdown ---")
    visualize_recommendation_breakdown(dissim_doc1_url, dissim_doc2_url, vectorizer, corpus_data, corpus_vectors, corpus_data, corpus_vectors)
else:
    print("Could not find a pair of different documents with 0.0 similarity in the random subset.")


--- Most Dissimilar Documents ---
Score: 0.0
Article 1: 'Python Software Foundation' (https://en.wikipedia.org/wiki/Python_Software_Foundation)
Article 2: 'Web indexing' (https://en.wikipedia.org/wiki/Web_indexing)

--- Shared Words Breakdown ---
No common words found between 'Python Software Foundation' and 'Web indexing'.
No common words found between 'Python Software Foundation' and 'Web indexing'.


### Most Dissimilar Documents with Non-Zero Similarity

Meaning that at least one word is shared between the documents, but they are still largely unrelated. 

In [14]:
temp_sim_matrix = np.copy(similarity_matrix)
temp_sim_matrix[temp_sim_matrix == 0] = np.inf

min_nonzero_sim = np.min(temp_sim_matrix)

found_pair = False
if not np.isinf(min_nonzero_sim):
    min_sim_indices = np.where(similarity_matrix == min_nonzero_sim)
    
    if len(min_sim_indices[0]) > 0:
        for i in range(len(min_sim_indices[0])):
            doc1_idx_in_subset = min_sim_indices[0][i]
            doc1_corpus_idx = random_indices[doc1_idx_in_subset]
            doc2_corpus_idx = min_sim_indices[1][i]

            url1 = corpus_data.loc[doc1_corpus_idx, 'url']
            url2 = corpus_data.loc[doc2_corpus_idx, 'url']

            if url1 != url2:
                doc1_title = corpus_data.loc[doc1_corpus_idx, 'title']
                doc2_title = corpus_data.loc[doc2_corpus_idx, 'title']
                found_pair = True
                break

if found_pair:
    print(f"--- Most Dissimilar Documents (Score > 0) ---")
    print(f"Score: {min_nonzero_sim:.4f}")
    print(f"Article 1: '{doc1_title}' ({url1})")
    print(f"Article 2: '{doc2_title}' ({url2})")

    print("\n--- Shared Words Breakdown ---")
    visualize_recommendation_breakdown(url1, url2, vectorizer, corpus_data, corpus_vectors, corpus_data, corpus_vectors)
else:
    print("Could not find a pair of different documents with a non-zero similarity score.")


--- Most Dissimilar Documents (Score > 0) ---
Score: 0.0000
Article 1: 'Autocrine signaling' (https://en.wikipedia.org/wiki/Autocrine_signaling)
Article 2: '1978–79 UEFA Cup' (https://en.wikipedia.org/wiki/1978%E2%80%9379_UEFA_Cup)

--- Shared Words Breakdown ---


### Article Length Distribution

The histogram below shows the distribution of article lengths based on word count. Most articles are relatively short, but there is a long tail of very comprehensive articles. This is typical for a diverse collection like Wikipedia.

In [15]:
fig = px.histogram(corpus_df, x='word_count',
                   title='Distribution of Article Word Counts',
                   labels={'word_count': 'Word Count'},
                   nbins=100)
fig.show()