# Wikipedia Article Recommender

This notebook walks through the complete process of building and using the article recommendation system. It uses the classes defined in the `components` directory to perform the following steps:

1.  **Scrape Wikipedia Articles**: Scrape articles from Wikipedia and save them as CSV files.
2.  **Merge Articles**: Combine individual scraped article CSVs into a single corpus.
3.  **Process Corpus**: Tokenize, stem, and lemmatize the text from the merged corpus.
4.  **Process History**: Scrape and process a new list of URLs (simulating browsing history).
5.  **Build/Load Model**: Create or load a TF-IDF model from the processed corpus.
6.  **Get Recommendations**: Find and display the most relevant articles from the corpus based on the history.

---

### 0. Setup and Imports

Import all necessary classes from our components and define the file paths that will be used throughout the notebook.

In [6]:
import os
from components.crawler import AsyncWikiCrawler, asyncio
from components.articles import ArticleMerger
from components.processor import TextProcessor
from components.similarities import SimilarityCalculator
import nest_asyncio

nest_asyncio.apply()

data_dir = 'data'
corpus_file = os.path.join(data_dir, 'wiki_articles_corpus.csv')
processed_corpus_file = os.path.join(data_dir, 'processed_articles.csv')
processed_history_file = os.path.join(data_dir, 'processed_history.csv')

history_urls_to_process = [
    "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://en.wikipedia.org/wiki/Natural_language_processing"
]

### 1. Wiki Scraper

First we can use the custom `AsyncWikiCrawler` to scrape a list of Wikipedia articles and save them as individual CSV files.
The crawler scrapes specified number of articles concurrently from given start URLs. This step is optional as it only appends the already existing corpus of scraped Wiki articles later using the `ArticleMerger` class. The result of this step is a file named `wiki_articles_<timestamp>.csv` in the `data` directory.

Set the parameters for scraping and run the crawler:
- `start_urls`: List of Wikipedia article URLs to start scraping from.
- `target_count`: Total number of articles to scrape. Treat it as maximum limit.
- `concurrency_limit`: Number of concurrent requests to make while scraping.
- `worker_target_count`: Number of articles worker should scrape from a given domain - start URL.
- `max_depth`: Maximum depth to follow links from the start URLs.

In [None]:
def scrape(start_urls):
    crawler = AsyncWikiCrawler(
        start_urls=start_urls,
        output_dir='./data',
        target_count=300, 
        max_depth=3,
        concurrency_limit=5,
        worker_target_count=10
    )
    
    try:
        asyncio.run(crawler.run())
    except KeyboardInterrupt:
        print("\nCrawl interrupted by user.")
        crawler.logger.info("Saving partial results...")
        crawler._save_data()
    
START_URLS = [
    # Computer Science
    'https://en.wikipedia.org/wiki/Python_(programming_language)',
    'https://en.wikipedia.org/wiki/Web_scraping',
    'https://en.wikipedia.org/wiki/Artificial_intelligence',

    # Natural Sciences
    'https://en.wikipedia.org/wiki/Physics',
    'https://en.wikipedia.org/wiki/Chemistry',
    'https://en.wikipedia.org/wiki/Biology',

    # Medicine
    'https://en.wikipedia.org/wiki/Medicine',
    'https://en.wikipedia.org/wiki/Neuroscience',
    'https://en.wikipedia.org/wiki/Genetics',

    # History
    'https://en.wikipedia.org/wiki/History',
    'https://en.wikipedia.org/wiki/Ancient_history',
    'https://en.wikipedia.org/wiki/Renaissance',

    # Culture
    'https://en.wikipedia.org/wiki/Art',
    'https://en.wikipedia.org/wiki/Music',
    'https://en.wikipedia.org/wiki/Literature',

    # Social Sciences
    'https://en.wikipedia.org/wiki/Psychology',
    'https://en.wikipedia.org/wiki/Sociology',
    'https://en.wikipedia.org/wiki/Economics',

    # Geography
    'https://en.wikipedia.org/wiki/Geography',
    'https://en.wikipedia.org/wiki/Asia',
    'https://en.wikipedia.org/wiki/Africa',

    # Engineering
    'https://en.wikipedia.org/wiki/Engineering',
    'https://en.wikipedia.org/wiki/Agriculture',
    'https://en.wikipedia.org/wiki/Space_exploration',

    # Sports
    'https://en.wikipedia.org/wiki/Olympic_Games',
    'https://en.wikipedia.org/wiki/National_Basketball_Association',
    'https://en.wikipedia.org/wiki/Skiing',
]


# OPTIONAL: To scrape data, uncomment the line below
# scrape(START_URLS)

2025-11-09 15:41:55,866 - components.crawler - INFO - Starting crawl: Target=300, Depth=3, Workers=5
2025-11-09 15:41:55,868 - components.crawler - INFO - Worker 0 assigned 6 start URLs
2025-11-09 15:41:55,871 - components.crawler - INFO - Worker 1 assigned 6 start URLs
2025-11-09 15:41:55,868 - components.crawler - INFO - Worker 0 assigned 6 start URLs
2025-11-09 15:41:55,871 - components.crawler - INFO - Worker 1 assigned 6 start URLs
2025-11-09 15:41:55,874 - components.crawler - INFO - Worker 2 assigned 5 start URLs
2025-11-09 15:41:55,876 - components.crawler - INFO - Worker 3 assigned 5 start URLs
2025-11-09 15:41:55,879 - components.crawler - INFO - Worker 4 assigned 5 start URLs
2025-11-09 15:41:55,882 - components.crawler - INFO - Worker 0 starting with 6 URLs
2025-11-09 15:41:55,884 - components.crawler - INFO - Worker 0 starting sub-crawl from: https://en.wikipedia.org/wiki/Art
2025-11-09 15:41:55,887 - components.crawler - INFO - Worker 1 starting with 6 URLs
2025-11-09 15:

### 2. Merge Individual Articles into a Corpus

This step uses the `ArticleMerger` to combine all the `wiki_articles_*.csv` files from the `data` directory into a single `wiki_articles_corpus.csv`. After a successful merge, it cleans up by deleting the original smaller files.

In [30]:
# Initialize and run the article merger
merger = ArticleMerger(data_directory=data_dir)
if merger.merge_articles():
    print("Corpus merge successful.")
    merger.cleanup_source_files()
else:
    print("Corpus merge failed.")

Found 6819 existing articles in 'data\wiki_articles_corpus.csv'.
Found 1 source files to process.
Found 6 new unique articles to add.
Appended 6 new articles to the corpus file.
Corpus merge successful.
Cleaning up source files...
Deleted data\wiki_articles_20251109_154529.csv


### 3. Preprocess the Corpus

This step uses the `TextProcessor` to read the `wiki_articles_corpus.csv`, perform NLP tasks (tokenization, stemming, lemmatization), and saves the output to `processed_articles.csv`.

In [None]:
# OPTIONAL if no extra articles have been scraped, takes about 15 minutes as the whole corpus (around 7k articles) is processed in batches of size 1000
processor_corpus = TextProcessor(corpus_path=corpus_file, output_path=processed_corpus_file)
processor_corpus.process_corpus()

Downloading necessary NLTK data...
NLTK data downloaded.
Starting processing of 'data\wiki_articles_corpus.csv'...
NLTK data downloaded.
Starting processing of 'data\wiki_articles_corpus.csv'...
Processing batch 1...
Processing batch 1...
Processing batch 2...
Processing batch 2...
Processing batch 3...
Processing batch 3...
Processing batch 4...
Processing batch 4...
Processing batch 5...
Processing batch 5...
Processing batch 6...
Processing batch 6...
Processing batch 7...
Processing batch 7...
Processing complete. Results saved to 'data\processed_articles.csv'.
Total processing time: 693.91 seconds.
Processing complete. Results saved to 'data\processed_articles.csv'.
Total processing time: 693.91 seconds.


### 4. Scrape and Process Browsing History

Here, we simulate a user's browsing history by providing a list of URLs. The `TextProcessor` scrapes these pages, processes the text in the same way as the corpus, and saves the result to `processed_history.csv`.

In [32]:
# We need a TextProcessor instance, even if we skipped corpus processing, to process the history
# The corpus_path and output_path are not strictly needed here but are required by the constructor.
processor_history = TextProcessor(corpus_path=corpus_file, output_path=processed_corpus_file)
processor_history.process_history(urls=history_urls_to_process, output_filename='processed_history.csv')

Downloading necessary NLTK data...
NLTK data downloaded.
Processing 3 URLs from history...
History processing complete. Results saved to 'data\processed_history.csv'.
History processing complete. Results saved to 'data\processed_history.csv'.


### 5. Build or Load the TF-IDF Model

This is the core of the recommendation engine. The `SimilarityCalculator` is used to build the TF-IDF model from the processed corpus. This only needs to be done once. If a model already exists in the `models` directory, it will be loaded from disk instead of being rebuilt.

In [5]:
calculator = SimilarityCalculator(
    processed_corpus_path=processed_corpus_file,
    processed_history_path=processed_history_file
)

# Fit the model on the corpus. Set force_refit=True to rebuild an existing model. (when corpus was changed)
calculator.fit_corpus(force_refit=False)

TF-IDF model already exists. Loading from disk.
Successfully loaded TF-IDF model from disk.


### 6. Get and Display Recommendations

Finally, we use the fitted model to calculate the similarity between the user's history and the corpus. The function returns the top K most similar articles, which we then print out.

In [4]:
# Get the top K recommendations - 5 by default
K = 5
top_recommendations = calculator.get_recommendations(top_k=K)

# Display the results
if top_recommendations:
    print(f"\n--- Top {K} Article Recommendations ---")
    for url, title, score in top_recommendations:
        print(f"Score: {score:.4f} | Title: {title} | URL: {url}")
else:
    print(f"\nCould not generate recommendations.")

Processing history to generate recommendations...

--- Top 5 Article Recommendations ---
Score: 0.3906 | Title: Applications of artificial intelligence | URL: https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence
Score: 0.3791 | Title: History of artificial intelligence | URL: https://en.wikipedia.org/wiki/History_of_artificial_intelligence
Score: 0.3304 | Title: Symbolic artificial intelligence | URL: https://en.wikipedia.org/wiki/Symbolic_artificial_intelligence
Score: 0.3217 | Title: Outline of machine learning | URL: https://en.wikipedia.org/wiki/Outline_of_machine_learning
Score: 0.3190 | Title: Ethics of artificial intelligence | URL: https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence
