<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_052_RAG_CahsFlow4Cast_BlogPost_Chunking_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## ✅ RAG for CashFlow4Cast

> **Turn your Blogger site into a RAG-powered support assistant**  
This means taking your blog content and:
- Converting it into a **searchable document base**
- Letting users ask questions in natural language
- Using **retrieval + generation** to answer based on your posts

---

## 🔧 Step-by-Step Plan

### 1. **Export or Copy Your Blog Posts**
You can:
- **Manually copy/paste** posts into `.txt` or `.md` files (good for 5–20 posts)
- Use a tool like [Blogger2Markdown](https://github.com/mosra/blogger2markdown) to bulk export

💡 **Format tip:** Each post should be saved as its own file or labeled section (with title and body).

---

### 2. **Preprocess Into Paragraph Chunks**
Once in `.txt` or `.md` form:
- Split content into smaller paragraphs (~2–5 sentences each)
- Save these as the **documents** to embed and retrieve from
- Keep metadata (e.g., post title, URL) so the chatbot can cite it later

---

### 3. **Embed & Index Your Blog with FAISS**
Just like you did with HotpotQA:
- Use `sentence-transformers` to embed each paragraph
- Store those in a FAISS index
- Optionally use the post title or URL as metadata

---

### 4. **Build the RAG Pipeline**
Re-use your `rag_qa()` function to:
- Embed the user’s question
- Retrieve the top blog paragraphs
- Generate a response using a model like `FLAN-T5`

---

### 5. **Optional: Deploy to Blogger**
You can’t run a full model directly on Blogger, but you can:
- Host the backend on Hugging Face Spaces, Replit, or Google Colab
- Embed a chatbot UI (like [Gradio](https://www.gradio.app/)) into your Blogger post via `<iframe>`

So your flow could look like:

> ✅ User visits your blog  
> 💬 Types a question into the embedded chatbot  
> 🧠 Backend retrieves answers from your own blog posts using RAG  
> 🤖 Model answers clearly and contextually

---

## 🔥 Real World Value

This gives you:
- A **custom, self-updating knowledge base**
- A chatbot grounded in your actual content
- A modern **AI-powered customer support system**  
- A **huge edge** over static FAQs




## 🧾 Notebook Purpose: Blog Post Chunking & Preprocessing

This notebook is dedicated to preparing blog content for use in a Retrieval-Augmented Generation (RAG) pipeline. Specifically, it focuses on:

### ✅ Objectives

- **Importing and parsing** a Blogger XML export
- **Manually curating** a list of relevant blog posts
- **Cleaning** HTML tags, Blogger comments, and formatting artifacts
- **Splitting** each post into paragraph-sized chunks (2–4 sentences)
- **Adding metadata** (title, filename, chunk ID) for each chunk
- **Exporting** the clean, structured content to a `.csv` file for use in downstream machine learning workflows

### 📦 Output

- A CSV file: `cleaned_blog_chunks.csv`
- Located in: `My Drive/CF4C/BLOG/`
- Each row contains:
  - Blog post title
  - Source filename
  - Chunk ID
  - Cleaned text chunk

---

### 🚀 Next Step

Use this cleaned and chunked dataset in a separate notebook to:
- Generate **embeddings**
- Build a **FAISS index**
- Create a **blog-aware chatbot** using RAG techniques



In [1]:
from google.colab import files
uploaded = files.upload()

Saving blog-04-10-2025.xml to blog-04-10-2025 (1).xml


In [2]:
import xml.etree.ElementTree as ET
import os

# Load your uploaded file
xml_file = list(uploaded.keys())[0]

# Parse the XML
tree = ET.parse(xml_file)
root = tree.getroot()

# Blogger uses this namespace
ns = {'atom': 'http://www.w3.org/2005/Atom'}

# Create output folder
os.makedirs("blog_posts", exist_ok=True)

# Loop through entries
post_count = 0
for entry in root.findall('atom:entry', ns):
    # Only keep entries with actual content
    content_elem = entry.find('atom:content', ns)
    title_elem = entry.find('atom:title', ns)

    if content_elem is not None and title_elem is not None:
        title = title_elem.text or "Untitled"
        content = content_elem.text or ""

        # Skip empty content
        if content.strip() == "":
            continue

        # Save as .txt
        safe_title = title.replace(" ", "_").replace("/", "-").lower()
        filepath = os.path.join("blog_posts", f"{safe_title[:40]}.txt")
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(f"# {title}\n\n{content}")
        post_count += 1

print(f"✅ Exported {post_count} posts to the 'blog_posts/' folder.")


✅ Exported 69 posts to the 'blog_posts/' folder.


### Check for Duplicates

In [3]:
from collections import Counter

blog_dir = "/content/blog_posts"
files = [f for f in os.listdir(blog_dir) if f.endswith(".txt")]

# Normalize filenames to detect duplicates
normalized_names = [os.path.splitext(f.lower())[0] for f in files]
name_counts = Counter(normalized_names)
duplicates = {name: count for name, count in name_counts.items() if count > 1}

print(f"Found {len(duplicates)} possible duplicates:")
for name, count in duplicates.items():
    print(f"{name} ({count})")


Found 0 possible duplicates:


### List of Posts & Items

In [4]:
import os

# List contents of the root directory
os.listdir("/content/blog_posts")

['📋_store_17_–_forecasting_accuracy_summar.txt',
 '📋_store_40_–_forecasting_accuracy_summar.txt',
 '🤖_how_it_works.txt',
 'default_comment_mode_for_posts.txt',
 '📋_store_36_–_forecasting_accuracy_summar.txt',
 '📋_store_13_–_forecasting_accuracy_summar.txt',
 '📋_store_37_–_forecasting_accuracy_summar.txt',
 'gainesville_economic_indicators_that_mat.txt',
 "the_list_of_administrators'_emails_for_t.txt",
 '📋_store_10_–_forecasting_accuracy_summar.txt',
 'the_type_of_publishing_done_for_this_blo.txt',
 'whether_quick_editing_is_enabled.txt',
 'whether_to_show_comments.txt',
 '📋_store_31_–_forecasting_accuracy_summar.txt',
 'blog_comment_form_location.txt',
 'the_number_of_the_time_stamp_format.txt',
 'the_number_of_the_archive_index_date_for.txt',
 '📋_store_23_–_forecasting_accuracy_summar.txt',
 'unit_of_things_to_show_on_the_main_page.txt',
 '📌_forecasting_you_can_trust.txt',
 'template:_cashflow_4cast.txt',
 'maximum_number_of_things_to_show_on_the_.txt',
 'a_description_of_the_blog.txt

### Filter List for Posts Only

In [5]:
blog_posts_list = [
 '📋_store_17_–_forecasting_accuracy_summar.txt',
 '📋_store_40_–_forecasting_accuracy_summar.txt',
 '🤖_how_it_works.txt',
#  'default_comment_mode_for_posts.txt',
 '📋_store_36_–_forecasting_accuracy_summar.txt',
 '📋_store_13_–_forecasting_accuracy_summar.txt',
 '📋_store_37_–_forecasting_accuracy_summar.txt',
 'gainesville_economic_indicators_that_mat.txt',
#  "the_list_of_administrators'_emails_for_t.txt",
 '📋_store_10_–_forecasting_accuracy_summar.txt',
#  'the_type_of_publishing_done_for_this_blo.txt',
#  'whether_quick_editing_is_enabled.txt',
#  'whether_to_show_comments.txt',
 '📋_store_31_–_forecasting_accuracy_summar.txt',
#  'blog_comment_form_location.txt',
#  'the_number_of_the_time_stamp_format.txt',
#  'the_number_of_the_archive_index_date_for.txt',
 '📋_store_23_–_forecasting_accuracy_summar.txt',
#  'unit_of_things_to_show_on_the_main_page.txt',
 '📌_forecasting_you_can_trust.txt',
#  'template:_cashflow_4cast.txt',
#  'maximum_number_of_things_to_show_on_the_.txt',
#  'a_description_of_the_blog.txt',
#  'whether_to_show_a_related_link_box_in_th.txt',
#  'whether_to_show_images_in_the_lightbox_w.txt',
#  'how_frequently_this_blog_should_be_archi.txt',
#  'whether_to_show_profile_images_in_commen.txt',
#  'whether_this_blog_serves_custom_robots.t.txt',
 '📋_store_15_–_forecasting_accuracy_summar.txt',
 '🚀_looking_ahead:_the_power_of_economic_i.txt',
 '📋_store_4_–_forecasting_accuracy_summary.txt',
#  'the_type_of_feed_to_provide_for_per-post.txt',
 '📋_store_19_–_forecasting_accuracy_summar.txt',
 '📋_store_28_–_forecasting_accuracy_summar.txt',
#  'the_number_of_the_date_header_format.txt',
 '📋_store_34_–_forecasting_accuracy_summar.txt',
 'pricing.txt',
#  'whether_this_blog_serves_custom_ads.txt_.txt',
 '📋_store_5_–_forecasting_accuracy_summary.txt',
#  'the_access_type_for_the_readers_of_the_b.txt',
#  'whether_to_show_a_link_for_users_to_e-ma.txt',
 'forecasting_you_can_trust_in_uncertain_t.txt',
 '📋_store_38_–_forecasting_accuracy_summar.txt',
 'what_if_you_could_cut_cash_flow_forecast.txt',
#  'language_for_this_blog.txt',
 'the_blogspot_subdomain_under_which_to_pu.txt',
 '📋_store_27_–_forecasting_accuracy_summar.txt',
#  'whether_to_provide_an_archive_page_for_e.txt',
 'the_time_zone_for_this_blog.txt',
 '📋_store_1_–_forecasting_accuracy_summary.txt',
#  'whether_this_blog_should_be_indexed_by_s.txt',
#  'who_can_comment.txt',
#  'whether_this_blog_contains_adult_content.txt',
#  'whether_to_enable_comment_moderation.txt',
#  'the_type_of_feed_to_provide_for_blog_pos.txt',
 'federal_economic_indicators_that_impact_.txt',
#  'the_type_of_feed_to_provide_for_blog_com.txt',
#  'whether_float_alignment_is_enabled_for_t.txt',
#  'number_of_days_after_which_new_comments_.txt',
 '📋_store_43_–_forecasting_accuracy_summar.txt',
 'consistency_that_builds_confidence.txt',
 '💼_smarter_forecasting_in_uncertain_times.txt',
#  'the_name_of_the_blog.txt',
 'about_micah_shull.txt',
#  'whether_this_blog_is_served_with_meta_de.txt',
 '🏠_federal_economic_indicators_that_impac.txt'
#  'whether_to_require_commenters_to_complet.txt',
#  'comment_time_stamp_format_number.txt'
 ]

blog_posts_list.sort()
blog_posts_list

['about_micah_shull.txt',
 'consistency_that_builds_confidence.txt',
 'federal_economic_indicators_that_impact_.txt',
 'forecasting_you_can_trust_in_uncertain_t.txt',
 'gainesville_economic_indicators_that_mat.txt',
 'pricing.txt',
 'the_blogspot_subdomain_under_which_to_pu.txt',
 'the_time_zone_for_this_blog.txt',
 'what_if_you_could_cut_cash_flow_forecast.txt',
 '🏠_federal_economic_indicators_that_impac.txt',
 '💼_smarter_forecasting_in_uncertain_times.txt',
 '📋_store_10_–_forecasting_accuracy_summar.txt',
 '📋_store_13_–_forecasting_accuracy_summar.txt',
 '📋_store_15_–_forecasting_accuracy_summar.txt',
 '📋_store_17_–_forecasting_accuracy_summar.txt',
 '📋_store_19_–_forecasting_accuracy_summar.txt',
 '📋_store_1_–_forecasting_accuracy_summary.txt',
 '📋_store_23_–_forecasting_accuracy_summar.txt',
 '📋_store_27_–_forecasting_accuracy_summar.txt',
 '📋_store_28_–_forecasting_accuracy_summar.txt',
 '📋_store_31_–_forecasting_accuracy_summar.txt',
 '📋_store_34_–_forecasting_accuracy_summar.txt

### Strip HTML code

### Chunking Code

In [9]:
import os
import re
import pandas as pd
from bs4 import BeautifulSoup

# Clean HTML and Blogger comments
def clean_html(text):
    # Remove comments like <!-- ... -->
    text = re.sub(r"<!--.*?-->", "", text, flags=re.DOTALL)
    # Remove HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    clean_text = soup.get_text()
    # Remove extra spacing
    return re.sub(r"\s+", " ", clean_text).strip()

# Chunking function: 2–4 sentences per chunk
def split_into_chunks(text, max_sentences=4):
    sentences = re.split(r'(?<=[.!?]) +', text)
    return [
        " ".join(sentences[i:i + max_sentences]).strip()
        for i in range(0, len(sentences), max_sentences)
        if sentences[i:i + max_sentences]
    ]

# Re-process your curated list of blog posts
blog_dir = "/content/blog_posts"
document_chunks = []

for filename in blog_posts_list:
    filepath = os.path.join(blog_dir, filename)
    try:
        with open(filepath, "r", encoding="utf-8") as file:
            content = file.read()

        title = content.splitlines()[0].replace("#", "").strip()
        body = "\n".join(content.splitlines()[1:]).strip()
        cleaned_body = clean_html(body)

        chunks = split_into_chunks(cleaned_body)

        for idx, chunk in enumerate(chunks):
            document_chunks.append({
                "title": title,
                "filename": filename,
                "chunk_id": idx,
                "text": chunk
            })
    except FileNotFoundError:
        print(f"⚠️ Skipped missing file: {filename}")

# Display cleaned and chunked data
df_clean_chunks = pd.DataFrame(document_chunks)
df_clean_chunks[0:50]



Unnamed: 0,title,filename,chunk_id,text
0,About Micah Shull,about_micah_shull.txt,0,About Micah Shull | Data Scientist & Founder o...
1,About Micah Shull,about_micah_shull.txt,1,That’s why I created Cashflow 4Cast — a servic...
2,Consistency That Builds Confidence,consistency_that_builds_confidence.txt,0,It Wasn’t Just One Store — It Was Every Store ...
3,Consistency That Builds Confidence,consistency_that_builds_confidence.txt,1,"And in every case, it consistently cut forecas..."
4,Consistency That Builds Confidence,consistency_that_builds_confidence.txt,2,MAPE (Mean Absolute Percentage Error): Shows f...
5,Consistency That Builds Confidence,consistency_that_builds_confidence.txt,3,If you're a mid-size business owner looking to...
6,Federal Economic Indicators That Impact Gaines...,federal_economic_indicators_that_impact_.txt,0,📘 Federal Economic Indicators That Impact Gain...
7,Federal Economic Indicators That Impact Gaines...,federal_economic_indicators_that_impact_.txt,1,Rising inflation affects what people can affor...
8,Federal Economic Indicators That Impact Gaines...,federal_economic_indicators_that_impact_.txt,2,2. Consumer Confidence Index What It Is: This ...
9,Federal Economic Indicators That Impact Gaines...,federal_economic_indicators_that_impact_.txt,3,A drop in confidence can lead to: Fewer custom...


### Save to CSV

In [11]:
from google.colab import drive
drive.mount('/content/drive')

# Save to local Colab (optional)
csv_path = "/content/cleaned_blog_chunks.csv"
df_clean_chunks.to_csv(csv_path, index=False)

# ✅ Save directly to Google Drive
drive_path = "/content/drive/My Drive/CF4C/BLOG/cleaned_blog_chunks.csv"
df_clean_chunks.to_csv(drive_path, index=False)

print(f"✅ Saved to Google Drive at: {drive_path}")

Mounted at /content/drive
✅ Saved to Google Drive at: /content/drive/My Drive/CF4C/BLOG/cleaned_blog_chunks.csv
