# 🗂️ Meeting Notes Optimization Pipeline

This notebook builds a complete data engineering pipeline to process internal meeting notes:
- Clean and structure raw meeting text
- Engineer LLM prompts
- Generate AI summaries (with Hugging Face)
- Reformat into natural language
- Upload final output to **BigQuery**


## 📥 Load & Inspect Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/mohamedalibenbelhassen/Documents/PERSONAL/notes-optimizer/Structured_Meeting_Notes.csv')
df.head()

Unnamed: 0,date,speaker,note,project,action_item,deadline
0,2025-03-28,Jordan,[Project: AI assistant integration] Jordan agr...,AI assistant integration,Set up Airflow DAGs,Friday
1,2025-04-04,Jordan,[Project: Internal documentation cleanup] Jord...,Internal documentation cleanup,Test Gemini with internal queries,Friday
2,2025-03-24,Jordan,[Project: Data pipeline improvement] Jordan hi...,Data pipeline improvement,Organize guild workshop,Monday
3,2025-02-13,Omar,[Project: Knowledge sharing guilds] Omar highl...,Knowledge sharing guilds,Summarize feedback for product,Thursday
4,2025-03-27,Anaïs,[Project: Dashboard redesign] Anaïs requested ...,Dashboard redesign,Summarize feedback for product,Thursday


## 🧹 Clean & Preprocess Data

### Check for duplicates

In [3]:
df_clean = df.dropna(subset=['note'])


### Capital letter for strings in ```speaker``` and ```project```

In [4]:
df_clean['speaker'] = df_clean['speaker'].str.title()
df_clean['project'] = df_clean['project'].str.title()
df_clean.head()

Unnamed: 0,date,speaker,note,project,action_item,deadline
0,2025-03-28,Jordan,[Project: AI assistant integration] Jordan agr...,Ai Assistant Integration,Set up Airflow DAGs,Friday
1,2025-04-04,Jordan,[Project: Internal documentation cleanup] Jord...,Internal Documentation Cleanup,Test Gemini with internal queries,Friday
2,2025-03-24,Jordan,[Project: Data pipeline improvement] Jordan hi...,Data Pipeline Improvement,Organize guild workshop,Monday
3,2025-02-13,Omar,[Project: Knowledge sharing guilds] Omar highl...,Knowledge Sharing Guilds,Summarize feedback for product,Thursday
4,2025-03-27,Anaïs,[Project: Dashboard redesign] Anaïs requested ...,Dashboard Redesign,Summarize feedback for product,Thursday


### ✂️ Truncate Notes for Prompt Stability

### Truncate text to avoid overflow or confuse the model

In [5]:
df_clean['note_short'] = df_clean['note'].apply(lambda x: x[:300])


## 📆 Parse Deadlines into Dates

### Map the day of the week to the actual deadline date in datetime

In [6]:
from datetime import datetime, timedelta

def convert_weekday_to_date(row):
    weekdays = {
        "Monday": 0, "Tuesday": 1, "Wednesday": 2,
        "Thursday": 3, "Friday": 4, "Saturday": 5, "Sunday": 6
    }
    try:
        note_date = datetime.strptime(row["date"], "%Y-%m-%d") + timedelta(days=1)  # Shift by 1 day
        target_weekday = weekdays.get(row["deadline"])
        if target_weekday is None:
            return None
        days_ahead = (target_weekday - note_date.weekday()) % 7
        return (note_date + timedelta(days=days_ahead)).strftime("%Y-%m-%d")
    except:
        return None

df_clean.head()


Unnamed: 0,date,speaker,note,project,action_item,deadline,note_short
0,2025-03-28,Jordan,[Project: AI assistant integration] Jordan agr...,Ai Assistant Integration,Set up Airflow DAGs,Friday,[Project: AI assistant integration] Jordan agr...
1,2025-04-04,Jordan,[Project: Internal documentation cleanup] Jord...,Internal Documentation Cleanup,Test Gemini with internal queries,Friday,[Project: Internal documentation cleanup] Jord...
2,2025-03-24,Jordan,[Project: Data pipeline improvement] Jordan hi...,Data Pipeline Improvement,Organize guild workshop,Monday,[Project: Data pipeline improvement] Jordan hi...
3,2025-02-13,Omar,[Project: Knowledge sharing guilds] Omar highl...,Knowledge Sharing Guilds,Summarize feedback for product,Thursday,[Project: Knowledge sharing guilds] Omar highl...
4,2025-03-27,Anaïs,[Project: Dashboard redesign] Anaïs requested ...,Dashboard Redesign,Summarize feedback for product,Thursday,[Project: Dashboard redesign] Anaïs requested ...


## 🤖 Create LLM Input from Note Data

In [7]:
df_clean["deadline_date"] = df_clean.apply(convert_weekday_to_date, axis=1)
df_clean["deadline_date"] = pd.to_datetime(df_clean["deadline_date"])
df_clean["date"] = pd.to_datetime(df_clean["date"])



In [8]:
df_clean["days_until_deadline"] = (df_clean["deadline_date"] - df_clean["date"]).dt.days
df_clean.head()

Unnamed: 0,date,speaker,note,project,action_item,deadline,note_short,deadline_date,days_until_deadline
0,2025-03-28,Jordan,[Project: AI assistant integration] Jordan agr...,Ai Assistant Integration,Set up Airflow DAGs,Friday,[Project: AI assistant integration] Jordan agr...,2025-04-04,7
1,2025-04-04,Jordan,[Project: Internal documentation cleanup] Jord...,Internal Documentation Cleanup,Test Gemini with internal queries,Friday,[Project: Internal documentation cleanup] Jord...,2025-04-11,7
2,2025-03-24,Jordan,[Project: Data pipeline improvement] Jordan hi...,Data Pipeline Improvement,Organize guild workshop,Monday,[Project: Data pipeline improvement] Jordan hi...,2025-03-31,7
3,2025-02-13,Omar,[Project: Knowledge sharing guilds] Omar highl...,Knowledge Sharing Guilds,Summarize feedback for product,Thursday,[Project: Knowledge sharing guilds] Omar highl...,2025-02-20,7
4,2025-03-27,Anaïs,[Project: Dashboard redesign] Anaïs requested ...,Dashboard Redesign,Summarize feedback for product,Thursday,[Project: Dashboard redesign] Anaïs requested ...,2025-04-03,7


## 🔍 Clean Prompt Text (Regex Removal of Project Tags)

In [9]:
def reformat_summary(row):
    date = pd.to_datetime(row["date"]).strftime("%B %d, %Y")  # e.g. March 27, 2025
    return (
        f"On {date}, the team held a meeting about the '{row['project']}' project. "
        f"{row['speaker']} shared updates and outlined key action items: {row['action_item']}. "
        f"The deadline agreed upon was {row['deadline']}."
    )

# Apply to the whole DataFrame
df_clean["summary"] = df_clean.apply(reformat_summary, axis=1)

In [10]:
df_clean["llm_input"] = df_clean.apply(
    lambda row: f"Meeting on {row['date']} about '{row['project']}' led by {row['speaker']}: {row['note_short']}",
    axis=1
)

df_clean["llm_input"].head()

0    Meeting on 2025-03-28 00:00:00 about 'Ai Assis...
1    Meeting on 2025-04-04 00:00:00 about 'Internal...
2    Meeting on 2025-03-24 00:00:00 about 'Data Pip...
3    Meeting on 2025-02-13 00:00:00 about 'Knowledg...
4    Meeting on 2025-03-27 00:00:00 about 'Dashboar...
Name: llm_input, dtype: object

## 📝 Reformat AI Summary into Natural Language

In [11]:
import re

def clean_note(text):
    return re.sub(r'\[Project: .*?\]\s*', '', text)


df_clean['llm_input'] = df_clean['llm_input'].apply(clean_note) 

df_clean['llm_input'].head()

0    Meeting on 2025-03-28 00:00:00 about 'Ai Assis...
1    Meeting on 2025-04-04 00:00:00 about 'Internal...
2    Meeting on 2025-03-24 00:00:00 about 'Data Pip...
3    Meeting on 2025-02-13 00:00:00 about 'Knowledg...
4    Meeting on 2025-03-27 00:00:00 about 'Dashboar...
Name: llm_input, dtype: object

In [12]:
from transformers import pipeline

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

df_clean["summary"] = df_clean["llm_input"].apply(
    lambda x: summarizer(x, max_length=40, min_length=10, do_sample=False)[0]['summary_text']
)

df_clean.head(10)





  from .autonotebook import tqdm as notebook_tqdm
Device set to use mps:0


Unnamed: 0,date,speaker,note,project,action_item,deadline,note_short,deadline_date,days_until_deadline,summary,llm_input
0,2025-03-28,Jordan,[Project: AI assistant integration] Jordan agr...,Ai Assistant Integration,Set up Airflow DAGs,Friday,[Project: AI assistant integration] Jordan agr...,2025-04-04,7,Meeting on 2025-03-28 00:00:00 about 'Ai Assi...,Meeting on 2025-03-28 00:00:00 about 'Ai Assis...
1,2025-04-04,Jordan,[Project: Internal documentation cleanup] Jord...,Internal Documentation Cleanup,Test Gemini with internal queries,Friday,[Project: Internal documentation cleanup] Jord...,2025-04-11,7,Meeting on 2025-04-04 00:00:00 about 'Interna...,Meeting on 2025-04-04 00:00:00 about 'Internal...
2,2025-03-24,Jordan,[Project: Data pipeline improvement] Jordan hi...,Data Pipeline Improvement,Organize guild workshop,Monday,[Project: Data pipeline improvement] Jordan hi...,2025-03-31,7,Meeting on 2025-03-24 00:00:00 about 'Data Pi...,Meeting on 2025-03-24 00:00:00 about 'Data Pip...
3,2025-02-13,Omar,[Project: Knowledge sharing guilds] Omar highl...,Knowledge Sharing Guilds,Summarize feedback for product,Thursday,[Project: Knowledge sharing guilds] Omar highl...,2025-02-20,7,Meeting on 2025-02-13 00:00:00 about 'Knowled...,Meeting on 2025-02-13 00:00:00 about 'Knowledg...
4,2025-03-27,Anaïs,[Project: Dashboard redesign] Anaïs requested ...,Dashboard Redesign,Summarize feedback for product,Thursday,[Project: Dashboard redesign] Anaïs requested ...,2025-04-03,7,Meeting on 2025-03-27 00:00:00 about 'Dashboa...,Meeting on 2025-03-27 00:00:00 about 'Dashboar...
5,2025-03-30,Louis,[Project: Malty AI latency issue] Louis raised...,Malty Ai Latency Issue,Run dbt models in staging,Sunday,[Project: Malty AI latency issue] Louis raised...,2025-04-06,7,Meeting on 2025-03-30 00:00:00 about 'Malty A...,Meeting on 2025-03-30 00:00:00 about 'Malty Ai...
6,2025-03-02,Mélanie,[Project: Malty AI latency issue] Mélanie agre...,Malty Ai Latency Issue,Review latency logs,Sunday,[Project: Malty AI latency issue] Mélanie agre...,2025-03-09,7,Meeting on 2025-03-02 00:00:00 about 'Malty A...,Meeting on 2025-03-02 00:00:00 about 'Malty Ai...
7,2025-02-28,Jordan,[Project: Knowledge sharing guilds] Jordan rai...,Knowledge Sharing Guilds,Share documentation with the team,Friday,[Project: Knowledge sharing guilds] Jordan rai...,2025-03-07,7,Meeting on 2025-02-28 00:00:00 about 'Knowled...,Meeting on 2025-02-28 00:00:00 about 'Knowledg...
8,2025-03-31,Louis,[Project: Freelancer matching algorithm] Louis...,Freelancer Matching Algorithm,Summarize feedback for product,Monday,[Project: Freelancer matching algorithm] Louis...,2025-04-07,7,Meeting on 2025-03-31 00:00:00 about 'Freelan...,Meeting on 2025-03-31 00:00:00 about 'Freelanc...
9,2025-02-24,Louis,[Project: Freelancer matching algorithm] Louis...,Freelancer Matching Algorithm,Review latency logs,Monday,[Project: Freelancer matching algorithm] Louis...,2025-03-03,7,Louis requested a report on the freelancer ma...,Meeting on 2025-02-24 00:00:00 about 'Freelanc...


In [13]:
reformat_summary(df_clean.iloc[0])


"On March 28, 2025, the team held a meeting about the 'Ai Assistant Integration' project. Jordan shared updates and outlined key action items: Set up Airflow DAGs. The deadline agreed upon was Friday."

In [14]:
df_clean["summary"] = df_clean.apply(reformat_summary, axis=1)


In [15]:
df_clean[["project", "speaker", "summary"]].head()


Unnamed: 0,project,speaker,summary
0,Ai Assistant Integration,Jordan,"On March 28, 2025, the team held a meeting abo..."
1,Internal Documentation Cleanup,Jordan,"On April 04, 2025, the team held a meeting abo..."
2,Data Pipeline Improvement,Jordan,"On March 24, 2025, the team held a meeting abo..."
3,Knowledge Sharing Guilds,Omar,"On February 13, 2025, the team held a meeting ..."
4,Dashboard Redesign,Anaïs,"On March 27, 2025, the team held a meeting abo..."


## ✅ Upload to BigQuery

In [18]:
from google.oauth2 import service_account
from pandas_gbq import to_gbq

# 🔐 Use your downloaded JSON key here
credentials = service_account.Credentials.from_service_account_file(
    "gcp-key-pipeline-notes.json"  # <-- replace with the exact filename if different
)

# 🚀 Upload to BigQuery
to_gbq(
    dataframe=df_clean,
    destination_table="meeting_notes.summarized_notes",  # dataset.table
    project_id="pipeline-notes-optimizer",
    credentials=credentials,
    if_exists="replace"  # or "append" if you're adding more data later
)

FileNotFoundError: [Errno 2] No such file or directory: 'gcp-key-pipeline-notes.json'