##  Agenda based Conference Optimization
---


##  Objective

The objective of this project is to:
- Analyze research abstract titles submitted to the AIOS conference for a selected year.
- Identify and separate unique talks from similar/duplicate ones using ML and clustering techniques.
- Help organizers detect redundant abstracts by grouping similar titles using similarity scores.

---

##  Techniques & Algorithms Used

- **YAKE (Yet Another Keyword Extractor):** Extracts important keywords from abstract titles to capture meaningful content.
- **TF-IDF (Term Frequency–Inverse Document Frequency):** Transforms text data into numerical feature vectors for similarity computation.
- **Cosine Similarity Matrix:** Measures similarity between abstract titles based on vector angles.
- **KMeans Clustering (K=4):** Groups titles into clusters using elbow method logic for identifying similar patterns.
- **Thresholding:** Filters pairs with cosine similarity ≥ 0.85 to detect high similarity.

---

##  Steps Involved: 

1. Load & Preprocess Data:

- Read the input CSV file.
- Extract the year from the dataset.
- Clean and normalize the title field by:
- Drop rows with missing required fields in conf_name and title.


2. Filter by Year:

- Select only those rows (abstracts/titles) from a specific year for analysis.

3. TF-IDF Vectorization: (Each title is converted into a TF-IDF vector)

- Algorithm Used: TF-IDF (Term Frequency-Inverse Document Frequency).

- Input: Preprocessed titles from the filtered year.

- Output: Sparse matrix of numerical features representing the text content.



4. Cosine Similarity Calculation:

- Algorithm Used: Cosine Similarity Matrix.

- Calculate pairwise cosine similarity scores between all title vectors.

- Threshold Score: Similarity > 0.85 is considered similar or duplicate.

- Output: Matrix of similarity scores between every pair of titles.




5. Cluster Similar Titles:

- Algorithm Used: KMeans Clustering.

- Technique: Elbow Method to determine optimal number of clusters.

- Input: Titles with similarity above threshold.

- Output: Clustered groups of talks that are similar or duplicates.



6. Identify Unique Talks:
- Extract talks that:

  - Are not part of any pair with similarity above 0.85.
  - Do not belong to any similarity cluster.

- Output: Talks considered unique.

7. Save Output CSVs:
- CSV 1: List of unique talks (no similar title found).

- CSV 2: List of similar/duplicate talks based on Similarity Score and Cluster ID.

In [2]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
import yake  # YAKE for keyword extraction (keyword ~ core meaning)

# -----------------------------
# Clean text: normalize and remove punctuation
# -----------------------------
def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower().strip()
    text = re.sub(r"[.,\-;_!::‘:’:`:`]+", "", text)
    text = re.sub(r"\s+", " ", text)  # normalize multiple spaces to one
    return text

# -----------------------------
# STEP 1: Read the input ophthalmology abstracts dataset
# -----------------------------
file_path = "/Users/pranavs/Desktop/Conference Optimization/AIOC 2014 - 2025 All Abstracts Data - Shared to VS - Sheet1.csv"
df = pd.read_csv(file_path)

# -----------------------------
# STEP 2: Clean the dataset – Drop rows with missing fields in conf_name and title
# -----------------------------
df = df.dropna(subset=['# conf_name', 'title'])

# -----------------------------
# STEP 3: Extract year from conference name (example: AIOS 2014 ➝ 2014)
# -----------------------------
df['year'] = df['# conf_name'].str.extract(r'(\d{4})')
df = df[df['year'].notnull()]
df['year'] = df['year'].astype(int)

# Clean and lowercase titles
df['title_clean'] = df['title'].astype(str).str.lower().str.strip()

# -----------------------------
# STEP 4: Filter by year – Select only abstracts from a specific year
# -----------------------------
year_to_analyze = 2014
df_year = df[df['year'] == year_to_analyze].reset_index(drop=True)

if len(df_year) < 2:
    print(f"Not enough entries in year {year_to_analyze}. Exiting.")
    exit()

print(f"\n Processing Abstracts for Year: {year_to_analyze}")

# -----------------------------
# STEP 5: Extract keywords using YAKE (Yet Another Keyword Extractor)
# -----------------------------
kw_extractor = yake.KeywordExtractor(lan="en", n=1, top=5)
df_year['keywords'] = df_year['title_clean'].apply(lambda text: " ".join([kw for kw, _ in kw_extractor.extract_keywords(text)]))

# -----------------------------
# STEP 6: TF-IDF (Term Frequency-Inverse Document Frequency) – Transform keywords into numerical feature vectors
# -----------------------------
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df_year['keywords'])

# -----------------------------
# STEP 7: Cosine Similarity Matrix Approach– Compute similarity between title vectors
# -----------------------------
similarity_matrix = cosine_similarity(tfidf_matrix)

# -----------------------------
# STEP 8: Apply Thresholding  – Similarity Score > 0.85 = similar or duplicate
#Duplicate Talks: score=1
#Similiar Talks: score=(0.85,1)
# -----------------------------
threshold = 0.85
similar_pairs = []
used_abs_nos = set()

for i in range(len(df_year)):
    for j in range(i + 1, len(df_year)):
        score = similarity_matrix[i, j]
        if score >= threshold:
            row_i = df_year.iloc[i]
            row_j = df_year.iloc[j]
            similar_pairs.append({
                'year': year_to_analyze,
                'similarity_score': round(score, 3),
                'abs_no_1': row_i['abs_no'],
                'name_1': row_i['name'],
                'title_1': row_i['title'],
                'abs_no_2': row_j['abs_no'],
                'name_2': row_j['name'],
                'title_2': row_j['title'],
            })
            used_abs_nos.update([row_i['abs_no'], row_j['abs_no']])

# -----------------------------
# STEP 9: Identify Unique Talks – Not part of any similarity match
# -----------------------------
unique_df = df_year[~df_year['abs_no'].isin(used_abs_nos)][['abs_no', 'name', 'title', 'year']]
unique_df = unique_df.sort_values(by='abs_no')

# -----------------------------
# STEP 10: Cluster similar/duplicate talks using KMeans Clustering Algorithm (Unsupervised ML)
# -----------------------------
k = 4  # Can be determined via elbow method 
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
df_year['cluster'] = kmeans.fit_predict(tfidf_matrix)

# Assign cluster info to similar pairs for context
for pair in similar_pairs:
    idx1 = df_year[df_year['abs_no'] == pair['abs_no_1']].index[0]
    pair['cluster'] = df_year.loc[idx1, 'cluster']

# -----------------------------
# STEP 11: Output Results – Save unique and similar talks to separate CSVs
# Save to CSV ( 2 CSV's: one with unique talks and other with similar/dupliacte talks )
# -----------------------------
similar_df = pd.DataFrame(similar_pairs).sort_values(by="similarity_score", ascending=True) #Ascending order of Similarity Score 
similar_output_path = f"similar_abstracts_{year_to_analyze}.csv"
unique_output_path = f"unique_abstracts_{year_to_analyze}.csv"

similar_df.to_csv(similar_output_path, index=False)
unique_df.to_csv(unique_output_path, index=False)

print(f"\n Done!")
print(f" Similar talks saved to: {similar_output_path}")
print(f" Unique talks saved to: {unique_output_path}")


  df = pd.read_csv(file_path)



 Processing Abstracts for Year: 2014

 Done!
 Similar talks saved to: similar_abstracts_2014.csv
 Unique talks saved to: unique_abstracts_2014.csv


##  Ranking similiar/duplicate talks based on how often they appear in pairs of similar/duplicate talks

To say:
- “Which abstracts are most commonly reused or appear in multiple similar pairs?”



In [2]:

import pandas as pd

# Load the similar/duplicate talks CSV
similar_df = pd.read_csv("/Users/pranavs/Desktop/Conference Optimization/results/similar_abstracts_2014.csv")

# Count how often each abstract appears in similar pairs (Example: 280 -> appeared in 280 similarity pairs)
talk_counts = pd.concat([similar_df['abs_no_1'], similar_df['abs_no_2']])
talk_freq = talk_counts.value_counts().reset_index()
talk_freq.columns = ['abs_no', 'similarity_count']

# Merge with original titles for context
titles = pd.concat([
    similar_df[['abs_no_1', 'title_1']].rename(columns={'abs_no_1': 'abs_no', 'title_1': 'title'}),
    similar_df[['abs_no_2', 'title_2']].rename(columns={'abs_no_2': 'abs_no', 'title_2': 'title'})
]).drop_duplicates(subset='abs_no')

ranked_talks = talk_freq.merge(titles, on='abs_no', how='left')
ranked_talks = ranked_talks.sort_values(by='similarity_count', ascending=False)


# Save the full ranked list to an output CSV file
output_path = "/Users/pranavs/Desktop/Conference Optimization/results/ranked_duplicate_talks_2014.csv"
ranked_talks.to_csv(output_path, index=False)

print(f"\n Full ranked duplicate talks saved to: {output_path}")



 Full ranked duplicate talks saved to: /Users/pranavs/Desktop/Conference Optimization/results/ranked_duplicate_talks_2014.csv
