## Check for Duplicate Words Between Captions and Tags
----

This script utilizes the `os` module to traverse a specified directory, where each `.txt` file is processed individually using the `analyze_duplicates` function, which reads the file, splits its content into tags and captions, and then analyzes for duplicate words between them. The script leverages sets to store unique tags and captions, employing set operations to find the intersection of duplicate words while excluding common words like "the", "and", "a", etc. Finally, it prints the file path and any detected duplicate words, providing a technical insight into potential redundancies within the textual content.

In [2]:
import os

def analyze_duplicates(file_path):
    unique_tags = set()
    unique_captions = set()
    
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()
        parts = content.split(',')
        tags = parts[:-1]
        caption = parts[-1].strip()
        
        for tag in tags:
            unique_tags.update(tag.lower().split())
        unique_captions.update(caption.lower().split())
    
    duplicate_words = unique_tags.intersection(unique_captions) - {"the", "and", "a", "of", "is", "has", "in", "its", "on"}
    
    if duplicate_words:
        print("File:", file_path)
        for word in duplicate_words:
            print(f"- {word}")

directory_to_analyze = r"C:\Users\kade\Desktop\training_dir_staging"

for root, _, files in os.walk(directory_to_analyze):
    for file in files:
        if file.endswith(".txt"):
            file_path = os.path.join(root, file)
            analyze_duplicates(file_path)

File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\037ceb330a46b3bd01e2bfda92fd66f5.txt
- black
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\0630ce3d01e25eb803781716440bd7b5.txt
- brown
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\07b5e552b6b946ebb9c970354c55b198.txt
- brown
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\0d281d07dfc997a5b1037939ccd33eca.txt
- outline
- bubble
- speech
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\12cc788a23632fbc5bddbe4a3d0b63c0.txt
- portrait
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\13f2de85dc85ab31d084254faac0e64d.txt
- hair
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\15814659a362c2fff15be0e673226521.txt
- sticker
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\17536230452b954e4df791723dad4580.txt
- sticker
File: C:\Users\kade\Desktop\training_dir_staging\1_furry_sticker\211ee180a626eb52835735734dfc4