<a href="https://colab.research.google.com/github/k-dinakaran/automation-of-wordpress-post-publication-using-AI-tools/blob/main/build_an_AI_assistant_to_detect_plagiarism_in_content_drafts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import spacy
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import re

# Load spaCy and Sentence-BERT model
nlp = spacy.load("en_core_web_sm")
model = SentenceTransformer("all-MiniLM-L6-v2")

# Preprocess text function
def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

# Embed text for similarity calculation
def embed_text(text):
    return model.encode([text])[0]

# Calculate similarity score between two texts
def calculate_similarity(text1, text2):
    embedding1 = embed_text(text1)
    embedding2 = embed_text(text2)
    similarity = cosine_similarity([embedding1], [embedding2])[0][0]
    return similarity

# Detect plagiarism by comparing draft with each source
def detect_plagiarism(content, sources, threshold):
    results = []
    processed_content = preprocess_text(content)

    for source_name, source_text in sources.items():
        processed_source = preprocess_text(source_text)
        similarity_score = calculate_similarity(processed_content, processed_source)

        if similarity_score >= threshold:
            results.append({
                "Source": source_name,
                "Similarity Score": round(similarity_score * 100, 2),
                "Matched Text": source_text
            })
    return results

# Dummy paraphrasing function (replace with actual model for production)
def suggest_paraphrase(text):
    return "Suggested paraphrase: " + re.sub(r"\bAI\b", "Artificial Intelligence", text)

# Generate plagiarism report
def generate_plagiarism_report(content_draft, sources, threshold=0.3):
    matches = detect_plagiarism(content_draft, sources, threshold)

    report = f"**Plagiarism Detection Report**\n\n**Similarity Threshold**: {threshold * 100}%\n"
    report += "### Detailed Analysis\n\n"

    if not matches:
        report += "No significant similarities detected.\n"
    else:
        for match in matches:
            source = match['Source']
            score = match['Similarity Score']
            matched_text = match['Matched Text']
            paraphrase_suggestion = suggest_paraphrase(matched_text)

            report += f"- **{source}** - Similarity Score: {score}%\n"
            report += f"  Matched Text: \"{matched_text}\"\n"
            report += f"  Paraphrasing Suggestion: \"{paraphrase_suggestion}\"\n\n"

    return report

# User input for content draft
print("Enter your content draft:")
content_draft = input()

# User input for sources
print("\nEnter sources (type 'done' when finished):")
sources = {}
while True:
    source_name = input("Source Name (or 'done' to finish): ")
    if source_name.lower() == "done":
        break
    source_text = input("Source Text: ")
    sources[source_name] = source_text

# User input for similarity threshold
threshold = float(input("\nEnter similarity threshold (e.g., 0.3 for 30%): "))

# Generate and print plagiarism report
report = generate_plagiarism_report(content_draft, sources, threshold)
print("\n" + report)


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Enter your content draft:
Artificial intelligence has revolutionized diagnostic practices.

Enter sources (type 'done' when finished):
Source Name (or 'done' to finish): source 1
Source Text: AI has changed diagnostics by processing data quickly.
Source Name (or 'done' to finish): source 2
Source Text: Healthcare uses AI for patient care predictions.
Source Name (or 'done' to finish): done

Enter similarity threshold (e.g., 0.3 for 30%): 0.3

**Plagiarism Detection Report**

**Similarity Threshold**: 30.0%
### Detailed Analysis

- **source 1** - Similarity Score: 56.7%
  Matched Text: "AI has changed diagnostics by processing data quickly."
  Paraphrasing Suggestion: "Suggested paraphrase: Artificial Intelligence has changed diagnostics by processing data quickly."

- **source 2** - Similarity Score: 48.1%
  Matched Text: "Healthcare uses AI for patient care predictions."
  Paraphrasing Suggestion: "Suggested paraphrase: Healthcare uses Artificial Intelligence for patient care predicti