# String Theory: Unfolding additional dimensions beyond a 3∆-Space Approaches to identifying Copy-Pasta, Rewording, and Translation in Information Manipulation
TKTKTKTK In an era where social media plays a pivotal role in shaping public opinion and discourse, the manipulation of information has emerged as a significant challenge. The proliferation of sophisticated techniques, such as copy-pasting, rewording, and translating messages, has enabled malicious actors to conduct large-scale information manipulation campaigns. These campaigns aim to influence opinions, spread disinformation, and evade detection by platform moderators. The original study by Richard et al. (2023) introduced the “3∆-space duplicate methodology,” a novel approach to identifying these manipulation techniques by quantifying semantic, grapheme, and language proximities within messages.

While the 3∆-space methodology has proven effective in detecting coordinated inauthentic behavior, there remain opportunities for enhancement. This paper seeks to build upon the foundational work of Richard et al., proposing refined techniques and advanced algorithms to improve the accuracy and efficiency of detecting manipulated textual content. By leveraging state-of-the-art machine learning models and innovative computational methods, we aim to address the limitations of the original approach and provide a more robust framework for identifying information manipulation on social media platforms.

Our enhanced methodology focuses on three primary areas of improvement: semantic proximity analysis, grapheme distance computation, and language differentiation. We introduce advanced sentence embeddings and refined distance metrics to better capture the nuances of textual manipulation. Additionally, we explore the integration of new tools and datasets to validate our approach, ensuring its applicability across diverse contexts and languages.

The significance of this research lies in its potential to advance the field of information integrity and security. By improving detection capabilities, we can better identify and mitigate the impact of coordinated inauthentic behavior, thereby safeguarding the integrity of public discourse. This paper not only contributes to the academic understanding of information manipulation techniques but also provides practical solutions for social media platforms and policymakers to enhance their defense mechanisms against disinformation campaigns.

In the following sections, we will detail the proposed improvements to the 3∆-space methodology, present our experimental validation using synthetic and real-world datasets, and discuss the implications of our findings for future research and practical applications.



## Limitations of the 3∆ Approach 

### The Twitter Transparency Dataset
While the Twitter dataset related to Venezuelan actors, released by Twitter Transparency in 2021, offers valuable insights into information manipulation tactics, it presents several limitations that could impact the robustness and generalizability of the findings. Firstly, the dataset's geographical focus on Venezuelan actors may limit the applicability of the results to other regions or contexts. Information manipulation techniques can vary significantly based on cultural, political, and social factors; therefore, the insights derived from this dataset might not fully capture the nuances of similar campaigns conducted in different geopolitical environments.

Secondly, the dataset is predominantly in Spanish (86.2%), with only marginal representation of other languages such as English (1.5%) and Portuguese (1.1%)​(2312.17338v1)​. This language limitation poses a challenge for developing and validating detection methodologies that are effective across multiple languages. Since the detection of translated content is a key component of identifying information manipulation, the lack of linguistic diversity in the dataset may hinder the development of robust, cross-linguistic detection algorithms.

Finally, the dataset's temporal scope and length restrictions also pose limitations. The analysis focuses on tweets from January to June 2021, which might not provide a comprehensive view of longer-term manipulation strategies. Furthermore, Twitter's inherent character limit—140 characters for tweets prior to 2017 and 280 characters thereafter—imposes constraints on the length and complexity of the messages analyzed. Shorter text lengths may result in reduced semantic richness, making it challenging to accurately detect nuanced manipulations such as subtle rewordings or sophisticated translations.


### Platform specific adjustments (out of scope)

The original paper, despite using a single platform's dataset, makes no efforts to adjust its approach to the unique data provided by that platform — namely usernames hashtags, and data about the posting users. While this is out of the scope of this paper, any truly comprehensive quantative approach to "identifying Copy-Pasta, Rewording, and Translation in Information Manipulation" should use this information as part of a model 

###  ∆1: Grapheme distance

#### Using NLP data cleaning techniques  
The described approach, while comprehensive in its use of  grapheme distance algorithms for detecting Copy-Pasta, has a few limitations that can be addressed using basic NLP data cleaning techniques. The primary criticisms revolve around lack of text normalization, and the potential for noise in the data that could be mitigated through preprocessing steps.
- Presence of Stopwords: Stopwords are often irrelevant to the meaning of a text but can affect the grapheme distance calculations. Their presence can create false positives in the detection of Copy-Pasta and rewording.
- Lemmatization and Stemming: The current approach does not leverage lemmatization or stemming, which can help in reducing words to their base or root form. This omission means that words like "running" and "ran" are treated as different, increasing grapheme distances unnecessarily. 

#### Changing granularity 
Using letters as tokens rather than words in grapheme distance calculations has several limitations that can negatively impact the accuracy and effectiveness of detecting textual manipulations. 
- Treating letters as the unit of measure results in a very fine-grained analysis. This granularity can lead to an overemphasis on minor, superficial changes such as typos, small edits, or formatting differences, which may not significantly alter the overall meaning of the text. For instance, a single character change due to a typo would result in a high grapheme distance, potentially misclassifying the text as reworded or manipulated.
- Letter-based analysis ignores the context provided by whole words. Words carry meaning and contextual information that individual letters do not. For example, the difference between "cat" and "bat" is significant in a word context but might be overly emphasized in a letter-based distance metric.
- Textual noise, such as typographical errors or inserted special characters, can have a large impact on letter-based grapheme distances. This sensitivity can obscure genuine content similarities or differences.
- Calculating grapheme distances based on individual letters can be computationally expensive, especially for longer texts. The number of comparisons needed increases significantly with text length, leading to inefficiencies and potentially longer processing times.




#### Semantic Clustering
No attempt is made by the exist methodology to generate semantic clusters that can be used to detect suspect new terms 

###  ∆2: Semantic  distance

#### Comaparing USE With BERT, SBERT, RoBERTa, GPT3

###  ∆3: Translation

#### Comaparing USE With BERT, SBERT, RoBERTa, GPT3

###  ∆5: Classifiers


### Libraries, Constants, and preloading

In [4]:
import pandas as pd
import os
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.tokenize import word_tokenize


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

file_path_naive = "C:\\Users\\benzo\\repo\\nlp-project\\data\\combined_dataset.csv"
file_path_detailed = "C:\\Users\\benzo\\repo\\nlp-project\\data\\combined_dataset_detailed.csv"
file_path_info_ops = "C:\\Users\\benzo\\repo\\nlp-project\\data\\combined_info_ops_dataset_slim.csv"
RAND_SEED = 417





[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\benzo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\benzo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\benzo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Improvements — an expanded dataset: Adding Context, Sources, Campaigns, and Languages

All of the datasets listed from June 2020 through 2021 are listed, except the following, for the noted reasons
- Armenia (February 2021) - low number of accounts and tweets
- Russia (September 2020) - low number of accounts 
- Saudi Arabia (September 2020) - Low number of accounts and tweets 
- Turkey - Overwhelming large number of posts (5 GB) 

A comparison set of tweets is sourced from the following two datasets: 
English Tweets of 2022 - https://www.kaggle.com/datasets/amirhosseinnaghshzan/twitter-2022
1.6 million random tweets - https://www.kaggle.com/datasets/i191796majid/tweets


Datasets and the ETL process can be found in the ./data_transform.py. To summarize transformations made: The combine_comparison_datasets function loads three separate CSV files, concatenates their tweet text columns into a single DataFrame, and filters out short tweets. The combined DataFrame is then saved to a specified output file. Similarly, the combine_info_ops_datasets function processes multiple CSV files in a specified directory, appending the source of each tweet based on the filename and filtering out retweets and tweets with undefined or short text. 



## Improvements - do we need to bother with a quantative approach?

Before creating a "quantitative approach to detecting Copy-pasta, Rewording, and Translation on Social Media",  *Unmasking information manipulation* fails to address the fundamental premise: are information operations linguistically distinctitive from authentic content? 

To explore this question, we can implement a series of machine learning classifiers to analyze tweets and determine if there's a distinctive linguistic pattern between information operations and authentic content. The following code demonstrates how to load, preprocess, and classify a dataset of tweets to identify these patterns. It begins by taking a some 'naive' approach, assuming all information operations have something in common, then we split out by sample 

In [9]:

def try_out_models(df):
    # Sample the data to speed up training, comment out if you want to use the full dataset. 
    df = df.sample(frac=0.25, replace=True, random_state=RAND_SEED)

    # Minimally preprocess the data 
    X = df['tweet_text']
    y = df['source']

    # Vectorize the text data
    vectorizer = TfidfVectorizer()
    X_vectorized = vectorizer.fit_transform(X)

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=RAND_SEED)

    # Define classifiers, tweaking hyperparameters appropriately
    classifiers = {
        "MultinomialNB": MultinomialNB(),
        "MultinomialNB_tightfit": MultinomialNB(alpha=0.1),
        "LogisticRegression_HighCon": LogisticRegression(max_iter=1000),
        "LogisticRegression_LowCon": LogisticRegression(max_iter=200, C=0.5),
        #"SVC_linear": SVC(kernel='linear'), # takes too long to train, may revisit later
        #"SVC_rbf": SVC(kernel='rbf'), # takes too long to train, may revisit later
        "RandomForest_Sparse": RandomForestClassifier(max_depth=10, random_state=RAND_SEED),
        "RandomForest_Deep": RandomForestClassifier(max_depth=100, random_state=RAND_SEED)

    }

    # Train and evaluate classifiers
    for name, clf in classifiers.items():
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(f"Classifier: {name}")
        print(classification_report(y_test, y_pred))

df_naive = pd.read_csv(file_path_naive)
df_detailed = pd.read_csv(file_path_detailed)

# Load the naive dataset
print("Naive dataset")
try_out_models(df_naive)

# Repeat the process with detailed source samples
print("Detailed dataset")
try_out_models(df_detailed)

 

  df_naive = pd.read_csv(file_path_naive)
  df_detailed = pd.read_csv(file_path_detailed)


Naive dataset
Classifier: MultinomialNB
              precision    recall  f1-score   support

  comparison       0.94      1.00      0.97     94159
    info_ops       0.98      0.61      0.75     15407

    accuracy                           0.94    109566
   macro avg       0.96      0.80      0.86    109566
weighted avg       0.95      0.94      0.94    109566

Classifier: MultinomialNB_tightfit
              precision    recall  f1-score   support

  comparison       0.97      0.98      0.98     94159
    info_ops       0.89      0.84      0.87     15407

    accuracy                           0.96    109566
   macro avg       0.93      0.91      0.92    109566
weighted avg       0.96      0.96      0.96    109566

Classifier: LogisticRegression_HighCon
              precision    recall  f1-score   support

  comparison       0.96      1.00      0.98     94159
    info_ops       0.96      0.75      0.84     15407

    accuracy                           0.96    109566
   macro avg  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

  comparison       0.86      1.00      0.92     94159
    info_ops       0.00      0.00      0.00     15407

    accuracy                           0.86    109566
   macro avg       0.43      0.50      0.46    109566
weighted avg       0.74      0.86      0.79    109566

Classifier: RandomForest_Deep
              precision    recall  f1-score   support

  comparison       0.90      1.00      0.95     94159
    info_ops       1.00      0.30      0.46     15407

    accuracy                           0.90    109566
   macro avg       0.95      0.65      0.70    109566
weighted avg       0.91      0.90      0.88    109566

Detailed dataset
Classifier: MultinomialNB


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        CNCC       0.00      0.00      0.00       129
        CNHU       1.00      0.22      0.36       747
         GRU       0.98      0.09      0.16       578
         IRA       1.00      0.01      0.01      1133
         REA       0.00      0.00      0.00       329
         RNA       0.00      0.00      0.00       497
   Venezuela       0.00      0.00      0.00       119
       china       1.00      0.00      0.01       484
  comparison       0.90      1.00      0.95     93893
        iran       0.98      0.20      0.33      3485
      russia       1.00      0.00      0.00       991
      uganda       0.99      0.57      0.72      7148

    accuracy                           0.90    109533
   macro avg       0.65      0.17      0.21    109533
weighted avg       0.90      0.90      0.87    109533

Classifier: MultinomialNB_tightfit
              precision    recall  f1-score   support

        CNCC       1.00      0.09      0.16

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Classifier: LogisticRegression_LowCon
              precision    recall  f1-score   support

        CNCC       1.00      0.02      0.05       129
        CNHU       0.99      0.63      0.77       747
         GRU       0.91      0.81      0.86       578
         IRA       0.87      0.55      0.67      1133
         REA       0.87      0.35      0.50       329
         RNA       0.89      0.38      0.53       497
   Venezuela       1.00      0.22      0.36       119
       china       1.00      0.15      0.26       484
  comparison       0.95      1.00      0.97     93893
        iran       0.91      0.67      0.77      3485
      russia       0.93      0.26      0.40       991
      uganda       0.99      0.79      0.88      7148

    accuracy                           0.95    109533
   macro avg       0.94      0.48      0.58    109533
weighted avg       0.95      0.95      0.94    109533

Classifier: RandomForest_Sparse


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        CNCC       0.00      0.00      0.00       129
        CNHU       0.00      0.00      0.00       747
         GRU       0.00      0.00      0.00       578
         IRA       0.00      0.00      0.00      1133
         REA       0.00      0.00      0.00       329
         RNA       0.00      0.00      0.00       497
   Venezuela       0.00      0.00      0.00       119
       china       0.00      0.00      0.00       484
  comparison       0.86      1.00      0.92     93893
        iran       0.00      0.00      0.00      3485
      russia       0.00      0.00      0.00       991
      uganda       0.00      0.00      0.00      7148

    accuracy                           0.86    109533
   macro avg       0.07      0.08      0.08    109533
weighted avg       0.73      0.86      0.79    109533

Classifier: RandomForest_Deep


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        CNCC       0.00      0.00      0.00       129
        CNHU       1.00      0.21      0.34       747
         GRU       0.99      0.25      0.40       578
         IRA       1.00      0.01      0.02      1133
         REA       0.00      0.00      0.00       329
         RNA       1.00      0.01      0.03       497
   Venezuela       1.00      0.03      0.07       119
       china       0.00      0.00      0.00       484
  comparison       0.88      1.00      0.94     93893
        iran       1.00      0.18      0.31      3485
      russia       1.00      0.02      0.04       991
      uganda       1.00      0.32      0.49      7148

    accuracy                           0.89    109533
   macro avg       0.74      0.17      0.22    109533
weighted avg       0.89      0.89      0.85    109533



  _warn_prf(average, modifier, msg_start, len(result))


### Analysis - 


## Improvements - Grapheme 

### Additional Preprocessing NLP CLeaning:
Stopword Removal:
Removing stopwords before calculating grapheme distances can reduce noise and focus the analysis on meaningful content. This can help in more accurately identifying Copy-Pasta and rewording by eliminating common but insignificant words.

Lemmatization and Stemming:
Applying lemmatization or stemming can normalize words to their root forms, reducing the variability in the text. For instance, "running" and "ran" would both be reduced to "run," making the grapheme distance more reflective of actual content changes rather than superficial differences.

Consistent Spacing:
Normalizing whitespace (e.g., converting multiple spaces to a single space) can prevent spacing differences from affecting the grapheme distance. This step can help in accurately categorizing texts that have been slightly altered in terms of spacing.

### Decreasing Granularity 
Reduced Noise Sensitivity:
Treating words as tokens can mitigate the impact of minor textual noise. Since words are larger units, small alterations (like a single-character change) will have a reduced effect on the overall distance calculation, leading to more stable and reliable results.

Improved Context Handling:
Words provide context that letters do not. Using words as tokens can help maintain the contextual integrity of the text, allowing for more accurate comparisons. For example, "breaking news" and "urgent news" share a contextual meaning that would be lost in a letter-based analysis.

Efficient Computation:
Word-based distance metrics can be more computationally efficient for longer texts. Instead of comparing every single letter, the algorithm can focus on comparing words, which can reduce the computational complexity and improve processing speed.



In [None]:

# Preprocessing functions

# Language-specific stopwords,  ru, es, en, zh, ar, fr are supported by NLTK and have > 10K results 
stop_words = {
    'en': set(stopwords.words('english')),
    'ru': set(stopwords.words('russian')),
    'es': set(stopwords.words('spanish')),
    'zh': set(stopwords.words('chinese')),
    'fr': set(stopwords.words('french')),
    'ar': set(stopwords.words('arabic'))
    #'in': set(stopwords.words('indonesian'))
}

lemmatizer = WordNetLemmatizer()
stemmers = {
    'en': SnowballStemmer('english'),
    'ru': SnowballStemmer('russian'),
    'es': SnowballStemmer('spanish'),
    'zh': None,  # Chinese does not use stemming in the same way
    'fr': SnowballStemmer('french'),
    'ar': SnowballStemmer('arabic')
}

def lower_and_remove_nonlang(text):
    # Lowercasing
    text = text.lower()

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Remove non-language characters (punctuation, emojis, etc.)
    text = re.sub(r'[^\w\s]', '', text)

    return text
    

# Preprocessing function
def preprocess_text(text, lang='en'):
    # Lowercasing and removing non-language characters
    text = lower_and_remove_nonlang(text)

    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words.get(lang, set())]
    
    # Stemming/Lemmatization
    if lang == 'en':
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    else:
        stemmer = stemmers.get(lang)
        if stemmer:
            tokens = [stemmer.stem(word) for word in tokens]
    
    # Consistent spacing (normalize whitespace)
    text = ' '.join(tokens)
        
    return text


df_info_ops = pd.read_csv(file_path_info_ops)
df_info_ops['tweet_text_ppp'] = df_info_ops.apply(lambda row: lower_and_remove_nonlang(row['tweet_text']), axis=1)

# Apply preprocessing to both datasets
df_info_ops['tweet_text_fpp'] = df_info_ops.apply(lambda row: preprocess_text(row['tweet_text'], row['tweet_language']), axis=1)

# Save the preprocessed datasets
df_info_ops.to_csv(file_path_info_ops.replace('.csv', '_preprocessed.csv'), index=False)


### Assessing Grapheme distance — comparing methods

In [5]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import  euclidean
from scipy.stats import wasserstein_distance
from Levenshtein import distance as levenshtein_distance
from sklearn.feature_extraction.text import CountVectorizer
from nltk import ngrams
from itertools import combinations


# Define similarity measures
def jaccard_similarity(str1, str2):
    a = set(str1.split())
    b = set(str2.split())
    intersection = len(a.intersection(b))
    union = len(a.union(b))
    return intersection / union

def dice_similarity(str1, str2):
    a = set(ngrams(str1.split(), 2))
    b = set(ngrams(str2.split(), 2))
    intersection = len(a.intersection(b))
    return 2 * intersection / (len(a) + len(b))



def levenshtein_similarity(str1, str2):
    max_len = max(len(str1), len(str2))
    return 1 - (levenshtein_distance(str1, str2) / max_len)

def compare_distances(df):
    threshold = 0.50 # Using the threshold suggested by the paper
    similar_pairs = []

    for (index1, row1), (index2, row2) in combinations(df.iterrows(), 2):
        if index1 == index2:
            continue
        tweet1 = row1['tweet_text_fpp']
        tweet2 = row2['tweet_text_fpp']
        tweet1_pp = row1['tweet_text_ppp']
        tweet2_pp = row2['tweet_text_ppp']

        if len(tweet1) == 0 or len(tweet2) == 0 or len(tweet1_pp) == 0 or len(tweet2_pp) == 0:
            continue
        try:
            jaccard_sim = jaccard_similarity(tweet1, tweet2)
            dice_sim = dice_similarity(tweet1, tweet2)

            levenshtein_sim = levenshtein_similarity(tweet1, tweet2)
            jaccard_sim_pp = jaccard_similarity(tweet1_pp, tweet2_pp)
            dice_sim_pp = dice_similarity(tweet1_pp, tweet2_pp)

            levenshtein_sim_pp = levenshtein_similarity(tweet1_pp, tweet2_pp)
        except Exception as e:
            #print(f"Error comparing {tweet1} and {tweet2}: {e}")
            continue
        

        #if any distance is above the threshold, add the pair and all scores to the list

        if  levenshtein_sim > threshold or levenshtein_sim_pp > threshold and levenshtein_sim < 1.0:
            
            similar_pairs.append({
                'index1': index1,
                'index2': index2,
                'og_tweet1': row1['tweet_text'],
                'og_tweet2': row2['tweet_text'],
                'tweet1': tweet1,
                'tweet2': tweet2,
                'tweet1_pp': tweet1_pp,
                'tweet2_pp': tweet2_pp,
                'jaccard': jaccard_sim,
                'dice': dice_sim,
                #'wasserstein': wasserstein_sim,
                'levenshtein': levenshtein_sim,
                'jaccard_pp': jaccard_sim_pp,
                'dice_pp': dice_sim_pp,
                #'wasserstein_pp': wasserstein_sim_pp,
                'levenshtein_pp': levenshtein_sim_pp
            })


    return similar_pairs

#load the preprocessed dataset
df_info_ops = pd.read_csv(file_path_info_ops.replace('.csv', '_preprocessed.csv'))

#filter df_info_ops to those with tweet_text_pp and tweet_text of length > 30
df_info_ops = df_info_ops[df_info_ops['tweet_text'].str.len() > 10]
df_info_ops = df_info_ops[df_info_ops['tweet_text_fpp'].str.len() > 10]

df_info_ops = df_info_ops[df_info_ops['tweet_text_ppp'].str.len() > 10]
    
# grab the first 1000 rows from each 

#filter df_info_ops to just english
df_info_ops_en = df_info_ops[df_info_ops['tweet_language'] == 'en']

#split info_ops into a list of dfs by whatever is the source column aand grab the first 1000 rows of each
df_info_ops_en_list = [df_info_ops_en[df_info_ops_en['source'] == source].head(1000) for source in df_info_ops_en['source'].unique()]

#compare distances for each source, for both the original and preprocessed text
similars = []
for sampled_df in df_info_ops_en_list:
    #print the source
    print(sampled_df['source'].iloc[0])
    similars.append( compare_distances(sampled_df))

#save the results to a json
import json
with open('similar_tweets.json', 'w') as f:
    json.dump(similars, f, indent=4)

#load the results from the json
# with open('similar_tweets.json', 'r') as f:
#     similars = json.load(f)

#analyze average similarity scores
def analyze_similarities(similars):
    for source_similars in similars:
        if len(source_similars) == 0:
            continue
        jaccard = np.mean([sim['jaccard'] for sim in source_similars])
        dice = np.mean([sim['dice'] for sim in source_similars])
        #wasserstein = np.mean([sim['wasserstein'] for sim in source_similars])
        levenshtein = np.mean([sim['levenshtein'] for sim in source_similars])
        jaccard_pp = np.mean([sim['jaccard_pp'] for sim in source_similars])
        dice_pp = np.mean([sim['dice_pp'] for sim in source_similars])
        #wasserstein_pp = np.mean([sim['wasserstein_pp'] for sim in source_similars])
        levenshtein_pp = np.mean([sim['levenshtein_pp'] for sim in source_similars])
        print(f"Jaccard (fully preprocessed): {jaccard:.2f}")
        print(f"Dice (fully preprocessed): {dice:.2f}")
        #print(f"Wasserstein: {wasserstein:.2f}")
        print(f"Levenshtein(fully preprocessed) : {levenshtein:.2f}")
        print(f"Jaccard (partially preprocessed): {jaccard_pp:.2f}")
        print(f"Dice (partially preprocessed): {dice_pp:.2f}")
        #print(f"Wasserstein (preprocessed): {wasserstein_pp:.2f}")
        print(f"Levenshtein (partially preprocessed): {levenshtein_pp:.2f}")
        print()
    #print the overall average similarity scores
    jaccard = np.mean([sim['jaccard'] for source_similars in similars for sim in source_similars])
    dice = np.mean([sim['dice'] for source_similars in similars for sim in source_similars])
    #wasserstein = np.mean([sim['wasserstein'] for source_similars in similars for sim in source_similars])
    levenshtein = np.mean([sim['levenshtein'] for source_similars in similars for sim in source_similars])
    jaccard_pp = np.mean([sim['jaccard_pp'] for source_similars in similars for sim in source_similars])
    dice_pp = np.mean([sim['dice_pp'] for source_similars in similars for sim in source_similars])
    #wasserstein_pp = np.mean([sim['wasserstein_pp'] for source_similars in similars for sim in source_similars])
    levenshtein_pp = np.mean([sim['levenshtein_pp'] for source_similars in similars for sim in source_similars])
    print("Overall averages")
    print(f"Jaccard (fully preprocessed): {jaccard:.2f}")
    print(f"Dice (fully preprocessed): {dice:.2f}")
    #print(f"Wasserstein: {wasserstein:.2f}")
    print(f"Levenshtein (fully preprocessed): {levenshtein:.2f}")
    print(f"Jaccard (partial preprocessed): {jaccard_pp:.2f}")
    print(f"Dice (partial preprocessed): {dice_pp:.2f}")
    #print(f"Wasserstein (preprocessed): {wasserstein_pp:.2f}")
    print(f"Levenshtein (partial preprocessed): {levenshtein_pp:.2f}")
    

#analyze the similarities
analyze_similarities(similars)

china
CNCC
CNHU
GRU
iran
IRA
MX
REA
RNA
russia
Tanzania
thailand
uganda
Venezuela
Jaccard (fully preprocessed): 0.06
Dice (fully preprocessed): 0.03
Levenshtein(fully preprocessed) : 0.31
Jaccard (partially preprocessed): 0.08
Dice (partially preprocessed): 0.03
Levenshtein (partially preprocessed): 0.33

Jaccard (fully preprocessed): 0.16
Dice (fully preprocessed): 0.09
Levenshtein(fully preprocessed) : 0.40
Jaccard (partially preprocessed): 0.16
Dice (partially preprocessed): 0.10
Levenshtein (partially preprocessed): 0.40

Jaccard (fully preprocessed): 0.03
Dice (fully preprocessed): 0.01
Levenshtein(fully preprocessed) : 0.30
Jaccard (partially preprocessed): 0.06
Dice (partially preprocessed): 0.01
Levenshtein (partially preprocessed): 0.32

Jaccard (fully preprocessed): 0.06
Dice (fully preprocessed): 0.01
Levenshtein(fully preprocessed) : 0.29
Jaccard (partially preprocessed): 0.12
Dice (partially preprocessed): 0.04
Levenshtein (partially preprocessed): 0.32

Jaccard (fully pre

### Benchmarking USE, BERT, SBERT, RoBERTa, for shortform translated content

Rather than generating a synthentic dataset, as the original paper does, I use another source of translated shortform content to benchmark models for use in the translation

In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

# Load the dataset
df = pd.read_csv('C:\\Users\\benzo\\repo\\nlp-project\\data\\gemma\\70k_gemma_template_built.csv')
df = df[['original_text', 'generated_text']]  # Filter out unnecessary columns


# Generate unique IDs
df['id'] = np.arange(len(df))

# Text Embedding
vectorizer = TfidfVectorizer(stop_words='english')
all_text = pd.concat([df['original_text'], df['generated_text']])
tfidf_matrix = vectorizer.fit_transform(all_text)
tfidf_matrix_normalized = normalize(tfidf_matrix)


# Split the matrices
midpoint = len(df)
original_text_matrix = tfidf_matrix_normalized[:midpoint]
generated_text_matrix = tfidf_matrix_normalized[midpoint:]

# Apply KNN to find the nearest rewritten text for each original text
knn = NearestNeighbors(n_neighbors=1, metric='cosine')
knn.fit(generated_text_matrix)

# Query the model to find the nearest neighbors
distances, indices = knn.kneighbors(original_text_matrix)

# Map indices to IDs
df['nearest_rewritten_id'] = df.iloc[indices.flatten()]['id'].values
df['nearest_rewritten_text'] = df.iloc[indices.flatten()]['generated_text'].values

# Output results -- how many nearest neighbors are correctly identified as having the same id
correct = np.sum(df['id'] == df['nearest_rewritten_id'])
total = len(df)
print(f"TF - IDF - Correctly identified {correct} out of {total} nearest neighbors, {correct/total*100}%")

# redo with bert embeddings


Correctly identified 65233 out of 69487 nearest neighbors.


In [30]:
# REDO Above with BERT embeddings
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from transformers import AutoTokenizer, AutoModel
import torch
# Generate unique IDs
df['id'] = np.arange(len(df))

df = df.head(100)

# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

def get_bert_embeddings(texts):
    embeddings = []
    for idx, text in texts:
        with torch.no_grad():
            print(f"Processing text {idx}")
            inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
            outputs = model(**inputs)
            # Take the mean of all token embeddings to get a single vector per text
            mean_embedding = outputs.last_hidden_state.mean(1)
            embeddings.append(mean_embedding.squeeze().numpy())
    return np.vstack(embeddings)

# Generate embeddings
all_text = pd.concat([df['original_text'], df['generated_text']])
all_embeddings = get_bert_embeddings(all_text)

# Split the embeddings
midpoint = len(df)
original_embeddings = all_embeddings[:midpoint]
generated_embeddings = all_embeddings[midpoint:]

# Apply KNN
knn = NearestNeighbors(n_neighbors=1, metric='cosine')
knn.fit(generated_embeddings)

# Query the model
distances, indices = knn.kneighbors(original_embeddings)

# Map indices to IDs
df['nearest_rewritten_id'] = df.iloc[indices.flatten()]['id'].values
df['nearest_rewritten_text'] = df.iloc[indices.flatten()]['generated_text'].values

# Evaluate results
correct = np.sum(df['id'] == df['nearest_rewritten_id'])
total = len(df)
print(f"BERT - Correctly identified {correct} out of {total} nearest neighbors, {correct/total*100}%")



KeyboardInterrupt: 

## Improvements —  testing predictivity

The point-in-time nature of the original 3∆ allows only for testing prediction validity within the datasets  

## Improvements —  image p-hashes 

### Improvements: Leveraging Retrieval-Augmented Generation (RAG) for Detecting Textual Manipulation
The detection of textual manipulation on social media is a complex and evolving challenge. Traditional methods, such as the 3∆-space duplicate methodology, have made significant strides in identifying manipulated content through semantic, grapheme, and language proximity analysis. However, these methods have limitations, particularly in handling diverse languages, short text length, and evolving manipulation techniques. To address these limitations, we propose an alternative methodology leveraging Retrieval-Augmented Generation (RAG), a state-of-the-art technique that combines the strengths of retrieval-based and generation-based models.

## Suggestions for future research

Integrate context aware p-hashes or other near duplicate image detection

Citations:
https://aclanthology.org/2023.findings-acl.426.pdf


https://aclanthology.org/2023.findings-acl.426.pdf