This Notebook contains the code to generate responses from Cohere LLM, sample the HuggingFace dataset and calculate AI Ratio and Perplexity

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("train_essays.csv")
promtps = pd.read_csv("train_prompts.csv")

In [None]:
filtered_human_texts = df[df['generated'] == 0]
sample_data = filtered_human_texts.sample(n=1000, random_state=43)

Storing the reference source data to be used in the prompts while generating new data

In [None]:
source_0 = promtps.iloc[0]['source_text']

In [None]:
prompts = [
    "Write a persuasive essay explaining the benefits of reducing car usage for both individuals and society. Base your essay on evidence and information found in the passage set.",
    "Compose an informative essay to highlight the environmental, economic, and social advantages of limiting car usage. Use information from the provided passages to support your points.",
    "Draft an essay to educate readers about the importance of decreasing car dependency. Incorporate evidence from the passage set, ensuring a balanced use of multiple sources.",
    "Write an essay to discuss how reducing car usage can improve urban living conditions. Use data and insights from the passage set to back your claims.",
    "Create an analytical essay on the reasons why limiting car use is beneficial for the environment and public health. Use the passage set to gather evidence and structure your essay with clear arguments.",
    "Develop an essay explaining the role of reduced car usage in combating climate change and improving air quality. Rely on evidence from the passage set.",
    "Write an essay to explain how decreasing car reliance can lead to better urban planning and increased efficiency. Use multiple sources from the passage set to provide evidence.",
    "Construct an essay that examines the economic advantages of reducing car usage, such as cost savings and increased public transportation investment. Base your response on evidence from the passage set.",
    "Write an essay to explain how limiting car usage contributes to sustainable development and resource conservation. Use the passage set to gather evidence.",
    "Create an essay discussing how reducing car usage can help alleviate traffic congestion and improve public health. Support your points with information from the passage set.",
    "Compose an essay explaining the ways limiting car usage can benefit local economies and create more sustainable communities. Rely on information from the provided passages.",
    "Write an essay analyzing how reducing car dependency aligns with global efforts to combat climate change. Use the passage set to provide supporting evidence.",
    "Develop an essay to discuss the challenges and solutions related to limiting car usage, supported by evidence from the passage set.",
    "Construct an essay that highlights the advantages of alternative transportation methods when car usage is limited. Use evidence from the passage set to build your case.",
    "Write an essay exploring the connection between reduced car use and improved quality of life in urban areas. Use multiple sources from the passage set to support your points.",
    "Compose an essay explaining the significance of reducing car usage for urban infrastructure and community development. Use multiple sources from the passage set.",
    "Create an essay describing the potential health benefits of reducing car usage, such as lower pollution exposure and increased physical activity. Base your essay on ideas and information from the passage set.",
    "Develop an essay to explain how limiting car usage contributes to sustainable cities and addresses traffic congestion. Use information from the passage set.",
    "Write an essay describing the benefits of reducing car dependency for future generations, emphasizing long-term advantages. Use evidence from the provided passages.",
    "Compose an essay to explain the social and environmental impacts of reducing car usage, using multiple sources from the passage set."
]


In [None]:
temperatures = [0.8]

In [None]:
import csv
def save_responses_to_csv(responses, filename="generated_responses.csv"):
    header = [ "text", "generated"]
    with open(filename, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(header)
        for i, response in enumerate(responses, 1):
            writer.writerow([response, 1])

# Cohere

In [None]:
!pip install cohere
import cohere



In [None]:
def generate_responses_cohere(prompts, source, temperatures, api_key, num_responses=100):
    responses = []
    num_prompts = len(prompts)
    num_temps = len(temperatures)
    co = cohere.Client(api_key)
    responses_per_combination = 5

    for prompt in prompts:
        for temp in temperatures:
            for _ in range(responses_per_combination):
                formatted_prompt = f"{prompt} \n Use the Source text: {source}"
                response = co.generate(
                    model='command',
                    prompt=formatted_prompt,
                    max_tokens=150,
                    temperature=temp,
                    k=50,
                    stop_sequences=["\n"]
                )
                generated_text = response.generations[0].text.strip()
                responses.append(generated_text)
    return responses

In [None]:
temperatures = [0.8]
api_key = ""
responses = generate_responses_cohere(prompts, source_0, temperatures, api_key)
save_responses_to_csv(responses, "cohere-generated.csv")

Generating the AI ratio feature

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
import re
from transformers import AutoTokenizer, AutoModel
import torch

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

file_path = 'final_huggingface_dataset.csv'
df = pd.read_csv(file_path)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
len(df)

23786

In [None]:
'''
This function get counts of words appearing in the AI data/human data in descending order
'''
def get_counts(column):
  filtered_df = df[df['generated'] == column]
  vectorizer = CountVectorizer(stop_words='english')
  X = vectorizer.fit_transform(filtered_df['text'])
  word_counts = X.sum(axis=0).A1
  words = vectorizer.get_feature_names_out()
  word_freq = dict(zip(words, word_counts))
  sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
  return sorted_word_freq

In [None]:
sorted_word_freq = get_counts('1')
sorted_word_freq1 = get_counts('0')

In [None]:
print(sorted_word_freq[:10])
print(sorted_word_freq1[:10])

[('students', 9931), ('like', 5697), ('school', 4362), ('people', 4212), ('help', 4125), ('time', 3715), ('important', 3617), ('make', 3180), ('car', 2821), ('learning', 2742)]
[('people', 17471), ('students', 16249), ('school', 10229), ('car', 9603), ('cars', 9494), ('electoral', 8492), ('college', 7589), ('like', 7422), ('just', 7173), ('vote', 7173)]


In [None]:
# Calculating the ratio of word frequency in AI data/human data for each word
result = [(word, freq / freq1) if found else (word, 100) for word, freq in sorted_word_freq for word1, freq1 in sorted_word_freq1 if word == word1 or not (found := True)]

# Storing only those words that have ratio > 1.5
filtered_result = [item for item in result if item[1] >= 1.5]

#Sorting the words in descending order of their ratio
sorted_result = sorted(filtered_result, key=lambda x: x[1], reverse=True)

In [None]:
# Getting a list of the words
result = [i for i, score in sorted_result]

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
# Check if the word is either verb, adverb, or adjective
def is_ai_word(word):
    ai_related_tags = ['VB', 'RB', 'JJ']
    doc = nlp(word)
    return doc[0].tag_ in ai_related_tags

In [None]:
# Filtering out the list of words based on part-of-speech
result = [i for i in result if not is_ai_word(i)]

In [None]:
# Calculating the AI-ratio

def calculate_ai_ratio(text, result):
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text.lower())
    ai_count = sum(1 for token in tokens if token in result)
    return ai_count / len(tokens)


In [None]:
# Adding the ratio to dataset and saving it
df['ai_ratio'] = df['text'].apply(lambda x: calculate_ai_ratio(x, result))
df.to_csv("new_dataset.csv", index=False)

Calculating the AI ratio for the test dataset

In [None]:
df2 = pd.read_csv('test_dataset.csv')
df2.head()

Unnamed: 0,text,generated,word_count,vocabulary_richness,gunning_fog,smog_index,polarity,subjectivity,noun_count,verb_count,adverb_count,noun_density,verb_density,adjective_density,adverb_density,perplexity
0,"With this project, participants can also learn...",1,81,0.728395,12.07,11.2,0.156944,0.315873,28,11,3,0.321839,0.126437,0.114943,0.034483,26.170715
1,Longer school days would give them less time t...,1,256,0.53125,11.65,11.7,0.073198,0.390412,63,42,19,0.226619,0.151079,0.111511,0.068345,17.081923
2,This could include having to independently pay...,1,290,0.593103,12.69,13.7,0.103705,0.519151,69,53,18,0.213622,0.164087,0.099071,0.055728,13.664727
3,Some poeple think the first impressions they c...,0,256,0.285156,6.34,11.0,-0.027778,0.54321,42,60,13,0.15,0.214286,0.142857,0.046429,16.209841
4,Also i think that the principal should decide ...,0,118,0.550847,8.12,10.6,0.155952,0.72619,33,22,1,0.253846,0.169231,0.069231,0.007692,33.50761


In [None]:
df2['ai_ratio'] = df2['text'].apply(lambda x: calculate_ai_ratio(x, result))

In [None]:
df2.to_csv("new_dataset.csv", index=False)

Creating New Dataset from HuggingFace

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
from datasets import load_dataset

ds = load_dataset("dmitva/human_ai_generated_text")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

model_training_dataset.csv:   0%|          | 0.00/3.93G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [None]:
import pandas as pd
df = pd.DataFrame(ds['train'])

In [None]:
df.head()

In [None]:
df = df.sample(n=50000, random_state=42)

In [None]:
new_df = pd.DataFrame(columns=['essay_id', 'text', 'source', 'generated'])

In [None]:
import pandas as pd
df_transformed = pd.DataFrame({
    'essay_id': df['id'].repeat(2).values,
    'text': df[['human_text', 'ai_text']].stack().values,
    'generated': [0, 1] * len(df)
}).reset_index(drop=True)
print(df_transformed)

                                   essay_id  \
0      28354571-51ed-4e57-8b62-473b686a3346   
1      28354571-51ed-4e57-8b62-473b686a3346   
2      6321ea61-7264-4060-811c-cea1286aa84f   
3      6321ea61-7264-4060-811c-cea1286aa84f   
4      6f1591c9-6cdf-4012-93f0-dc41dbd1055d   
...                                     ...   
99995  029b8b13-217d-400e-9450-7058416e848f   
99996  78d30c4b-46d2-4082-b115-3b1cd85b704e   
99997  78d30c4b-46d2-4082-b115-3b1cd85b704e   
99998  59bad321-a5fe-4cae-bc73-172e9166f6df   
99999  59bad321-a5fe-4cae-bc73-172e9166f6df   

                                                    text  generated  
0      Even though lots of students like to work at h...          0  
1      \n\nUltimately, the decision as to whether stu...          1  
2      Fresh air we're use to being home all day and ...          0  
3      For physical activity, going to the park can b...          1  
4      Many people believe that self-esteem comes fro...          0  
...            

In [None]:
df_transformed['source'] = 'huggingface'

In [None]:
df_transformed.head()

Unnamed: 0,essay_id,text,generated,source
0,e553a352-f066-4518-8431-1e59ca621e16,For those reasons I think class in arts should...,0,huggingface
1,e553a352-f066-4518-8431-1e59ca621e16,"\n\nFrom my own experience, taking classes in ...",1,huggingface
2,8a2d32c2-4a67-4e5c-9f3f-8dfd42e88fca,"But, they their is a negative effects on peopl...",0,huggingface
3,8a2d32c2-4a67-4e5c-9f3f-8dfd42e88fca,"It has made our lives easier in many ways, but...",1,huggingface
4,9ef77519-ec09-4038-92a8-7f8f5fc3a1b5,You don't have to change yourself to prove a p...,0,huggingface


In [None]:
df_transformed.to_csv('100kdataset.csv', index=False)

Calculating perplexity

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

MODEL_MAX_LENGTH = model.config.max_position_embeddings

def calculate_chunk_perplexity(chunk, tokenizer, model):
    try:
        tokens = tokenizer.encode(chunk, return_tensors="pt", truncation=True, max_length=MODEL_MAX_LENGTH).to(device)
        if tokens.size(1) == 0:
            print(f"Empty token sequence for chunk: {chunk[:50]}...")
            return None
        with torch.no_grad():
            outputs = model(tokens, labels=tokens)
            loss = outputs.loss
            if loss is None or torch.isnan(loss):
                print(f"Loss is NaN for chunk: {chunk[:50]}...")
                return None
        return torch.exp(loss).cpu().item()
    except Exception as e:
        print(f"Error processing chunk: {chunk[:50]}... | Error: {e}")
        return None

# Function to handle long texts by splitting into chunks
def calculate_text_perplexity(text, tokenizer, model):
    tokens = tokenizer.encode(text)
    if len(tokens) <= MODEL_MAX_LENGTH:
        return calculate_chunk_perplexity(text, tokenizer, model)
    else:
        chunks = [tokens[i:i + MODEL_MAX_LENGTH] for i in range(0, len(tokens), MODEL_MAX_LENGTH)]
        chunk_texts = [tokenizer.decode(chunk, skip_special_tokens=True) for chunk in chunks]
        perplexities = []
        for chunk in chunk_texts:
            chunk_perplexity = calculate_chunk_perplexity(chunk, tokenizer, model)
            if chunk_perplexity is not None:
                perplexities.append(chunk_perplexity)
        if perplexities:
            return sum(perplexities) / len(perplexities)
        return None

# Function to process dataset in batches for NaN perplexity rows
def recalculate_na_perplexity_in_batches(dataset, batch_size=100, save_path="test_dataset.csv"):
    rows_with_na = dataset[dataset['perplexity'].isna()]
    print(f"Total rows with NaN perplexity: {len(rows_with_na)}")

    total_rows = len(rows_with_na)
    for start in range(0, total_rows, batch_size):
        end = min(start + batch_size, total_rows)
        batch_indices = rows_with_na.index[start:end]
        print(f"Processing batch {start // batch_size + 1} ({start}-{end})...")

        for idx in batch_indices:
            row = dataset.loc[idx]
            perplexity = calculate_text_perplexity(row["text"], tokenizer, model)
            dataset.at[idx, "perplexity"] = perplexity

        dataset.to_csv(save_path, index=False)
        print(f"Completed batch {start // batch_size + 1} and saved to {save_path}")

dataset = df
dataset["perplexity"] = None

# Clean text column
dataset['text'] = dataset['text'].fillna('').str.strip()
# Recalculate perplexity for NaN rows in batches
recalculate_na_perplexity_in_batches(dataset, batch_size=100, save_path="Dataset_with_new_features.csv")

Total rows with NaN perplexity: 23786
Processing batch 1 (0-100)...


Token indices sequence length is longer than the specified maximum sequence length for this model (1048 > 1024). Running this sequence through the model will result in indexing errors


Completed batch 1 and saved to Dataset_with_new_features.csv
Processing batch 2 (100-200)...
Completed batch 2 and saved to Dataset_with_new_features.csv
Processing batch 3 (200-300)...
Completed batch 3 and saved to Dataset_with_new_features.csv
Processing batch 4 (300-400)...
Completed batch 4 and saved to Dataset_with_new_features.csv
Processing batch 5 (400-500)...
Completed batch 5 and saved to Dataset_with_new_features.csv
Processing batch 6 (500-600)...
Completed batch 6 and saved to Dataset_with_new_features.csv
Processing batch 7 (600-700)...
Completed batch 7 and saved to Dataset_with_new_features.csv
Processing batch 8 (700-800)...
Completed batch 8 and saved to Dataset_with_new_features.csv
Processing batch 9 (800-900)...
Completed batch 9 and saved to Dataset_with_new_features.csv
Processing batch 10 (900-1000)...
Completed batch 10 and saved to Dataset_with_new_features.csv
Processing batch 11 (1000-1100)...
Completed batch 11 and saved to Dataset_with_new_features.csv
Pr