<a href="https://colab.research.google.com/github/julia-lukasiewicz-pater/gpt-wiki-features/blob/main/Code/Creating_features_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating GPT-wiki-intro-features dataset


In this notebook you can find the code used to create
[GPT-wiki-intro-features dataset](https://huggingface.co/datasets/julia-lukasiewicz-pater/GPT-wiki-intro-features) 

and 
[small-GPT-wiki-intro-features dataset](https://huggingface.co/datasets/julia-lukasiewicz-pater/small-GPT-wiki-intro-features) from &#x1F917;HuggingFace.

The full version of the dataset contains 150k short texts from Wikipedia (label 0) and corresponding texts generated by ChatGPT (label 1) (together 300k texts).

The smaller version contains 100k short texts from Wikipedia (50k) and those generated by ChatGPT (50k). 

The texts come from a great [aadityaubhat/GPT-wiki-intro](https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro) dataset that you can consult for more details.

For each text, a variety of text complexity measures have been calculated using various Python libraries: [NLTK](https://www.nltk.org/), [readibility-metrics](https://pypi.org/project/py-readability-metrics/), [lexical-diversity](https://pypi.org/project/lexical-diversity/), and [TextDescriptives](https://hlasse.github.io/TextDescriptives/).

Let's see how it was done.


# Installing and loading libraries

In [None]:
!pip install datasets
!pip install py-readability-metrics
!python -m nltk.downloader punkt
!pip install lexical-diversity
!pip install textdescriptives
!python -m spacy download en_core_web_lg

In [None]:
from datasets import load_dataset, concatenate_datasets
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('universal_tagset')
nltk.download('averaged_perceptron_tagger')
from collections import Counter
from readability import Readability
from readability.exceptions import ReadabilityException
from lexical_diversity import lex_div as ld
import random
import math
import spacy
import textdescriptives as td

# Functions

This section contains all the custom functions used to calculate linguistic features. 

* `estimate_max_bigram_entropy`: estimation of maximum bigram entropy on a sample of 1000 ChatGPT and 1000 Wikipedia texts 

* `estimate_bigram_entropy` : estimation of total bigram entropy for each text in the dataset

* `count_part_of_speech`: creates a dictionary of part of speech counts for every text in the dataset

* `add_features` : function calculating such features as mean_word_length, readability and lexical diversity measures etc. Used to add those features to the dataset.

* `add_spacy_features` : calculating and adding the rest of the features (from TextDescription library)

In [None]:
def estimate_max_bigram_entropy(dataset):
  random.seed(10)
  random_wiki_indices = random.sample(range(0,150000),1000)
  random_gpt_indices = random.sample(range(150000, 300001),1000)

  wiki = dataset.select(random_wiki_indices)
  gpt = dataset.select(random_gpt_indices)

  small_dataset = concatenate_datasets([wiki,gpt])
  words = word_tokenize(' '.join(small_dataset['text']))
  bigrams = list(nltk.bigrams(words))
  freq_dist = nltk.FreqDist(bigrams)
  total_bigrams = len(bigrams)
  entropy_values = [-freq_dist[bigram]/total_bigrams*math.log2(freq_dist[bigram]/total_bigrams) for bigram in freq_dist]
  max_entropy = sum(entropy_values)
  return max_entropy

In [None]:
def estimate_bigram_entropy(text):
  words = word_tokenize(text)
  bigrams = list(nltk.bigrams(words))
  freq_dist = nltk.FreqDist(bigrams)
  total_bigrams = len(bigrams)
  entropy_values = [-freq_dist[bigram]/total_bigrams*math.log2(freq_dist[bigram]/total_bigrams) for bigram in freq_dist]
  total_entropy = sum(entropy_values)
  return total_entropy


In [None]:
def count_part_of_speech(counted):
  pos_tags =  {'ADJ':0, 'ADP':0, 'ADV':0, 'CONJ':0, 'DET':0, 'NOUN':0, 'NUM':0, 'PRT':0, 'PRON':0, 'VERB':0, '.':0, 'X':0}
  values = [0 for i in range(len(pos_tags)+1)]
  all_tags = dict(zip(pos_tags,values))
  if set(counted.keys()).union(set(pos_tags)) == set(pos_tags): 
    result = {**all_tags, **counted}
    return result
  else:
    return -1

In [None]:
def add_features(example):
  text = example['text']
  words = word_tokenize(text)
  sentences = sent_tokenize(text)
  mean_sent_length = np.array(list(map(lambda x:len(word_tokenize(x)), sentences))).mean()
  mean_word_length = np.array(list(map(lambda x:len(x), words))).mean()
  tagged = nltk.pos_tag(words, tagset = 'universal')
  unzipped_tags = list(zip(*tagged))[1]
  counted_tags = Counter(unzipped_tags)
  all_tags = count_part_of_speech(counted_tags)
  num_tags = np.zeros(shape=(12))
  for j, tag in enumerate(all_tags):
        num_tags[j] = all_tags[tag]
  r = Readability(text)
  try:
    gunning_fog = r.gunning_fog().score
    ari = r.ari().score
    dale_chall = r.dale_chall().score
  except ReadabilityException:
    gunning_fog = 0
    ari = 0
    dale_chall = 0

  HDD = ld.hdd(words)
  MTLD = ld.mtld(words)
  MATTR = ld.mattr(words)

  bigram_entropy = estimate_bigram_entropy(text)

  example['normalized_bigram_entropy'] = bigram_entropy / max_bigram_entropy
  example['mean_word_length'] = mean_word_length
  example['mean_sent_length'] = mean_sent_length
  example['fog'] = gunning_fog
  example['ari'] = ari
  example['dale_chall'] = dale_chall
  example['hdd'] = HDD
  example['mtld'] = MTLD
  example['mattr'] = MATTR

  example['number_of_ADJ'] =  num_tags[0] / len(words)
  example['number_of_ADP'] = num_tags[1] / len(words)
  example['number_of_ADV'] = num_tags[2] / len(words)
  example['number_of_CONJ'] = num_tags[3] / len(words)
  example['number_of_DET'] = num_tags[4] / len(words)
  example['number_of_NOUN'] = num_tags[5] / len(words)
  example['number_of_NUM'] = num_tags[6] / len(words)
  example['number_of_PRT'] = num_tags[7] / len(words)
  example['number_of_PRON'] = num_tags[8] / len(words)
  example['number_of_VERB'] = num_tags[9] / len(words)
  example['number_of_DOT'] = num_tags[10] / len(words)
  example['number_of_X'] =  num_tags[11] / len(words)

  return example

In [None]:
def add_spacy_features(example):
  text = example['text']
  doc = nlp(text)
  features = td.extract_dict(doc)[0]
  example['spacy_perplexity'] = features['perplexity']
  example['entropy'] = features['entropy']
  example['automated_readability_index'] = features['automated_readability_index']
  example['per_word_spacy_perplexity'] = features['per_word_perplexity']
  example['dependency_distance_mean'] = features['dependency_distance_mean']
  example['dependency_distance_std'] = features['dependency_distance_std']
  example['first_order_coherence'] = features['first_order_coherence']
  example['second_order_coherence'] = features['second_order_coherence']
  example['smog'] = features['smog']
  example['prop_adjacent_dependency_relation_mean'] = features['prop_adjacent_dependency_relation_mean']
  example['prop_adjacent_dependency_relation_std'] = features['prop_adjacent_dependency_relation_std']
  example['syllables_per_token_mean'] = features['syllables_per_token_mean']
  example['syllables_per_token_median'] = features['syllables_per_token_median']
  example['token_length_std'] = features['token_length_std']
  example['token_length_median'] = features['token_length_median']
  example['sentence_length_median'] = features['sentence_length_median']
  example['syllables_per_token_std'] = features['syllables_per_token_std']
  example['proportion_unique_tokens'] = features['proportion_unique_tokens']
  example['top_ngram_chr_fraction_3'] = features['top_ngram_chr_fraction_3']
  example['top_ngram_chr_fraction_2'] = features['top_ngram_chr_fraction_2']
  example['top_ngram_chr_fraction_4'] = features['top_ngram_chr_fraction_4']
  example['proportion_bullet_points'] = features['proportion_bullet_points']
  example['flesch_reading_ease'] = features['flesch_reading_ease']
  example['flesch_kincaid_grade'] = features['flesch_kincaid_grade']
  example['gunning_fog'] = features['gunning_fog']
  example['coleman_liau_index'] = features['coleman_liau_index']
  example['oov_ratio'] = features['oov_ratio']


  return example


# Adding modifications and saving the dataset

Since the original dataset is on &#x1F917;HuggingFace, I can make use of the advantages of the `datasets` library. Check out its documentation [here](https://huggingface.co/docs/datasets/index)!

It took only a few steps to generate a new dataset. I started with loading the original dataset and transforming it into a desired format.

In [None]:
dataset = load_dataset('aadityaubhat/GPT-wiki-intro', split='train')

In [None]:
wiki = dataset.remove_columns(['id', 'url', 'title', 'generated_intro', 'title_len', 'wiki_intro_len', 'generated_intro_len', 'prompt', 'generated_text', 'prompt_tokens', 'generated_text_tokens'])
gpt = dataset.remove_columns(['id', 'url', 'title', 'wiki_intro', 'title_len', 'wiki_intro_len', 'generated_intro_len', 'prompt', 'generated_text', 'prompt_tokens', 'generated_text_tokens'])
wiki = wiki.rename_column('wiki_intro', 'text')
gpt = gpt.rename_column('generated_intro', 'text')
concatenated_dataset = concatenate_datasets([wiki,gpt])

Now, my dataset consists of 300k rows and just one column `text`. First in order are 150k texts from Wikipedia, followed by 150k generated texts. Knowing this, I can create a new `class` column with 0 meaning Wikipedia and 1 meaning ChatGPT.

In [None]:
class_gpt = np.repeat(1, 150000)
class_wiki = np.repeat(0, 150000)
binary_class = np.concatenate((class_wiki, class_gpt), axis = 0)
concatenated_dataset = concatenated_dataset.add_column('class',binary_class)

Here, I set some global variables. `nlp` is used by TextDescriptives library (which in turn makes use of [Spacy](https://spacy.io/)) in order to parse the input text and `max_bigram_entropy` is used by the `add_features` function.

In [None]:
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textdescriptives/all")
max_bigram_entropy = estimate_max_bigram_entropy(concatenated_dataset)

300k rows is a big number. I randomly selected an equal number of rows from both classes to be included in the final dataset.

NOTE: this chunk of code was only used to generate the smaller version of the dataset. 

In [None]:
wiki_to_select = list(random.sample(range(0,150000),50000))
gpt_to_select = list(random.sample(range(150000,300000),50000))
to_select = wiki_to_select + gpt_to_select
selected_dataset = concatenated_dataset.select(to_select)

Now I can apply the features calculation to each of the selected texts.

In [None]:
selected_dataset = selected_dataset.map(add_features, num_proc = 2) 

In [None]:
selected_dataset = selected_dataset.map(add_spacy_features, num_proc = 2)

And finally save the dataset!

In [None]:
selected_dataset.save_to_disk('.../small-GPT-wiki-intro-features')

# Bonus

Let's have a quick inital look at how those features perform in the classification task using a simple Random Forest model.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

In [None]:
my_dataset = load_dataset('julia-lukasiewicz-pater/small-GPT-wiki-intro-features', split = 'train')

Downloading readme:   0%|          | 0.00/5.01k [00:00<?, ?B/s]

Downloading and preparing dataset csv/julia-lukasiewicz-pater--small-GPT-wiki-intro-features to /root/.cache/huggingface/datasets/julia-lukasiewicz-pater___csv/julia-lukasiewicz-pater--small-GPT-wiki-intro-features-3eb385516bd1dc74/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/180M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/julia-lukasiewicz-pater___csv/julia-lukasiewicz-pater--small-GPT-wiki-intro-features-3eb385516bd1dc74/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


In [None]:
my_dataset

Dataset({
    features: ['Unnamed: 0', 'text', 'normalized_bigram_entropy', 'mean_word_length', 'mean_sent_length', 'fog', 'ari', 'dale_chall', 'hdd', 'mtld', 'mattr', 'number_of_ADJ', 'number_of_ADP', 'number_of_ADV', 'number_of_CONJ', 'number_of_DET', 'number_of_NOUN', 'number_of_NUM', 'number_of_PRT', 'number_of_PRON', 'number_of_VERB', 'number_of_DOT', 'number_of_X', 'class', 'spacy_perplexity', 'entropy', 'automated_readability_index', 'per_word_spacy_perplexity', 'dependency_distance_mean', 'dependency_distance_std', 'first_order_coherence', 'second_order_coherence', 'smog', 'prop_adjacent_dependency_relation_mean', 'prop_adjacent_dependency_relation_std', 'syllables_per_token_mean', 'syllables_per_token_median', 'token_length_std', 'token_length_median', 'sentence_length_median', 'syllables_per_token_std', 'proportion_unique_tokens', 'top_ngram_chr_fraction_3', 'top_ngram_chr_fraction_2', 'top_ngram_chr_fraction_4', 'proportion_bullet_points', 'flesch_reading_ease', 'flesch_kinc

In [None]:
my_dataset = my_dataset.remove_columns(['text', 'Unnamed: 0'])
my_dataset = my_dataset.shuffle(seed=42)
my_dataset = pd.DataFrame(my_dataset)
y = my_dataset['class']
X = my_dataset.drop(['class'],axis=1)

In [None]:
model = RandomForestClassifier(verbose=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
my_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy = 'constant', fill_value = 0)),
                              ('model', RandomForestClassifier(random_state=42))
                             ])
score = cross_val_score(my_pipeline, X_train, y_train, cv = 5, scoring = 'f1')

In [None]:
#Cross-validation scores
score

array([0.91619677, 0.91563932, 0.91371994, 0.91443678, 0.91393469])

In [None]:
my_pipeline.fit(X_train,y_train)
preds = my_pipeline.predict(X_test)
f1 = f1_score(y_test, preds)

#F1 score on the test set
f1

0.9127233407136605

:)