## [Description](https://www.kaggle.com/competitions/comp90051-2024s1-project-1)
Text generation has become an increasingly popular task with the rise of natural language processing (NLP) techniques and advancements in deep learning. Given a short text prompt written by a human, text generation employs overparameterised models to generate response text: a likely sequence of text that might follow the prompt, based on enormous training datasets of text from news articles, online libraries of books, and from scraping the web. While text generation has a wide range of applications, including chatbots, language translation, and content creation, it also poses a significant challenge in ensuring content authenticity, accuracy, and authoritativeness. This is where text generation detection comes in, which is the process of identifying whether a given text is machine-generated or human-written. Developing effective text generation detection models is important because it can help prevent the spread of fake news, misinformation, and propaganda.

Your task is to come up with test predictions for a machine-generated detection problem given a training set and test input instances. That is, your task is to predict whether given text input instances have been generated by a human or a machine.

Full details for this task are provided in the assignment description on Canvas.


## [Evaluation](www.kaggle.com/competitions/comp90051-2024s1-project-1/overview/evaluation)
For all participants, the same 50% of predictions from the test set are assigned to the public leaderboard. The score you see on the public leaderboard reflects your model’s accuracy on this portion of the test set. The other 50% of the test set will only be used once, after the competition has closed, to determine your final ranking and accuracy scores. This means that you must take care not to overfit to the leaderboard.

Submissions will be evaluated using the classification accuracy between the predicted class and the observed target.



In [135]:
#Import packages
import json
import sklearn
import numpy as np
import pandas as pd

In [136]:
#Load datasets
with open("domain1_train_data.json", "r") as f:
    dataset_1 = [ json.loads(line, parse_int = str) for line in f ]

with open("domain2_train_data.json", "r") as f:
    dataset_2 = [ json.loads(line, parse_int = str) for line in f ]

with open("test_data.json", "r") as f:
    testset = [ json.loads(line, parse_int = str) for line in f ]

n_samples_1 = len(dataset_1)
n_samples_2 = len(dataset_2)
n_tests = len(testset)

In [137]:
#Build a dict of vocabulary containing 1-grams (single words), and n-grams (word sequences)
#A key is a unique vocabulary, and the value is its frequency throughout dataset
vocab = {}
for m in range(n_samples_1):
    text = dataset_1[m]['text']
    textlength = len(text)
    for i in range(textlength):
        onegram = f"{text[i]}"
        vocab[onegram] = vocab.get(onegram, 0) + 1
        if i < textlength-1: 
            twogram = f"{text[i]} {text[i+1]}"
            vocab[twogram] = vocab.get(twogram, 0) + 1
        # if i < textlength-2:
        #     threegram = f"{text[i]} {text[i+1]} {text[i+2]}"
        #     vocab[threegram] = vocab.get(threegram, 0) + 1
        

for m in range(n_samples_2):
    text = dataset_2[m]['text']
    textlength = len(text)
    for i in range(textlength):
        onegram = f"{text[i]}"
        vocab[onegram] = vocab.get(onegram, 0) + 1
        if i < textlength-1: 
            twogram = f"{text[i]} {text[i+1]}"
            vocab[twogram] = vocab.get(twogram, 0) + 1
        # if i < textlength-2:
        #     threegram = f"{text[i]} {text[i+1]} {text[i+2]}"
        #     vocab[threegram] = vocab.get(threegram, 0) + 1

In [138]:
#Same as above, but we only count how often a vocabulary appears in a document (so repeating 
vocab = {}
for m in range(n_samples_1):
    text = dataset_1[m]['text']
    textlength = len(text)
    for i in range(textlength):
        onegram = f"{text[i]}"
        vocab[onegram] = vocab.get(onegram, 0) + 1
        if i < textlength-1: 
            twogram = f"{text[i]} {text[i+1]}"
            vocab[twogram] = vocab.get(twogram, 0) + 1
        # if i < textlength-2:
        #     threegram = f"{text[i]} {text[i+1]} {text[i+2]}"
        #     vocab[threegram] = vocab.get(threegram, 0) + 1
        

for m in range(n_samples_2):
    text = dataset_2[m]['text']
    textlength = len(text)
    for i in range(textlength):
        onegram = f"{text[i]}"
        vocab[onegram] = vocab.get(onegram, 0) + 1
        if i < textlength-1: 
            twogram = f"{text[i]} {text[i+1]}"
            vocab[twogram] = vocab.get(twogram, 0) + 1
        # if i < textlength-2:
        #     threegram = f"{text[i]} {text[i+1]} {text[i+2]}"
        #     vocab[threegram] = vocab.get(threegram, 0) + 1

In [139]:
print("Size of vocabulary: ", len(vocab))

Size of vocabulary:  1040230


In [140]:
#To reduce number of features, just choose ALL 1-grams (like bag of words)
#Then include the top most occuring (2,3)-grams
# final_vocab = {str(i): i for i in range(83583)} #1-grams
# s = 0
final_vocab = {}
t = 0
number_1gram = 0
number_2gram = 0

#Remove unique words and n-grams in the final vocabulary, as they will act as noise in classifying.
for word, count in vocab.items():
    if (count > 1) and (" " not in word): #1-grams don't have space
        final_vocab[ word ] = t
        number_1gram += 1
        t += 1
        
    elif (count > 1) and (" " in word):
        final_vocab[ word ] = t
        number_2gram += 1
        t += 1

# sorted_vocabs = sorted( vocab.items(), key=lambda x:x[1] )
# number_ngrams_included = 10000
# while s < number_ngrams_included:
#     word, count = sorted_vocabs.pop()
#     if word not in final_vocab:
#         final_vocab[str(word)] = 83583+s
#         s += 1

In [141]:
with open("final_vocab.json", "w") as f:
    json.dump(final_vocab, f)

In [142]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def feature_select( texts: list[str], *, vocabulary: dict = None, method="countvectorize", sparse=False, **kwargs ):
    """From a list of texts, output a dataframe of features with shape (n_samples, n_features).

     Args:
         texts (list[str]): list of strings, each item corresponding to a text.
         vocabulary (dict, optional): _description_. Defaults to None.
         method (str, optional): Method to select features. Defaults to "count-vectorizer".
         **kwargs: kwarg arguments to pass to Vectorizer classes of sklearn.
    Raises:
        ValueError: If passing an non-specified method of text feature extraction

     Returns:
         pd.DataFrame: dataframe of shape (n_samples, n_features)
    """
    #We want single digits to tokenized. This regex considers everything as a token except whitespace.
    kwargs['token_pattern'] = r'\S+' 
    if method == "countvectorize":
        vectorizer = CountVectorizer(vocabulary = vocabulary, **kwargs) if vocabulary else CountVectorizer(**kwargs)
    elif method == "tfidf":
        vectorizer = TfidfVectorizer(vocabulary = vocabulary, **kwargs) if vocabulary else TfidfVectorizer(**kwargs)
    else:
        raise ValueError(f"{method} is not a supported method.")
    if not sparse:
        X = vectorizer.fit_transform(texts)
        feature_names = vectorizer.get_feature_names_out()
        df = pd.DataFrame.sparse.from_spmatrix(data=X, columns = feature_names)
        return df, vectorizer
    else:
        X = vectorizer.fit_transform(texts)
        return X, vectorizer

In [143]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [144]:
#Turn lists of words from dataset into sentences:
datatexts = []
for dataset in [dataset_1, dataset_2]:
    for instance in dataset:
        datatexts += [ " ".join(instance["text"]) ]
testtexts = []
for instance in testset:
    testtexts += [ " ".join(instance["text"]) ]

In [160]:
#Feature Selection:
with open("final_vocab.json", "r") as f:
    final_vocab = json.load(f)
    
X, vectorizer = feature_select(texts = datatexts, 
                    method='tfidf',
                    ngram_range=(1,2),
                    sparse=True,
                    max_df=0.95, #Ignore vocabulary appearing most instances. We expect these are words corresponding to things like "is", "are", "and", "this" etc.
                    min_df=10, #Ignore vocabulary that is too infrequent, as this may lead to low prediction accuracy.
                    )
X.shape

(18000, 65578)

In [161]:
#Labels
y = [ [dataset_1[i]['label']] for i in range(n_samples_1) ] 
y += [ [dataset_2[i]['label']] for i in range(n_samples_2) ]
y = pd.DataFrame( y, columns=["label"] )

In [162]:
from sklearn.decomposition import IncrementalPCA, PCA, SparsePCA
from sklearn.feature_selection import mutual_info_classif, f_classif, SelectKBest

# #sparse matrix feature selection
# #Feature Selection:
# X_sparse, vectorizer = feature_select(texts = datatexts, 
#                     vocabulary=final_vocab, 
#                     method='tfidf',
#                     ngram_range=(1,2),
#                     sparse=True
#                     )

In [163]:
# #Repeat for full vocab
# X_sparse_2, vectorizer_2 = feature_select(texts = datatexts, 
#                     vocabulary=vocab.keys(), 
#                     method='tfidf',
#                     ngram_range=(1,2),
#                     sparse=True
#                     )
# X_sparse_2.shape

In [None]:
%%capture pca_transform
%%time
transformer = IncrementalPCA(n_components=4500, batch_size=4500)
X_pca = transformer.fit_transform(X.toarray())
X_pca.shape

In [17]:
pca_transform()

(18000, 4500)

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, 
                                                    test_size=0.2, 
                                                    train_size=0.8, 
                                                    random_state=2024, 
                                                    shuffle=True, 
                                                    stratify=None )

In [110]:
# X_train = MinMaxScaler().fit_transform(X_train.to_numpy(dtype=np.float64))
# X_test = MinMaxScaler().fit_transform(X_test.to_numpy(dtype=np.float64))

X_train = MinMaxScaler().fit_transform(X_train)
X_test = MinMaxScaler().fit_transform(X_test)

In [111]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  y = column_or_1d(y, warn=True)


0.7686111111111111

In [114]:
testtexts = [ testset[i]['text'] for i in range(n_tests) ]
testtexts = [ " ".join(testtexts[i]) for i in range(n_tests) ]
X_test_featureset = vectorizer.transform( testtexts )
# X_test_featureset = transformer.transform( X_test_featureset )
X_test_featureset = MinMaxScaler().fit_transform(X_test_featureset.toarray())
predictions = classifier.predict( X_test_featureset )
predictions = pd.DataFrame( predictions, index=range(n_tests), columns=[ "class" ])

In [115]:
predictions.value_counts()

class
0        2004
1        1996
dtype: int64

In [116]:
predictions.to_csv("sample.csv", sep=",", header=True, index_label="id")

In [120]:
!kaggle competitions submit -c comp90051-2024s1-project-1 -f sample.csv -m "Using ~4k features only"

100%|██████████████████████████████████████| 26.3k/26.3k [00:02<00:00, 10.9kB/s]
Successfully submitted to COMP90051 2024S1 Project 1

In [127]:
rng = np.random.default_rng(1231)
random_predicts = pd.DataFrame( [ rng.choice([0,1]) for _ in range(4000) ], index=range(n_tests), columns=[ "class" ])

In [129]:
random_predicts.to_csv("random.csv", sep=",", header=True, index_label="id")
!kaggle competitions submit -c comp90051-2024s1-project-1 -f random.csv -m "Completely random guessing"

100%|██████████████████████████████████████| 26.3k/26.3k [00:01<00:00, 13.6kB/s]
Successfully submitted to COMP90051 2024S1 Project 1