## [Description](https://www.kaggle.com/competitions/comp90051-2024s1-project-1)
Text generation has become an increasingly popular task with the rise of natural language processing (NLP) techniques and advancements in deep learning. Given a short text prompt written by a human, text generation employs overparameterised models to generate response text: a likely sequence of text that might follow the prompt, based on enormous training datasets of text from news articles, online libraries of books, and from scraping the web. While text generation has a wide range of applications, including chatbots, language translation, and content creation, it also poses a significant challenge in ensuring content authenticity, accuracy, and authoritativeness. This is where text generation detection comes in, which is the process of identifying whether a given text is machine-generated or human-written. Developing effective text generation detection models is important because it can help prevent the spread of fake news, misinformation, and propaganda.

Your task is to come up with test predictions for a machine-generated detection problem given a training set and test input instances. That is, your task is to predict whether given text input instances have been generated by a human or a machine.

Full details for this task are provided in the assignment description on Canvas.


## [Evaluation](www.kaggle.com/competitions/comp90051-2024s1-project-1/overview/evaluation)
For all participants, the same 50% of predictions from the test set are assigned to the public leaderboard. The score you see on the public leaderboard reflects your model’s accuracy on this portion of the test set. The other 50% of the test set will only be used once, after the competition has closed, to determine your final ranking and accuracy scores. This means that you must take care not to overfit to the leaderboard.

Submissions will be evaluated using the classification accuracy between the predicted class and the observed target.



In [None]:
#Import packages
import json
import sklearn
import numpy as np
import pandas as pd

In [None]:
#Load datasets
with open("domain1_train_data.json", "r") as f:
    dataset_1 = [ json.loads(line, parse_int = str) for line in f ]

with open("domain2_train_data.json", "r") as f:
    dataset_2 = [ json.loads(line, parse_int = str) for line in f ]

with open("test_data.json", "r") as f:
    testset = [ json.loads(line, parse_int = str) for line in f ]

n_samples_1 = len(dataset_1)
n_samples_2 = len(dataset_2)
n_tests = len(testset)

In [3]:
#Build a vocabulary of 1-grams, 2-grams, 3-grams (1-word, 2-word, 3-word)
vocab = {}
for m in range(n_samples_1):
    text = dataset_1[m]['text']
    textlength = len(text)
    for i in range(textlength):
        onegram = f"{text[i]}"
        vocab[onegram] = vocab.get(onegram, 0) + 1
        if i < textlength-1: 
            twogram = f"{text[i]} {text[i+1]}"
            vocab[twogram] = vocab.get(twogram, 0) + 1
        if i < textlength-2:
            threegram = f"{text[i]} {text[i+1]} {text[i+2]}"
            vocab[threegram] = vocab.get(threegram, 0) + 1
        
    
for m in range(n_samples_2):
    text = dataset_2[m]['text']
    textlength = len(text)
    for i in range(textlength):
        onegram = f"{text[i]}"
        vocab[onegram] = vocab.get(onegram, 0) + 1
        if i < textlength-1: 
            twogram = f"{text[i]} {text[i+1]}"
            vocab[twogram] = vocab.get(twogram, 0) + 1
        if i < textlength-2:
            threegram = f"{text[i]} {text[i+1]} {text[i+2]}"
            vocab[threegram] = vocab.get(threegram, 0) + 1

In [4]:
print("Number of 1,2,3-grams: ", len(vocab))

Number of 1,2,3-grams:  3359653


In [5]:
#To reduce number of features, just choose ALL 1-grams (like bag of words)
#Then include the top most occuring (2,3)-grams
final_vocab = {str(i): i for i in range(83583)} #1-grams
s = 0
sorted_vocabs = sorted( vocab.items(), key=lambda x:x[1] )
number_ngrams_included = 100000
while s < number_ngrams_included:
    word, count = sorted_vocabs.pop()
    if word not in final_vocab:
        final_vocab[str(word)] = 83583+s
        s += 1

In [5]:
len(final_vocab)

183583

In [6]:
with open("final_vocab.json", "w") as f:
    json.dump(final_vocab, f)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def feature_select( texts: list[str], *, vocabulary: dict = None, method="countvectorize", **kwargs ) -> pd.DataFrame:
    """From a list of texts, output a dataframe of features with shape (n_samples, n_features).

     Args:
         texts (list[str]): list of strings, each item corresponding to a text.
         vocabulary (dict, optional): _description_. Defaults to None.
         method (str, optional): Method to select features. Defaults to "count-vectorizer".
         **kwargs: kwarg arguments to pass to Vectorizer classes of sklearn.
    Raises:
        ValueError: If passing an non-specified method of text feature extraction

     Returns:
         pd.DataFrame: dataframe of shape (n_samples, n_features)
    """
    #We want single digits to tokenized. This regex considers everything as a token except whitespace.
    kwargs['token_pattern'] = r'\S+' 
    if method == "countvectorize":
        vectorizer = CountVectorizer(vocabulary = vocabulary, **kwargs) if vocabulary else CountVectorizer(**kwargs)
    elif method == "tfidf":
        vectorizer = TfidfVectorizer(vocabulary = vocabulary, **kwargs) if vocabulary else TfidfVectorizer(**kwargs)
    else:
        raise ValueError(f"{method} is not a supported method.")
    X = vectorizer.fit_transform(texts).toarray()
    feature_names = vectorizer.get_feature_names_out()
    df = pd.DataFrame( data = X, columns=feature_names )
    return df

In [9]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [10]:
#Turn lists of words from dataset into sentences:
datatexts = []
for dataset in [dataset_1, dataset_2]:
    for instance in dataset:
       datatexts += [ " ".join(instance["text"]) ]
testtexts = []
for instance in testset:
    testtexts += [ " ".join(instance["text"]) ]

In [12]:
#Feature Selection:
with open("final_vocab.json", "r") as f:
    final_vocab = json.load(f)
    
X = feature_select(texts = datatexts, 
                    vocabulary=final_vocab, 
                    method='tfidf',
                    ngram_range=(1,2)
                    )
X.shape

(18000, 183583)

In [13]:
#Labels
y = [ [dataset_1[i]['label']] for i in range(n_samples_1) ] 
y += [ [dataset_2[i]['label']] for i in range(n_samples_2) ]
y = pd.DataFrame( y, columns=["label"] )

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    train_size=0.8, 
                                                    random_state=200124, 
                                                    shuffle=True, 
                                                    stratify=None )

In [15]:
X_train = MinMaxScaler().fit_transform(X_train.to_numpy(dtype=np.float64))
X_test = MinMaxScaler().fit_transform(X_test.to_numpy(dtype=np.float64))

In [14]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)

  y = column_or_1d(y, warn=True)


0.7766666666666666

In [18]:
testtexts = [ testset[i]['text'] for i in range(n_tests) ]
testtexts = [ " ".join(testtexts[i]) for i in range(n_tests) ]
X_test_featureset = feature_select(texts = testtexts, 
                                         vocabulary=final_vocab,
                                         method="tfidf",
                                         ngram_range=(1,2)
                                        )
predictions = classifier.predict( X_test_featureset )
predictions = pd.DataFrame( predictions, index=range(n_tests), columns=[ "class" ])



In [19]:
predictions.to_csv("sample.csv", sep=",", header=True, index_label="id")

In [21]:
predictions.value_counts()

class
0        2072
1        1928
Name: count, dtype: int64

In [22]:
!kaggle competitions submit -c comp90051-2024s1-project-1 -f sample.csv -m "Trial"

100%|██████████████████████████████████████| 26.3k/26.3k [00:01<00:00, 16.3kB/s]
Successfully submitted to COMP90051 2024S1 Project 1