### OpenAI Benchmark

In [10]:
! pip install transformers==4.25.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.25.0
  Downloading transformers-4.25.0-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
Reason for being yanked: Version was not properly set[0m[33m
[0mInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.26.0
    Uninstalling transformers-4.26.0:
      Successfully uninstalled transformers-4.26.0
Successfully installed transformers-4.25.0


In [2]:
import os 
from transformers import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

In [3]:
test_set = pd.read_csv('./data/test.csv')
test_set.shape

(4763, 2)

In [4]:
# openAI detector

pipe = pipeline(model="roberta-base-openai-detector")

Some weights of the model checkpoint at roberta-base-openai-detector were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
test_set.text = test_set.text.astype(str)

In [6]:
test_set['predictions'] = test_set.text.apply(lambda x : pipe(x))

In [7]:
test_set['predicted_label'] = test_set.predictions.apply(lambda x: x[0]['label'])

In [8]:
test_set['predicted_label'] = test_set['predicted_label'].replace({'Real':0, 'Fake':1})

In [9]:
accuracy_score(test_set.label, test_set.predicted_label), precision_score(test_set.label, test_set.predicted_label), recall_score(test_set.label, test_set.predicted_label), f1_score(test_set.label, test_set.predicted_label)

(0.7677934075162712, 0.848651623555311, 0.649810366624526, 0.7360381861575179)

### ML Benchmarks

In [None]:
! pip install scikit-learn==0.22
! pip install xgboost==1.7.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn==0.22
  Downloading scikit_learn-0.22-cp38-cp38-manylinux1_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, but you have scikit-learn 0.22 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.22 which is incompatible.[0m[31m
[0mSuccessfully installed scikit-learn-0.22
Looking in indexes:

In [None]:
# Logistic Regression

! python3 src/benchmarks_sklearn.py --model lr

Model chosen: lr.
Optimising..
Best estimator: 
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
Evaluation metrics on test set: 
Accuracy:  85.07 %
Precision:  87.0 %
Recall:  82.34 %
F1-score 84.61 %


In [None]:
# Naive Bayes

! python3 src/benchmarks_sklearn.py --model nb

Model chosen: nb.
Optimising..
Best estimator: 
MultinomialNB(alpha=10, class_prior=None, fit_prior=True)
MultinomialNB(alpha=10, class_prior=None, fit_prior=True)
Evaluation metrics on test set: 
Accuracy:  83.48 %
Precision:  93.62 %
Recall:  71.72 %
F1-score 81.22 %


In [None]:
# Random Forest

! python3 src/benchmarks_sklearn.py --model rf

Model chosen: rf.
Optimising..
Best estimator: 
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=14, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=125,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=14, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_es

In [None]:
# XGBoost

! python3 src/benchmarks_sklearn.py --model xgb

Model chosen: xgb.
Optimising..
Best estimator: 
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eta=0.05, eval_metric=None,
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=125, n_jobs=None,
              num_parallel_tree=None, objective='binary:logistic', ...)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
           

### LSTM - Train and Test

In [None]:
! pip install torch==1.9.0
! pip install torchtext==0.10.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.9.0
  Downloading torch-1.9.0-cp38-cp38-manylinux1_x86_64.whl (831.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m831.4/831.4 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.13.1+cu116
    Uninstalling torch-1.13.1+cu116:
      Successfully uninstalled torch-1.13.1+cu116
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.14.1+cu116 requires torch==1.13.1, but you have torch 1.9.0 which is incompatible.
torchtext 0.14.1 requires torch==1.13.1, but you have torch 1.9.0 which is incompatible.
torchaudio 0.13.1+cu116 requires torch==1.13.1, but you have torch 1.9.0 which is incompatible.[0m[31m
[0mSucces

In [None]:
! python src/lstm.py

LSTM
Epoch:  1
0.8446350762527233 0.7683615211496916 0.9194938215183572

Epoch:  2
0.8822848583877996 0.9001832682589406 0.8469693568902293

Epoch:  3
0.9001225490196079 0.8890568793220108 0.8926733413867008

Epoch:  4
0.9068627450980392 0.8836963640853891 0.9160211778689973

Epoch:  5
0.9120710784313726 0.8818227557151429 0.9294686411168178

Epoch:  6
0.9178921568627451 0.9047282577832343 0.9174654621268373

Epoch:  7
0.9197303921568627 0.887390695411315 0.9400910989477955

Epoch:  8
0.914828431372549 0.9064595724949062 0.9074656876690427

Epoch:  9
0.9169730392156863 0.917068861204106 0.9022680362057482

Epoch:  10
0.9224877450980392 0.8999004013938061 0.9313452251210955

Epoch:  11
0.9231004901960784 0.8949831609016423 0.9360736126114281

Epoch:  12
0.9231004901960784 0.8995015047300782 0.932278141443243

Epoch:  13
0.9261642156862745 0.896331815389551 0.9442345420194634

Epoch:  14
0.9231004901960784 0.9031088990108489 0.9307820766153844

Epoch:  15
0.9169730392156863 0.92481400450

Accuracy: 93.71%

Precision: 93.09%

Recall: 93.09%

F1-score: 93.09%

### GPT - Train

In [None]:
! pip install transformers==4.25.0
! pip install torch==1.12
! pip install scikit-learn==0.22

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.25.0
  Downloading transformers-4.25.0-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m65.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
Reason for being yanked: Version was not properly set[0m[33m
[0mInstalling collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 

In [None]:
# torch 1.12

! python3 src/gpt_2/train.py

Training on device:  cuda
Learning rate: 0.0001
EPOCH: 1..
Training..
Processed 500th batch..
Processed 1000th batch..
Processed 1500th batch..
Processed 2000th batch..
Processed 2500th batch..
Processed 3000th batch..
Processed 3500th batch..
Processed 4000th batch..
Processed 4500th batch..
Processed 5000th batch..
Processed 5500th batch..
Processed 6000th batch..
Processed 6500th batch..
Processed 7000th batch..
Processed 7500th batch..
Processed 8000th batch..
Processed 8500th batch..
Processed 9000th batch..
Processed 9500th batch..
Processed 10000th batch..
Processed 10500th batch..
Processed 11000th batch..
Processed 11500th batch..
Processed 12000th batch..
Processed 12500th batch..
Processed 13000th batch..
Processed 13500th batch..
Processed 14000th batch..
Processed 14500th batch..
Processed 15000th batch..
Processed 15500th batch..
Processed 16000th batch..
Testing..
Prediction metrics at 0.5: 
Accuracy: 0.9432845123091306
Precision: 0.9558541266794626
Recall: 0.92968263845

### GPT - Eval

In [None]:
# best at epoch 2

! python3 src/gpt_2/evaluate.py --weights_path "output/gpt2_2.pt" -j 0.5947999954223633

Loading Model..
Downloading (…)olve/main/vocab.json: 100% 1.04M/1.04M [00:01<00:00, 940kB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 488kB/s]
Downloading (…)lve/main/config.json: 100% 665/665 [00:00<00:00, 137kB/s]
Downloading (…)"pytorch_model.bin";: 100% 548M/548M [00:05<00:00, 102MB/s]
Model Loaded.
Evaluating..
Evaluation metrics on test set: 
Accuracy:  94.88 %
Precision:  95.98 %
Recall:  93.64 %
F1-score 94.8 %


In [None]:
# at threshold 0.5

! python3 src/gpt_2/evaluate.py --weights_path "output/gpt2_2.pt" -j 0.5

Loading Model..
Model Loaded.
Evaluating..
Evaluation metrics on test set: 
Accuracy:  94.63 %
Precision:  94.57 %
Recall:  94.65 %
F1-score 94.61 %
