### Lab 7.1: Bag of Words Model

In this lab you will use the bag of words model to learn author attribution with a [dataset of texts from Victorian authors](https://github.com/agungor2/Authorship_Attribution?tab=readme-ov-file).

In [12]:
import numpy as np
import sklearn
import pandas as pd

Here we download the CSV file containing the text snippets and author IDs.

In [13]:
!wget --no-clobber -O Gungor_2018_VictorianAuthorAttribution_data-train.csv -q "https://www.dropbox.com/scl/fi/emk9db05t9u8yzgrjje7t/Gungor_2018_VictorianAuthorAttribution_data-train.csv?rlkey=kzvbl0mbpnrpjr4c3q18le6w2&dl=1"

In [14]:
df = pd.read_csv('Gungor_2018_VictorianAuthorAttribution_data-train.csv', encoding = "ISO-8859-1")
df.head()

Unnamed: 0,text,author
0,ou have time to listen i will give you the ent...,1
1,wish for solitude he was twenty years of age a...,1
2,and the skirt blew in perfect freedom about th...,1
3,of san and the rows of shops opposite impresse...,1
4,an hour s walk was as tiresome as three in a s...,1


In [15]:
text = list(df['text'])
labels = df['author'].values

In [16]:
text[0]

'ou have time to listen i will give you the entire story he said it may form the basis of a future novel and prove quite as interesting as one of your own invention i had the time to listen of course one has time for anything and everything agreeable in the best place to hear the tale was in a victoria and with my good on the box with the coachman we set out at once on a drive to the as the recital was only half through when we reached the house we postponed the remainder while we stopped there for an excellent lunch on the way back to my friend continued and finished the story it was indeed quite suitable for use and i told my friend with thanks that i should at once put it in shape for my readers i said i should make a few alterations in it for the sake of dramatic interest but in the main would follow the lines he had given me it would spoil my romance were i to answer on this page the question that must be uppermost in the reader s mind i have already revealed almost too much of th

### Exercises

1. Use the `CountVectorizer` from `sklearn.feature_extraction.text` to produce a term frequency vector for each text.  Set `max_features=1000` to only use the top 1000 terms.

Prepare a 90/10 train-test split `random_state=42`.

Train the default `MLPCLassifier` from `sklearn.neural_network` on the data and report the train and test accuracy.  You can use the argument `verbose=True` to `MLPClassifier` to monitor training.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.1, random_state=42)
clf = MLPClassifier(verbose=True)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print("Train accuracy:", accuracy_score(y_train, y_train_pred))
print("Test accuracy:", accuracy_score(y_test, y_test_pred))

Iteration 1, loss = 1.92049611
Iteration 2, loss = 0.50725217
Iteration 3, loss = 0.28534340
Iteration 4, loss = 0.19642350
Iteration 5, loss = 0.14355440
Iteration 6, loss = 0.10809862
Iteration 7, loss = 0.08454068
Iteration 8, loss = 0.06441515
Iteration 9, loss = 0.04846363
Iteration 10, loss = 0.03714305
Iteration 11, loss = 0.02882209
Iteration 12, loss = 0.02181749
Iteration 13, loss = 0.01952588
Iteration 14, loss = 0.01487900
Iteration 15, loss = 0.01151154
Iteration 16, loss = 0.01048287
Iteration 17, loss = 0.00885181
Iteration 18, loss = 0.00858167
Iteration 19, loss = 0.00888093
Iteration 20, loss = 0.00837009
Iteration 21, loss = 0.00816079
Iteration 22, loss = 0.01030391
Iteration 23, loss = 0.00885173
Iteration 24, loss = 0.00919736
Iteration 25, loss = 0.01052904
Iteration 26, loss = 0.00779272
Iteration 27, loss = 0.00746545
Iteration 28, loss = 0.00432913
Iteration 29, loss = 0.00455537
Iteration 30, loss = 0.00497592
Iteration 31, loss = 0.01959066
Iteration 32, los


2. Repeat the steps but using `TfidfVectorizer` to produce term frequency - inverse document frequency vectors.

Does the IDF weighting improve the results?

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.1, random_state=42)
clf = MLPClassifier(verbose=True)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print("Train accuracy:", accuracy_score(y_train, y_train_pred))
print("Test accuracy:", accuracy_score(y_test, y_test_pred))

Iteration 1, loss = 3.01926992
Iteration 2, loss = 1.99989702
Iteration 3, loss = 1.38074143
Iteration 4, loss = 0.99334759
Iteration 5, loss = 0.75812567
Iteration 6, loss = 0.60819539
Iteration 7, loss = 0.50657658
Iteration 8, loss = 0.43203296
Iteration 9, loss = 0.37583316
Iteration 10, loss = 0.33166791
Iteration 11, loss = 0.29580513
Iteration 12, loss = 0.26717061
Iteration 13, loss = 0.24178459
Iteration 14, loss = 0.22146274
Iteration 15, loss = 0.20319301
Iteration 16, loss = 0.18773918
Iteration 17, loss = 0.17354409
Iteration 18, loss = 0.16096451
Iteration 19, loss = 0.15036259
Iteration 20, loss = 0.14057153
Iteration 21, loss = 0.13134682
Iteration 22, loss = 0.12332095
Iteration 23, loss = 0.11578058
Iteration 24, loss = 0.10905762
Iteration 25, loss = 0.10293909
Iteration 26, loss = 0.09675139
Iteration 27, loss = 0.09191076
Iteration 28, loss = 0.08682815
Iteration 29, loss = 0.08222857
Iteration 30, loss = 0.07808349
Iteration 31, loss = 0.07403504
Iteration 32, los

IDF weighting did not have a significant impact on the results since both models achieved similar accuracy scores.