![intro](intro.png)

## Summary
Model: Sequential neural network (from Keras)

Accuracy: 85.3%

## Dataset Information

A structured dataset is provided, where each row represents a text sample. The columns indicate either the presence of specific words in the text or the author who wrote it.

<code>Word</code> Columns: These columns represent individual words. Each entry is binary—1 indicates that the word appears in the text, and 0 indicates that it does not.

<code>author</code> Column: This column contains the name of the author who wrote the text. It serves as the target variable that the model is expected to predict.

|word_1|word_2|...|word_n|**author**|
|:------:|:---:|:---:|:---:|:---:|
|0|1|...|1|`Mason Reed`|
|1|1|...|1|`Ava Thompson`|
|0|1|...|0|`Liam Carter`|

The test dataset has the same structure as the training dataset, except that it does not include the <code>author</code> column (the target variable). The test set contains 2,765 rows.


### Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

### Load Dataset

In [None]:
train_df = pd.read_csv(r'.../train.csv')
test_df = pd.read_csv(r'.../test.csv')
test_df.head()

Unnamed: 0,lung,council,solution,quite,rain,hair,skill,difficulty,add,pull,...,stocking,near,oil,dive,many,run,tender,asleep,eat,sweep
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Data processing

In [3]:
# Preprocessing 
x = train_df.drop(columns='author')
y = train_df['author']

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

x_train, x_val, y_train, y_val = train_test_split(x, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

x_test = test_df.copy()

### Model training

In [4]:
num_classes = len(set(y_train))

model = Sequential ([
    Dense(256, activation='relu', input_shape=(x_train.shape[1],)),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(num_classes, activation='softmax')

])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=20, batch_size=32, validation_data=(x_val, y_val))

Epoch 1/20


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.3808 - loss: 1.4499 - val_accuracy: 0.7531 - val_loss: 0.7116
Epoch 2/20
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7454 - loss: 0.7367 - val_accuracy: 0.8172 - val_loss: 0.5492
Epoch 3/20
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7979 - loss: 0.5950 - val_accuracy: 0.8172 - val_loss: 0.5277
Epoch 4/20
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8291 - loss: 0.4910 - val_accuracy: 0.8109 - val_loss: 0.5334
Epoch 5/20
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8543 - loss: 0.4358 - val_accuracy: 0.8359 - val_loss: 0.5286
Epoch 6/20
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8753 - loss: 0.3678 - val_accuracy: 0.8219 - val_loss: 0.5619
Epoch 7/20
[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x16b8401d0>

### Model Testing

In [5]:
# evaluate your model
from sklearn.metrics import f1_score

val_preds = model.predict(x_val)
val_pred_labels = val_preds.argmax(axis=1)

f1 = f1_score(y_val, val_pred_labels, average='macro')
print("F1 Score (macro):", round(f1, 3))
print("Final Score:", round(f1, 3) * 100)


[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 650us/step
F1 Score (macro): 0.838
Final Score: 83.8


### Predict test.csv

In [6]:
y_test_pred_probs = model.predict(x_test)
y_test_pred_labels = y_test_pred_probs.argmax(axis=1)

y_test_pred_names = label_encoder.inverse_transform(y_test_pred_labels)


submission = pd.DataFrame({
    'author': y_test_pred_names
})

submission

[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step


Unnamed: 0,author
0,Olivia Bennett
1,Ethan Brooks
2,Liam Carter
3,Liam Carter
4,Olivia Bennett
...,...
794,Olivia Bennett
795,Liam Carter
796,Olivia Bennett
797,Ava Thompson


![outro](outro.png)