# Modelling

https://www.datacamp.com/es/tutorial/introduction-to-convolutional-neural-networks-cnns

In [1]:
# Constants & Hyperparameters to define
RANDOM_SEED = 42

NUM_WORDS = 5000
MAX_SEQ_LEN = 100
EMBEDDING_DIM = 50
NUM_FILTERS = 64
KERNEL_SIZE = 5
NUM_CLASSES = 3

In [2]:
# Import Libraries
import pandas as pd
from tensorflow import keras
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense

# Import functions
import sys
sys.path.append('../src')
from support_model import f1_score

2024-07-04 18:24:27.801351: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-04 18:24:27.801507: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-04 18:24:27.803451: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-04 18:24:27.829637: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
train_data = pd.read_csv('../data/train_data_preprocessed.csv')
text_data = train_data['text']
labels = train_data['labels']

In [4]:
# Text preprocessing
tokenizer = Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(text_data)
sequences = tokenizer.texts_to_sequences(text_data)

In [5]:
# Padding sequences
sequences = pad_sequences(
    sequences, 
    maxlen=MAX_SEQ_LEN, 
    padding='post')

In [6]:
# One-hot encode labels
labels_encoded = to_categorical(
    labels, 
    num_classes=NUM_CLASSES)

In [7]:
# Split data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(
    sequences, 
    labels, 
    test_size=0.2, 
    random_state=RANDOM_SEED)

In [8]:
y_train_encoded = to_categorical(
    y_train, 
    num_classes=NUM_CLASSES) 

In [15]:
print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('y_train_encoded shape:', y_train_encoded.shape)


X_train shape: (2880000, 100)
y_train shape: (2880000,)
y_train_encoded shape: (2880000, 3)


## Model
**Sequential convolutional neural network (CNN) for text classification**

In [11]:
# Define the CNN model
model = Sequential([
    Embedding(
        input_dim=MAX_SEQ_LEN, 
        output_dim=EMBEDDING_DIM, 
        input_length=MAX_SEQ_LEN),
    Conv1D(
        filters=NUM_FILTERS, 
        kernel_size=KERNEL_SIZE, 
        activation='relu', 
        padding='same'),
    MaxPooling1D(
        pool_size=4, 
        padding='same'),
    Flatten(),
    Dense(10, activation='relu'),
    Dense(NUM_CLASSES, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', 
    loss='sparse_categorical_crossentropy', 
    metrics=['accuracy', 'precision', 'recall', f1_score, keras.metrics.categorical_crossentropy, keras.metrics.AUC])

2024-07-04 18:29:40.193718: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: heroines
2024-07-04 18:29:40.193737: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: heroines
2024-07-04 18:29:40.193803: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2024-07-04 18:29:40.193829: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 550.90.7


1. Embedding Layer
* `Embedding(input_dim=5000, output_dim=100, input_length=100)`
* `input_dim=5000`: This specifies the vocabulary size, meaning the model can handle up to **5000** unique words.
* `output_dim=100`: This defines the dimensionality of the embedding vector, which compresses each word into a **100**-dimensional vector.
* `input_length=100`: This sets the maximum length of the input text sequences (sentences or paragraphs) to **100** words.

2. Convolutional Layer
* `Conv1D(filters=64, kernel_size=5, activation='relu')`: This 1D convolutional layer extracts features from the embedded text sequences.
* `filters=64`: This indicates the number of filters used to identify patterns in the text.
* `kernel_size=5`: This defines the size of the window that the filter slides over the text sequence (**5** words in this case).
* `activation='relu'`: This activation function introduces non-linearity, allowing the model to learn complex relationships between words.

    * `'relu'` means Rectified Linear Unit (ReLU). 
    * For any input value $(x)$, it outputs the value itself if it's positive $(x > 0)$ and zero otherwise $(x <= 0)$. 
    * Mathematically, it can be represented as:
    * $f(x) = max(0, x)$

3. Pooling Layer
* `MaxPooling1D(pool_size=4)`: This layer reduces the dimensionality of the data by taking the maximum value from every window of size **4** along the sequence This helps control overfitting and focuses on the most important features.

4. Flattening Layer
* `Flatten()`: This layer transforms the 2D output from the convolutional layer into a 1D vector suitable for feeding into the fully connected layers.

5. Fully Connected Layers
* `Dense(10, activation='relu')`: This first fully connected layer has **10** neurons and uses the ReLU activation function. It learns higher-level features by combining the extracted features from the convolutional layers.

* `Dense(3, activation='softmax')`: This final fully connected layer has 3 neurons and uses the softmax activation function. It outputs a probability distribution over 3 categories, making it suitable for multi-class classification tasks (e.g., classifying text into 3 different genres).

    * `'softmax'`: For each element $(i)$ in the input vector, softmax calculates the probability $(p_i)$ using the following formula:
    * $p_i = exp(x_i) / Σ(exp(x_j))$  for all $j$ in the vector
    * Here, $exp(x_i)$ represents the exponentiation of the i-th element in the input vector.
    * $Σ(exp(x_j))$ represents the sum of the exponentials of all elements in the vector.

6. Compiling the Model:
* `model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])`: This compiles the model by specifying the optimizer (Adaptive Moment Estimation (Adam) for efficient training), the loss function (sparse categorical crossentropy for multi-class classification), and the metrics (accuracy to measure performance).

### Train the Model

In [23]:
# Train model
model.fit(X_train, y_train_encoded, epochs=10, validation_split=0.2)

Epoch 1/10


AttributeError: 'NoneType' object has no attribute 'items'

: 

In [None]:
# model_loss, model_accuracy, model_precision, model_recall, model_categorical_crossentropy, model_auc, model_f1_score = model.evaluate(X_test, y_test)
# print("F1 Score:", model_f1_score)