# Convert Dataset into Vectors
The dataset we'll use combined 2 open source datasets curated by [The University of California, Irvine (UCI)](https://archive.ics.uci.edu). Learn how to [prepare this dataset here](https://github.com/codingforentrepreneurs/AI-Microservice-from-Scratch/blob/main/guides/Prepare%20the%20AI%20Spam%20Classifier%20Dataset.ipynb).

#### Requirements
- Python
- Jupyter (Setup with [this video](https://www.youtube.com/watch?v=9tPS-7TWjq0))
- Pandas
- `scikit-learn` (aka `sklearn`)
- Tensorflow

### Step 1: Load Dataset

In [1]:
import pandas as pd
import os
import pathlib

In [2]:
USE_PROJECT_ROOT = True
BASE_DIR = pathlib.Path().resolve()
if USE_PROJECT_ROOT:
    BASE_DIR = BASE_DIR.parent.parent
DATASET_DIR = BASE_DIR / "datasets"
EXPORT_DIR = DATASET_DIR / "exports"
DATASET_CSV_PATH = EXPORT_DIR / 'spam-dataset.csv'
TRAINING_DATA_PATH = EXPORT_DIR / 'spam-training-data.pkl'
print("BASE_DIR is", BASE_DIR)

BASE_DIR is /Users/albertsalgueda/Desktop/prodAI


In [3]:
RUN_DATASET_PREPARE = False
if RUN_DATASET_PREPARE:
    # if active, this will download and prepare the dataset.
    SOURCE_NB = pathlib.Path('1 - Prepare the AI Spam Classifier Dataset.ipynb')
    if SOURCE_NB.exists():
        %run './{SOURCE_NB}'
    else:
        print("Prepare the AI Spam Classifier Dataset.ipynb does not exist.")

In [4]:
if not DATASET_CSV_PATH.exists():
    raise Exception(f"You must download or create the spam-dataset.csv \n{DATASET_CSV_PATH} not found.")

In [5]:
df = pd.read_csv(str(DATASET_CSV_PATH))
df.head()

Unnamed: 0,label,text,source
0,ham,"Go until jurong point, crazy.. Available only ...",uci-spam-sms
1,ham,Ok lar... Joking wif u oni...,uci-spam-sms
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,uci-spam-sms
3,ham,U dun say so early hor... U c already then say...,uci-spam-sms
4,ham,"Nah I don't think he goes to usf, he lives aro...",uci-spam-sms


### Step 2: Convert Dataset to Lists

In [6]:
texts = df['text'].tolist()
labels = df['label'].tolist()

Now we need to map our labels from being text values to being integer values. It's pretty simple:

In [7]:
labels_legend = {'ham': 0, 'spam': 1}
labels_legend_inverted = {f"{v}":k for k,v in labels_legend.items()}

The inverted legend is there to help us when we need to add a label to our predictions later.

What's cool, is you can create this legend automatically with:

```python
legend = {f"{x}": i for i, x in enumerate(list(set(labels)))}
legend_inverted = {f"{v}":k for k,v in legend.items()}
```

In [8]:
labels_as_int =  [labels_legend[str(x)] for x in labels]

__Verify the indices__

It's important that our indices are still correct since this is the data we'll be using to train our model.

In [9]:
import random
random_idx = random.randint(0, len(texts))
print('Random Index', random_idx)

assert texts[random_idx] == df.iloc[random_idx].text
assert labels[random_idx] == df.iloc[random_idx].label
assert labels_legend_inverted[str(labels_as_int[random_idx])] == labels[random_idx]

Random Index 1491


### Step 3: Tokenize Texts

The Keras Tokenizer will convert our raw text into vectors. Converting texts to vectors is a required step for any machine learning model (not just keras).

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [11]:
MAX_NUM_WORDS=280

> `MAX_NUM_WORDS` is set to the current max length of any given post (tweet) on Twitter. This max number of words is likely to exceed *all* of our sms text size (typically 160 characters).

In [12]:
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9538 unique tokens.


In [13]:
assert len(sequences) == len(texts) == len(labels_as_int)

### Step 4: Create `X`, `y` training sets

In machine learning, it's common to denote the training inputs as `X` and their corresponding labels (the outputs) as `y`. 

Let's start with the `X` data (aka the text) by padding all of our tokenized sequences. This ensures all training inputs are the same shape (aka size). 

Each sentence in each paragraph in every conversation you have is rarely the same length. It is almost certainly *sometimes* the same length, but rarely all the time. With that in mind, we want to categorize every sentence (or paragraph) as either `spam` or `ham` -- an arbitrary length of data into known length of data. 

This means we have two challenges:
- Matrix multiplication has strict rules
- Spoken or written language rarely adheres to strict rules.

What to do?

`X` as new representation for the `text` from our raw dataset. As stated above, there's a very small chance that all data in this group is the exact same length so we'll use the built-in tool called `pad_sequences` to correct for the inconsistent length. This length is actually called shape because of it's roots in linear algebra (matrix multiplication).

In [14]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_SEQUENCE_LENGTH = 280

X = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

Now we covert our `labels_as_int` into a corresponding matrix value (instead of just a list of ints) by using the built-in `to_categorical` function. The number of labels does not have to be 2 (as we have) but it should be at least 2.

In [15]:
import numpy as np
from tensorflow.keras.utils import to_categorical

y = to_categorical(np.asarray(labels_as_int))

### Step 5: Split our Training Data

If we trained on all of our data, our model will fit very *well* to that training data but it will not perform well on new data; aka it will be mostly useless.

Since we have the `X` and `y` designations, we split the data into at least 2 corresponding sets: training data and validation data for each designation resulting in `X_train`, `X_test`, `y_train`, `y_test`.

An easy way (but not the only way) is to use `scikit-learn` for this:

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

As we'll see soon, the test sets (aka `X_test` and `y_test`) are used to evaluate how our AI model is learning (aka the performance). This means it's often a good idea to save the test sets for future training and not splitting the data all over again. Using the same test set over and over will show how our model is performing over time.

### Step 6: Export our Training Data

For this step, we'll use Python's built-in `pickle` module. Pickle is *not secure* so only open pickles that you create yourself. I am doing it as a way to pass data between jupyter notebooks.

In [17]:
import pickle

In [18]:
training_data = {
    'X_train': X_train,
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test,
    'max_words': MAX_NUM_WORDS,
    'max_sequence': MAX_SEQUENCE_LENGTH,
    'legend': labels_legend,
    'labels_legend_inverted': labels_legend_inverted,
    "tokenizer": tokenizer,
}

In [19]:
with open(TRAINING_DATA_PATH, 'wb') as f:
    pickle.dump(training_data, f)

In [20]:
data = {}

with open(TRAINING_DATA_PATH, 'rb') as f:
    data = pickle.load(f)