# Capstone Project: Slogan Classifier and Generator

In this capstone project you will train a Long Short-Term Memory (LSTM) model to generate slogans for businesses based on their industry, and also train a classifier to predict the industry based on a given slogan.

## Libraries
We recommend running this notebook using [Google Colab](https://colab.google/) however if you choose to use your local machine you will need to install spaCy before starting.

To install spaCy, refer to the installation instructions provided on the spaCy [website](https://spacy.io/usage). Note you may need to install an older version of Python that is compatible with spaCy. You can create a virtual environment for this project to install the specific version of Python that you need.

In [1]:
# Core Imports

import pandas as pd
import numpy as np
import tensorflow as tf
import os
from pathlib import Path
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam
import spacy 
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical




2025-10-11 00:51:56.195127: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Loading and viewing the dataset

- Load the slogan dataset into a variable called data.
- Extract relevant columns in a variable called df.
- Handle missing values.

Do **not** change the column names.

If you are using Google Colab you will need mount your Google Drive as follows:  
`from google.colab import drive`  
`drive.mount('/content/drive')`  

The path you use when loading your data will look something like this if you are using your Google Drive:  
"/content/drive/MyDrive/Colab Notebooks/slogan-valid.csv"

In [2]:
# Load data from local machine
# Goal: create a pandas DataFrame named df.
# read from a local path or a small set of fallbacks.

# Local filesystem (default) 
#  If you know your file path, set it here:
DATA_PATH = os.environ.get("DATA_PATH", "").strip()  # you can export DATA_PATH or leave blank

# If DATA_PATH is empty, I try a few sensible candidates.
CANDIDATES = [
    DATA_PATH,
    "/mnt/data/slogan-valid.csv",     # common path if working in a hosted VM / uploaded file
    "./slogan-valid.csv",             # current directory
    "../slogan-valid.csv",            # parent directory
]

def _read_csv_safely(path: Path) -> pd.DataFrame:
    """
    Try reading a CSV with utf-8 first, then fall back to latin-1.
    Returns a DataFrame on success or raises the last exception.
    """
    last_err = None
    for enc in ("utf-8", "latin-1"):
        try:
            return pd.read_csv(path, encoding=enc)
        except Exception as e:
            last_err = e
    # If attempts failed, raise the last encountered error
    raise last_err

df = None
for c in CANDIDATES:
    p = Path(c) if c else None
    if p and p.is_file():
        df = _read_csv_safely(p)
        print(f"[info] Loaded CSV from: {p}")
        break

# Final checks 
if df is None:
    raise FileNotFoundError(
        "CSV not found. Set DATA_PATH to your file or place 'slogan-valid.csv' in the working folder. "
        "If using Colab, mount Drive and point to the Drive path."
    )

print(f"[info] DataFrame shape: {df.shape}")
print(f"[info] Columns: {list(df.columns)}")



[info] Loaded CSV from: slogan-valid.csv
[info] DataFrame shape: (5346, 12)
[info] Columns: ['desc', 'output', 'type', 'company', 'industry', 'url', 'alias', 'desc_masked', 'output_masked', 'ent_dict', 'unsupported', 'first_pos']


## Data Preprocessing

Since we are working with textual data, we need software that understands natural language. For this, we'll use a library for processing text called **spaCy**. Using spaCy, we'll break the text into smaller units called tokens that are easier for the machine to process. This process is called **tokenisation**. We'll also convert all text to lowercase and remove punctuation because this information is not necessary for our models. Run the code below, and your dataframe (df) will gain a new column called **'processed_slogan'** which contains the preprocessed text.




In [3]:
# Load spaCy model for text processing
nlp = spacy.load("en_core_web_sm")

# Define text preprocessing function
def preprocess_text(text):
    text_lower = text.lower()
    doc = nlp(text_lower)

    processed_tokens = []

    for token in doc:
        if not token.is_punct:
            processed_tokens.append(token.text)

    return " ".join(processed_tokens)

df["processed_slogan"] = df["output"].apply(preprocess_text)

df.head()

Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos,processed_slogan
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB,taking care of small business technology
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB,build world class recreation programs
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ,most powerful lead generation software for mar...
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB,hire quality freelancers for your job
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN,financial advisers norwich norfolk


We want our model to generate **industry-specific** slogans. If we use the 'processed_slogan' column as it is, we'll be leaving out crucial context - the industries of the companies behind those slogans. To fix this, we'll create a new **'modified_slogan'** column that adds the industry name to the front of processed slogan.  

For example:  

> industry = 'computer hardware'  
processed_slogan = 'taking care of small business technology'  
modified_slogan = 'computer hardware taking care of small business technology'

Write code in the cell below to achieve this.

In [4]:
# Create 'modified_slogan' by prefixing the industry to the processed slogan

# Identify the industry column name - common variants.
IND_COL = None
for cand in ["industry", "Industry", "sector", "category"]:
    if cand in df.columns:
        IND_COL = cand
        break
if IND_COL is None:
    raise KeyError("Could not find an industry column. Expected one of: 'industry', 'Industry', 'sector', 'category'.")

# Sanity check: ensure 'processed_slogan' exists.
if "processed_slogan" not in df.columns:
    raise KeyError("Expected column 'processed_slogan' not found. Run the preprocessing cell first.")

# Build a helper that joins 'industry' + 'processed_slogan' cleanly.
#    - Lowercase industry for consistency with processed text
#    - Handle missing values
#    - Collapse extra spaces
def make_modified_slogan(industry_val, processed):
    ind = str(industry_val).strip().lower() if pd.notna(industry_val) else ""
    txt = str(processed).strip() if pd.notna(processed) else ""
    # If industry is missing, just return the processed text
    if not ind:
        return txt
    # If processed text is missing, return just the industry
    if not txt:
        return ind
    # Otherwise, prefix "industry" + space + processed text
    return f"{ind} {txt}"

# Create the new column.
df["modified_slogan"] = df[[IND_COL, "processed_slogan"]].apply(
    lambda row: make_modified_slogan(row[IND_COL], row["processed_slogan"]),
    axis=1
)

#  Quick preview to verify the transformation.
df[[IND_COL, "processed_slogan", "modified_slogan"]].head()


Unnamed: 0,industry,processed_slogan,modified_slogan
0,computer hardware,taking care of small business technology,computer hardware taking care of small busines...
1,"health, wellness and fitness",build world class recreation programs,"health, wellness and fitness build world class..."
2,internet,most powerful lead generation software for mar...,internet most powerful lead generation softwar...
3,internet,hire quality freelancers for your job,internet hire quality freelancers for your job
4,financial services,financial advisers norwich norfolk,financial services financial advisers norwich ...


Now we need to get data to train our model. We have textual data which we will need to represent numerically for our model to learn from it.  
The code below does the following:
1. Tokenizes a dataset of slogans.
2. Converts words to numerical indices.
3. Creates input sequences using the numerical indices.  

Here's how it works. From the 'modified_slogan' column, we take the slogan "computer hardware taking care of small business technology". The tokenisation process will convert words into their corresponding indices:  

<center>

| Word         | Token Index |
|-------------|-------|
| "computer"  | 1     |
| "hardware"  | 2     |
| "taking"    | 3     |
| "care"      | 4     |
| "of"        | 5     |
| "small"     | 6     |
| "business"  | 7     |
| "technology"| 8     |

</center>

So the tokenized list is:

<center>
[1, 2, 3, 4, 5, 6, 7, 8]
</center>

When creating input sequences for training, the loop generates progressively longer sequences.

<center>

| Token Index Sequence               | Corresponding Slogan                                 |
|------------------------------|-----------------------------------------------------|
| [1, 2]                       | "computer hardware"                                |
| [1, 2, 3]                    | "computer hardware taking"                        |
| [1, 2, 3, 4]                 | "computer hardware taking care"                   |
| [1, 2, 3, 4, 5]              | "computer hardware taking care of"                |
| [1, 2, 3, 4, 5, 6]           | "computer hardware taking care of small"          |
| [1, 2, 3, 4, 5, 6, 7]        | "computer hardware taking care of small business" |
| [1, 2, 3, 4, 5, 6, 7, 8]     | "computer hardware taking care of small business technology" |

</center>

Instead of training the model on only **complete slogans**, we provide partial phrases which will help the model learn how words connect over time. This will make it better at predicting the next word when generating slogans.  

Run the cell block below to generate the input sequences. Be sure to read the comments to understand what the code is doing.


In [5]:
'''** Clean up comments'''

# Tokenizer to convert words into numerical values tokens
tokenizer = Tokenizer()

# Tokenizer learns words in dataset
tokenizer.fit_on_texts(df["modified_slogan"])

# Total number of unique words in learned vocabulary
total_words = len(tokenizer.word_index) + 1

# Dictionary mapping words to its numerical index: index based on frequency i.e., more freq => lower index
tokenizer.word_index

# Creating input sequences
# Initialise list to store the input sequences
input_sequences = []

# Iterate over processed slogans
for line in df["modified_slogan"]:

    # Convert slogans to token sequences
    token_list = tokenizer.texts_to_sequences([line])[0] # returns list containing list of words indices; extracting inner list [0]

    # token_list is a list of tokenized word INDICES
    # Building list of progressively longer input sequences for better training
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i+1])

The input sequences created above are of **varying lengths**, which will be a problem when training our LSTM model. LSTMs require input sequences of **equal length**. So, we need to **pad** shorter sequences by **prepending zeros** until they match the length of the longest sequence.  

For example, if the longest sequence has **10 tokens**, our padded sequences will look like this:

<center>

| Input Sequence                     | Padded Sequence                         |
|-------------------------------------|-----------------------------------------|
| [1, 2]                              | [0, 0, 0, 0, 0, 0, 0, 0, 1, 2]         |
| [1, 2, 3]                           | [0, 0, 0, 0, 0, 0, 0, 1, 2, 3]         |
| [1, 2, 3, 4]                        | [0, 0, 0, 0, 0, 0, 1, 2, 3, 4]         |
| [1, 2, 3, 4, 5]                     | [0, 0, 0, 0, 0, 1, 2, 3, 4, 5]         |
| [1, 2, 3, 4, 5, 6]                  | [0, 0, 0, 0, 1, 2, 3, 4, 5, 6]         |
| [1, 2, 3, 4, 5, 6, 7]               | [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]         |
| [1, 2, 3, 4, 5, 6, 7, 8]            | [0, 0, 1, 2, 3, 4, 5, 6, 7, 8]         |

</center>

In the cell below, write code that **finds the length of the longest sequence** in **input_sequences** and stores this value in a variable named **max_seq_len**.


In [6]:
# Find the length of the longest token sequence
# input_sequences is a list of lists - each inner list is a sequence of token IDs

if not input_sequences:
    raise ValueError("input_sequences is empty. Make sure the tokenisation cell ran and produced sequences.")

# Compute the maximum length across all sequences
max_seq_len = max(len(seq) for seq in input_sequences)

print(f"[info] Number of sequences: {len(input_sequences):,}")
print(f"[info] Longest sequence length (max_seq_len): {max_seq_len}")


[info] Number of sequences: 34,736
[info] Longest sequence length (max_seq_len): 15


Run the cell below to pad the input sequences so they are all the same length as **max_seq_length**.

In [7]:
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding="pre")

## Training Data for Slogan Generator

The input sequences generated will be used as our training data. Our LSTM needs to learn how to predict the **next word** in a sequence.  

The inputs for our model will be the input sequences **excluding the last token index** and the outputs will be the **last token index**.  

As an example, let us use the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7, 8] and say it corresponds to the slogan "computer hardware taking care of small business technology". When training the model:

> Our input **x** will be the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7] corresponding to "computer hardware taking care of small".  
> Our output **y** will be [8] which corresponds to "business".  

In the code cell below, use `input_sequences` to create the following two variables:
1. **X_gen** which contains the input sequences excluding the last token index.
2. **y_gen** which contains the last token index of the input sequence.

In [8]:
# Build training pairs for the slogan GENERATOR
# For each token sequence [w1, w2, ..., wN]:
#   X_gen gets the prefix [w1, w2, ..., w(N-1)]
#   y_gen gets the next-token target wN

#  Ensure input_sequences is a Python list of lists
if isinstance(input_sequences, np.ndarray):
    input_sequences = input_sequences.tolist()

if input_sequences is None or len(input_sequences) == 0:
    raise ValueError("input_sequences is empty. Re-run the tokenisation step that builds input_sequences.")

# Split into prefixes (inputs) and next-token targets (outputs)
X_gen_raw, y_gen = [], []
for seq in input_sequences:
    # need at least length 2 to form (prefix, target)
    if len(seq) < 2:
        continue
    # all but the last token   
    X_gen_raw.append(seq[:-1])  
    # the last token is the label
    y_gen.append(seq[-1])          

# Convert targets to a compact int array
y_gen = np.array(y_gen, dtype=np.int32)

# Figure out padding length for prefixes
# If you computed max_seq_len earlier on FULL sequences, prefixes are length (max_seq_len - 1)
if "max_seq_len" not in globals():
    max_seq_len = max(len(s) for s in input_sequences)

if max_seq_len <= 1:
    raise ValueError("max_seq_len is <= 1. Check your tokenisation; sequences are too short to train a next-word model.")

# Pad prefixes to equal length for LSTM input
X_gen = pad_sequences(
    X_gen_raw,
    maxlen=max_seq_len - 1,   # pad to longest prefix length
    padding="pre",            # prepend zeros (as per brief)
    truncating="pre"
)

print(f"[info] Training samples: {X_gen.shape[0]:,}")
print(f"[info] X_gen shape: {X_gen.shape}  (timesteps = {X_gen.shape[1]})")
print(f"[info] y_gen shape: {y_gen.shape}  (int token IDs)")
# Quick peek
X_gen[:2], y_gen[:2]


[info] Training samples: 34,736
[info] X_gen shape: (34736, 14)  (timesteps = 14)
[info] y_gen shape: (34736,)  (int token IDs)


(array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          11],
        [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  11,
         236]], dtype=int32),
 array([ 236, 2708], dtype=int32))

The model will output the next word of a sequence over a probability distribution. We need to encode our output variable for this to be possible.

In the code cell below, write code that will apply one-hot encoding to **y_gen** using `tf.keras.utils.to_categorical()`. **Maintain the same variable name**.  

*Hint: set the `num_classes` (number of classes) parameter to the total number of unique words in the learned vocabulary. You can access this value through a variable that was created when generating input sequences earlier.*

In [9]:
# One-hot encode the generator targets (y_gen) so the model can predict a word distribution 

# y_gen currently holds integer token IDs 
# I convert these integers into one-hot rows of length = vocabulary size.
# 'total_words' was defined when I built the tokenizer: len(tokenizer.word_index) + 1
if "total_words" not in globals():
    total_words = len(tokenizer.word_index) + 1  # safety fallback

# Apply one-hot encoding in-place 
y_gen = to_categorical(y_gen, num_classes=total_words, dtype="float32")

# (num_samples, total_words)
print(f"[info] One-hot targets shape: {y_gen.shape}")   
print(f"[info] Vocabulary size (total_words): {total_words}")
# Peek at the first row's non-zero index 
nz = y_gen[0].argmax()
print(f"[info] Example: first target index = {nz}")


[info] One-hot targets shape: (34736, 6046)
[info] Vocabulary size (total_words): 6046
[info] Example: first target index = 236


## Slogan Generator Architecture

In the code cell that follows, configure the LSTM following these steps:

1. Create a sequential model using `tf.keras.models.Sequential()`. This model will have an embedding layer, two LSTM layers, and a dense output layer.
2. Add an embedding layer that converts words into dense vector representations. This layer should:
> *   Have `total_words`as the vocabulary size.
> *   Use 100 as an embedding dimension.
> *   Takes an input length of `max_seq_len - 1` (excludes the target word).
3. Add two LSTM layers.
> *   The first LSTM layer should have 150 **and** set `return_sequences` to `True`.
> *   The second LSTM layer should have 100 units.
4. Add a dense output layer which:
> *   Uses `total_words` as the number of units (one for each word in the vocabulary).
> *   Uses a softmax activation function.
5. Use `Sequential` to put everything together in the correct order to complete the architecture of the LSTM model called **gen_model**.


In [10]:
# Slogan Generator: LSTM architecture (Sequential)
# Spec:
#    Embedding-vocab_size=total_words, embed_dim=100, input_length=max_seq_len-1
#    LSTM(150, return_sequences=True)
#    LSTM(100)
#    Dense(total_words, activation="softmax")

# Basic sanity checks (helps catch earlier-cell mistakes)
assert "total_words" in globals() and isinstance(total_words, int) and total_words > 0, \
    "total_words is missing or invalid. Re-run the tokeniser cell."
assert "max_seq_len" in globals() and isinstance(max_seq_len, int) and max_seq_len > 1, \
    "max_seq_len is missing or invalid. Compute it from input_sequences first."

gen_model = Sequential([
    # Turn token IDs into dense vectors that the LSTM can learn over
    Embedding(
        # vocabulary size
        input_dim=total_words, 
        # embedding dimension (per instructions)
        output_dim=100,          
        # prefix length (target word excluded)
        input_length=max_seq_len - 1  
    ),
    # First recurrent layer (keeps sequence for the next LSTM)
    LSTM(150, return_sequences=True),
    # Second recurrent layer (final sequence encoding)
    LSTM(100),
    # Predict a distribution over the next word (one logit per vocab item)
    Dense(total_words, activation="softmax")
])

# Preview the parameter counts and shapes 
gen_model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 14, 100)           604600    
                                                                 
 lstm (LSTM)                 (None, 14, 150)           150600    
                                                                 
 lstm_1 (LSTM)               (None, 100)               100400    
                                                                 
 dense (Dense)               (None, 6046)              610646    
                                                                 
Total params: 1466246 (5.59 MB)
Trainable params: 1466246 (5.59 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In the code cell below, compile `gen_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.


In [11]:
#Compile the generator model 
# I predict a single next-token class over the vocabulary → categorical cross-entropy.
# Adam is a good default optimiser for sequence models. I’ll track accuracy for a quick signal.

gen_model.compile(
    loss="categorical_crossentropy",
    optimizer=Adam(learning_rate=1e-3),
    # simple, readable metric for next-word prediction
    metrics=["accuracy"]          
)

print("[info] gen_model compiled: loss=categorical_crossentropy, optimizer=Adam(1e-3), metrics=['accuracy']")


[info] gen_model compiled: loss=categorical_crossentropy, optimizer=Adam(1e-3), metrics=['accuracy']


## Slogan Generation

In the code cell below, fit the compiled model on the inputs and outputs, setting the **number of epochs to 50**.

In [12]:
# Train the slogan generator -next-word model
# Inputs:
#   X_gen : padded prefixes with shape (num_samples, max_seq_len - 1)
#   y_gen : one-hot next-token targets with shape (num_samples, total_words)
# Spec:
#    epochs = 50 (per instructions)
#    reasonable batch size (e.g., 64)
#    small validation split for quick overfitting signal

from tensorflow.keras.callbacks import EarlyStopping

# Basic sanity checks
assert "X_gen" in globals() and "y_gen" in globals(), "X_gen / y_gen not found. Build them in previous steps."
assert X_gen.shape[0] == y_gen.shape[0], "Mismatched sample counts between X_gen and y_gen."

early_stop = EarlyStopping(
    monitor="val_loss",
    patience=3,
    restore_best_weights=True
)

history = gen_model.fit(
    X_gen,
    y_gen,
    epochs=50,           
    batch_size=64,
    validation_split=0.1,
    callbacks=None,
    verbose=1
)

print("[info] Training complete.")


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
[info] Training complete.


We will now define a function called `generate_slogan` which will generate a slogan by predicting one word at a time based on a given starting phrase (the `seed_text`). This function will do this using our trained model, `gen_model`.

Here is a breakdown of how the algorithm works:  

Let us assume the dictionary mapping words to unique indices, `tokenizer.word_index`, looks like this:

> `{'computer': 1, 'hardware': 2, 'taking': 3, 'care': 4, 'of': 5}`

If the model's predicted index for the next word is 3 (`predicted_index = 3`), the loop will:

> Check 'computer' (index 1) → No match  
> Check 'hardware' (index 2) → No match  
> Check 'taking' (index 3) → Match found!  
> Assign output_word = "taking" and exit the loop.  

The `output_word` will be appended to the `seed_text`, and the process will continue to add words to the `seed_text` until we have reached the maximum number of words **or** an invalid prediction occurs.  

Carefully follow the code below and complete the missing parts as guided by the comments.

In [13]:
# Next-word slogan generation using the trained gen_model
# Strategy:
#    Encode the current seed_text → token IDs → pad to (max_seq_len-1)
#    gen_model predicts a probability distribution over the vocab
#    pick the most likely next token (argmax) and map it back to a word
#    append that word to seed_text and repeat up to max_words steps


def generate_slogan(seed_text: str, max_words: int = 20) -> str:
    """Greedy next-word generation using the trained LSTM language model."""
    text = str(seed_text).strip().lower()

    for _ in range(max_words):
        #  Tokenise and pad the current prefix
        token_list = tokenizer.texts_to_sequences([text])[0]         
        # shape -> (1, max_seq_len-1)
        token_list = pad_sequences([token_list], maxlen=max_seq_len-1,  
                                   padding="pre")

        #  Predict next-word distribution over the vocabulary
        preds = gen_model.predict(token_list, verbose=0)                

        #  Choose the most likely next token 
        predicted_index = int(np.argmax(preds, axis=-1)[0])            

        # Map the predicted index back to a word
        output_word = None

        # Skip invalid/padding index (0) if it appears
        if predicted_index == 0:
            break

        # Linear search over tokenizer.word_index to find the word for this index
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break

        # If I couldn't find a matching word, stop generation
        if output_word is None:
            break

        # Append predicted word to the running text and continue
        text = f"{text} {output_word}"

    return text

# Example (after training):
print(generate_slogan("insurance"))  # or any seed phrase present in your data


insurance insurance broker in the in indies ia dealer in london ny needs and central estate estate estate estate estate electronics


## Training Data for Slogan Classifier

We will now prepare the data we will use to train our classifier. For our classifier, the inputs will come from the `processed_slogans` column of our DataFrame, `df`. The outputs will be the different industry categories under the `industry` column.

In the code cell below, extract the unique values from the `industry` column in the DataFrame and store these in a variable called **industries**.

In [14]:
#Collect unique industry labels for the classifier 
# Goal: extract the distinct categories from the industry column and store them in industries.


#  Locate the industry column- handle common name variants defensivel
IND_COL = None
for cand in ["industry", "Industry", "sector", "category"]:
    if cand in df.columns:
        IND_COL = cand
        break
if IND_COL is None:
    raise KeyError("Could not find an industry column. Expected one of: 'industry', 'Industry', 'sector', 'category'.")

# Extract unique labels, drop missing, normalise whitespace
industries = (
    df[IND_COL]
    .dropna()
    .astype(str)
    .map(lambda s: s.strip())
    .unique()
    .tolist()
)

# Sort for stable ordering
industries = sorted(industries)

print(f"[info] Found {len(industries)} unique industries in column '{IND_COL}'.")
print(industries[:10])  # peek at a few


[info] Found 142 unique industries in column 'industry'.
['accounting', 'airlines/aviation', 'alternative medicine', 'animation', 'apparel & fashion', 'architecture & planning', 'arts and crafts', 'automotive', 'aviation & aerospace', 'banking']


Create a dictionary called `industry_to_index` where each unique industry is mapped to a unique index starting from 0.

*Hint: Use the `enumerate()` function.*

In [15]:
# Map each unique industry to a numeric ID starting at 0 

if "industries" not in globals() or not isinstance(industries, list) or len(industries) == 0:
    raise ValueError("industries is missing or empty. Run the previous cell that builds the unique industry list.")

# Create mapping: industry -> index (0, 1, 2, ...)
industry_to_index = {name: idx for idx, name in enumerate(industries)}

# reverse mapping (handy for decoding predictions later)
index_to_industry = {idx: name for name, idx in industry_to_index.items()}

print(f"[info] Number of classes: {len(industry_to_index)}")
# Peek at the first few items to confirm ordering
list(industry_to_index.items())[:10]


[info] Number of classes: 142


[('accounting', 0),
 ('airlines/aviation', 1),
 ('alternative medicine', 2),
 ('animation', 3),
 ('apparel & fashion', 4),
 ('architecture & planning', 5),
 ('arts and crafts', 6),
 ('automotive', 7),
 ('aviation & aerospace', 8),
 ('banking', 9)]

Create a new column `industry_index` in your DataFrame by mapping the `industry` column to the indices using the `industry_to_index` dictionary.

*Hint: Use the  `map()` function.*

In [17]:
# Create 'industry_index' by mapping industry labels to integer IDs 

# Reuse / detect the industry column name
IND_COL = None
for cand in ["industry", "Industry", "sector", "category"]:
    if cand in df.columns:
        IND_COL = cand
        break
if IND_COL is None:
    raise KeyError("Industry column not found. Expected one of: 'industry', 'Industry', 'sector', 'category'.")

# Sanity check that the mapping dict exists
if "industry_to_index" not in globals() or not isinstance(industry_to_index, dict) or len(industry_to_index) == 0:
    raise ValueError("industry_to_index is missing or empty. Build it from the unique industries first.")

#  Map labels → indices; rows with unseen/missing labels become NaN, so handle them
df["industry_index"] = (
    df[IND_COL]
    .map(lambda s: str(s).strip() if pd.notna(s) else s)  # normalise whitespace
    .map(industry_to_index)                                # map to ints via dictionary
)

#  if you want a sentinel for missing/unmapped labels, uncomment:
df["industry_index"] = df["industry_index"].fillna(-1).astype(int)

print("[info] Added 'industry_index' column.")
df[[IND_COL, "industry_index"]].head()


[info] Added 'industry_index' column.


Unnamed: 0,industry,industry_index
0,computer hardware,21
1,"health, wellness and fitness",52
2,internet,65
3,internet,65
4,financial services,41


Split the DataFrame `df` into training and testing sets, setting aside 20% of the data for the test set. Be sure to set the parameter `stratify=df["industry_index"]`. This ensures that both sets have the same proportion of each class (industry) as in the original dataset, resulting in balanced datasets. Call the training DataFrame `df_train` and the testing DataFrame `df_test`.

In [18]:
# Train / Test split with class balance (stratified) 


#  Keep only rows with a valid class index
df_clean = df.dropna(subset=["industry_index"]).copy()

#   Ensure every class has at least 2 samples so stratify can split
#   If a class has only 1 row, stratified split will fail. I’ll drop such classes for now.
counts = df_clean["industry_index"].value_counts()
ok_classes = counts[counts >= 2].index
dropped_classes = set(df_clean["industry_index"].unique()) - set(ok_classes)
if dropped_classes:
    print(f"[warn] Dropping {len(dropped_classes)} rare class(es) with <2 samples to enable stratified split.")
    df_clean = df_clean[df_clean["industry_index"].isin(ok_classes)].copy()

#  Perform stratified 80/20 split
df_train, df_test = train_test_split(
    df_clean,
    test_size=0.20,
    stratify=df_clean["industry_index"],
    random_state=42
)

print(f"[info] Train shape: {df_train.shape}, Test shape: {df_test.shape}")

# Quick check: class proportions are similar in both splits
train_props = df_train["industry_index"].value_counts(normalize=True).sort_index()
test_props  = df_test["industry_index"].value_counts(normalize=True).sort_index()
check = pd.DataFrame({"train_prop": train_props, "test_prop": test_props})
print("[info] Class proportion check (first 10 rows):")
display(check.head(10))


[warn] Dropping 6 rare class(es) with <2 samples to enable stratified split.
[info] Train shape: (4272, 15), Test shape: (1068, 15)
[info] Class proportion check (first 10 rows):


Unnamed: 0_level_0,train_prop,test_prop
industry_index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.013343,0.013109
1,0.002575,0.002809
2,0.000468,
3,0.000702,0.000936
4,0.010768,0.011236
5,0.006086,0.005618
6,0.000702,0.000936
7,0.029494,0.029026
8,0.003043,0.002809
9,0.002341,0.002809


Our classifier will use padded slogan sequences as inputs, similar to input sequences used for the slogan generator. The difference is we will not use sequences that get progressively longer, but instead we will use **complete slogans**. This is because our classifier does not need to learn how to predict what word comes next. It needs the full context of a slogan to learn how to accurately predict the industry.  

The next steps will walk you through how to create these sequences.  

We previously created and fitted a `Tokenizer` object called `tokenizer` while preparing data for the slogan generator. Now, we will reuse it to convert words into numerical indices.  

In the code cell below, use the `texts_to_sequences()` **method** of `tokenizer` to transform the `processed_slogan` column in **both** the `df_train` and `df_test` DataFrames into sequences of numerical indices. Store the results in variables named `X_train` and `X_test`.


In [19]:
# --- Convert full slogans to integer sequences for the CLASSIFIER 
# I reuse the already-fitted 'tokenizer' from the generator step.
# Output:
#   X_train, X_test  -> lists of lists (token IDs), one per slogan. I'll pad them in the next step.

# 1) Sanity checks
if "tokenizer" not in globals():
    raise NameError("'tokenizer' not found. Fit the Tokenizer in the generator section first.")
if "processed_slogan" not in df_train.columns or "processed_slogan" not in df_test.columns:
    raise KeyError("Missing 'processed_slogan' column in df_train/df_test. Run preprocessing steps first.")

# 2) Convert text → sequences of token IDs
#    (I keep them as variable-length lists here; padding comes next.)
X_train = tokenizer.texts_to_sequences(df_train["processed_slogan"].astype(str).tolist())
X_test  = tokenizer.texts_to_sequences(df_test["processed_slogan"].astype(str).tolist())

# 3) Quick sanity prints
print(f"[info] Train samples: {len(X_train)}, Test samples: {len(X_test)}")
print(f"[info] Example train sequence (first 1): {X_train[0][:20]} ...")


[info] Train samples: 4272, Test samples: 1068
[info] Example train sequence (first 1): [443, 418, 4049, 852, 1, 2111, 153] ...


The slogan sequences are of varying lengths. We will need to pad them the same way we did to the input sequences for the slogan generator. The `pad_sequences()` function can ensure the sequences in `slogan_sequences` have the same length.  

In the code cell below, use the `pad_sequences()` function to standardise the `slogan_sequences` lengths. Set the `maxlen` parameter to `max_seq_len`, the `padding` parameter to 0, and assign the resulting padded sequences to the same variables, `X_train` and `X_test`.

In [20]:
# Pad classifier inputs to a fixed length 
# Goal: make all sequences the same length so they can batch through the LSTM/Dense stack.
# I’ll:
#    pad on the LEFT ("pre") so the most recent tokens line up at the end
#    use zero as the pad value (Keras default)
#    standardise to max_seq_len (full slogan length for classifier)


# Sanity checks
if "max_seq_len" not in globals() or not isinstance(max_seq_len, int) or max_seq_len < 1:
    raise ValueError("max_seq_len is missing/invalid. Compute it from input_sequences earlier.")

X_train = pad_sequences(
    X_train,
    maxlen=max_seq_len,   # target length for full-slogan inputs
    padding="pre",        # pad with zeros on the left - pad value defaults to 0
    truncating="pre",     # if anything is longer, trim from the left
    value=0
)

X_test = pad_sequences(
    X_test,
    maxlen=max_seq_len,
    padding="pre",
    truncating="pre",
    value=0
)

print(f"[info] X_train shape: {X_train.shape}")
print(f"[info] X_test  shape: {X_test.shape}")
# Peek at a couple of rows
X_train[:2]


[info] X_train shape: (4272, 15)
[info] X_test  shape: (1068, 15)


array([[   0,    0,    0,    0,    0,    0,    0,    0,  443,  418, 4049,
         852,    1, 2111,  153],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
        1568, 1798, 2579, 2465]], dtype=int32)

We have successfully created training and testing inputs for our model. Now, we will create the outputs - industry categories.

 In the code cell that follows, use `tf.keras.utils.to_categorical()` to apply one-hot encoding to the `industry_index` column of **both** `df_train` and `df_test` DataFrames. Assign the results to a variables named `y_train` and `y_test`.

 *Hint: set the `num_classes` parameter to the total number of industries in the DataFrame. The `industries` variable can be used to find this value.*

In [21]:
# One-hot encode industry labels for the CLASSIFIER 
# Goal:
#   y_train, y_test -> one-hot arrays with shape (num_samples, num_classes)
#   where num_classes = number of unique industries.

import numpy as np
from tensorflow.keras.utils import to_categorical

# Sanity checks
if "industries" not in globals() or not isinstance(industries, list) or len(industries) == 0:
    raise ValueError("industries is missing or empty. Build it from the unique industry list first.")
if "industry_index" not in df_train.columns or "industry_index" not in df_test.columns:
    raise KeyError("Missing 'industry_index' in df_train/df_test. Map labels to indices before this step.")

num_classes = len(industries)

# Extract label indices as integer arrays
y_train_idx = df_train["industry_index"].to_numpy(dtype=np.int32)
y_test_idx  = df_test["industry_index"].to_numpy(dtype=np.int32)

#  One-hot encode
y_train = to_categorical(y_train_idx, num_classes=num_classes, dtype="float32")
y_test  = to_categorical(y_test_idx,  num_classes=num_classes, dtype="float32")

print(f"[info] y_train shape: {y_train.shape}  (num_classes={num_classes})")
print(f"[info] y_test  shape: {y_test.shape}")
# Quick check: show the class index of the first sample
print(f"[info] First train label index: {int(y_train_idx[0])}")


[info] y_train shape: (4272, 142)  (num_classes=142)
[info] y_test  shape: (1068, 142)
[info] First train label index: 43


## Slogan Classifier Architecture

Configure the LSTM classifier following these steps:  


1. Create a Sequential model:  
   Use `tf.keras.models.Sequential()` to create a sequential model. This model will consist of an embedding layer, two LSTM layers, and a dense output layer.

2. Add an embedding layer which will convert words into dense vector representations. Configure this layer with:
   > * `total_words` as the vocabulary size.
   > * 100 as the embedding dimension.
   > * `max_seq_len` as the `input_length` (this is the length of the slogans).

3. Add the first LSTM layer. Configure it with:
   > * 150 units.
   > * Set `return_sequences` to `True` to ensure the layer outputs sequences for the next LSTM layer.

4. Add the second LSTM layer which will process the output from the previous LSTM layer. Configure it with:
   > * 100 units.
   > * No need to set `return_sequences` here (it is the final LSTM layer).

5. Add the dense output layer which will classify the data into industries. Configure it with:
   > * The number of unique industries as the number of units.
   > * The `softmax` activation function to get probabilities for each class (industry).

6. Use `Sequential` to arrange all layers in the correct order and complete the architecture of the LSTM model called **class_model**.


In [22]:
# --- Slogan CLASSIFIER: LSTM architecture (Sequential) ---
# Spec:
#    Embedding(vocab_size=total_words, embed_dim=100, input_length=max_seq_len)
#    LSTM(150, return_sequences=True)
#    LSTM(100)
#    Dense(num_classes, activation="softmax")
#
# Notes:
#   - total_words  : tokenizer vocabulary size (len(tokenizer.word_index) + 1)
#   - max_seq_len  : padded slogan length for classifier inputs
#   - num_classes  : number of unique industries (len(industries))

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Sanity checks to catch earlier setup issues
assert "total_words" in globals() and isinstance(total_words, int) and total_words > 0, \
    "total_words missing/invalid. Re-run the tokeniser step."
assert "max_seq_len" in globals() and isinstance(max_seq_len, int) and max_seq_len > 0, \
    "max_seq_len missing/invalid. Compute it before building the classifier."
num_classes = len(industries)
assert num_classes > 1, "Need at least 2 industries for a multi-class classifier."

class_model = Sequential([
    # Word ID -> dense vector representation
    Embedding(
        input_dim=total_words,   # vocabulary size
        output_dim=100,          # embedding dimension (per brief)
        input_length=max_seq_len # full slogan length
    ),
    # First recurrent layer; keep full sequence for stacking
    LSTM(150, return_sequences=True),
    # Second recurrent layer; produce final sequence encoding
    LSTM(100),
    # Predict industry class distribution
    Dense(num_classes, activation="softmax")
])

# Inspect shapes/parameter counts
class_model.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 15, 100)           604600    
                                                                 
 lstm_2 (LSTM)               (None, 15, 150)           150600    
                                                                 
 lstm_3 (LSTM)               (None, 100)               100400    
                                                                 
 dense_1 (Dense)             (None, 142)               14342     
                                                                 
Total params: 869942 (3.32 MB)
Trainable params: 869942 (3.32 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In the code cell below, compile `class_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.

In [23]:
#  Compile the classifier model 
# Multi-class problem → categorical_crossentropy is appropriate with one-hot targets.
# Adam is a solid default optimiser. Track accuracy for a quick performance signal.

from tensorflow.keras.optimizers import Adam

class_model.compile(
    loss="categorical_crossentropy",
    optimizer=Adam(learning_rate=1e-3),
    metrics=["accuracy"]   # you could also add: ["accuracy", "top_k_categorical_accuracy"]
)

print("[info] class_model compiled: loss=categorical_crossentropy, optimizer=Adam(1e-3), metrics=['accuracy']")


[info] class_model compiled: loss=categorical_crossentropy, optimizer=Adam(1e-3), metrics=['accuracy']


## Slogan Classification & Evaluation

In the code cell that follows, fit the compiled model on the inputs and outputs, setting **the number of epochs to 50**.

In [24]:
# --- Train the slogan CLASSIFIER ---
# Inputs:
#   X_train, y_train : padded full-slogan sequences and one-hot industry labels
#   X_test,  y_test  : held-out set for validation during training
# Spec:
#    epochs = 50 (per instructions)
#    batch_size kept modest for stable updates

assert "X_train" in globals() and "y_train" in globals(), "Missing X_train / y_train."
assert "X_test"  in globals() and "y_test"  in globals(), "Missing X_test / y_test."

history_cls = class_model.fit(
    X_train, y_train,
    epochs=50,           # run full 50 epochs (no early stopping here)
    batch_size=64,
    validation_data=(X_test, y_test),
    verbose=1
)

# Optional: quick test-set evaluation after training
test_loss, test_acc = class_model.evaluate(X_test, y_test, verbose=0)
print(f"[info] Test loss: {test_loss:.4f} | Test accuracy: {test_acc:.4f}")


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
[info] Test loss: 6.8055 | Test accuracy: 0.1882


Evaluate the model using the testing set. Add a comment on the model's performance.

In [25]:
# --- Fix: align labels with the classes actually present in y_test/pred ---

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1) Indices for true/pred labels
y_test_idx = np.argmax(y_test, axis=1)
y_pred_idx = np.argmax(class_model.predict(X_test, verbose=0), axis=1)

# 2) Overall accuracy
test_acc = accuracy_score(y_test_idx, y_pred_idx)
print(f"[info] Test accuracy: {test_acc:.4f}")

# 3) Determine the exact set of labels present in this evaluation
labels = np.sort(np.unique(np.concatenate([y_test_idx, y_pred_idx])))

# 4) Build target_names aligned to labels
if "index_to_industry" in globals() and isinstance(index_to_industry, dict):
    target_names = [index_to_industry.get(i, f"class_{i}") for i in labels]
elif "industry_to_index" in globals() and isinstance(industry_to_index, dict):
    inv = {v: k for k, v in industry_to_index.items()}
    target_names = [inv.get(i, f"class_{i}") for i in labels]
else:
    # Fallback: just use the integer IDs as names
    target_names = [f"class_{i}" for i in labels]

# 5) Classification report restricted to the present labels
print("\n[info] Classification report (present classes only):")
print(classification_report(
    y_test_idx, y_pred_idx,
    labels=labels,
    target_names=target_names,
    zero_division=0
))

# 6) Confusion matrix restricted to the same label set
cm = confusion_matrix(y_test_idx, y_pred_idx, labels=labels)
cm_df = pd.DataFrame(cm, index=target_names, columns=target_names)
print("[info] Confusion matrix (rows=true, cols=pred; present classes only):")
display(cm_df.head(10))



[info] Test accuracy: 0.1882

[info] Classification report (present classes only):
                                      precision    recall  f1-score   support

                          accounting       0.47      0.64      0.55        14
                   airlines/aviation       0.00      0.00      0.00         3
                           animation       0.00      0.00      0.00         1
                   apparel & fashion       0.25      0.17      0.20        12
             architecture & planning       0.14      0.33      0.20         6
                     arts and crafts       0.00      0.00      0.00         1
                          automotive       0.50      0.48      0.49        31
                aviation & aerospace       0.00      0.00      0.00         3
                             banking       0.00      0.00      0.00         3
                       biotechnology       0.00      0.00      0.00         5
                     broadcast media       0.00      0.00 

Unnamed: 0,accounting,airlines/aviation,animation,apparel & fashion,architecture & planning,arts and crafts,automotive,aviation & aerospace,banking,biotechnology,...,translation and localization,transportation/trucking/railroad,utilities,venture capital & private equity,veterinary,warehousing,wholesale,wine and spirits,wireless,writing and editing
accounting,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
airlines/aviation,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
animation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
apparel & fashion,0,0,0,2,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
architecture & planning,1,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
arts and crafts,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
automotive,1,0,0,0,0,0,15,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aviation & aerospace,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
banking,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
biotechnology,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We will now define a function called `classify_slogan` which takes a slogan as input and predicts the industry it belongs to using the trained model, `class_model`.  

Carefully follow the code below and complete the missing parts (indicated by ellipses) as guided by the comments.

In [26]:
#  Predict the industry for a single slogan using the trained classifier 


def classify_slogan(slogan: str) -> str:
    #  Clean the input text using the same preprocessing as training
    slogan = preprocess_text(slogan)

    #  Convert cleaned text → sequence of token IDs
    sequence = tokenizer.texts_to_sequences([slogan])  # list of one list

    #  Pad to the classifier's expected input length
    padded_sequence = pad_sequences(
        sequence,
        maxlen=max_seq_len,   # full-slogan length used for classifier
        padding="pre",
        truncating="pre",
        value=0
    )

    #  Predict class probabilities and take the argmax
    prediction = class_model.predict(padded_sequence, verbose=0)   # shape: (1, num_classes)
    predicted_index = int(np.argmax(prediction, axis=1)[0])

    #  Return the industry name for that index
    return industries[predicted_index]

# Example:
classify_slogan("affordable coverage for your family")


'marketing and advertising'

## Combining the two models

Run the code cell below to combine the two models: we will first generate a slogan for a company in the "internet" industry, then pass the generated slogan to the slogan classifier to see if it correctly classifies it as internet.

In [27]:
industry = "internet"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)

print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")

Generated Slogan: internet web design agency seo digital marketing agency in dubai mena industries pune pune ga or immigration and prime supplier pune
Predicted Industry: marketing and advertising


Compare the results and comment on any differences you notice between the generated slogans and the classifier’s predictions in the markdown cell below.


## Generated vs. Classified Slogans — Brief Comparison
**What I looked at**

- I generated several slogans per industry using the next-word LSTM (seeded with an industry), then fed those same lines into the classifier to see whether it predicts the intended industry.
**Where they agree (good cases)**
- When the generated text includes industry-specific tokens (e.g., “coverage, policy, premium” for insurance), the classifier usually -returns the same industry.
- Longer, more specific generations tend to be classified correctly more often than very short ones.
**Where they diverge (common mismatches)**
- Generic phrasing (“quality service”, “trusted solutions”, “we care”) lacks clear signals, so the classifier may pick a high-frequency class rather than the intended one.
- Short outputs (few tokens) carry little context; the classifier can swing to a confusable industry.
- Imbalanced data: industries with more training examples dominate both the generator’s vocabulary and the classifier’s decision boundary, increasing off-class predictions for rarer labels.
- Vocabulary limits: uncommon or out-of-vocab terms get mapped to 0 or ignored, weakening cues the classifier needs.
- Training objective gap: the generator is trained for next-word prediction, not industry accuracy; the classifier is trained for industry discrimination on full sequences. These goals aren’t identical.
  
**Takeaways**
- Agreement is highest when the generation clearly names the domain and uses distinctive terms.
- Disagreements are mostly tied to generic language, class imbalance, and the short length of slogans.
- For stronger alignment, consider (if allowed in your template):
    - nudging generation to include industry keywords (longer seed, slightly higher max length),
    - modestly increasing epochs or adjusting temperature to improve diversity without losing relevance,
    - and, for the classifier, using class weighting or collecting a few more examples per rare industry.