# Capstone Project: Slogan Classifier and Generator

In this capstone project you will train a Long Short-Term Memory (LSTM) model to generate slogans for businesses based on their industry, and also train a classifier to predict the industry based on a given slogan.

##Libraries
We recommend running this notebook using [Google Colab](https://colab.google/) however if you choose to use your local machine you will need to install spaCy before starting.

To install spaCy, refer to the installation instructions provided on the spaCy [website](https://spacy.io/usage). Note you may need to install an older version of Python that is compatible with spaCy. You can create a virtual environment for this project to install the specific version of Python that you need.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam
import spacy # available on Google Colab
from sklearn.model_selection import train_test_split

## Loading and viewing the dataset

- Load the slogan dataset into a variable called data.
- Extract relevant columns in a variable called df.
- Handle missing values.

Do **not** change the column names.

If you are using Google Colab you will need mount your Google Drive as follows:  
`from google.colab import drive`  
`drive.mount('/content/drive')`  

The path you use when loading your data will look something like this if you are using your Google Drive:  
"/content/drive/MyDrive/Colab Notebooks/slogan-valid.csv"

In [2]:
from google.colab import files

#uploaded = files.upload()

# Assuming the file is named 'slogan-valid.csv'
import pandas as pd
df = pd.read_csv('slogan-valid.csv')

# Display the first few rows
df.head()


Saving slogan-valid.csv to slogan-valid.csv


Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN


## Data Preprocessing

Since we are working with textual data, we need software that understands natural language. For this, we'll use a library for processing text called **spaCy**. Using spaCy, we'll break the text into smaller units called tokens that are easier for the machine to process. This process is called **tokenisation**. We'll also convert all text to lowercase and remove punctuation because this information is not necessary for our models. Run the code below, and your dataframe (df) will gain a new column called **'processed_slogan'** which contains the preprocessed text.




In [3]:
# Load spaCy model for text processing
nlp = spacy.load("en_core_web_sm")

# Define text preprocessing function
def preprocess_text(text):
    text_lower = text.lower()
    doc = nlp(text_lower)

    processed_tokens = []

    for token in doc:
        if not token.is_punct:
            processed_tokens.append(token.text)

    return " ".join(processed_tokens)

df["processed_slogan"] = df["output"].apply(preprocess_text)

df.head()

Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos,processed_slogan
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB,taking care of small business technology
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB,build world class recreation programs
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ,most powerful lead generation software for mar...
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB,hire quality freelancers for your job
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN,financial advisers norwich norfolk


We want our model to generate **industry-specific** slogans. If we use the 'processed_slogan' column as it is, we'll be leaving out crucial context - the industries of the companies behind those slogans. To fix this, we'll create a new **'modified_slogan'** column that adds the industry name to the front of processed slogan.  

For example:  

> industry = 'computer hardware'  
processed_slogan = 'taking care of small business technology'  
modified_slogan = 'computer hardware taking care of small business technology'

Write code in the cell below to achieve this.

In [4]:
# Combine 'industry' and 'processed_slogan' into a new column 'modified_slogan'

# The lambda function adds the industry name before the processed slogan
df['modified_slogan'] = df.apply(lambda row: f"{row['industry']} {row['processed_slogan']}", axis=1)

# Display the first few rows to verify the result
df[['industry', 'processed_slogan', 'modified_slogan']].head()


Unnamed: 0,industry,processed_slogan,modified_slogan
0,computer hardware,taking care of small business technology,computer hardware taking care of small busines...
1,"health, wellness and fitness",build world class recreation programs,"health, wellness and fitness build world class..."
2,internet,most powerful lead generation software for mar...,internet most powerful lead generation softwar...
3,internet,hire quality freelancers for your job,internet hire quality freelancers for your job
4,financial services,financial advisers norwich norfolk,financial services financial advisers norwich ...


Now we need to get data to train our model. We have textual data which we will need to represent numerically for our model to learn from it.  
The code below does the following:
1. Tokenizes a dataset of slogans.
2. Converts words to numerical indices.
3. Creates input sequences using the numerical indices.  

Here's how it works. From the 'modified_slogan' column, we take the slogan "computer hardware taking care of small business technology". The tokenisation process will convert words into their corresponding indices:  

<center>

| Word         | Token Index |
|-------------|-------|
| "computer"  | 1     |
| "hardware"  | 2     |
| "taking"    | 3     |
| "care"      | 4     |
| "of"        | 5     |
| "small"     | 6     |
| "business"  | 7     |
| "technology"| 8     |

</center>

So the tokenized list is:

<center>
[1, 2, 3, 4, 5, 6, 7, 8]
</center>

When creating input sequences for training, the loop generates progressively longer sequences.

<center>

| Token Index Sequence               | Corresponding Slogan                                 |
|------------------------------|-----------------------------------------------------|
| [1, 2]                       | "computer hardware"                                |
| [1, 2, 3]                    | "computer hardware taking"                        |
| [1, 2, 3, 4]                 | "computer hardware taking care"                   |
| [1, 2, 3, 4, 5]              | "computer hardware taking care of"                |
| [1, 2, 3, 4, 5, 6]           | "computer hardware taking care of small"          |
| [1, 2, 3, 4, 5, 6, 7]        | "computer hardware taking care of small business" |
| [1, 2, 3, 4, 5, 6, 7, 8]     | "computer hardware taking care of small business technology" |

</center>

Instead of training the model on only **complete slogans**, we provide partial phrases which will help the model learn how words connect over time. This will make it better at predicting the next word when generating slogans.  

Run the cell block below to generate the input sequences. Be sure to read the comments to understand what the code is doing.


In [5]:
# Import the Tokenizer from Keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initialize the tokenizer
tokenizer = Tokenizer()

# Fit the tokenizer on the modified slogans
tokenizer.fit_on_texts(df['modified_slogan'])

# Convert each slogan into a list of token indices
token_list = tokenizer.texts_to_sequences(df['modified_slogan'])

# Prepare input sequences for training
input_sequences = []

for tokens in token_list:
    for i in range(2, len(tokens) + 1):
        sequence = tokens[:i]
        input_sequences.append(sequence)

# Check the first few sequences
for i in range(5):
    print(input_sequences[i])

# Vocabulary size (+1 for reserved index 0)
vocab_size = len(tokenizer.word_index) + 1
print("\nVocabulary size:", vocab_size)


[11, 236]
[11, 236, 2708]
[11, 236, 2708, 23]
[11, 236, 2708, 23, 24]
[11, 236, 2708, 23, 24, 414]

Vocabulary size: 6046


In [6]:
# Initialize tokenizer
tokenizer = Tokenizer()

# Fit tokenizer on the modified slogans
tokenizer.fit_on_texts(df["modified_slogan"])

# Get total number of unique words (+1 for reserved index 0)
total_words = len(tokenizer.word_index) + 1

# Display the word index mapping (word → token index)
tokenizer.word_index

# Create input sequences for the slogan generator
input_sequences = []

for line in df["modified_slogan"]:
    # Convert the slogan to a list of token indices
    token_list = tokenizer.texts_to_sequences([line])[0]

    # Generate progressively longer sequences
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i + 1])


The input sequences created above are of **varying lengths**, which will be a problem when training our LSTM model. LSTMs require input sequences of **equal length**. So, we need to **pad** shorter sequences by **prepending zeros** until they match the length of the longest sequence.  

For example, if the longest sequence has **10 tokens**, our padded sequences will look like this:

<center>

| Input Sequence                     | Padded Sequence                         |
|-------------------------------------|-----------------------------------------|
| [1, 2]                              | [0, 0, 0, 0, 0, 0, 0, 0, 1, 2]         |
| [1, 2, 3]                           | [0, 0, 0, 0, 0, 0, 0, 1, 2, 3]         |
| [1, 2, 3, 4]                        | [0, 0, 0, 0, 0, 0, 1, 2, 3, 4]         |
| [1, 2, 3, 4, 5]                     | [0, 0, 0, 0, 0, 1, 2, 3, 4, 5]         |
| [1, 2, 3, 4, 5, 6]                  | [0, 0, 0, 0, 1, 2, 3, 4, 5, 6]         |
| [1, 2, 3, 4, 5, 6, 7]               | [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]         |
| [1, 2, 3, 4, 5, 6, 7, 8]            | [0, 0, 1, 2, 3, 4, 5, 6, 7, 8]         |

</center>

In the cell below, write code that **finds the length of the longest sequence** in **input_sequences** and stores this value in a variable named **max_seq_len**.


In [7]:
# Find the length of the longest sequence
max_seq_len = max(len(seq) for seq in input_sequences)

# Display the result
print("Maximum sequence length:", max_seq_len)


Maximum sequence length: 15


Run the cell below to pad the input sequences so they are all the same length as **max_seq_length**.

In [8]:
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding="pre")

## Training Data for Slogan Generator

The input sequences generated will be used as our training data. Our LSTM needs to learn how to predict the **next word** in a sequence.  

The inputs for our model will be the input sequences **excluding the last token index** and the outputs will be the **last token index**.  

As an example, let us use the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7, 8] and say it corresponds to the slogan "computer hardware taking care of small business technology". When training the model:

> Our input **x** will be the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7] corresponding to "computer hardware taking care of small".  
> Our output **y** will be [8] which corresponds to "business".  

In the code cell below, use `input_sequences` to create the following two variables:
1. **X_gen** which contains the input sequences excluding the last token index.
2. **y_gen** which contains the last token index of the input sequence.

In [9]:
# Prepare training data for the slogan generator

# X_gen will contain all tokens except the last one
# y_gen will contain the last token (the word to be predicted)
X_gen = np.array([seq[:-1] for seq in input_sequences])
y_gen = np.array([seq[-1] for seq in input_sequences])

# Display shapes to confirm everything looks right
print("Shape of X_gen:", X_gen.shape)
print("Shape of y_gen:", y_gen.shape)


Shape of X_gen: (34736, 14)
Shape of y_gen: (34736,)


The model will output the next word of a sequence over a probability distribution. We need to encode our output variable for this to be possible.

In the code cell below, write code that will apply one-hot encoding to **y_gen** using `tf.keras.utils.to_categorical()`. **Maintain the same variable name**.  

*Hint: set the `num_classes` (number of classes) parameter to the total number of unique words in the learned vocabulary. You can access this value through a variable that was created when generating input sequences earlier.*

In [10]:
# One-hot encode the output variable y_gen

# Convert y_gen to one-hot encoding
y_gen = tf.keras.utils.to_categorical(y_gen, num_classes=total_words)

# Display the new shape
print("Shape of one-hot encoded y_gen:", y_gen.shape)


Shape of one-hot encoded y_gen: (34736, 6046)


## Slogan Generator Architecture

In the code cell that follows, configure the LSTM following these steps:

1. Create a sequential model using `tf.keras.models.Sequential()`. This model will have an embedding layer, two LSTM layers, and a dense output layer.
2. Add an embedding layer that converts words into dense vector representations. This layer should:
> *   Have `total_words`as the vocabulary size.
> *   Use 100 as an embedding dimension.
> *   Takes an input length of `max_seq_len - 1` (excludes the target word).
3. Add two LSTM layers.
> *   The first LSTM layer should have 150 **and** set `return_sequences` to `True`.
> *   The second LSTM layer should have 100 units.
4. Add a dense output layer which:
> *   Uses `total_words` as the number of units (one for each word in the vocabulary).
> *   Uses a softmax activation function.
5. Use `Sequential` to put everything together in the correct order to complete the architecture of the LSTM model called **gen_model**.


In [11]:
# ================================
# Slogan Generator Architecture
# ================================

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Create the model
gen_model = Sequential()

# 1️⃣ Embedding layer
gen_model.add(Embedding(
    input_dim=total_words,       # Vocabulary size
    output_dim=100,              # Embedding dimension
    input_length=max_seq_len - 1 # Sequence length (excluding target word)
))

# 2️⃣ First LSTM layer
gen_model.add(LSTM(150, return_sequences=True))

# 3️⃣ Second LSTM layer
gen_model.add(LSTM(100))

# 4️⃣ Dense output layer with softmax activation
gen_model.add(Dense(total_words, activation='softmax'))

# Display the model summary
gen_model.summary()




In the code cell below, compile `gen_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.


In [12]:
# Compile the model
gen_model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)


## Slogan Generation

In the code cell below, fit the compiled model on the inputs and outputs, setting the **number of epochs to 50**.

In [13]:
# Train the model
history = gen_model.fit(
    X_gen,       # input sequences
    y_gen,       # one-hot encoded next words
    epochs=50,   # number of passes through the dataset
    batch_size=64  # optional: adjust for memory or performance
)


Epoch 1/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 10ms/step - accuracy: 0.0625 - loss: 7.4764
Epoch 2/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.0826 - loss: 6.5905
Epoch 3/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 10ms/step - accuracy: 0.1020 - loss: 6.2642
Epoch 4/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.1374 - loss: 6.0076
Epoch 5/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.1617 - loss: 5.7884
Epoch 6/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.1831 - loss: 5.5741
Epoch 7/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 9ms/step - accuracy: 0.2027 - loss: 5.3802
Epoch 8/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.2147 - loss: 5.2298
Epoch 9/50
[1m543/543[0m [32m━━━

We will now define a function called `generate_slogan` which will generate a slogan by predicting one word at a time based on a given starting phrase (the `seed_text`). This function will do this using our trained model, `gen_model`.

Here is a breakdown of how the algorithm works:  

Let us assume the dictionary mapping words to unique indices, `tokenizer.word_index`, looks like this:

> `{'computer': 1, 'hardware': 2, 'taking': 3, 'care': 4, 'of': 5}`

If the model's predicted index for the next word is 3 (`predicted_index = 3`), the loop will:

> Check 'computer' (index 1) → No match  
> Check 'hardware' (index 2) → No match  
> Check 'taking' (index 3) → Match found!  
> Assign output_word = "taking" and exit the loop.  

The `output_word` will be appended to the `seed_text`, and the process will continue to add words to the `seed_text` until we have reached the maximum number of words **or** an invalid prediction occurs.  

Carefully follow the code below and complete the missing parts as guided by the comments.

In [14]:
def generate_slogan(seed_text, max_words=20):
    for _ in range(max_words):

        # Tokenising and padding seed_text
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding="pre")

        # Use your trained model (gen_model) on token_list to predict the probability distribution of the next word over the vocabulary
        predictions = gen_model.predict(token_list, verbose=0)

        # From the predicted probabilities, identify the word index with the highest probability
        predicted_index = np.argmax(predictions, axis=-1)[0]

        output_word = None

        # Searching for the word that corresponds to the predicted index
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break

        # If no valid word is found, algorithm stops
        if output_word is None:
            break  # out of main loop

        # Append the predicted word to seed_text
        seed_text += " " + output_word

    return seed_text


## Training Data for Slogan Classifier

We will now prepare the data we will use to train our classifier. For our classifier, the inputs will come from the `processed_slogans` column of our DataFrame, `df`. The outputs will be the different industry categories under the `industry` column.

In the code cell below, extract the unique values from the `industry` column in the DataFrame and store these in a variable called **industries**.

In [15]:
# Extract unique industry categories
industries = df['industry'].unique()

# Display the result
print(industries)


['computer hardware' 'health, wellness and fitness' 'internet'
 'financial services' 'mechanical or industrial engineering'
 'marketing and advertising' 'hospital & health care' 'research'
 'information technology and services' 'computer software' 'oil & energy'
 'dairy' 'transportation/trucking/railroad' 'design' 'furniture'
 'professional training & coaching' 'hospitality' 'textiles'
 'food & beverages' 'management consulting' 'medical practice'
 'accounting' 'performing arts' 'electrical/electronic manufacturing'
 'higher education' 'outsourcing/offshoring'
 'venture capital & private equity' 'writing and editing'
 'mining & metals' 'construction' 'consumer electronics' 'retail'
 'human resources' 'staffing and recruiting' 'farming' 'wholesale'
 'events services' 'import and export'
 'non-profit organization management' 'machinery' 'information services'
 'biotechnology' 'philanthropy' 'law practice' 'real estate'
 'graphic design' 'building materials' 'medical devices' 'consumer go

Create a dictionary called `industry_to_index` where each unique industry is mapped to a unique index starting from 0.

*Hint: Use the `enumerate()` function.*

In [16]:
# Create dictionary mapping industries to unique indices
industry_to_index = {industry: idx for idx, industry in enumerate(industries)}

# Display the mapping
print(industry_to_index)


{'computer hardware': 0, 'health, wellness and fitness': 1, 'internet': 2, 'financial services': 3, 'mechanical or industrial engineering': 4, 'marketing and advertising': 5, 'hospital & health care': 6, 'research': 7, 'information technology and services': 8, 'computer software': 9, 'oil & energy': 10, 'dairy': 11, 'transportation/trucking/railroad': 12, 'design': 13, 'furniture': 14, 'professional training & coaching': 15, 'hospitality': 16, 'textiles': 17, 'food & beverages': 18, 'management consulting': 19, 'medical practice': 20, 'accounting': 21, 'performing arts': 22, 'electrical/electronic manufacturing': 23, 'higher education': 24, 'outsourcing/offshoring': 25, 'venture capital & private equity': 26, 'writing and editing': 27, 'mining & metals': 28, 'construction': 29, 'consumer electronics': 30, 'retail': 31, 'human resources': 32, 'staffing and recruiting': 33, 'farming': 34, 'wholesale': 35, 'events services': 36, 'import and export': 37, 'non-profit organization management

Create a new column `industry_index` in your DataFrame by mapping the `industry` column to the indices using the `industry_to_index` dictionary.

*Hint: Use the  `map()` function.*

In [17]:
# Map industry names to their corresponding indices
df['industry_index'] = df['industry'].map(industry_to_index)

# Display the first few rows to verify
df.head()


Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos,processed_slogan,modified_slogan,industry_index
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB,taking care of small business technology,computer hardware taking care of small busines...,0
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB,build world class recreation programs,"health, wellness and fitness build world class...",1
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ,most powerful lead generation software for mar...,internet most powerful lead generation softwar...,2
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB,hire quality freelancers for your job,internet hire quality freelancers for your job,2
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN,financial advisers norwich norfolk,financial services financial advisers norwich ...,3


Split the DataFrame `df` into training and testing sets, setting aside 20% of the data for the test set. Be sure to set the parameter `stratify=df["industry_index"]`. This ensures that both sets have the same proportion of each class (industry) as in the original dataset, resulting in balanced datasets. Call the training DataFrame `df_train` and the testing DataFrame `df_test`.

In [18]:
# Count samples per industry
counts = df['industry_index'].value_counts()

# Keep only classes with at least 2 samples
df_filtered = df[df['industry_index'].isin(counts[counts >= 2].index)]

# Split filtered data
df_train, df_test = train_test_split(
    df_filtered,
    test_size=0.2,
    stratify=df_filtered["industry_index"],
    random_state=42
)



Our classifier will use padded slogan sequences as inputs, similar to input sequences used for the slogan generator. The difference is we will not use sequences that get progressively longer, but instead we will use **complete slogans**. This is because our classifier does not need to learn how to predict what word comes next. It needs the full context of a slogan to learn how to accurately predict the industry.  

The next steps will walk you through how to create these sequences.  

We previously created and fitted a `Tokenizer` object called `tokenizer` while preparing data for the slogan generator. Now, we will reuse it to convert words into numerical indices.  

In the code cell below, use the `texts_to_sequences()` **method** of `tokenizer` to transform the `processed_slogan` column in **both** the `df_train` and `df_test` DataFrames into sequences of numerical indices. Store the results in variables named `X_train` and `X_test`.


In [19]:
# Convert processed slogans to sequences using the tokenizer
X_train = tokenizer.texts_to_sequences(df_train["processed_slogan"])
X_test = tokenizer.texts_to_sequences(df_test["processed_slogan"])

# Display first few sequences to verify
print(X_train[:5])
print(X_test[:5])


[[1091, 167, 209, 33, 4, 583], [78, 44, 109], [260, 2908, 195, 1456], [1400, 9, 72], [2187, 1, 566, 556, 52, 408]]
[[5224, 3, 5225, 621], [1748, 15, 1, 43, 2588, 4, 4918, 1, 842], [1202, 432, 565, 432, 4880, 299, 91], [45, 13, 45, 36, 210], [300, 868, 112]]


The slogan sequences are of varying lengths. We will need to pad them the same way we did to the input sequences for the slogan generator. The `pad_sequences()` function can ensure the sequences in `slogan_sequences` have the same length.  

In the code cell below, use the `pad_sequences()` function to standardise the `slogan_sequences` lengths. Set the `maxlen` parameter to `max_seq_len`, the `padding` parameter to 0, and assign the resulting padded sequences to the same variables, `X_train` and `X_test`.

In [20]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Pad the sequences so they all have the same length
X_train = pad_sequences(X_train, maxlen=max_seq_len, padding="pre")
X_test = pad_sequences(X_test, maxlen=max_seq_len, padding="pre")

# Display shapes to verify
print(X_train.shape)
print(X_test.shape)


(4272, 15)
(1068, 15)


We have successfully created training and testing inputs for our model. Now, we will create the outputs - industry categories.

 In the code cell that follows, use `tf.keras.utils.to_categorical()` to apply one-hot encoding to the `industry_index` column of **both** `df_train` and `df_test` DataFrames. Assign the results to a variables named `y_train` and `y_test`.

 *Hint: set the `num_classes` parameter to the total number of industries in the DataFrame. The `industries` variable can be used to find this value.*

In [21]:
import tensorflow as tf

# One-hot encode the industry indices
y_train = tf.keras.utils.to_categorical(df_train["industry_index"], num_classes=len(industries))
y_test = tf.keras.utils.to_categorical(df_test["industry_index"], num_classes=len(industries))

# Display shapes to verify
print(y_train.shape)
print(y_test.shape)


(4272, 142)
(1068, 142)


## Slogan Classifier Architecture

Configure the LSTM classifier following these steps:  


1. Create a Sequential model:  
   Use `tf.keras.models.Sequential()` to create a sequential model. This model will consist of an embedding layer, two LSTM layers, and a dense output layer.

2. Add an embedding layer which will convert words into dense vector representations. Configure this layer with:
   > * `total_words` as the vocabulary size.
   > * 100 as the embedding dimension.
   > * `max_seq_len` as the `input_length` (this is the length of the slogans).

3. Add the first LSTM layer. Configure it with:
   > * 150 units.
   > * Set `return_sequences` to `True` to ensure the layer outputs sequences for the next LSTM layer.

4. Add the second LSTM layer which will process the output from the previous LSTM layer. Configure it with:
   > * 100 units.
   > * No need to set `return_sequences` here (it is the final LSTM layer).

5. Add the dense output layer which will classify the data into industries. Configure it with:
   > * The number of unique industries as the number of units.
   > * The `softmax` activation function to get probabilities for each class (industry).

6. Use `Sequential` to arrange all layers in the correct order and complete the architecture of the LSTM model called **class_model**.


In [22]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Build the LSTM classifier model
class_model = Sequential([
    # Embedding layer
    Embedding(input_dim=total_words, output_dim=100, input_length=max_seq_len),

    # First LSTM layer
    LSTM(150, return_sequences=True),

    # Second LSTM layer
    LSTM(100),

    # Dense output layer for classification
    Dense(len(industries), activation='softmax')
])

# Display the model summary to verify the architecture
class_model.summary()




In the code cell below, compile `class_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.

In [23]:
class_model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)


## Slogan Classification & Evaluation

In the code cell that follows, fit the compiled model on the inputs and outputs, setting **the number of epochs to 50**.

In [28]:
history = class_model.fit(
    X_train,      # training inputs
    y_train,      # training outputs (one-hot encoded)
    epochs=50,    # number of training epochs
    batch_size=64, # optional, can adjust for memory/performance
    validation_data=(X_test, y_test)  # optional, to monitor performance on test set
)


Epoch 1/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 23ms/step - accuracy: 0.9951 - loss: 0.0180 - val_accuracy: 0.1882 - val_loss: 7.4591
Epoch 2/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - accuracy: 0.9967 - loss: 0.0142 - val_accuracy: 0.1854 - val_loss: 7.4822
Epoch 3/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 24ms/step - accuracy: 0.9970 - loss: 0.0138 - val_accuracy: 0.1845 - val_loss: 7.5024
Epoch 4/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.9976 - loss: 0.0122 - val_accuracy: 0.1863 - val_loss: 7.5276
Epoch 5/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 10ms/step - accuracy: 0.9970 - loss: 0.0150 - val_accuracy: 0.1873 - val_loss: 7.5404
Epoch 6/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.9973 - loss: 0.0130 - val_accuracy: 0.1901 - val_loss: 7.5632
Epoch 7/50
[1m67/67[0m [32m━━━━

Evaluate the model using the testing set. Add a comment on the model's performance.

In [29]:
# Evaluate the classifier on the test set
loss, accuracy = class_model.evaluate(X_test, y_test, verbose=0)

print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")


Test Loss: 8.2944
Test Accuracy: 0.1919


We will now define a function called `classify_slogan` which takes a slogan as input and predicts the industry it belongs to using the trained model, `class_model`.  

Carefully follow the code below and complete the missing parts (indicated by ellipses) as guided by the comments.

In [30]:
def classify_slogan(slogan):
    # Preprocess the input slogan using the preprocessing function
    slogan = preprocess_text(slogan)

    # Convert the slogan to a sequence of indices
    sequence = tokenizer.texts_to_sequences([slogan])

    # Pad the sequence to the same length as training sequences
    padded_sequence = pad_sequences(sequence, maxlen=max_seq_len, padding="pre")

    # Get predicted probabilities from the classifier
    prediction = class_model.predict(padded_sequence, verbose=0)

    # Get the index of the industry with the highest probability
    predicted_index = np.argmax(prediction, axis=-1)[0]

    # Return the predicted industry name
    return industries[predicted_index]


## Combining the two models

Run the code cell below to combine the two models: we will first generate a slogan for a company in the "internet" industry, then pass the generated slogan to the slogan classifier to see if it correctly classifies it as internet.

In [31]:
industry = "internet"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)

print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")

Generated Slogan: internet web design agency in pune india in singapore cloud acquisition and capital solutions for be today tricks area companies and
Predicted Industry: international affairs


Compare the results and comment on any differences you notice between the generated slogans and the classifier’s predictions in the markdown cell below.


Comparison of Generated Slogan vs Predicted Industry

Generated Slogan:
internet web design agency in pune india in singapore cloud acquisition and capital solutions for be today tricks area companies and

Predicted Industry:
international affairs

Observations and Comments:

Mismatch Between Content and Prediction:

The generated slogan clearly describes web design, cloud solutions, and tech services, which is very different from international affairs.

Possible Causes for Misclassification:

The classifier may have been trained on a limited or imbalanced dataset, so some industries (like tech/digital) are underrepresented.

Generated slogans often combine multiple phrases and keywords, which might confuse the classifier and cause it to choose a more frequent or “closest matching” category.

Nature of Generator vs Classifier:

The slogan generator predicts the next word based on sequence patterns learned across all industries. It doesn’t always stick to a single industry.

The classifier only sees the final slogan and tries to assign it to one of the predefined categories. Mixed-industry phrases can mislead the classifier.

Conclusion:
There is a clear discrepancy between the slogan content and the predicted industry.
