## 📦 1. Importing Required Libraries

We begin by importing essential libraries for data processing and building the LSTM model:

In [None]:
import pandas as pd
import numpy as np
from collections import Counter
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dropout, Dense
from tensorflow.keras.optimizers import Adam
from collections import Counter


## 📄 Step 2: Load Dataset in Chunks

To handle large datasets efficiently, we load the FreeCodeCamp chat data in chunks of 100,000 rows:

In [None]:
data_path = "freecodecamp_chat.csv"
texts = []
chunks = pd.read_csv(data_path, chunksize=100_000)

for i, chunk in enumerate(chunks):
    chunk = chunk[chunk["text"].notna()]
    texts += chunk["text"].astype(str).tolist()

    if i == 5:
        break

print("Total messages loaded:", len(texts))

- We use `chunksize=100_000` to read the dataset incrementally.
- Only rows with non-null `text` values are retained.
- The loop stops after reading 6 chunks (approximately 600,000 rows).

## 🧹 Step 3: Text Preprocessing and Character Filtering

We create a clean corpus by lowercasing all text, filtering infrequent characters, and mapping characters to integer indices:

In [None]:
corpus = " ".join(texts).lower()

char_freq = Counter(corpus)
min_freq = 100
valid_chars = sorted([c for c, f in char_freq.items() if f >= min_freq])

char_to_idx = {c: i for i, c in enumerate(valid_chars)}
idx_to_char = {i: c for c, i in char_to_idx.items()}

clean_corpus = ''.join(c for c in corpus if c in valid_chars)

## 🔢 Step 4: Sequence Generation

We split the cleaned corpus into overlapping sequences of fixed length and prepare input-output pairs for the model:

In [None]:
maxlen = 100
step = 3

sequences = []
next_chars = []

for i in range(0, len(clean_corpus) - maxlen, step):
    sequences.append(clean_corpus[i:i+maxlen])
    next_chars.append(clean_corpus[i + maxlen])

X = [[char_to_idx[c] for c in seq] for seq in sequences]
y = [char_to_idx[c] for c in next_chars]

## 🧠 Step 5: Model Architecture and Training

We build a character-level LSTM model using Keras and train it on the prepared input-output sequences:

In [None]:
model = Sequential([
    Embedding(input_dim=len(char_to_idx), output_dim=64, input_length=maxlen),
    LSTM(128),
    Dropout(0.2),
    Dense(len(char_to_idx), activation='softmax')
])

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=Adam(learning_rate=0.001),
    metrics=['accuracy']
)

model.fit(X, y, batch_size=64, epochs=5, validation_split=0.1)

### 🔍 Model Details:
- `Embedding`: Converts character indices to dense vectors of size 64.
- `LSTM`: 128 units to capture sequential dependencies.
- `Dropout`: Prevents overfitting by randomly dropping 20% of connections during training.
- `Dense`: Final layer with softmax activation to predict the next character.
- `loss`: `sparse_categorical_crossentropy` is used for integer targets.
- `optimizer`: Adam optimizer with learning rate 0.001.
- `validation_split=0.1`: 10% of the data is used for validation.

Training runs for 5 epochs with a batch size of 64.

## ✍️ Step 6: Text Generation with the Trained Model

We define functions to generate text character-by-character using the trained LSTM model. The generation process is autoregressive: each predicted character is appended to the input for the next prediction.

In [None]:
def sample(preds, temperature=1.0):
    preds = np.log(preds + 1e-8) / temperature
    preds = np.exp(preds) / np.sum(np.exp(preds))
    return np.random.choice(len(preds), p=preds)

def generate_text(model, seed, length=300, temperature=1.0):
    result = seed
    input_seq = seed[-maxlen:]

    for _ in range(length):
        input_indices = [char_to_idx.get(c, 0) for c in input_seq]
        input_array = np.zeros((1, maxlen), dtype=np.int32)
        input_array[0, -len(input_indices):] = input_indices

        preds = model.predict(input_array, verbose=0)[0]
        next_idx = sample(preds, temperature)
        next_char = idx_to_char[next_idx]

        result += next_char
        input_seq = result[-maxlen:]

    return result

## 🚀 Step 7: Generate Sample Text

We now generate a sample output using the trained LSTM model and a custom seed string:

In [None]:
seed = "what are you working on"
print(generate_text(model, seed, length=300, temperature=0.8))