# 📊 **New York Times Comments Dataset Analysis**
This notebook analyzes the New York Times Comments dataset available on Kaggle.
We will extract metadata, check for missing values, and summarize the structure of the dataset before proceeding with text analysis.

## **📌 Step 1: Setup the Environment**
We start by importing the necessary libraries and listing all available files in the dataset.

In [1]:
import os
import pandas as pd

# Set path to dataset (Kaggle users should adjust as needed)
dataset_path = "/kaggle/input/nyt-comments"

# List all files in the dataset
files = os.listdir(dataset_path)
print("Files in dataset:\n", files)

Files in dataset:
 ['CommentsFeb2018.csv', 'ArticlesFeb2017.csv', 'CommentsApril2018.csv', 'ArticlesJan2017.csv', 'ArticlesMay2017.csv', 'CommentsJan2017.csv', 'CommentsMarch2017.csv', 'CommentsMay2017.csv', 'CommentsMarch2018.csv', 'CommentsApril2017.csv', 'ArticlesMarch2017.csv', 'ArticlesApril2017.csv', 'CommentsFeb2017.csv', 'ArticlesJan2018.csv', 'ArticlesFeb2018.csv', 'ArticlesMarch2018.csv', 'CommentsJan2018.csv', 'ArticlesApril2018.csv']


## **📌 Step 2: Load & Inspect Data**
Let's load files (e.g., `ArticlesJan2017.csv` and `CommentsApril2017.csv`) to inspect its structure.

In [2]:
# Load an example file to inspect its structure
sample_file = "ArticlesJan2017.csv"  # You can change this to any file in the dataset
df = pd.read_csv(os.path.join(dataset_path, sample_file))

# Display first few rows
df.head()

Unnamed: 0,articleID,abstract,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,,By JENNIFER STEINHAUER,article,G.O.P. Leadership Poised to Topple Obama’s Pi...,"['United States Politics and Government', 'Law...",1,National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led...,The New York Times,News,https://www.nytimes.com/2017/01/01/us/politics...,1324
1,586967bf95d0e03926078915,,By MARK LANDLER,article,Fractured World Tested the Hope of a Young Pre...,"['Obama, Barack', 'Afghanistan', 'United State...",1,Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a “good war” to the ...,The New York Times,News,https://www.nytimes.com/2017/01/01/world/asia/...,2836
2,58698a1095d0e0392607894a,,By CAITLIN LOVINGER,article,Little Troublemakers,"['Crossword Puzzles', 'Boxing Day', 'Holidays ...",1,Games,0,2017-01-01 23:00:24,Unknown,Chuck Deodene puts us in a bubbly mood.,The New York Times,News,https://www.nytimes.com/2017/01/01/crosswords/...,445
3,5869911a95d0e0392607894e,,By JOCHEN BITTNER,article,"Angela Merkel, Russia’s Next Target","['Cyberwarfare and Defense', 'Presidential Ele...",1,OpEd,15,2017-01-01 23:30:27,Unknown,"With a friend entering the White House, Vladim...",The New York Times,Op-Ed,https://www.nytimes.com/2017/01/01/opinion/ang...,864
4,5869a61795d0e03926078962,,By JIAYIN SHEN,article,Boots for a Stranger on a Bus,"['Shoes and Boots', 'Buses', 'New York City']",0,Metro,12,2017-01-02 01:00:02,Unknown,Witnessing an act of generosity on a rainy day.,The New York Times,Brief,https://www.nytimes.com/2017/01/01/nyregion/me...,309


In [9]:
# Load an example file to inspect its structure
sample_file_2 = "CommentsApril2017.csv"  # You can change this to any file in the dataset
df_2 = pd.read_csv(os.path.join(dataset_path, sample_file_2))

# Display first few rows
df_2.head()

  df_2 = pd.read_csv(os.path.join(dataset_path, sample_file_2))


Unnamed: 0,approveDate,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,parentID,...,userLocation,userTitle,userURL,inReplyTo,articleID,sectionName,newDesk,articleWordCount,printPage,typeOfMaterial
0,1491245186,This project makes me happy to be a 30+ year T...,22022598.0,22022598,<br/>,comment,1491237000.0,1,False,0.0,...,"Riverside, CA",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
1,1491188619,Stunning photos and reportage. Infuriating tha...,22017350.0,22017350,,comment,1491180000.0,1,False,0.0,...,<br/>,,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
2,1491188617,Brilliant work from conception to execution. I...,22017334.0,22017334,<br/>,comment,1491179000.0,1,False,0.0,...,Raleigh NC,,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
3,1491167820,NYT reporters should provide a contributor's l...,22015913.0,22015913,<br/>,comment,1491150000.0,1,False,0.0,...,"Missouri, USA",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
4,1491167815,Could only have been done in print. Stunning.,22015466.0,22015466,<br/>,comment,1491147000.0,1,False,0.0,...,"Tucson, Arizona",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News


## **📌 Step 3: Extract Metadata**
Now, we extract key metadata, such as column names, data types, and missing values.

In [4]:
# Display dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   articleID         850 non-null    object
 1   abstract          40 non-null     object
 2   byline            850 non-null    object
 3   documentType      850 non-null    object
 4   headline          850 non-null    object
 5   keywords          850 non-null    object
 6   multimedia        850 non-null    int64 
 7   newDesk           850 non-null    object
 8   printPage         850 non-null    int64 
 9   pubDate           850 non-null    object
 10  sectionName       850 non-null    object
 11  snippet           850 non-null    object
 12  source            850 non-null    object
 13  typeOfMaterial    850 non-null    object
 14  webURL            850 non-null    object
 15  articleWordCount  850 non-null    int64 
dtypes: int64(3), object(13)
memory usage: 106.4+ KB


In [10]:
# Display dataset information
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243832 entries, 0 to 243831
Data columns (total 34 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   approveDate            243832 non-null  int64  
 1   commentBody            243832 non-null  object 
 2   commentID              243832 non-null  float64
 3   commentSequence        243832 non-null  int64  
 4   commentTitle           228498 non-null  object 
 5   commentType            243832 non-null  object 
 6   createDate             243832 non-null  float64
 7   depth                  243832 non-null  int64  
 8   editorsSelection       243832 non-null  bool   
 9   parentID               243832 non-null  float64
 10  parentUserDisplayName  70526 non-null   object 
 11  permID                 243832 non-null  object 
 12  picURL                 243832 non-null  object 
 13  recommendations        243832 non-null  float64
 14  recommendedFlag        0 non-null   

## **📌 Step 4: Check for Missing Values**
Checking for missing values in each column.

In [5]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

abstract    810
dtype: int64

In [11]:
# Check for missing values
missing_values_2 = df_2.isnull().sum()
missing_values_2[missing_values_2 > 0]

commentTitle              15334
parentUserDisplayName    173306
recommendedFlag          243832
reportAbuseFlag          243832
userDisplayName              77
userLocation                 62
userTitle                243791
userURL                  243827
dtype: int64

## **📌 Step 5: Summary Statistics**
Generate a summary of numeric and categorical columns.

In [6]:
# Display summary statistics
df.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
articleID,850.0,850.0,58691a5795d0e039260788b9,1.0,,,,,,,
abstract,40.0,40.0,"After losing three limbs in Afghanistan, a Mar...",1.0,,,,,,,
byline,850.0,434.0,By THE EDITORIAL BOARD,32.0,,,,,,,
documentType,850.0,2.0,article,810.0,,,,,,,
headline,850.0,774.0,Unknown,73.0,,,,,,,
keywords,850.0,717.0,[],73.0,,,,,,,
multimedia,850.0,,,,0.927059,0.260193,0.0,1.0,1.0,1.0,1.0
newDesk,850.0,28.0,OpEd,175.0,,,,,,,
printPage,850.0,,,,7.077647,10.100022,0.0,0.0,1.0,12.0,66.0
pubDate,850.0,786.0,2017-02-02 08:21:23,4.0,,,,,,,


In [12]:
# Display summary statistics
df_2.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
approveDate,243832.0,,,,1492504461.72283,1330667.064463,1491008047.0,1491783853.25,1492437034.5,1493119511.0,1524346252.0
commentBody,243832.0,243169.0,Well said.,21.0,,,,,,,
commentID,243832.0,,,,22188608.516839,185515.004384,21999548.0,22092910.75,22176681.5,22260467.5,26824246.0
commentSequence,243832.0,,,,22188608.516839,185515.004384,21999548.0,22092910.75,22176681.5,22260467.5,26824246.0
commentTitle,228498.0,1.0,<br/>,228498.0,,,,,,,
commentType,243832.0,3.0,comment,173277.0,,,,,,,
createDate,243832.0,,,,1492495336.431272,1328516.276157,1491006872.0,1491768493.75,1492428038.0,1493086582.25,1524345694.0
depth,243832.0,,,,1.289425,0.453641,1.0,1.0,1.0,2.0,3.0
editorsSelection,243832.0,2.0,False,238159.0,,,,,,,
parentID,243832.0,,,,6416284.403421,10055291.992253,0.0,0.0,0.0,22051084.75,26426201.0


## **📌 Step 6: Check for Unique Identifiers**
Find columns that can be used as unique identifiers.

In [7]:
# Check if any column can be used as a unique identifier
unique_counts = df.nunique()
unique_counts

articleID           850
abstract             40
byline              434
documentType          2
headline            774
keywords            717
multimedia            2
newDesk              28
printPage            43
pubDate             786
sectionName          30
snippet             846
source                2
typeOfMaterial       11
webURL              850
articleWordCount    689
dtype: int64

In [13]:
# Check if any column can be used as a unique identifier
unique_counts_2 = df_2.nunique()
unique_counts_2

approveDate              115718
commentBody              243169
commentID                243832
commentSequence          243832
commentTitle                  1
commentType                   3
createDate               228348
depth                         3
editorsSelection              2
parentID                  41494
parentUserDisplayName     15712
permID                   243832
picURL                     4282
recommendations            1232
recommendedFlag               0
replyCount                   79
reportAbuseFlag               0
sharing                       2
status                        1
timespeople                   2
trusted                       2
updateDate               136865
userDisplayName           46510
userID                    62946
userLocation              15890
userTitle                     9
userURL                       1
inReplyTo                 41494
articleID                   886
sectionName                  31
newDesk                      28
articleW

## **📌 Step 7: Automate Metadata Extraction for All Files**
Instead of manually inspecting each file, we automate metadata extraction for all files.

In [8]:
# Iterate over all files and extract metadata
metadata_summary = []

for file in files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path)

    metadata_summary.append({
        "File Name": file,
        "Rows": df.shape[0],
        "Columns": df.shape[1],
        "Missing Values": df.isnull().sum().sum(),
        "Duplicate Rows": df.duplicated().sum(),
        "Unique Columns": df.nunique().to_dict(),
    })

# Convert metadata summary to DataFrame for better readability
metadata_df = pd.DataFrame(metadata_summary)
metadata_df

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


Unnamed: 0,File Name,Rows,Columns,Missing Values,Duplicate Rows,Unique Columns
0,CommentsFeb2018.csv,215282,34,1018593,0,"{'approveDate': 158054, 'articleID': 1155, 'ar..."
1,ArticlesFeb2017.csv,885,16,855,0,"{'articleID': 885, 'abstract': 30, 'byline': 4..."
2,CommentsApril2018.csv,264924,34,1240919,0,"{'approveDate': 196777, 'articleID': 1351, 'ar..."
3,ArticlesJan2017.csv,850,16,810,0,"{'articleID': 850, 'abstract': 40, 'byline': 4..."
4,ArticlesMay2017.csv,996,16,963,0,"{'abstract': 33, 'articleID': 996, 'articleWor..."
5,CommentsJan2017.csv,231449,34,1114483,0,"{'approveDate': 106710, 'articleID': 850, 'art..."
6,CommentsMarch2017.csv,260967,34,1249140,0,"{'approveDate': 115903, 'articleID': 949, 'art..."
7,CommentsMay2017.csv,276389,34,1322148,0,"{'approveDate': 160236, 'commentBody': 275493,..."
8,CommentsMarch2018.csv,246915,34,1331416,0,"{'approveDate': 187256, 'articleID': 1385, 'ar..."
9,CommentsApril2017.csv,243832,34,1164061,0,"{'approveDate': 115718, 'commentBody': 243169,..."


## **🔍 Conclusion**
This notebook provides insights into the dataset structure, missing values, and metadata, making it ready for further text processing and LSTM-based text generation analysis.

# 📊 **LSTM-Based Text Generation on NYT Comments Dataset**
This notebook trains an LSTM model using the **New York Times Comments dataset** to generate human-like text. The notebook follows a structured process: merging datasets, preprocessing text, tokenization, training an LSTM model, and saving progress to prevent data loss in case of session shutdown.

## **📌 Step 1: Load & Merge All Comment Datasets**
We combine all comments into a single dataset for better model generalization.

In [3]:
import os
import pandas as pd

# Path to dataset directory
dataset_path = "/kaggle/input/nyt-comments"

# List all comment files
comment_files = [file for file in os.listdir(dataset_path) if file.startswith("Comments")]

# Initialize empty list to store DataFrames
df_list = []

# Load and merge all comment files
for file in comment_files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path, usecols=["commentBody"])
    df_list.append(df)

# Combine all comments into one DataFrame
df_combined = pd.concat(df_list, ignore_index=True)

# Save merged dataset to avoid reloading
df_combined.to_csv("nyt_comments_cleaned.csv", index=False)

# Display dataset shape
print("Total Comments:", df_combined.shape[0])
df_combined.head()

Total Comments: 2176364


Unnamed: 0,commentBody
0,The snake-filled heads comment made me think o...
1,She-devil reporting for duty!
2,XX is the new mark of the devil.
3,"""Courtland Sykes"" should be writing for The On..."
4,"I happen to descend for a few of them, because..."


## **📌 Step 2: Preprocessing the Text**
We clean the text by converting to lowercase, removing special characters, and tokenizing words into sequences.

### The implemntation below uses too much memory

In [None]:
import re
import json
import numpy as np
import tensorflow as tf
from tqdm import tqdm
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tqdm.pandas()  # Enables progress bars for Pandas operations

# Function to clean text with tqdm progress bar
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Apply text cleaning with a progress bar
print("\n🔄 Cleaning text data...")
df_combined["commentBody"] = df_combined["commentBody"].astype(str).progress_apply(clean_text)

# ✅ Print sample cleaned text
print("\n📌 Sample cleaned text:\n", df_combined["commentBody"].head(5))

# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_combined["commentBody"])

# ✅ Print vocabulary size
print(f"\n📌 Vocabulary Size: {len(tokenizer.word_index)} unique words")

# Save tokenizer
with open("tokenizer.json", "w") as f:
    json.dump(tokenizer.to_json(), f)

# Convert text to sequences
print("\n🔄 Tokenizing text sequences...")
sequences = tokenizer.texts_to_sequences(df_combined["commentBody"])

# ✅ Print first few tokenized sequences
print("\n📌 Sample tokenized sequences:\n", sequences[:5])

# Create input sequences with tqdm progress bar
sequence_length = 50
input_sequences = []

print("\n🔄 Creating input sequences...")
for seq in tqdm(sequences, desc="Processing sequences"):
    for i in range(1, len(seq)):
        input_sequences.append(seq[:i+1])

# ✅ Print first few input sequences before padding
print("\n📌 Sample input sequences before padding:\n", input_sequences[:5])

# Pad sequences
print("\n🔄 Padding input sequences...")
input_sequences = pad_sequences(input_sequences, maxlen=sequence_length, padding="pre")

# ✅ Print shape of input sequences after padding
print(f"\n📌 Padded input shape: {input_sequences.shape}")

# Extract input (X) and output (y)
X, y = input_sequences[:, :-1], input_sequences[:, -1]

# ✅ Print shape of X and y
print(f"\n📌 X shape: {X.shape}, y shape: {y.shape}")

# Convert y to categorical
print("\n🔄 One-hot encoding target labels...")
y = tf.keras.utils.to_categorical(y, num_classes=len(tokenizer.word_index) + 1)

# ✅ Print a sample one-hot encoded output
print("\n📌 Sample y (one-hot encoded output):\n", y[:3])

# Save tokenized sequences
print("\n💾 Saving tokenized sequences...")
np.save("input_sequences.npy", input_sequences)

# ✅ Confirm data saving
print("\n✅ Tokenized sequences saved successfully!")



🔄 Cleaning text data...


100%|██████████| 2176364/2176364 [01:06<00:00, 32676.46it/s]



📌 Sample cleaned text:
 0    the snakefilled heads comment made me think of...
1                          shedevil reporting for duty
2                      xx is the new mark of the devil
3    courtland sykes should be writing for the onio...
4    i happen to descend for a few of them because ...
Name: commentBody, dtype: object

📌 Vocabulary Size: 1525475 unique words

🔄 Tokenizing text sequences...

📌 Sample tokenized sequences:
 [[1, 151481, 1639, 624, 167, 84, 83, 4, 51071, 633, 212, 22, 3913, 41, 13206], [66071, 1181, 9, 1837], [22001, 6, 1, 116, 1946, 4, 1, 3835], [64185, 12872, 64, 15, 884, 9, 1, 11959, 13, 626, 9, 236], [10, 459, 2, 11706, 9, 5, 233, 4, 59, 69, 23, 2758, 168, 53, 2, 15706, 7, 11843, 49511, 9723, 3, 423138, 161, 425, 43, 301, 29, 622, 2, 15, 15318, 1010, 49511, 55, 1748, 98, 1, 253, 27, 23, 362, 101, 50, 4, 51]]

🔄 Creating input sequences...


Processing sequences:  30%|██▉       | 650081/2176364 [02:23<03:27, 7360.24it/s] 

### Will now implement an more memory friendly solution

In [5]:
import re
import json
import numpy as np
import tensorflow as tf
from tqdm import tqdm
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tqdm.pandas()

# Function to clean text
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  
    text = re.sub(r'[^\w\s]', '', text)  
    text = re.sub(r'\s+', ' ', text).strip()  
    return text

# Apply text cleaning with progress bar
print("\n🔄 Cleaning text data...")
df_combined["commentBody"] = df_combined["commentBody"].astype(str).progress_apply(clean_text)


🔄 Cleaning text data...


100%|██████████| 2176364/2176364 [01:01<00:00, 35259.67it/s]


In [6]:
from tqdm import tqdm

# Tokenize text with vocab limit
tokenizer = Tokenizer(num_words=20000)  # Limit vocab to 20,000 words

print("\n🔄 Fitting tokenizer on text data...")
tokenizer.fit_on_texts(tqdm(df_combined["commentBody"], desc="Processing text"))

# ✅ Print unique words before filtering
print(f"\n📌 Total Unique Words (before limiting): {len(tokenizer.word_index)}")

# ✅ Extract word frequencies
word_counts = sorted(tokenizer.word_counts.items(), key=lambda x: x[1], reverse=True)

# ✅ Keep only the most frequent 20,000 words
tokenizer.word_index = {word: i for word, i in word_counts[:20000]}

# ✅ Verify Vocabulary Size
print(f"\n📌 Vocabulary Size (After Limiting): {len(tokenizer.word_index)} words")

# ✅ Print top 10 most frequent words
print("\n📌 Top 10 Most Frequent Words in the Dataset:")
for i, (word, count) in enumerate(word_counts[:10]):
    print(f"   {i+1}. {word}: {count} occurrences")

# Save tokenizer
with open("tokenizer.json", "w") as f:
    json.dump(tokenizer.to_json(), f)

# Convert text to sequences with tqdm progress bar
print("\n🔄 Tokenizing text sequences...")
sequences = list(tqdm(tokenizer.texts_to_sequences(df_combined["commentBody"]), desc="Converting to sequences"))

# ✅ Confirm the process is complete
print("\n✅ Tokenization complete!")



🔄 Fitting tokenizer on text data...


Processing text: 100%|██████████| 2176364/2176364 [01:45<00:00, 20662.24it/s]



📌 Total Unique Words (before limiting): 1525475

📌 Vocabulary Size (After Limiting): 20000 words

📌 Top 10 Most Frequent Words in the Dataset:
   1. the: 8013972 occurrences
   2. to: 4570214 occurrences
   3. and: 4198158 occurrences
   4. of: 3740329 occurrences
   5. a: 3271858 occurrences
   6. is: 2604472 occurrences
   7. in: 2367733 occurrences
   8. that: 2219162 occurrences
   9. for: 1606318 occurrences
   10. i: 1387537 occurrences

🔄 Tokenizing text sequences...


Converting to sequences: 100%|██████████| 2176364/2176364 [00:00<00:00, 4614921.02it/s]



✅ Tokenization complete!


In [7]:
# Create input sequences using NumPy arrays (Optimized)
sequence_length = 30  # Reduced from 50 to 30
max_sequences = sum(len(seq) for seq in sequences)
input_sequences = np.zeros((max_sequences, sequence_length), dtype=np.int32)


print("\n🔄 Creating input sequences...")
index = 0
for seq in tqdm(sequences, desc="Processing sequences"):
    for i in range(1, len(seq)):
        sub_seq = seq[:i+1]

        # ✅ Fix: Trim sequences that exceed `sequence_length`
        if len(sub_seq) > sequence_length:
            sub_seq = sub_seq[-sequence_length:]  # Keep last 30 tokens
        
        input_sequences[index, -len(sub_seq):] = sub_seq  # Insert at the end
        index += 1

# Pad sequences with reduced max length
print("\n🔄 Padding input sequences...")
input_sequences = pad_sequences(input_sequences, maxlen=sequence_length, padding="pre")

print(f"\n📌 Padded input shape: {input_sequences.shape}")

# Extract input (X) and output (y)
X, y = input_sequences[:, :-1], input_sequences[:, -1]

print(f"\n📌 X shape: {X.shape}, y shape: {y.shape}")

# Convert y to integer labels (Sparse Encoding)
print("\n🔄 Converting y to sparse labels...")
y = np.array(y, dtype=np.int32)  # Uses sparse categorical encoding

# Save tokenized sequences
print("\n💾 Saving tokenized sequences...")
np.save("input_sequences.npy", input_sequences)

print("\n✅ Tokenized sequences saved successfully!")


🔄 Creating input sequences...


Processing sequences: 100%|██████████| 2176364/2176364 [00:54<00:00, 40274.38it/s]



🔄 Padding input sequences...

📌 Padded input shape: (37039301, 30)

📌 X shape: (37039301, 29), y shape: (37039301,)

🔄 Converting y to sparse labels...

💾 Saving tokenized sequences...

✅ Tokenized sequences saved successfully!


## **📌 Step 3: Building the LSTM Model**
We define an LSTM-based architecture with embedding and dense layers.

In [2]:
import numpy as np

sequence_length = 30  # Reduced from 50 to 30

# ✅ Load preprocessed input sequences
print("\n📂 Loading preprocessed sequences from 'input_sequences.npy'...")
input_sequences = np.load("input_sequences.npy", allow_pickle=True)

# ✅ Extract X and y from input sequences
X, y = input_sequences[:, :-1], input_sequences[:, -1]

print(f"✅ Loaded input sequences. X shape: {X.shape}, y shape: {y.shape}")



📂 Loading preprocessed sequences from 'input_sequences.npy'...
✅ Loaded input sequences. X shape: (37039301, 29), y shape: (37039301,)


In [3]:
import json
from tensorflow.keras.preprocessing.text import tokenizer_from_json

# ✅ Load tokenizer
print("\n📂 Loading tokenizer from 'tokenizer.json'...")
with open("tokenizer.json", "r") as f:
    tokenizer = tokenizer_from_json(json.load(f))

# ✅ Get vocab size
vocab_size = len(tokenizer.word_index) + 1
print(f"✅ Loaded tokenizer. Vocabulary size: {vocab_size}")



📂 Loading tokenizer from 'tokenizer.json'...
✅ Loaded tokenizer. Vocabulary size: 20001


In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input

# Ensure tokenizer has been trained
if not tokenizer.word_index:
    raise ValueError("Tokenizer word_index is empty. Ensure tokenizer.fit_on_texts() was called.")

# Define vocabulary size
vocab_size = len(tokenizer.word_index) + 1  # Ensure vocabulary size is correct

# ✅ Use `Input()` for defining the input layer
model = Sequential([
    Input(shape=(sequence_length-1,)),  # Explicit input layer
    Embedding(input_dim=vocab_size, output_dim=128),  # Removed input_shape
    LSTM(128, return_sequences=True),
    LSTM(128),
    Dense(128, activation="relu"),
    Dense(vocab_size, activation="softmax")
])

# Compile model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# ✅ Print model summary (No need to call `build()` manually)
model.summary()


## **📌 Step 4: Training the LSTM Model**
We train the LSTM model with categorical cross-entropy loss.

In [3]:
import pandas as pd

print("\n📂 Loading cleaned dataset 'nyt_comments_cleaned.csv'...")
df_combined = pd.read_csv("nyt_comments_cleaned.csv")

print(f"✅ Loaded dataset with {df_combined.shape[0]} rows.")
print(df_combined.head())  # Preview first few rows



📂 Loading cleaned dataset 'nyt_comments_cleaned.csv'...
✅ Loaded dataset with 2176364 rows.
                                         commentBody
0  The snake-filled heads comment made me think o...
1                      She-devil reporting for duty!
2                   XX is the new mark of the devil.
3  "Courtland Sykes" should be writing for The On...
4  I happen to descend for a few of them, because...


In [7]:
import tensorflow as tf

# Check if GPU is available
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available: 1


In [8]:
# Ensure TensorFlow uses GPU
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)  # Prevents memory overflow issues
        print("\n✅ GPU is enabled and TensorFlow is using it!")
    except RuntimeError as e:
        print(e)
else:
    print("\n❌ No GPU detected, training may be slow!")


Physical devices cannot be modified after being initialized


In [10]:
import time
import pickle
import numpy as np

# Ensure `y` is sparse categorical (integer labels)
y = np.array(y, dtype=np.int32)

# Compile model with sparse categorical cross-entropy
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Start Timer
start_time = time.time()
print("\n🚀 Starting Model Training...\n")

# Train model with verbose logging
history = model.fit(
    X, y,
    epochs=10,
    batch_size=512,
    validation_split=0.2,
    verbose=1
)

# Compute total training time
end_time = time.time()
total_time = end_time - start_time
print(f"\n✅ Training Completed in {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

# Save Model
model.save("nyt_lstm_model.h5")
print("\n💾 Model saved as 'nyt_lstm_model.h5'")

# Save Training History
with open("training_history.pkl", "wb") as f:
    pickle.dump(history.history, f)
print("\n📊 Training history saved as 'training_history.pkl'")

# Print Final Training Stats
print("\n📌 Final Training Metrics:")
print(f"   🔹 Final Training Loss: {history.history['loss'][-1]:.4f}")
print(f"   🔹 Final Validation Loss: {history.history['val_loss'][-1]:.4f}")
print(f"   🔹 Final Training Accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"   🔹 Final Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")

print("\n🎯 Training Complete! You can now evaluate the model and generate text.")



🚀 Starting Model Training...

Epoch 1/10
[1m 1521/57874[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m38:15[0m 41ms/step - accuracy: 9.2576e-04 - loss: 8.4111

KeyboardInterrupt: 

## **📌 Step 5: Generate New Comments Using the LSTM**
We use the trained model to predict and generate text from a given seed phrase.

In [None]:
import numpy as np

def generate_text(seed_text, next_words=50, temperature=1.0):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=sequence_length-1, padding="pre")

        # Predict next word
        predicted_probs = model.predict(token_list, verbose=0)
        predicted_index = np.argmax(predicted_probs, axis=-1)[0]

        # Convert index to word
        output_word = tokenizer.index_word.get(predicted_index, "")
        seed_text += " " + output_word
    return seed_text

# Example
print(generate_text("the government should", next_words=20))

## **📌 Final Summary**
1. **Merged all comment datasets** into a single dataset.
2. **Preprocessed and tokenized the text** for input sequences.
3. **Trained an LSTM model** with embeddings and dense layers.
4. **Saved progress at every stage** to prevent data loss.
5. **Generated new comments** based on seed text input.