# üìä **New York Times Comments Dataset Analysis**
This notebook analyzes the New York Times Comments dataset available on Kaggle.
We will extract metadata, check for missing values, and summarize the structure of the dataset before proceeding with text analysis.

## **üìå Step 1: Setup the Environment**
We start by importing the necessary libraries and listing all available files in the dataset.

In [1]:
import os
import pandas as pd

# Set path to dataset (Kaggle users should adjust as needed)
dataset_path = "/kaggle/input/nyt-comments"

# List all files in the dataset
files = os.listdir(dataset_path)
print("Files in dataset:\n", files)

Files in dataset:
 ['CommentsFeb2018.csv', 'ArticlesFeb2017.csv', 'CommentsApril2018.csv', 'ArticlesJan2017.csv', 'ArticlesMay2017.csv', 'CommentsJan2017.csv', 'CommentsMarch2017.csv', 'CommentsMay2017.csv', 'CommentsMarch2018.csv', 'CommentsApril2017.csv', 'ArticlesMarch2017.csv', 'ArticlesApril2017.csv', 'CommentsFeb2017.csv', 'ArticlesJan2018.csv', 'ArticlesFeb2018.csv', 'ArticlesMarch2018.csv', 'CommentsJan2018.csv', 'ArticlesApril2018.csv']


## **üìå Step 2: Load & Inspect Data**
Let's load files (e.g., `ArticlesJan2017.csv` and `CommentsApril2017.csv`) to inspect its structure.

In [2]:
# Load an example file to inspect its structure
sample_file = "ArticlesJan2017.csv"  # You can change this to any file in the dataset
df = pd.read_csv(os.path.join(dataset_path, sample_file))

# Display first few rows
df.head()

Unnamed: 0,articleID,abstract,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,,By JENNIFER STEINHAUER,article,G.O.P. Leadership Poised to Topple Obama‚Äôs Pi...,"['United States Politics and Government', 'Law...",1,National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led...,The New York Times,News,https://www.nytimes.com/2017/01/01/us/politics...,1324
1,586967bf95d0e03926078915,,By MARK LANDLER,article,Fractured World Tested the Hope of a Young Pre...,"['Obama, Barack', 'Afghanistan', 'United State...",1,Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a ‚Äúgood war‚Äù to the ...,The New York Times,News,https://www.nytimes.com/2017/01/01/world/asia/...,2836
2,58698a1095d0e0392607894a,,By CAITLIN LOVINGER,article,Little Troublemakers,"['Crossword Puzzles', 'Boxing Day', 'Holidays ...",1,Games,0,2017-01-01 23:00:24,Unknown,Chuck Deodene puts us in a bubbly mood.,The New York Times,News,https://www.nytimes.com/2017/01/01/crosswords/...,445
3,5869911a95d0e0392607894e,,By JOCHEN BITTNER,article,"Angela Merkel, Russia‚Äôs Next Target","['Cyberwarfare and Defense', 'Presidential Ele...",1,OpEd,15,2017-01-01 23:30:27,Unknown,"With a friend entering the White House, Vladim...",The New York Times,Op-Ed,https://www.nytimes.com/2017/01/01/opinion/ang...,864
4,5869a61795d0e03926078962,,By JIAYIN SHEN,article,Boots for a Stranger on a Bus,"['Shoes and Boots', 'Buses', 'New York City']",0,Metro,12,2017-01-02 01:00:02,Unknown,Witnessing an act of generosity on a rainy day.,The New York Times,Brief,https://www.nytimes.com/2017/01/01/nyregion/me...,309


In [9]:
# Load an example file to inspect its structure
sample_file_2 = "CommentsApril2017.csv"  # You can change this to any file in the dataset
df_2 = pd.read_csv(os.path.join(dataset_path, sample_file_2))

# Display first few rows
df_2.head()

  df_2 = pd.read_csv(os.path.join(dataset_path, sample_file_2))


Unnamed: 0,approveDate,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,parentID,...,userLocation,userTitle,userURL,inReplyTo,articleID,sectionName,newDesk,articleWordCount,printPage,typeOfMaterial
0,1491245186,This project makes me happy to be a 30+ year T...,22022598.0,22022598,<br/>,comment,1491237000.0,1,False,0.0,...,"Riverside, CA",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
1,1491188619,Stunning photos and reportage. Infuriating tha...,22017350.0,22017350,,comment,1491180000.0,1,False,0.0,...,<br/>,,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
2,1491188617,Brilliant work from conception to execution. I...,22017334.0,22017334,<br/>,comment,1491179000.0,1,False,0.0,...,Raleigh NC,,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
3,1491167820,NYT reporters should provide a contributor's l...,22015913.0,22015913,<br/>,comment,1491150000.0,1,False,0.0,...,"Missouri, USA",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
4,1491167815,Could only have been done in print. Stunning.,22015466.0,22015466,<br/>,comment,1491147000.0,1,False,0.0,...,"Tucson, Arizona",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News


## **üìå Step 3: Extract Metadata**
Now, we extract key metadata, such as column names, data types, and missing values.

In [4]:
# Display dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   articleID         850 non-null    object
 1   abstract          40 non-null     object
 2   byline            850 non-null    object
 3   documentType      850 non-null    object
 4   headline          850 non-null    object
 5   keywords          850 non-null    object
 6   multimedia        850 non-null    int64 
 7   newDesk           850 non-null    object
 8   printPage         850 non-null    int64 
 9   pubDate           850 non-null    object
 10  sectionName       850 non-null    object
 11  snippet           850 non-null    object
 12  source            850 non-null    object
 13  typeOfMaterial    850 non-null    object
 14  webURL            850 non-null    object
 15  articleWordCount  850 non-null    int64 
dtypes: int64(3), object(13)
memory usage: 106.4+ KB


In [10]:
# Display dataset information
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243832 entries, 0 to 243831
Data columns (total 34 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   approveDate            243832 non-null  int64  
 1   commentBody            243832 non-null  object 
 2   commentID              243832 non-null  float64
 3   commentSequence        243832 non-null  int64  
 4   commentTitle           228498 non-null  object 
 5   commentType            243832 non-null  object 
 6   createDate             243832 non-null  float64
 7   depth                  243832 non-null  int64  
 8   editorsSelection       243832 non-null  bool   
 9   parentID               243832 non-null  float64
 10  parentUserDisplayName  70526 non-null   object 
 11  permID                 243832 non-null  object 
 12  picURL                 243832 non-null  object 
 13  recommendations        243832 non-null  float64
 14  recommendedFlag        0 non-null   

## **üìå Step 4: Check for Missing Values**
Checking for missing values in each column.

In [5]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

abstract    810
dtype: int64

In [11]:
# Check for missing values
missing_values_2 = df_2.isnull().sum()
missing_values_2[missing_values_2 > 0]

commentTitle              15334
parentUserDisplayName    173306
recommendedFlag          243832
reportAbuseFlag          243832
userDisplayName              77
userLocation                 62
userTitle                243791
userURL                  243827
dtype: int64

## **üìå Step 5: Summary Statistics**
Generate a summary of numeric and categorical columns.

In [6]:
# Display summary statistics
df.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
articleID,850.0,850.0,58691a5795d0e039260788b9,1.0,,,,,,,
abstract,40.0,40.0,"After losing three limbs in Afghanistan, a Mar...",1.0,,,,,,,
byline,850.0,434.0,By THE EDITORIAL BOARD,32.0,,,,,,,
documentType,850.0,2.0,article,810.0,,,,,,,
headline,850.0,774.0,Unknown,73.0,,,,,,,
keywords,850.0,717.0,[],73.0,,,,,,,
multimedia,850.0,,,,0.927059,0.260193,0.0,1.0,1.0,1.0,1.0
newDesk,850.0,28.0,OpEd,175.0,,,,,,,
printPage,850.0,,,,7.077647,10.100022,0.0,0.0,1.0,12.0,66.0
pubDate,850.0,786.0,2017-02-02 08:21:23,4.0,,,,,,,


In [12]:
# Display summary statistics
df_2.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
approveDate,243832.0,,,,1492504461.72283,1330667.064463,1491008047.0,1491783853.25,1492437034.5,1493119511.0,1524346252.0
commentBody,243832.0,243169.0,Well said.,21.0,,,,,,,
commentID,243832.0,,,,22188608.516839,185515.004384,21999548.0,22092910.75,22176681.5,22260467.5,26824246.0
commentSequence,243832.0,,,,22188608.516839,185515.004384,21999548.0,22092910.75,22176681.5,22260467.5,26824246.0
commentTitle,228498.0,1.0,<br/>,228498.0,,,,,,,
commentType,243832.0,3.0,comment,173277.0,,,,,,,
createDate,243832.0,,,,1492495336.431272,1328516.276157,1491006872.0,1491768493.75,1492428038.0,1493086582.25,1524345694.0
depth,243832.0,,,,1.289425,0.453641,1.0,1.0,1.0,2.0,3.0
editorsSelection,243832.0,2.0,False,238159.0,,,,,,,
parentID,243832.0,,,,6416284.403421,10055291.992253,0.0,0.0,0.0,22051084.75,26426201.0


## **üìå Step 6: Check for Unique Identifiers**
Find columns that can be used as unique identifiers.

In [7]:
# Check if any column can be used as a unique identifier
unique_counts = df.nunique()
unique_counts

articleID           850
abstract             40
byline              434
documentType          2
headline            774
keywords            717
multimedia            2
newDesk              28
printPage            43
pubDate             786
sectionName          30
snippet             846
source                2
typeOfMaterial       11
webURL              850
articleWordCount    689
dtype: int64

In [13]:
# Check if any column can be used as a unique identifier
unique_counts_2 = df_2.nunique()
unique_counts_2

approveDate              115718
commentBody              243169
commentID                243832
commentSequence          243832
commentTitle                  1
commentType                   3
createDate               228348
depth                         3
editorsSelection              2
parentID                  41494
parentUserDisplayName     15712
permID                   243832
picURL                     4282
recommendations            1232
recommendedFlag               0
replyCount                   79
reportAbuseFlag               0
sharing                       2
status                        1
timespeople                   2
trusted                       2
updateDate               136865
userDisplayName           46510
userID                    62946
userLocation              15890
userTitle                     9
userURL                       1
inReplyTo                 41494
articleID                   886
sectionName                  31
newDesk                      28
articleW

## **üìå Step 7: Automate Metadata Extraction for All Files**
Instead of manually inspecting each file, we automate metadata extraction for all files.

In [8]:
# Iterate over all files and extract metadata
metadata_summary = []

for file in files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path)

    metadata_summary.append({
        "File Name": file,
        "Rows": df.shape[0],
        "Columns": df.shape[1],
        "Missing Values": df.isnull().sum().sum(),
        "Duplicate Rows": df.duplicated().sum(),
        "Unique Columns": df.nunique().to_dict(),
    })

# Convert metadata summary to DataFrame for better readability
metadata_df = pd.DataFrame(metadata_summary)
metadata_df

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


Unnamed: 0,File Name,Rows,Columns,Missing Values,Duplicate Rows,Unique Columns
0,CommentsFeb2018.csv,215282,34,1018593,0,"{'approveDate': 158054, 'articleID': 1155, 'ar..."
1,ArticlesFeb2017.csv,885,16,855,0,"{'articleID': 885, 'abstract': 30, 'byline': 4..."
2,CommentsApril2018.csv,264924,34,1240919,0,"{'approveDate': 196777, 'articleID': 1351, 'ar..."
3,ArticlesJan2017.csv,850,16,810,0,"{'articleID': 850, 'abstract': 40, 'byline': 4..."
4,ArticlesMay2017.csv,996,16,963,0,"{'abstract': 33, 'articleID': 996, 'articleWor..."
5,CommentsJan2017.csv,231449,34,1114483,0,"{'approveDate': 106710, 'articleID': 850, 'art..."
6,CommentsMarch2017.csv,260967,34,1249140,0,"{'approveDate': 115903, 'articleID': 949, 'art..."
7,CommentsMay2017.csv,276389,34,1322148,0,"{'approveDate': 160236, 'commentBody': 275493,..."
8,CommentsMarch2018.csv,246915,34,1331416,0,"{'approveDate': 187256, 'articleID': 1385, 'ar..."
9,CommentsApril2017.csv,243832,34,1164061,0,"{'approveDate': 115718, 'commentBody': 243169,..."


## **üîç Conclusion**
This notebook provides insights into the dataset structure, missing values, and metadata, making it ready for further text processing and LSTM-based text generation analysis.

# üìä **LSTM-Based Text Generation on NYT Comments Dataset**
This notebook trains an LSTM model using the **New York Times Comments dataset** to generate human-like text. The notebook follows a structured process: merging datasets, preprocessing text, tokenization, training an LSTM model, and saving progress to prevent data loss in case of session shutdown.

## **üìå Step 1: Load & Merge All Comment Datasets**
We combine all comments into a single dataset for better model generalization.

In [2]:
import os
import pandas as pd

# Path to dataset directory
dataset_path = "/kaggle/input/nyt-comments"

# List all comment files
comment_files = [file for file in os.listdir(dataset_path) if file.startswith("Comments")]

# Initialize empty list to store DataFrames
df_list = []

# Load and merge all comment files
for file in comment_files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path, usecols=["commentBody"])
    df_list.append(df)

# Combine all comments into one DataFrame
df_combined = pd.concat(df_list, ignore_index=True)

# Display dataset shape before cleaning
print("Total Comments (Before Cleaning):", df_combined.shape[0])
df_combined.head()

Total Comments (Before Cleaning): 2176364


Unnamed: 0,commentBody
0,The snake-filled heads comment made me think o...
1,She-devil reporting for duty!
2,XX is the new mark of the devil.
3,"""Courtland Sykes"" should be writing for The On..."
4,"I happen to descend for a few of them, because..."


## **üìå Step 2: Preprocessing the Text**
We clean the text by converting to lowercase, removing special characters, and tokenizing words into sequences.

In [4]:
import re
import json
import numpy as np
import tensorflow as tf
from tqdm import tqdm
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tqdm.pandas()

# Function to clean text
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# Apply text cleaning with progress bar
print("\nüîÑ Cleaning text data...")
df_combined["commentBody"] = df_combined["commentBody"].astype(str).progress_apply(clean_text)

# ‚úÖ Save AFTER cleaning
df_combined.to_csv("nyt_comments_cleaned.csv", index=False)
print("\n‚úÖ Cleaned dataset saved as 'nyt_comments_cleaned.csv'")

# Display sample cleaned text
print("\nüìå Sample cleaned text:\n", df_combined["commentBody"].head(5))


üîÑ Cleaning text data...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2176364/2176364 [01:48<00:00, 20001.10it/s]



‚úÖ Cleaned dataset saved as 'nyt_comments_cleaned.csv'

üìå Sample cleaned text:
 0    the snakefilled heads comment made me think of...
1                          shedevil reporting for duty
2                      xx is the new mark of the devil
3    courtland sykes should be writing for the onio...
4    i happen to descend for a few of them because ...
Name: commentBody, dtype: object


In [5]:
import pandas as pd

print("\nüìÇ Loading cleaned dataset 'nyt_comments_cleaned.csv'...")
df_combined = pd.read_csv("nyt_comments_cleaned.csv")

print(f"‚úÖ Loaded dataset with {df_combined.shape[0]} rows.")
print(df_combined.head())  # Preview first few rows



üìÇ Loading cleaned dataset 'nyt_comments_cleaned.csv'...
‚úÖ Loaded dataset with 2176364 rows.
                                         commentBody
0  the snakefilled heads comment made me think of...
1                        shedevil reporting for duty
2                    xx is the new mark of the devil
3  courtland sykes should be writing for the onio...
4  i happen to descend for a few of them because ...


In [6]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization
import numpy as np
import json
from tqdm import tqdm

# ‚úÖ Set vocabulary size & sequence length
max_tokens = 50000  # Adjusted to avoid memory overload
sequence_length = 30  # Limit sequence length
batch_size = 10000  # Process comments in batches

print("\nüîÑ Initializing TextVectorization layer...")
vectorizer = TextVectorization(max_tokens=max_tokens, output_sequence_length=sequence_length)

# ‚úÖ Convert comments to NumPy array & Ensure all are strings
print("\nüîÑ Preparing comments array...")
comments_array = df_combined["commentBody"].astype(str).fillna("").to_numpy()  # Convert NaNs & floats to strings

# ‚úÖ Fit vectorizer to text data
print("\nüîÑ Fitting vectorizer on text data...")
vectorizer.adapt(comments_array)

# ‚úÖ Convert text to sequences in batches (Fixes tqdm printing issues)
print("\nüîÑ Converting text to tokenized sequences...")
sequences = []

for i in tqdm(range(0, len(comments_array), batch_size), desc="Processing text", ncols=100, dynamic_ncols=True, leave=True):
    batch = comments_array[i : i + batch_size]  # Process in batches
    batch_sequences = vectorizer(batch).numpy()  # Vectorize batch at once
    sequences.extend(batch_sequences)

    # ‚úÖ Force tqdm to only update one line instead of multiple bars
    tqdm.write(f"‚úÖ Processed {min(i+batch_size, len(comments_array))}/{len(comments_array)} comments...")

# ‚úÖ Convert to NumPy array
sequences = np.array(sequences, dtype=np.int32)

# ‚úÖ Verify Vocabulary Size
tqdm.write(f"\nüìå Vocabulary Size (After Limiting): {vectorizer.vocabulary_size()} words")  # Use tqdm.write to keep it in one line

# ‚úÖ Save vectorizer (Required for future model inference)
tqdm.write("\nüíæ Saving vectorizer configuration...")
vectorizer_config = {"max_tokens": max_tokens, "sequence_length": sequence_length}
with open("vectorizer_config.json", "w") as f:
    json.dump(vectorizer_config, f)

# ‚úÖ Save tokenized text before sequence creation
print("\nüíæ Saving raw tokenized sequences...")
np.save("tokenized_comments.npy", sequences)  # Changed filename

# ‚úÖ Confirm the process is complete
tqdm.write("\n‚úÖ Tokenization complete!")



üîÑ Initializing TextVectorization layer...

üîÑ Preparing comments array...

üîÑ Fitting vectorizer on text data...

üîÑ Converting text to tokenized sequences...


Processing text:   0%|          | 1/218 [00:00<00:40,  5.42it/s]

‚úÖ Processed 10000/2176364 comments...


Processing text:   1%|          | 2/218 [00:00<00:34,  6.21it/s]

‚úÖ Processed 20000/2176364 comments...


Processing text:   1%|‚ñè         | 3/218 [00:00<00:34,  6.24it/s]

‚úÖ Processed 30000/2176364 comments...


Processing text:   2%|‚ñè         | 4/218 [00:00<00:32,  6.64it/s]

‚úÖ Processed 40000/2176364 comments...


Processing text:   2%|‚ñè         | 5/218 [00:00<00:31,  6.77it/s]

‚úÖ Processed 50000/2176364 comments...


Processing text:   3%|‚ñé         | 6/218 [00:00<00:31,  6.78it/s]

‚úÖ Processed 60000/2176364 comments...


Processing text:   3%|‚ñé         | 7/218 [00:01<00:31,  6.74it/s]

‚úÖ Processed 70000/2176364 comments...


Processing text:   4%|‚ñé         | 8/218 [00:01<00:31,  6.71it/s]

‚úÖ Processed 80000/2176364 comments...


Processing text:   4%|‚ñç         | 9/218 [00:01<00:30,  6.84it/s]

‚úÖ Processed 90000/2176364 comments...


Processing text:   5%|‚ñç         | 10/218 [00:01<00:30,  6.92it/s]

‚úÖ Processed 100000/2176364 comments...


Processing text:   5%|‚ñå         | 11/218 [00:01<00:28,  7.19it/s]

‚úÖ Processed 110000/2176364 comments...


Processing text:   6%|‚ñå         | 12/218 [00:01<00:28,  7.11it/s]

‚úÖ Processed 120000/2176364 comments...


Processing text:   6%|‚ñå         | 13/218 [00:01<00:30,  6.67it/s]

‚úÖ Processed 130000/2176364 comments...


Processing text:   6%|‚ñã         | 14/218 [00:02<00:31,  6.43it/s]

‚úÖ Processed 140000/2176364 comments...


Processing text:   7%|‚ñã         | 15/218 [00:02<00:32,  6.31it/s]

‚úÖ Processed 150000/2176364 comments...


Processing text:   7%|‚ñã         | 16/218 [00:02<00:32,  6.17it/s]

‚úÖ Processed 160000/2176364 comments...


Processing text:   8%|‚ñä         | 17/218 [00:02<00:32,  6.26it/s]

‚úÖ Processed 170000/2176364 comments...


Processing text:   8%|‚ñä         | 18/218 [00:02<00:32,  6.21it/s]

‚úÖ Processed 180000/2176364 comments...


Processing text:   9%|‚ñä         | 19/218 [00:02<00:32,  6.21it/s]

‚úÖ Processed 190000/2176364 comments...


Processing text:   9%|‚ñâ         | 20/218 [00:03<00:31,  6.24it/s]

‚úÖ Processed 200000/2176364 comments...


Processing text:  10%|‚ñâ         | 21/218 [00:03<00:31,  6.19it/s]

‚úÖ Processed 210000/2176364 comments...


Processing text:  10%|‚ñà         | 22/218 [00:03<00:31,  6.27it/s]

‚úÖ Processed 220000/2176364 comments...


Processing text:  11%|‚ñà         | 23/218 [00:03<00:30,  6.47it/s]

‚úÖ Processed 230000/2176364 comments...


Processing text:  11%|‚ñà         | 24/218 [00:03<00:29,  6.59it/s]

‚úÖ Processed 240000/2176364 comments...


Processing text:  11%|‚ñà‚ñè        | 25/218 [00:03<00:28,  6.80it/s]

‚úÖ Processed 250000/2176364 comments...


Processing text:  12%|‚ñà‚ñè        | 26/218 [00:03<00:28,  6.77it/s]

‚úÖ Processed 260000/2176364 comments...


Processing text:  12%|‚ñà‚ñè        | 27/218 [00:04<00:27,  7.01it/s]

‚úÖ Processed 270000/2176364 comments...


Processing text:  13%|‚ñà‚ñé        | 28/218 [00:04<00:26,  7.14it/s]

‚úÖ Processed 280000/2176364 comments...


Processing text:  13%|‚ñà‚ñé        | 29/218 [00:04<00:26,  7.16it/s]

‚úÖ Processed 290000/2176364 comments...


Processing text:  14%|‚ñà‚ñç        | 30/218 [00:04<00:25,  7.24it/s]

‚úÖ Processed 300000/2176364 comments...


Processing text:  14%|‚ñà‚ñç        | 31/218 [00:04<00:25,  7.32it/s]

‚úÖ Processed 310000/2176364 comments...


Processing text:  15%|‚ñà‚ñç        | 32/218 [00:04<00:25,  7.31it/s]

‚úÖ Processed 320000/2176364 comments...


Processing text:  15%|‚ñà‚ñå        | 33/218 [00:04<00:26,  6.99it/s]

‚úÖ Processed 330000/2176364 comments...


Processing text:  16%|‚ñà‚ñå        | 34/218 [00:05<00:26,  6.97it/s]

‚úÖ Processed 340000/2176364 comments...


Processing text:  16%|‚ñà‚ñå        | 35/218 [00:05<00:26,  6.97it/s]

‚úÖ Processed 350000/2176364 comments...


Processing text:  17%|‚ñà‚ñã        | 36/218 [00:05<00:26,  6.96it/s]

‚úÖ Processed 360000/2176364 comments...


Processing text:  17%|‚ñà‚ñã        | 37/218 [00:05<00:26,  6.86it/s]

‚úÖ Processed 370000/2176364 comments...


Processing text:  17%|‚ñà‚ñã        | 38/218 [00:05<00:26,  6.88it/s]

‚úÖ Processed 380000/2176364 comments...


Processing text:  18%|‚ñà‚ñä        | 39/218 [00:05<00:25,  6.98it/s]

‚úÖ Processed 390000/2176364 comments...


Processing text:  18%|‚ñà‚ñä        | 40/218 [00:05<00:25,  7.04it/s]

‚úÖ Processed 400000/2176364 comments...


Processing text:  19%|‚ñà‚ñâ        | 41/218 [00:06<00:24,  7.14it/s]

‚úÖ Processed 410000/2176364 comments...


Processing text:  19%|‚ñà‚ñâ        | 42/218 [00:06<00:24,  7.28it/s]

‚úÖ Processed 420000/2176364 comments...


Processing text:  20%|‚ñà‚ñâ        | 43/218 [00:06<00:23,  7.34it/s]

‚úÖ Processed 430000/2176364 comments...


Processing text:  20%|‚ñà‚ñà        | 44/218 [00:06<00:24,  7.21it/s]

‚úÖ Processed 440000/2176364 comments...


Processing text:  21%|‚ñà‚ñà        | 45/218 [00:06<00:24,  7.16it/s]

‚úÖ Processed 450000/2176364 comments...


Processing text:  21%|‚ñà‚ñà        | 46/218 [00:06<00:24,  7.15it/s]

‚úÖ Processed 460000/2176364 comments...


Processing text:  22%|‚ñà‚ñà‚ñè       | 47/218 [00:06<00:24,  7.01it/s]

‚úÖ Processed 470000/2176364 comments...


Processing text:  22%|‚ñà‚ñà‚ñè       | 48/218 [00:07<00:24,  7.00it/s]

‚úÖ Processed 480000/2176364 comments...


Processing text:  22%|‚ñà‚ñà‚ñè       | 49/218 [00:07<00:25,  6.75it/s]

‚úÖ Processed 490000/2176364 comments...


Processing text:  23%|‚ñà‚ñà‚ñé       | 50/218 [00:07<00:26,  6.46it/s]

‚úÖ Processed 500000/2176364 comments...


Processing text:  23%|‚ñà‚ñà‚ñé       | 51/218 [00:07<00:26,  6.38it/s]

‚úÖ Processed 510000/2176364 comments...


Processing text:  24%|‚ñà‚ñà‚ñç       | 52/218 [00:07<00:27,  6.01it/s]

‚úÖ Processed 520000/2176364 comments...


Processing text:  24%|‚ñà‚ñà‚ñç       | 53/218 [00:07<00:28,  5.85it/s]

‚úÖ Processed 530000/2176364 comments...


Processing text:  25%|‚ñà‚ñà‚ñç       | 54/218 [00:08<00:27,  5.98it/s]

‚úÖ Processed 540000/2176364 comments...


Processing text:  25%|‚ñà‚ñà‚ñå       | 55/218 [00:08<00:27,  5.84it/s]

‚úÖ Processed 550000/2176364 comments...


Processing text:  26%|‚ñà‚ñà‚ñå       | 56/218 [00:08<00:27,  5.83it/s]

‚úÖ Processed 560000/2176364 comments...


Processing text:  26%|‚ñà‚ñà‚ñå       | 57/218 [00:08<00:27,  5.82it/s]

‚úÖ Processed 570000/2176364 comments...


Processing text:  27%|‚ñà‚ñà‚ñã       | 58/218 [00:08<00:26,  6.05it/s]

‚úÖ Processed 580000/2176364 comments...


Processing text:  27%|‚ñà‚ñà‚ñã       | 59/218 [00:08<00:26,  5.96it/s]

‚úÖ Processed 590000/2176364 comments...


Processing text:  28%|‚ñà‚ñà‚ñä       | 60/218 [00:09<00:26,  5.99it/s]

‚úÖ Processed 600000/2176364 comments...


Processing text:  28%|‚ñà‚ñà‚ñä       | 61/218 [00:09<00:25,  6.06it/s]

‚úÖ Processed 610000/2176364 comments...


Processing text:  28%|‚ñà‚ñà‚ñä       | 62/218 [00:09<00:25,  6.13it/s]

‚úÖ Processed 620000/2176364 comments...


Processing text:  29%|‚ñà‚ñà‚ñâ       | 63/218 [00:09<00:25,  6.15it/s]

‚úÖ Processed 630000/2176364 comments...


Processing text:  29%|‚ñà‚ñà‚ñâ       | 64/218 [00:09<00:24,  6.29it/s]

‚úÖ Processed 640000/2176364 comments...


Processing text:  30%|‚ñà‚ñà‚ñâ       | 65/218 [00:09<00:24,  6.19it/s]

‚úÖ Processed 650000/2176364 comments...


Processing text:  30%|‚ñà‚ñà‚ñà       | 66/218 [00:10<00:23,  6.39it/s]

‚úÖ Processed 660000/2176364 comments...


Processing text:  31%|‚ñà‚ñà‚ñà       | 67/218 [00:10<00:21,  6.94it/s]

‚úÖ Processed 670000/2176364 comments...


Processing text:  31%|‚ñà‚ñà‚ñà       | 68/218 [00:10<00:20,  7.29it/s]

‚úÖ Processed 680000/2176364 comments...


Processing text:  32%|‚ñà‚ñà‚ñà‚ñè      | 69/218 [00:10<00:19,  7.65it/s]

‚úÖ Processed 690000/2176364 comments...


Processing text:  32%|‚ñà‚ñà‚ñà‚ñè      | 70/218 [00:10<00:18,  7.89it/s]

‚úÖ Processed 700000/2176364 comments...


Processing text:  33%|‚ñà‚ñà‚ñà‚ñé      | 71/218 [00:10<00:18,  8.13it/s]

‚úÖ Processed 710000/2176364 comments...


Processing text:  33%|‚ñà‚ñà‚ñà‚ñé      | 72/218 [00:10<00:19,  7.64it/s]

‚úÖ Processed 720000/2176364 comments...


Processing text:  33%|‚ñà‚ñà‚ñà‚ñé      | 73/218 [00:10<00:19,  7.54it/s]

‚úÖ Processed 730000/2176364 comments...


Processing text:  34%|‚ñà‚ñà‚ñà‚ñç      | 74/218 [00:11<00:20,  7.16it/s]

‚úÖ Processed 740000/2176364 comments...


Processing text:  34%|‚ñà‚ñà‚ñà‚ñç      | 75/218 [00:11<00:20,  7.10it/s]

‚úÖ Processed 750000/2176364 comments...


Processing text:  35%|‚ñà‚ñà‚ñà‚ñç      | 76/218 [00:11<00:19,  7.14it/s]

‚úÖ Processed 760000/2176364 comments...


Processing text:  35%|‚ñà‚ñà‚ñà‚ñå      | 77/218 [00:11<00:20,  6.87it/s]

‚úÖ Processed 770000/2176364 comments...


Processing text:  36%|‚ñà‚ñà‚ñà‚ñå      | 78/218 [00:11<00:20,  6.79it/s]

‚úÖ Processed 780000/2176364 comments...


Processing text:  36%|‚ñà‚ñà‚ñà‚ñå      | 79/218 [00:11<00:20,  6.64it/s]

‚úÖ Processed 790000/2176364 comments...


Processing text:  37%|‚ñà‚ñà‚ñà‚ñã      | 80/218 [00:11<00:20,  6.75it/s]

‚úÖ Processed 800000/2176364 comments...


Processing text:  37%|‚ñà‚ñà‚ñà‚ñã      | 81/218 [00:12<00:20,  6.62it/s]

‚úÖ Processed 810000/2176364 comments...


Processing text:  38%|‚ñà‚ñà‚ñà‚ñä      | 82/218 [00:12<00:20,  6.59it/s]

‚úÖ Processed 820000/2176364 comments...


Processing text:  38%|‚ñà‚ñà‚ñà‚ñä      | 83/218 [00:12<00:20,  6.52it/s]

‚úÖ Processed 830000/2176364 comments...


Processing text:  39%|‚ñà‚ñà‚ñà‚ñä      | 84/218 [00:12<00:20,  6.52it/s]

‚úÖ Processed 840000/2176364 comments...


Processing text:  39%|‚ñà‚ñà‚ñà‚ñâ      | 85/218 [00:12<00:20,  6.40it/s]

‚úÖ Processed 850000/2176364 comments...


Processing text:  39%|‚ñà‚ñà‚ñà‚ñâ      | 86/218 [00:12<00:20,  6.34it/s]

‚úÖ Processed 860000/2176364 comments...


Processing text:  40%|‚ñà‚ñà‚ñà‚ñâ      | 87/218 [00:13<00:20,  6.47it/s]

‚úÖ Processed 870000/2176364 comments...


Processing text:  40%|‚ñà‚ñà‚ñà‚ñà      | 88/218 [00:13<00:19,  6.63it/s]

‚úÖ Processed 880000/2176364 comments...


Processing text:  41%|‚ñà‚ñà‚ñà‚ñà      | 89/218 [00:13<00:19,  6.66it/s]

‚úÖ Processed 890000/2176364 comments...


Processing text:  41%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 90/218 [00:13<00:19,  6.73it/s]

‚úÖ Processed 900000/2176364 comments...


Processing text:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 91/218 [00:13<00:19,  6.68it/s]

‚úÖ Processed 910000/2176364 comments...


Processing text:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 92/218 [00:13<00:19,  6.52it/s]

‚úÖ Processed 920000/2176364 comments...


Processing text:  43%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 93/218 [00:13<00:18,  6.68it/s]

‚úÖ Processed 930000/2176364 comments...


Processing text:  43%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 94/218 [00:14<00:18,  6.54it/s]

‚úÖ Processed 940000/2176364 comments...


Processing text:  44%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 95/218 [00:14<00:18,  6.79it/s]

‚úÖ Processed 950000/2176364 comments...


Processing text:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 96/218 [00:14<00:18,  6.49it/s]

‚úÖ Processed 960000/2176364 comments...


Processing text:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 97/218 [00:14<00:18,  6.54it/s]

‚úÖ Processed 970000/2176364 comments...


Processing text:  45%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 98/218 [00:14<00:18,  6.56it/s]

‚úÖ Processed 980000/2176364 comments...


Processing text:  45%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 99/218 [00:14<00:18,  6.56it/s]

‚úÖ Processed 990000/2176364 comments...


Processing text:  46%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 100/218 [00:15<00:18,  6.56it/s]

‚úÖ Processed 1000000/2176364 comments...


Processing text:  46%|‚ñà‚ñà‚ñà‚ñà‚ñã     | 101/218 [00:15<00:17,  6.61it/s]

‚úÖ Processed 1010000/2176364 comments...


Processing text:  47%|‚ñà‚ñà‚ñà‚ñà‚ñã     | 102/218 [00:15<00:17,  6.54it/s]

‚úÖ Processed 1020000/2176364 comments...


Processing text:  47%|‚ñà‚ñà‚ñà‚ñà‚ñã     | 103/218 [00:15<00:17,  6.54it/s]

‚úÖ Processed 1030000/2176364 comments...


Processing text:  48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 104/218 [00:15<00:17,  6.55it/s]

‚úÖ Processed 1040000/2176364 comments...


Processing text:  48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 105/218 [00:15<00:17,  6.54it/s]

‚úÖ Processed 1050000/2176364 comments...


Processing text:  49%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 106/218 [00:15<00:16,  6.92it/s]

‚úÖ Processed 1060000/2176364 comments...


Processing text:  49%|‚ñà‚ñà‚ñà‚ñà‚ñâ     | 107/218 [00:16<00:16,  6.80it/s]

‚úÖ Processed 1070000/2176364 comments...


Processing text:  50%|‚ñà‚ñà‚ñà‚ñà‚ñâ     | 108/218 [00:16<00:16,  6.83it/s]

‚úÖ Processed 1080000/2176364 comments...


Processing text:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 109/218 [00:16<00:15,  6.83it/s]

‚úÖ Processed 1090000/2176364 comments...


Processing text:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 110/218 [00:16<00:16,  6.71it/s]

‚úÖ Processed 1100000/2176364 comments...


Processing text:  51%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 111/218 [00:16<00:15,  6.85it/s]

‚úÖ Processed 1110000/2176364 comments...


Processing text:  51%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè    | 112/218 [00:16<00:15,  6.84it/s]

‚úÖ Processed 1120000/2176364 comments...


Processing text:  52%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè    | 113/218 [00:16<00:15,  6.88it/s]

‚úÖ Processed 1130000/2176364 comments...


Processing text:  52%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè    | 114/218 [00:17<00:14,  7.03it/s]

‚úÖ Processed 1140000/2176364 comments...


Processing text:  53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé    | 115/218 [00:17<00:14,  7.00it/s]

‚úÖ Processed 1150000/2176364 comments...


Processing text:  53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé    | 116/218 [00:17<00:14,  6.82it/s]

‚úÖ Processed 1160000/2176364 comments...


Processing text:  54%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé    | 117/218 [00:17<00:14,  6.77it/s]

‚úÖ Processed 1170000/2176364 comments...


Processing text:  54%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 118/218 [00:17<00:15,  6.29it/s]

‚úÖ Processed 1180000/2176364 comments...


Processing text:  55%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 119/218 [00:17<00:15,  6.40it/s]

‚úÖ Processed 1190000/2176364 comments...


Processing text:  55%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 120/218 [00:18<00:15,  6.47it/s]

‚úÖ Processed 1200000/2176364 comments...


Processing text:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 121/218 [00:18<00:14,  6.48it/s]

‚úÖ Processed 1210000/2176364 comments...


Processing text:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 122/218 [00:18<00:14,  6.44it/s]

‚úÖ Processed 1220000/2176364 comments...


Processing text:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 123/218 [00:18<00:14,  6.40it/s]

‚úÖ Processed 1230000/2176364 comments...


Processing text:  57%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 124/218 [00:18<00:14,  6.58it/s]

‚úÖ Processed 1240000/2176364 comments...


Processing text:  57%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 125/218 [00:18<00:13,  6.83it/s]

‚úÖ Processed 1250000/2176364 comments...


Processing text:  58%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä    | 126/218 [00:18<00:13,  6.64it/s]

‚úÖ Processed 1260000/2176364 comments...


Processing text:  58%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä    | 127/218 [00:19<00:14,  6.41it/s]

‚úÖ Processed 1270000/2176364 comments...


Processing text:  59%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä    | 128/218 [00:19<00:14,  6.26it/s]

‚úÖ Processed 1280000/2176364 comments...


Processing text:  59%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ    | 129/218 [00:19<00:14,  6.29it/s]

‚úÖ Processed 1290000/2176364 comments...


Processing text:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ    | 130/218 [00:19<00:14,  6.27it/s]

‚úÖ Processed 1300000/2176364 comments...


Processing text:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 131/218 [00:19<00:14,  6.18it/s]

‚úÖ Processed 1310000/2176364 comments...


Processing text:  61%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 132/218 [00:19<00:13,  6.29it/s]

‚úÖ Processed 1320000/2176364 comments...


Processing text:  61%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 133/218 [00:20<00:13,  6.49it/s]

‚úÖ Processed 1330000/2176364 comments...


Processing text:  61%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè   | 134/218 [00:20<00:12,  6.61it/s]

‚úÖ Processed 1340000/2176364 comments...


Processing text:  62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè   | 135/218 [00:20<00:12,  6.40it/s]

‚úÖ Processed 1350000/2176364 comments...


Processing text:  62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè   | 136/218 [00:20<00:12,  6.35it/s]

‚úÖ Processed 1360000/2176364 comments...


Processing text:  63%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 137/218 [00:20<00:12,  6.24it/s]

‚úÖ Processed 1370000/2176364 comments...


Processing text:  63%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 138/218 [00:20<00:12,  6.42it/s]

‚úÖ Processed 1380000/2176364 comments...


Processing text:  64%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 139/218 [00:20<00:12,  6.48it/s]

‚úÖ Processed 1390000/2176364 comments...


Processing text:  64%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 140/218 [00:21<00:12,  6.39it/s]

‚úÖ Processed 1400000/2176364 comments...


Processing text:  65%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 141/218 [00:21<00:12,  6.38it/s]

‚úÖ Processed 1410000/2176364 comments...


Processing text:  65%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 142/218 [00:21<00:12,  6.21it/s]

‚úÖ Processed 1420000/2176364 comments...


Processing text:  66%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 143/218 [00:21<00:12,  6.19it/s]

‚úÖ Processed 1430000/2176364 comments...


Processing text:  66%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 144/218 [00:21<00:11,  6.35it/s]

‚úÖ Processed 1440000/2176364 comments...


Processing text:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 145/218 [00:21<00:11,  6.63it/s]

‚úÖ Processed 1450000/2176364 comments...


Processing text:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 146/218 [00:22<00:10,  7.11it/s]

‚úÖ Processed 1460000/2176364 comments...


Processing text:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 147/218 [00:22<00:09,  7.60it/s]

‚úÖ Processed 1470000/2176364 comments...


Processing text:  68%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä   | 148/218 [00:22<00:08,  7.89it/s]

‚úÖ Processed 1480000/2176364 comments...


Processing text:  68%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä   | 149/218 [00:22<00:08,  8.11it/s]

‚úÖ Processed 1490000/2176364 comments...


Processing text:  69%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ   | 150/218 [00:22<00:08,  8.13it/s]

‚úÖ Processed 1500000/2176364 comments...


Processing text:  69%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ   | 151/218 [00:22<00:08,  7.54it/s]

‚úÖ Processed 1510000/2176364 comments...


Processing text:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ   | 152/218 [00:22<00:09,  7.26it/s]

‚úÖ Processed 1520000/2176364 comments...


Processing text:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 153/218 [00:22<00:09,  7.00it/s]

‚úÖ Processed 1530000/2176364 comments...


Processing text:  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 154/218 [00:23<00:09,  6.89it/s]

‚úÖ Processed 1540000/2176364 comments...


Processing text:  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 155/218 [00:23<00:08,  7.03it/s]

‚úÖ Processed 1550000/2176364 comments...


Processing text:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 156/218 [00:23<00:09,  6.88it/s]

‚úÖ Processed 1560000/2176364 comments...


Processing text:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 157/218 [00:23<00:09,  6.65it/s]

‚úÖ Processed 1570000/2176364 comments...


Processing text:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 158/218 [00:23<00:08,  6.70it/s]

‚úÖ Processed 1580000/2176364 comments...


Processing text:  73%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 159/218 [00:23<00:08,  6.96it/s]

‚úÖ Processed 1590000/2176364 comments...


Processing text:  73%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 160/218 [00:23<00:08,  6.86it/s]

‚úÖ Processed 1600000/2176364 comments...


Processing text:  74%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç  | 161/218 [00:24<00:08,  6.75it/s]

‚úÖ Processed 1610000/2176364 comments...


Processing text:  74%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç  | 162/218 [00:24<00:08,  6.74it/s]

‚úÖ Processed 1620000/2176364 comments...


Processing text:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç  | 163/218 [00:24<00:08,  6.82it/s]

‚úÖ Processed 1630000/2176364 comments...


Processing text:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 164/218 [00:24<00:08,  6.42it/s]

‚úÖ Processed 1640000/2176364 comments...


Processing text:  76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 165/218 [00:24<00:08,  6.33it/s]

‚úÖ Processed 1650000/2176364 comments...


Processing text:  76%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 166/218 [00:24<00:08,  6.42it/s]

‚úÖ Processed 1660000/2176364 comments...


Processing text:  77%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã  | 167/218 [00:25<00:07,  6.52it/s]

‚úÖ Processed 1670000/2176364 comments...


Processing text:  77%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã  | 168/218 [00:25<00:07,  6.66it/s]

‚úÖ Processed 1680000/2176364 comments...


Processing text:  78%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 169/218 [00:25<00:07,  6.52it/s]

‚úÖ Processed 1690000/2176364 comments...


Processing text:  78%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 170/218 [00:25<00:07,  6.62it/s]

‚úÖ Processed 1700000/2176364 comments...


Processing text:  78%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 171/218 [00:25<00:06,  6.73it/s]

‚úÖ Processed 1710000/2176364 comments...


Processing text:  79%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  | 172/218 [00:25<00:06,  6.82it/s]

‚úÖ Processed 1720000/2176364 comments...


Processing text:  79%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  | 173/218 [00:25<00:06,  6.77it/s]

‚úÖ Processed 1730000/2176364 comments...


Processing text:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  | 174/218 [00:26<00:06,  6.62it/s]

‚úÖ Processed 1740000/2176364 comments...


Processing text:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 175/218 [00:26<00:06,  6.42it/s]

‚úÖ Processed 1750000/2176364 comments...


Processing text:  81%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 176/218 [00:26<00:06,  6.29it/s]

‚úÖ Processed 1760000/2176364 comments...


Processing text:  81%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 177/218 [00:26<00:06,  6.23it/s]

‚úÖ Processed 1770000/2176364 comments...


Processing text:  82%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè | 178/218 [00:26<00:06,  6.31it/s]

‚úÖ Processed 1780000/2176364 comments...


Processing text:  82%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè | 179/218 [00:26<00:06,  6.26it/s]

‚úÖ Processed 1790000/2176364 comments...


Processing text:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 180/218 [00:27<00:06,  6.15it/s]

‚úÖ Processed 1800000/2176364 comments...


Processing text:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 181/218 [00:27<00:06,  6.17it/s]

‚úÖ Processed 1810000/2176364 comments...


Processing text:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 182/218 [00:27<00:05,  6.26it/s]

‚úÖ Processed 1820000/2176364 comments...


Processing text:  84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 183/218 [00:27<00:05,  6.19it/s]

‚úÖ Processed 1830000/2176364 comments...


Processing text:  84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 184/218 [00:27<00:05,  5.78it/s]

‚úÖ Processed 1840000/2176364 comments...


Processing text:  85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 185/218 [00:27<00:05,  5.66it/s]

‚úÖ Processed 1850000/2176364 comments...


Processing text:  85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 186/218 [00:28<00:05,  5.86it/s]

‚úÖ Processed 1860000/2176364 comments...


Processing text:  86%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 187/218 [00:28<00:05,  5.72it/s]

‚úÖ Processed 1870000/2176364 comments...


Processing text:  86%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 188/218 [00:28<00:05,  5.99it/s]

‚úÖ Processed 1880000/2176364 comments...


Processing text:  87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 189/218 [00:28<00:04,  5.90it/s]

‚úÖ Processed 1890000/2176364 comments...


Processing text:  87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 190/218 [00:28<00:04,  5.89it/s]

‚úÖ Processed 1900000/2176364 comments...


Processing text:  88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 191/218 [00:28<00:04,  6.00it/s]

‚úÖ Processed 1910000/2176364 comments...


Processing text:  88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 192/218 [00:29<00:04,  6.14it/s]

‚úÖ Processed 1920000/2176364 comments...


Processing text:  89%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 193/218 [00:29<00:03,  6.57it/s]

‚úÖ Processed 1930000/2176364 comments...


Processing text:  89%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ | 194/218 [00:29<00:03,  7.12it/s]

‚úÖ Processed 1940000/2176364 comments...


Processing text:  89%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ | 195/218 [00:29<00:03,  7.52it/s]

‚úÖ Processed 1950000/2176364 comments...


Processing text:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ | 196/218 [00:29<00:02,  7.63it/s]

‚úÖ Processed 1960000/2176364 comments...


Processing text:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 197/218 [00:29<00:02,  7.87it/s]

‚úÖ Processed 1970000/2176364 comments...


Processing text:  91%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 198/218 [00:29<00:02,  7.70it/s]

‚úÖ Processed 1980000/2176364 comments...


Processing text:  91%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 199/218 [00:29<00:02,  7.43it/s]

‚úÖ Processed 1990000/2176364 comments...


Processing text:  92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 200/218 [00:30<00:02,  7.08it/s]

‚úÖ Processed 2000000/2176364 comments...


Processing text:  92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 201/218 [00:30<00:02,  7.06it/s]

‚úÖ Processed 2010000/2176364 comments...


Processing text:  93%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé| 202/218 [00:30<00:02,  6.96it/s]

‚úÖ Processed 2020000/2176364 comments...


Processing text:  93%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé| 203/218 [00:30<00:02,  7.05it/s]

‚úÖ Processed 2030000/2176364 comments...


Processing text:  94%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé| 204/218 [00:30<00:02,  6.79it/s]

‚úÖ Processed 2040000/2176364 comments...


Processing text:  94%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 205/218 [00:30<00:01,  6.77it/s]

‚úÖ Processed 2050000/2176364 comments...


Processing text:  94%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 206/218 [00:31<00:01,  6.79it/s]

‚úÖ Processed 2060000/2176364 comments...


Processing text:  95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 207/218 [00:31<00:01,  6.75it/s]

‚úÖ Processed 2070000/2176364 comments...


Processing text:  95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 208/218 [00:31<00:01,  6.65it/s]

‚úÖ Processed 2080000/2176364 comments...


Processing text:  96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 209/218 [00:31<00:01,  6.50it/s]

‚úÖ Processed 2090000/2176364 comments...


Processing text:  96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã| 210/218 [00:31<00:01,  6.60it/s]

‚úÖ Processed 2100000/2176364 comments...


Processing text:  97%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã| 211/218 [00:31<00:01,  6.65it/s]

‚úÖ Processed 2110000/2176364 comments...


Processing text:  97%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã| 212/218 [00:31<00:00,  6.62it/s]

‚úÖ Processed 2120000/2176364 comments...


Processing text:  98%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 213/218 [00:32<00:00,  5.15it/s]

‚úÖ Processed 2130000/2176364 comments...


Processing text:  98%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 214/218 [00:32<00:00,  4.06it/s]

‚úÖ Processed 2140000/2176364 comments...


Processing text:  99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 215/218 [00:32<00:00,  3.67it/s]

‚úÖ Processed 2150000/2176364 comments...


Processing text:  99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 216/218 [00:33<00:00,  3.52it/s]

‚úÖ Processed 2160000/2176364 comments...


Processing text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 217/218 [00:33<00:00,  3.09it/s]

‚úÖ Processed 2170000/2176364 comments...


Processing text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 218/218 [00:33<00:00,  6.41it/s]


‚úÖ Processed 2176364/2176364 comments...

üìå Vocabulary Size (After Limiting): 50000 words

üíæ Saving vectorizer configuration...

üíæ Saving raw tokenized sequences...

‚úÖ Tokenization complete!


In [7]:
import numpy as np
from tqdm import tqdm
from tensorflow.keras.preprocessing.sequence import pad_sequences

# ‚úÖ Load tokenized text
print("\nüìÇ Loading tokenized sequences from 'tokenized_comments.npy'...")
sequences = np.load("tokenized_comments.npy", allow_pickle=True)

# ‚úÖ Ensure the shape is correct
print(f"\nüìå Loaded tokenized sequences with shape: {sequences.shape}")

# ‚úÖ Create input sequences using NumPy arrays (Optimized)
sequence_length = 30  # Ensure consistency
max_sequences = sequences.shape[0]  # Avoid inefficient `sum(len(seq)...)`
input_sequences = np.zeros((max_sequences, sequence_length), dtype=np.int32)

print("\nüîÑ Creating input sequences...")
index = 0

for seq in tqdm(sequences, desc="Processing sequences"):
    for i in range(1, len(seq)):
        sub_seq = seq[:i+1]

        # ‚úÖ Fix: Trim sequences that exceed `sequence_length`
        if len(sub_seq) > sequence_length:
            sub_seq = sub_seq[-sequence_length:]  # Keep last 30 tokens
        
        # ‚úÖ Prevent overflow in `input_sequences`
        if index >= max_sequences:
            break

        input_sequences[index, -len(sub_seq):] = sub_seq  # Insert at the end
        index += 1

# ‚úÖ Pad sequences with reduced max length
print("\nüîÑ Padding input sequences...")
input_sequences = pad_sequences(input_sequences, maxlen=sequence_length, padding="pre")

print(f"\nüìå Padded input shape: {input_sequences.shape}")

# ‚úÖ Extract input (X) and output (y)
X, y = input_sequences[:, :-1], input_sequences[:, -1]

print(f"\nüìå X shape: {X.shape}, y shape: {y.shape}")

# ‚úÖ Convert y to integer labels (Sparse Encoding)
print("\nüîÑ Converting y to sparse labels...")
y = np.array(y, dtype=np.int32)  # Uses sparse categorical encoding

# ‚úÖ Save processed training sequences separately
print("\nüíæ Saving LSTM training sequences...")
np.save("X.npy", X)
np.save("y.npy", y)

print("\n‚úÖ Tokenized sequences saved successfully!")



üìÇ Loading tokenized sequences from 'tokenized_comments.npy'...

üìå Loaded tokenized sequences with shape: (2176364, 30)

üîÑ Creating input sequences...


Processing sequences: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2176364/2176364 [00:05<00:00, 408875.99it/s]



üîÑ Padding input sequences...

üìå Padded input shape: (2176364, 30)

üìå X shape: (2176364, 29), y shape: (2176364,)

üîÑ Converting y to sparse labels...

üíæ Saving LSTM training sequences...

‚úÖ Tokenized sequences saved successfully!


## **üìå Step 3: Building the LSTM Model**
We define an LSTM-based architecture with embedding and dense layers.

In [7]:
import numpy as np

sequence_length = 30  # Ensure consistency with previous settings

# ‚úÖ Load preprocessed training data
print("\nüìÇ Loading preprocessed sequences from 'X.npy' and 'y.npy'...")
X = np.load("X.npy", allow_pickle=True)
y = np.load("y.npy", allow_pickle=True)

# ‚úÖ Confirm dataset shapes
print(f"‚úÖ Loaded input sequences. X shape: {X.shape}, y shape: {y.shape}")



üìÇ Loading preprocessed sequences from 'X.npy' and 'y.npy'...
‚úÖ Loaded input sequences. X shape: (2176364, 29), y shape: (2176364,)


In [3]:
import json
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, Input, TextVectorization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import mixed_precision

# ‚úÖ Enable Mixed Precision for Faster Training
mixed_precision.set_global_policy("mixed_float16")

# ‚úÖ Load vectorizer configuration
print("\nüìÇ Loading vectorizer configuration from 'vectorizer_config.json'...")
with open("vectorizer_config.json", "r") as f:
    vectorizer_config = json.load(f)

# ‚úÖ Restore `TextVectorization` layer with saved settings
max_tokens = vectorizer_config["max_tokens"]
sequence_length = vectorizer_config["sequence_length"]

print(f"\n‚úÖ Restoring vectorizer with max_tokens={max_tokens} and sequence_length={sequence_length}...")

vectorizer = TextVectorization(max_tokens=max_tokens, output_sequence_length=sequence_length)

# ‚úÖ Define vocabulary size from `vectorizer`
vocab_size = vectorizer.vocabulary_size()  # Correct way to get vocab size

print(f"‚úÖ Vocabulary Size: {vocab_size}")

# ‚úÖ Define Optimized LSTM Model
model = Sequential([
    Input(shape=(sequence_length - 1,)),  # Explicit input layer
    Embedding(input_dim=vocab_size, output_dim=256),  # Increased embedding size for better word representation
    Bidirectional(LSTM(256, return_sequences=True)),  # BiLSTM for deeper context understanding
    LSTM(256),  # Additional LSTM layer
    Dropout(0.3),  # Higher dropout to prevent overfitting
    Dense(256, activation="relu"),
    Dense(vocab_size, activation="softmax", dtype="float32")  # Ensure output layer remains float32
])

# ‚úÖ Compile model with Adam optimizer and lower learning rate
model.compile(loss="sparse_categorical_crossentropy", optimizer=Adam(learning_rate=0.0005), metrics=["accuracy"])

# ‚úÖ Print model summary
model.summary()



üìÇ Loading vectorizer configuration from 'vectorizer_config.json'...

‚úÖ Restoring vectorizer with max_tokens=50000 and sequence_length=30...
‚úÖ Vocabulary Size: 2


## **üìå Step 4: Training the LSTM Model**
We train the LSTM model with categorical cross-entropy loss.

In [4]:
import tensorflow as tf

# Check if GPU is available
print("Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available: 1


In [5]:
# Ensure TensorFlow uses GPU
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)  # Prevents memory overflow issues
        print("\n‚úÖ GPU is enabled and TensorFlow is using it!")
    except RuntimeError as e:
        print(e)
else:
    print("\n‚ùå No GPU detected, training may be slow!")


Physical devices cannot be modified after being initialized


In [9]:
import time
import pickle
import numpy as np
import tensorflow as tf
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# ‚úÖ Ensure `y` is sparse categorical (integer labels)
y = np.array(y, dtype=np.int32)

# ‚úÖ Ensure mixed precision if enabled
if tf.keras.mixed_precision.global_policy().name == "mixed_float16":
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False, dtype=tf.float32)
else:
    loss_fn = "sparse_categorical_crossentropy"

# ‚úÖ Compile model with Adam optimizer (learning rate already set in model definition)
model.compile(loss=loss_fn, optimizer="adam", metrics=["accuracy"])

# ‚úÖ Define Callbacks (Save best model & stop early if validation loss stops improving)
callbacks = [
    ModelCheckpoint("nyt_lstm_best_model.keras", save_best_only=True, monitor="val_loss", mode="min", verbose=1),
    EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True, verbose=1)
]

# ‚úÖ Start Timer
start_time = time.time()
print("\nüöÄ Starting Model Training...\n")

# ‚úÖ Reduce training dataset size for quick testing (Adjust as needed)
X_train, y_train = X[:1000000], y[:1000000]

# ‚úÖ Train model with verbose logging and callbacks
history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=512,
    validation_split=0.2,  # Ensure data is shuffled before training
    shuffle=True,
    verbose=1,
    callbacks=callbacks
)

# ‚úÖ Compute total training time
end_time = time.time()
total_time = end_time - start_time
print(f"\n‚úÖ Training Completed in {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

# ‚úÖ Save Model (Final version)
model.save("nyt_lstm_model.keras")
print("\nüíæ Model saved as 'nyt_lstm_model.keras'")

# ‚úÖ Save Training History
with open("training_history.pkl", "wb") as f:
    pickle.dump(history.history, f)
print("\nüìä Training history saved as 'training_history.pkl'")

# ‚úÖ Print Final Training Stats
print("\nüìå Final Training Metrics:")
print(f"   üîπ Final Training Loss: {history.history['loss'][-1]:.4f}")
print(f"   üîπ Final Validation Loss: {history.history['val_loss'][-1]:.4f}")
print(f"   üîπ Final Training Accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"   üîπ Final Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")

print("\nüéØ Training Complete! You can now evaluate the model and generate text.")



üöÄ Starting Model Training...

Epoch 1/10


InvalidArgumentError: Graph execution error:

Detected at node TensorScatterUpdate defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>

  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelapp.py", line 619, in start

  File "/usr/local/lib/python3.10/dist-packages/tornado/platform/asyncio.py", line 195, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 685, in <lambda>

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 738, in _run_callback

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 825, in inner

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 786, in run

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 361, in process_one

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 539, in execute_request

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py", line 302, in do_execute

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/zmqshell.py", line 539, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code

  File "<ipython-input-9-d193ab5ea42c>", line 33, in <cell line: 33>

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 320, in fit

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 121, in one_step_on_iterator

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 108, in one_step_on_data

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 73, in train_step

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/base_optimizer.py", line 291, in apply_gradients

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 165, in apply

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 216, in _tf_apply

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 199, in _common_apply

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 252, in check_finite

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 252, in <listcomp>

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/numpy.py", line 2859, in isfinite

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/sparse.py", line 338, in sparse_wrapper

Detected at node TensorScatterUpdate defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>

  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelapp.py", line 619, in start

  File "/usr/local/lib/python3.10/dist-packages/tornado/platform/asyncio.py", line 195, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 685, in <lambda>

  File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 738, in _run_callback

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 825, in inner

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 786, in run

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 361, in process_one

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 261, in dispatch_shell

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/kernelbase.py", line 539, in execute_request

  File "/usr/local/lib/python3.10/dist-packages/tornado/gen.py", line 234, in wrapper

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py", line 302, in do_execute

  File "/usr/local/lib/python3.10/dist-packages/ipykernel/zmqshell.py", line 539, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2975, in run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3030, in _run_cell

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/async_helpers.py", line 78, in _pseudo_sync_runner

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3257, in run_cell_async

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3473, in run_ast_nodes

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code

  File "<ipython-input-9-d193ab5ea42c>", line 33, in <cell line: 33>

  File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 320, in fit

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 121, in one_step_on_iterator

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 108, in one_step_on_data

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 73, in train_step

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/base_optimizer.py", line 291, in apply_gradients

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 165, in apply

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 216, in _tf_apply

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 199, in _common_apply

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 252, in check_finite

  File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 252, in <listcomp>

  File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/numpy.py", line 2859, in isfinite

  File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/sparse.py", line 338, in sparse_wrapper

2 root error(s) found.
  (0) INVALID_ARGUMENT:  indices[27] = [11] does not index into shape [2,256]
	 [[{{node TensorScatterUpdate}}]]
	 [[StatefulPartitionedCall/TensorScatterUpdate/_36]]
  (1) INVALID_ARGUMENT:  indices[27] = [11] does not index into shape [2,256]
	 [[{{node TensorScatterUpdate}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_one_step_on_iterator_4914]

## **üìå Step 5: Generate New Comments Using the LSTM**
We use the trained model to predict and generate text from a given seed phrase.

In [None]:
import numpy as np

def generate_text(seed_text, next_words=50, temperature=1.0):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=sequence_length-1, padding="pre")

        # Predict next word probabilities
        predicted_probs = model.predict(token_list, verbose=0)[0]
        
        # Apply temperature to control randomness
        predicted_probs = np.log(predicted_probs) / temperature
        exp_preds = np.exp(predicted_probs)
        predicted_probs = exp_preds / np.sum(exp_preds)

        # Sample from probability distribution
        predicted_index = np.random.choice(len(predicted_probs), p=predicted_probs)

        # Convert index to word
        output_word = tokenizer.index_word.get(predicted_index, "")
        seed_text += " " + output_word
    return seed_text

# Example usage
print(generate_text("the government should", next_words=20, temperature=0.7))  # Try different values


## **üìå Final Summary**
1. **Merged all comment datasets** into a single dataset.
2. **Preprocessed and tokenized the text** for input sequences.
3. **Trained an LSTM model** with embeddings and dense layers.
4. **Saved progress at every stage** to prevent data loss.
5. **Generated new comments** based on seed text input.