# 📊 **New York Times Comments Dataset Analysis**
This notebook analyzes the New York Times Comments dataset available on Kaggle.
We will extract metadata, check for missing values, and summarize the structure of the dataset before proceeding with text analysis.

## **📌 Step 1: Setup the Environment**
We start by importing the necessary libraries and listing all available files in the dataset.

In [None]:
import os
import pandas as pd

# Set path to dataset (Kaggle users should adjust as needed)
dataset_path = "../input/nyt-comments/"

# List all files in the dataset
files = os.listdir(dataset_path)
print("Files in dataset:\n", files)

## **📌 Step 2: Load & Inspect Data**
Let's load one file (e.g., `ArticlesJan2017.csv`) to inspect its structure.

In [None]:
# Load an example file to inspect its structure
sample_file = "ArticlesJan2017.csv"  # You can change this to any file in the dataset
df = pd.read_csv(os.path.join(dataset_path, sample_file))

# Display first few rows
df.head()

## **📌 Step 3: Extract Metadata**
Now, we extract key metadata, such as column names, data types, and missing values.

In [None]:
# Display dataset information
df.info()

## **📌 Step 4: Check for Missing Values**
Checking for missing values in each column.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

## **📌 Step 5: Summary Statistics**
Generate a summary of numeric and categorical columns.

In [None]:
# Display summary statistics
df.describe(include="all").transpose()

## **📌 Step 6: Check for Unique Identifiers**
Find columns that can be used as unique identifiers.

In [None]:
# Check if any column can be used as a unique identifier
unique_counts = df.nunique()
unique_counts

## **📌 Step 7: Automate Metadata Extraction for All Files**
Instead of manually inspecting each file, we automate metadata extraction for all files.

In [None]:
# Iterate over all files and extract metadata
metadata_summary = []

for file in files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path)

    metadata_summary.append({
        "File Name": file,
        "Rows": df.shape[0],
        "Columns": df.shape[1],
        "Missing Values": df.isnull().sum().sum(),
        "Duplicate Rows": df.duplicated().sum(),
        "Unique Columns": df.nunique().to_dict(),
    })

# Convert metadata summary to DataFrame for better readability
metadata_df = pd.DataFrame(metadata_summary)
metadata_df

## **🔍 Conclusion**
This notebook provides insights into the dataset structure, missing values, and metadata, making it ready for further text processing and LSTM-based text generation analysis.

# 📊 **LSTM-Based Text Generation on NYT Comments Dataset**
This notebook trains an LSTM model using the **New York Times Comments dataset** to generate human-like text. The notebook follows a structured process: merging datasets, preprocessing text, tokenization, training an LSTM model, and saving progress to prevent data loss in case of session shutdown.

## **📌 Step 1: Load & Merge All Comment Datasets**
We combine all comments into a single dataset for better model generalization.

In [None]:
import os
import pandas as pd

# Path to dataset directory
dataset_path = "../input/nyt-comments/"

# List all comment files
comment_files = [file for file in os.listdir(dataset_path) if file.startswith("Comments")]

# Initialize empty list to store DataFrames
df_list = []

# Load and merge all comment files
for file in comment_files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path, usecols=["commentBody"])
    df_list.append(df)

# Combine all comments into one DataFrame
df_combined = pd.concat(df_list, ignore_index=True)

# Save merged dataset to avoid reloading
df_combined.to_csv("nyt_comments_cleaned.csv", index=False)

# Display dataset shape
print("Total Comments:", df_combined.shape[0])
df_combined.head()

## **📌 Step 2: Preprocessing the Text**
We clean the text by converting to lowercase, removing special characters, and tokenizing words into sequences.

In [None]:
import re
import json
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Function to clean text
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply text cleaning
df_combined["commentBody"] = df_combined["commentBody"].astype(str).apply(clean_text)

# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_combined["commentBody"])

# Save tokenizer
with open("tokenizer.json", "w") as f:
    json.dump(tokenizer.to_json(), f)

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(df_combined["commentBody"])

# Create input sequences
sequence_length = 50
input_sequences = []
for seq in sequences:
    for i in range(1, len(seq)):
        input_sequences.append(seq[:i+1])

# Pad sequences
input_sequences = pad_sequences(input_sequences, maxlen=sequence_length, padding="pre")

# Extract input (X) and output (y)
X, y = input_sequences[:, :-1], input_sequences[:, -1]

# Convert y to categorical
y = tf.keras.utils.to_categorical(y, num_classes=len(tokenizer.word_index) + 1)

# Save tokenized sequences
import numpy as np
np.save("input_sequences.npy", input_sequences)

## **📌 Step 3: Building the LSTM Model**
We define an LSTM-based architecture with embedding and dense layers.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Define LSTM model
vocab_size = len(tokenizer.word_index) + 1

model = Sequential([
    Embedding(vocab_size, 128, input_length=sequence_length-1),
    LSTM(256, return_sequences=True),
    LSTM(256),
    Dense(256, activation="relu"),
    Dense(vocab_size, activation="softmax")
])

# Compile model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Model summary
model.summary()

## **📌 Step 4: Training the LSTM Model**
We train the LSTM model with categorical cross-entropy loss.

In [None]:
# Train model
history = model.fit(X, y, epochs=30, batch_size=128, validation_split=0.2)

# Save model
model.save("nyt_lstm_model.h5")

# Save training history
import pickle
with open("training_history.pkl", "wb") as f:
    pickle.dump(history.history, f)

## **📌 Step 5: Generate New Comments Using the LSTM**
We use the trained model to predict and generate text from a given seed phrase.

In [None]:
import numpy as np

def generate_text(seed_text, next_words=50, temperature=1.0):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=sequence_length-1, padding="pre")

        # Predict next word
        predicted_probs = model.predict(token_list, verbose=0)
        predicted_index = np.argmax(predicted_probs, axis=-1)[0]

        # Convert index to word
        output_word = tokenizer.index_word.get(predicted_index, "")
        seed_text += " " + output_word
    return seed_text

# Example
print(generate_text("the government should", next_words=20))

## **📌 Final Summary**
1. **Merged all comment datasets** into a single dataset.
2. **Preprocessed and tokenized the text** for input sequences.
3. **Trained an LSTM model** with embeddings and dense layers.
4. **Saved progress at every stage** to prevent data loss.
5. **Generated new comments** based on seed text input.