# 📊 **New York Times Comments Dataset Analysis**
This notebook analyzes the New York Times Comments dataset available on Kaggle.
We will extract metadata, check for missing values, and summarize the structure of the dataset before proceeding with text analysis.

## **📌 Step 1: Setup the Environment**
We start by importing the necessary libraries and listing all available files in the dataset.

In [None]:
import os
import pandas as pd

# Set path to dataset (Kaggle users should adjust as needed)
dataset_path = "../input/nyt-comments/"

# List all files in the dataset
files = os.listdir(dataset_path)
print("Files in dataset:\n", files)

## **📌 Step 2: Load & Inspect Data**
Let's load one file (e.g., `ArticlesJan2017.csv`) to inspect its structure.

In [None]:
# Load an example file to inspect its structure
sample_file = "ArticlesJan2017.csv"  # You can change this to any file in the dataset
df = pd.read_csv(os.path.join(dataset_path, sample_file))

# Display first few rows
df.head()

## **📌 Step 3: Extract Metadata**
Now, we extract key metadata, such as column names, data types, and missing values.

In [None]:
# Display dataset information
df.info()

## **📌 Step 4: Check for Missing Values**
Checking for missing values in each column.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

## **📌 Step 5: Summary Statistics**
Generate a summary of numeric and categorical columns.

In [None]:
# Display summary statistics
df.describe(include="all").transpose()

## **📌 Step 6: Check for Unique Identifiers**
Find columns that can be used as unique identifiers.

In [None]:
# Check if any column can be used as a unique identifier
unique_counts = df.nunique()
unique_counts

## **📌 Step 7: Automate Metadata Extraction for All Files**
Instead of manually inspecting each file, we automate metadata extraction for all files.

In [None]:
# Iterate over all files and extract metadata
metadata_summary = []

for file in files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path)

    metadata_summary.append({
        "File Name": file,
        "Rows": df.shape[0],
        "Columns": df.shape[1],
        "Missing Values": df.isnull().sum().sum(),
        "Duplicate Rows": df.duplicated().sum(),
        "Unique Columns": df.nunique().to_dict(),
    })

# Convert metadata summary to DataFrame for better readability
metadata_df = pd.DataFrame(metadata_summary)
metadata_df

## **🔍 Conclusion**
This notebook provides insights into the dataset structure, missing values, and metadata, making it ready for further text processing and LSTM-based text generation analysis.