# 📊 **New York Times Comments Dataset Analysis**
This notebook analyzes the New York Times Comments dataset available on Kaggle.
We will extract metadata, check for missing values, and summarize the structure of the dataset before proceeding with text analysis.

## **📌 Step 1: Setup the Environment**
We start by importing the necessary libraries and listing all available files in the dataset.

In [1]:
import os
import pandas as pd

# Set path to dataset (Kaggle users should adjust as needed)
dataset_path = "/kaggle/input/nyt-comments"

# List all files in the dataset
files = os.listdir(dataset_path)
print("Files in dataset:\n", files)

Files in dataset:
 ['CommentsFeb2018.csv', 'ArticlesFeb2017.csv', 'CommentsApril2018.csv', 'ArticlesJan2017.csv', 'ArticlesMay2017.csv', 'CommentsJan2017.csv', 'CommentsMarch2017.csv', 'CommentsMay2017.csv', 'CommentsMarch2018.csv', 'CommentsApril2017.csv', 'ArticlesMarch2017.csv', 'ArticlesApril2017.csv', 'CommentsFeb2017.csv', 'ArticlesJan2018.csv', 'ArticlesFeb2018.csv', 'ArticlesMarch2018.csv', 'CommentsJan2018.csv', 'ArticlesApril2018.csv']


## **📌 Step 2: Load & Inspect Data**
Let's load files (e.g., `ArticlesJan2017.csv` and `CommentsApril2017.csv`) to inspect its structure.

In [2]:
# Load an example file to inspect its structure
sample_file = "ArticlesJan2017.csv"  # You can change this to any file in the dataset
df = pd.read_csv(os.path.join(dataset_path, sample_file))

# Display first few rows
df.head()

Unnamed: 0,articleID,abstract,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,58691a5795d0e039260788b9,,By JENNIFER STEINHAUER,article,G.O.P. Leadership Poised to Topple Obama’s Pi...,"['United States Politics and Government', 'Law...",1,National,1,2017-01-01 15:03:38,Politics,The most powerful and ambitious Republican-led...,The New York Times,News,https://www.nytimes.com/2017/01/01/us/politics...,1324
1,586967bf95d0e03926078915,,By MARK LANDLER,article,Fractured World Tested the Hope of a Young Pre...,"['Obama, Barack', 'Afghanistan', 'United State...",1,Foreign,1,2017-01-01 20:34:00,Asia Pacific,A strategy that went from a “good war” to the ...,The New York Times,News,https://www.nytimes.com/2017/01/01/world/asia/...,2836
2,58698a1095d0e0392607894a,,By CAITLIN LOVINGER,article,Little Troublemakers,"['Crossword Puzzles', 'Boxing Day', 'Holidays ...",1,Games,0,2017-01-01 23:00:24,Unknown,Chuck Deodene puts us in a bubbly mood.,The New York Times,News,https://www.nytimes.com/2017/01/01/crosswords/...,445
3,5869911a95d0e0392607894e,,By JOCHEN BITTNER,article,"Angela Merkel, Russia’s Next Target","['Cyberwarfare and Defense', 'Presidential Ele...",1,OpEd,15,2017-01-01 23:30:27,Unknown,"With a friend entering the White House, Vladim...",The New York Times,Op-Ed,https://www.nytimes.com/2017/01/01/opinion/ang...,864
4,5869a61795d0e03926078962,,By JIAYIN SHEN,article,Boots for a Stranger on a Bus,"['Shoes and Boots', 'Buses', 'New York City']",0,Metro,12,2017-01-02 01:00:02,Unknown,Witnessing an act of generosity on a rainy day.,The New York Times,Brief,https://www.nytimes.com/2017/01/01/nyregion/me...,309


In [9]:
# Load an example file to inspect its structure
sample_file_2 = "CommentsApril2017.csv"  # You can change this to any file in the dataset
df_2 = pd.read_csv(os.path.join(dataset_path, sample_file_2))

# Display first few rows
df_2.head()

  df_2 = pd.read_csv(os.path.join(dataset_path, sample_file_2))


Unnamed: 0,approveDate,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,parentID,...,userLocation,userTitle,userURL,inReplyTo,articleID,sectionName,newDesk,articleWordCount,printPage,typeOfMaterial
0,1491245186,This project makes me happy to be a 30+ year T...,22022598.0,22022598,<br/>,comment,1491237000.0,1,False,0.0,...,"Riverside, CA",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
1,1491188619,Stunning photos and reportage. Infuriating tha...,22017350.0,22017350,,comment,1491180000.0,1,False,0.0,...,<br/>,,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
2,1491188617,Brilliant work from conception to execution. I...,22017334.0,22017334,<br/>,comment,1491179000.0,1,False,0.0,...,Raleigh NC,,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
3,1491167820,NYT reporters should provide a contributor's l...,22015913.0,22015913,<br/>,comment,1491150000.0,1,False,0.0,...,"Missouri, USA",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News
4,1491167815,Could only have been done in print. Stunning.,22015466.0,22015466,<br/>,comment,1491147000.0,1,False,0.0,...,"Tucson, Arizona",,,0,58def1347c459f24986d7c80,Unknown,Insider,716.0,2,News


## **📌 Step 3: Extract Metadata**
Now, we extract key metadata, such as column names, data types, and missing values.

In [4]:
# Display dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   articleID         850 non-null    object
 1   abstract          40 non-null     object
 2   byline            850 non-null    object
 3   documentType      850 non-null    object
 4   headline          850 non-null    object
 5   keywords          850 non-null    object
 6   multimedia        850 non-null    int64 
 7   newDesk           850 non-null    object
 8   printPage         850 non-null    int64 
 9   pubDate           850 non-null    object
 10  sectionName       850 non-null    object
 11  snippet           850 non-null    object
 12  source            850 non-null    object
 13  typeOfMaterial    850 non-null    object
 14  webURL            850 non-null    object
 15  articleWordCount  850 non-null    int64 
dtypes: int64(3), object(13)
memory usage: 106.4+ KB


In [10]:
# Display dataset information
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243832 entries, 0 to 243831
Data columns (total 34 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   approveDate            243832 non-null  int64  
 1   commentBody            243832 non-null  object 
 2   commentID              243832 non-null  float64
 3   commentSequence        243832 non-null  int64  
 4   commentTitle           228498 non-null  object 
 5   commentType            243832 non-null  object 
 6   createDate             243832 non-null  float64
 7   depth                  243832 non-null  int64  
 8   editorsSelection       243832 non-null  bool   
 9   parentID               243832 non-null  float64
 10  parentUserDisplayName  70526 non-null   object 
 11  permID                 243832 non-null  object 
 12  picURL                 243832 non-null  object 
 13  recommendations        243832 non-null  float64
 14  recommendedFlag        0 non-null   

## **📌 Step 4: Check for Missing Values**
Checking for missing values in each column.

In [5]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

abstract    810
dtype: int64

In [11]:
# Check for missing values
missing_values_2 = df_2.isnull().sum()
missing_values_2[missing_values_2 > 0]

commentTitle              15334
parentUserDisplayName    173306
recommendedFlag          243832
reportAbuseFlag          243832
userDisplayName              77
userLocation                 62
userTitle                243791
userURL                  243827
dtype: int64

## **📌 Step 5: Summary Statistics**
Generate a summary of numeric and categorical columns.

In [6]:
# Display summary statistics
df.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
articleID,850.0,850.0,58691a5795d0e039260788b9,1.0,,,,,,,
abstract,40.0,40.0,"After losing three limbs in Afghanistan, a Mar...",1.0,,,,,,,
byline,850.0,434.0,By THE EDITORIAL BOARD,32.0,,,,,,,
documentType,850.0,2.0,article,810.0,,,,,,,
headline,850.0,774.0,Unknown,73.0,,,,,,,
keywords,850.0,717.0,[],73.0,,,,,,,
multimedia,850.0,,,,0.927059,0.260193,0.0,1.0,1.0,1.0,1.0
newDesk,850.0,28.0,OpEd,175.0,,,,,,,
printPage,850.0,,,,7.077647,10.100022,0.0,0.0,1.0,12.0,66.0
pubDate,850.0,786.0,2017-02-02 08:21:23,4.0,,,,,,,


In [12]:
# Display summary statistics
df_2.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
approveDate,243832.0,,,,1492504461.72283,1330667.064463,1491008047.0,1491783853.25,1492437034.5,1493119511.0,1524346252.0
commentBody,243832.0,243169.0,Well said.,21.0,,,,,,,
commentID,243832.0,,,,22188608.516839,185515.004384,21999548.0,22092910.75,22176681.5,22260467.5,26824246.0
commentSequence,243832.0,,,,22188608.516839,185515.004384,21999548.0,22092910.75,22176681.5,22260467.5,26824246.0
commentTitle,228498.0,1.0,<br/>,228498.0,,,,,,,
commentType,243832.0,3.0,comment,173277.0,,,,,,,
createDate,243832.0,,,,1492495336.431272,1328516.276157,1491006872.0,1491768493.75,1492428038.0,1493086582.25,1524345694.0
depth,243832.0,,,,1.289425,0.453641,1.0,1.0,1.0,2.0,3.0
editorsSelection,243832.0,2.0,False,238159.0,,,,,,,
parentID,243832.0,,,,6416284.403421,10055291.992253,0.0,0.0,0.0,22051084.75,26426201.0


## **📌 Step 6: Check for Unique Identifiers**
Find columns that can be used as unique identifiers.

In [7]:
# Check if any column can be used as a unique identifier
unique_counts = df.nunique()
unique_counts

articleID           850
abstract             40
byline              434
documentType          2
headline            774
keywords            717
multimedia            2
newDesk              28
printPage            43
pubDate             786
sectionName          30
snippet             846
source                2
typeOfMaterial       11
webURL              850
articleWordCount    689
dtype: int64

In [13]:
# Check if any column can be used as a unique identifier
unique_counts_2 = df_2.nunique()
unique_counts_2

approveDate              115718
commentBody              243169
commentID                243832
commentSequence          243832
commentTitle                  1
commentType                   3
createDate               228348
depth                         3
editorsSelection              2
parentID                  41494
parentUserDisplayName     15712
permID                   243832
picURL                     4282
recommendations            1232
recommendedFlag               0
replyCount                   79
reportAbuseFlag               0
sharing                       2
status                        1
timespeople                   2
trusted                       2
updateDate               136865
userDisplayName           46510
userID                    62946
userLocation              15890
userTitle                     9
userURL                       1
inReplyTo                 41494
articleID                   886
sectionName                  31
newDesk                      28
articleW

## **📌 Step 7: Automate Metadata Extraction for All Files**
Instead of manually inspecting each file, we automate metadata extraction for all files.

In [8]:
# Iterate over all files and extract metadata
metadata_summary = []

for file in files:
    file_path = os.path.join(dataset_path, file)
    df = pd.read_csv(file_path)

    metadata_summary.append({
        "File Name": file,
        "Rows": df.shape[0],
        "Columns": df.shape[1],
        "Missing Values": df.isnull().sum().sum(),
        "Duplicate Rows": df.duplicated().sum(),
        "Unique Columns": df.nunique().to_dict(),
    })

# Convert metadata summary to DataFrame for better readability
metadata_df = pd.DataFrame(metadata_summary)
metadata_df

  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)
  df = pd.read_csv(file_path)


Unnamed: 0,File Name,Rows,Columns,Missing Values,Duplicate Rows,Unique Columns
0,CommentsFeb2018.csv,215282,34,1018593,0,"{'approveDate': 158054, 'articleID': 1155, 'ar..."
1,ArticlesFeb2017.csv,885,16,855,0,"{'articleID': 885, 'abstract': 30, 'byline': 4..."
2,CommentsApril2018.csv,264924,34,1240919,0,"{'approveDate': 196777, 'articleID': 1351, 'ar..."
3,ArticlesJan2017.csv,850,16,810,0,"{'articleID': 850, 'abstract': 40, 'byline': 4..."
4,ArticlesMay2017.csv,996,16,963,0,"{'abstract': 33, 'articleID': 996, 'articleWor..."
5,CommentsJan2017.csv,231449,34,1114483,0,"{'approveDate': 106710, 'articleID': 850, 'art..."
6,CommentsMarch2017.csv,260967,34,1249140,0,"{'approveDate': 115903, 'articleID': 949, 'art..."
7,CommentsMay2017.csv,276389,34,1322148,0,"{'approveDate': 160236, 'commentBody': 275493,..."
8,CommentsMarch2018.csv,246915,34,1331416,0,"{'approveDate': 187256, 'articleID': 1385, 'ar..."
9,CommentsApril2017.csv,243832,34,1164061,0,"{'approveDate': 115718, 'commentBody': 243169,..."


## **🔍 Conclusion**
This notebook provides insights into the dataset structure, missing values, and metadata, making it ready for further text processing and LSTM-based text generation analysis.