In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
file_path = 'css_chatgpt2023_research.csv'

In [3]:
df = pd.read_csv(file_path)

## General metrics:

In [4]:
print("DataFrame columns:")
print(df.columns.tolist())

DataFrame columns:
['subreddit', 'author', 'score', 'created', 'body', 'link']


In [5]:
total_size_mb = df.memory_usage(deep=True).sum() / (1024**2)
print(f"Total data size: {total_size_mb:.2f} MB")

Total data size: 1776.44 MB


In [6]:
num_rows = len(df)
print(f"Number of rows (records): {num_rows:,}")

Number of rows (records): 2,678,522


In [7]:
df.info(verbose=True, memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2678522 entries, 0 to 2678521
Data columns (total 6 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   subreddit  object
 1   author     object
 2   score      int64 
 3   created    object
 4   body       object
 5   link       object
dtypes: int64(1), object(5)
memory usage: 1.7 GB


In [8]:
print("Numerical statistics:")
print(df.describe(include='number'))

Numerical statistics:
              score
count  2.678522e+06
mean   9.170316e+00
std    1.080675e+02
min   -7.660000e+02
25%    1.000000e+00
50%    1.000000e+00
75%    3.000000e+00
max    2.210300e+04


In [9]:
print("Categorical statistics:")
print(df.describe(include='object'))

Categorical statistics:
         subreddit       author           created       body  \
count      2678522      2678522           2678522    2678478   
unique          10       460491            176850    2333352   
top     technology  u/[deleted]  2023-04-20 19:12  [removed]   
freq        883513       285177                65     123135   

                                                     link  
count                                             2678522  
unique                                            2678522  
top     https://www.reddit.com/r/singularity/comments/...  
freq                                                    1  


## Cleaning the data:

During our initial review, we found that over 120,000 records had been deleted or removed by moderators, and many comments came from deleted user accounts. Because of this, we had to clean the data before doing any sentiment analysis. 

In [10]:
noise_values = ['[deleted]', '[removed]', '', None, np.nan]
df.drop(df[df['body'].isin(noise_values)].index, inplace=True)

df.drop(df[df['author'].isin(['[deleted]', 'None'])].index, inplace=True)


We filtered out all rows where the comment text was missing or showed placeholders like [deleted] or [removed]. This step helped improve the quality of our dataset by keeping only real, readable comments. As a result, our analysis of comment frequency and sentiment will better reflect actual user opinions, not moderation noise.


In [11]:
try:
    df['score'] = pd.to_numeric(df['score'], errors='coerce').fillna(0).astype(int)
except Exception as e:
    print(f"Score conversion error: {e}. Left as 'object'")

Next, we fixed the score column, which shows how popular each comment is. We replaced those missing values with zero, so every comment has a usable score. Finally, we converted the whole column into clean integer numbers. 

This helps us make sure the popularity scores are consistent and reliable, which is important for calculating average sentiment and other key statistics.

## Final result:

In [12]:
print("Numerical statistics:")
print(df.describe(include='number'))

Numerical statistics:
              score
count  2.440487e+06
mean   9.726924e+00
std    1.118540e+02
min   -5.450000e+02
25%    1.000000e+00
50%    1.000000e+00
75%    3.000000e+00
max    2.210300e+04


In [13]:
print("Categorical statistics:")
print(df.describe(include='object'))

Categorical statistics:
         subreddit           author           created  \
count      2440487          2440487           2440487   
unique          10           460484            176832   
top     technology  u/AutoModerator  2023-04-20 19:12   
freq        789831            99794                58   

                                                     body  \
count                                             2440487   
unique                                            2333350   
top     **Attention! [Serious] Tag Notice**\n\n : Joke...   
freq                                                 8092   

                                                     link  
count                                             2440487  
unique                                            2440487  
top     https://www.reddit.com/r/singularity/comments/...  
freq                                                    1  


In [14]:
print("Simple frequency overview per category:")
print("\nSubreddit Counts:")
print(df['subreddit'].value_counts())

Simple frequency overview per category:

Subreddit Counts:
subreddit
technology               789831
ChatGPT                  747607
Futurology               317730
singularity              311371
OpenAI                   110510
privacy                   54066
MachineLearning           52557
ArtificialInteligence     47408
Gemini                     5924
Bard                       3483
Name: count, dtype: int64


In [15]:
df['date'] = pd.to_datetime(df['created'])
df['month'] = df['date'].dt.to_period('M')

print("Monthly counts:")
print(df['month'].value_counts().sort_index())

Monthly counts:
month
2023-03    600616
2023-04    659669
2023-05    642643
2023-06      4016
2023-12    532340
2024-01      1203
Freq: M, Name: count, dtype: int64


When we looked at the final data, we saw that most comments came from our planned months (March, May, April, December 2023). However, a very small number of comments were also counted in June 2023 and January 2024. 

This happens because all Reddit data is recorded using UTC time (Universal Coordinated Time). When we converted this UTC time to local dates for our analysis, some comments posted in the last few hours of one month officially "leaked" into the first few hours of the next month. Because the number of leaked comments is extremely small (only 4,615 in June 2023 and 1,259 in January 2024, compared to hundreds of thousands in our target months), we know these months do not affect our main analysis of the two major sentiment phases.

In [16]:
num_rows = len(df)
print(f"Number of rows after cleaning: {num_rows:,}")

Number of rows after cleaning: 2,440,487


In [17]:
print("DataFrame head:")
print(df.head())

DataFrame head:
     subreddit                 author  score           created  \
0  singularity        u/rodeoclownboy     -2  2023-03-01 02:00   
1   technology  u/Siliceously_Sintery      0  2023-03-01 02:00   
2   Futurology          u/WarLordM123     -3  2023-03-01 02:00   
3   technology   u/Sven_Grammerstorf_      1  2023-03-01 02:00   
4   technology             u/Nkognito      1  2023-03-01 02:00   

                                                body  \
0  if your job is at serious risk of being replac...   
1  Anything out of China can be controlled direct...   
2  It's the reason we have the wonders of the mod...   
3              Most large companies are self insured   
4  You missed the beard scratch violation in the ...   

                                                link                date  \
0  https://www.reddit.com/r/singularity/comments/... 2023-03-01 02:00:00   
1  https://www.reddit.com/r/technology/comments/1... 2023-03-01 02:00:00   
2  https://www.reddit.

In [19]:
df.head()

Unnamed: 0,subreddit,author,score,created,body,link,date,month
0,singularity,u/rodeoclownboy,-2,2023-03-01 02:00,if your job is at serious risk of being replac...,https://www.reddit.com/r/singularity/comments/...,2023-03-01 02:00:00,2023-03
1,technology,u/Siliceously_Sintery,0,2023-03-01 02:00,Anything out of China can be controlled direct...,https://www.reddit.com/r/technology/comments/1...,2023-03-01 02:00:00,2023-03
2,Futurology,u/WarLordM123,-3,2023-03-01 02:00,It's the reason we have the wonders of the mod...,https://www.reddit.com/r/Futurology/comments/1...,2023-03-01 02:00:00,2023-03
3,technology,u/Sven_Grammerstorf_,1,2023-03-01 02:00,Most large companies are self insured,https://www.reddit.com/r/technology/comments/1...,2023-03-01 02:00:00,2023-03
4,technology,u/Nkognito,1,2023-03-01 02:00,You missed the beard scratch violation in the ...,https://www.reddit.com/r/technology/comments/1...,2023-03-01 02:00:00,2023-03
