Welcome to the first notebook of the Un4CHANate project.

In this notebook you will chunk and clean dataset from the InternetArchive 4Plebs (https://archive.org/details/4plebs-org-data-dump-2024-01). Through the chunking process, the files are divided into smaller, more manageable pieces, making them small enough to be opened and processed on a standard computer without requiring high-performance hardware.

If you are interested in testing our second notebook ---for data analysis--- a demo csv file is added onto this Github repository.

This notebook extracts the csv file from 4plebs to a smaller csv file only containing the timestamp, comment in the timeframe of interest for your own analysis.

**STEP I:**

Download your chosen dataset at https://archive.org/details/4plebs-org-data-dump-2024-01

*Categories:*
- /b/: Random (the infamous anything-goes board).
- /v/: Video games.
- /pol/: Politically incorrect.
- /a/: Anime & manga.

**STEP II:** 

Write down the the start- and end-date of the timeframe of interest for your analysis (YYYY-MM-DD). This code cell will give you the UNIX values back.

In [None]:
from datetime import datetime

# Fill your timeframe of interest in Year-Month-Day format, this code cell will give you the unix-values needed to find the correct timeframe in the larger CSV file.

def convert_to_unix(date_string, date_format="%Y-%m-%d"):

    try:
        # Parse the date string into a datetime object
        date_obj = datetime.strptime(date_string, date_format)
        # Convert the datetime object to a Unix timestamp
        unix_timestamp = int(date_obj.timestamp())
        return unix_timestamp
    except ValueError as e:
        return f"Error: {e}"

# Prompt the user for start and end dates
start_date = input("Enter the start date (format: YYYY-MM-DD): ")
end_date = input("Enter the end date (format: YYYY-MM-DD): ")

# Convert the provided dates to Unix timestamps
timestamp_start = int(convert_to_unix(start_date))
timestamp_end = int(convert_to_unix(end_date))

# Print the results
print(f"Date of start = {start_date}")
print(f"timestamp_start = {timestamp_start}")
print(f"Date of end = {end_date}")
print(f"timestamp_end = {timestamp_end}")



**STEP III:** 

Write down the correct path which redirects to the CSV containing the dataset downloaded on 4plebs. Create a folder and fill in the path where the file chuncks will show up.

Please note that by changing the columns of interest the notebooks will not work adequately.

This process will take some time, you can wait for all the chuncks to be processed.
Another option would be to open the processed chunks and assess if the number in the 'time' column is corresponding within your timeframe. To find the the number tied to the timeframe read the UNIX number at STEP II.

**Note** 1 You can change the 'chunck_size', the smaller the number the less time it will take your computer to create a chunck, but you will end up with more chuncks

In [None]:
import pandas as pd
import os
import re

# File paths
input_file = r'C:\Users'  # Direct to the csv file.
output_folder = r'C:\Users' # Direct to the folder in which you want the chuncked files

chunk_size = 10000000 # Number of rows per chunk
columns_to_extract = [4, 22]  # Columns containing the timestamp and comment string
pattern = r">>"  # Regex pattern to match strings containing '>>'

if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Process the file in chunks
chunks = pd.read_csv(
    input_file,
    engine='python',
    chunksize=chunk_size,
    on_bad_lines='skip',
    delimiter=',',
    quoting=3
)

for i, chunk in enumerate(chunks):
    try:
        # Extract specified columns and drop missing values
        selected_columns = chunk.iloc[:, columns_to_extract].copy()
        selected_columns.columns = ['time', 'comment']  # Rename the columns for clarity
        selected_columns = selected_columns.dropna()

        # Filter out rows where 'comment' contains '>>'
        cleaned_data = selected_columns[~selected_columns['comment'].str.contains(pattern, na=False)]

        # Strip whitespace or quotes from the 'time' column and convert to integers
        cleaned_data['time'] = cleaned_data['time'].astype(str).str.strip(' "')
        cleaned_data['time'] = pd.to_numeric(cleaned_data['time'], errors='coerce', downcast='integer')

        # Drop rows where conversion resulted in NaN
        cleaned_data = cleaned_data.dropna(subset=['time'])

        # Convert the 'time' column to integers after cleaning
        cleaned_data['time'] = cleaned_data['time'].astype(int)

        # Define the output file path for this chunk
        output_file = os.path.join(output_folder, f"chunk_{i + 1}.csv")

        # Save the cleaned chunk to a CSV file
        cleaned_data.to_csv(output_file, index=False, header=True)
        print(f"Cleaned chunk {i + 1} saved to {output_file}.")

    except Exception as e:
        print(f"Error processing chunk {i + 1}: {e}")

print("All chunks processed and saved successfully.")


**STEP IV:**

Fill in the 'filtered_folder' with the folder where you want to save the filtered chunks.

Fill in the 'final_output_file' with the path and name for the final CSV.

This code cell below will merge the CSV chuncks into one CSV file only containing the rows that are within your timeframe.




In [None]:
import os
import pandas as pd

def filter_chunks_by_timestamp(input_folder, output_folder, timestamp_start, timestamp_end):

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for file in os.listdir(input_folder):
        if file.endswith('.csv'):
            file_path = os.path.join(input_folder, file)
            try:
                # Read the CSV file
                df = pd.read_csv(file_path)

                # Ensure the first column is cleaned and converted to integers
                df.iloc[:, 0] = df.iloc[:, 0].astype(str).str.strip(' "')
                df.iloc[:, 0] = pd.to_numeric(df.iloc[:, 0], errors='coerce', downcast='integer')

                # Drop rows where conversion resulted in NaN
                df = df.dropna(subset=[df.columns[0]])

                # Convert the column to integers after cleaning
                df.iloc[:, 0] = df.iloc[:, 0].astype(int)

                # Filter rows based on the timestamp range
                filtered_df = df[(df.iloc[:, 0] >= timestamp_start) & (df.iloc[:, 0] <= timestamp_end)]

                # If there are matching rows, save the filtered chunk to the output folder
                if not filtered_df.empty:
                    output_path = os.path.join(output_folder, file)
                    filtered_df.to_csv(output_path, index=False)
                    print(f"Filtered chunk saved: {output_path}")

            except Exception as e:
                print(f"Error processing {file}: {e}")

def merge_filtered_chunks(filtered_folder, output_file):
  
    dataframes = []

    for file in os.listdir(filtered_folder):
        if file.endswith('.csv'):
            file_path = os.path.join(filtered_folder, file)
            try:
                df = pd.read_csv(file_path)
                dataframes.append(df)
            except Exception as e:
                print(f"Error reading {file}: {e}")

    if dataframes:
        merged_df = pd.concat(dataframes, ignore_index=True)
        merged_df.to_csv(output_file, index=False)
        print(f"Merged CSV saved to {output_file}")
    else:
        print("No CSV files found to merge.")

# Fill in the missing values
input_folder = output_folder
filtered_folder = r"C:\Users"  # Replace with the folder where you want to save the filtered chunks
final_output_file = r"C:\Users.csv"  # Replace with the path and name for the final CSV

filter_chunks_by_timestamp(input_folder, filtered_folder, timestamp_start, timestamp_end)
merge_filtered_chunks(filtered_folder, final_output_file)


**DATA_LOADER_4CHAN** Completed! Now proceed to the second notebook to analyze the the dataset!