 \ \ \ \ \ \  **WARNING!** \ \ \ \ \ \
 THE FOLLOWING CODE CONTAINS TOXIC SPEECH AND HARMFUL CONTENTS.

Welcome to the Un4CHANate Analysis notebook I ! The present ipynb file is designed to help you research 4chan datasets from 4plebs. If you have already followed our 'DATA_LOADER_4CHAN_v2.ipynb' for the dataset chunking and merging the data into your specific timeframe, you are now ready to analyze the posts! The analysis will be pursued through two distinct notebooks: "DATA_ANALYSIS_4CHAN_I_" and "DATA_ANALYSIS_4CHAN_II_".

This notebook will be twofold: on one hand, it will help you read your data; secondly, it will guide you through specific topic-modelling techniques, performed through a "BERTopic" model.  

To help you better visualize the potential results of these techniques example graphs are added on the GitHub repository. Next to this as previously mentioned in the 'DATA_LOADER_4CHAN_v2' a demo dataset is also added to the repository, called 'demo_politically_incorrect_7d_jan_6_2021.csv'.

Phase I: Cleaning

**STEP I:**

Replace with the correct file_path towards your dataset.

In [None]:
import pandas as pd
import os

file_path = r'' #Write the path to your folder here
all_comments = pd.DataFrame()

try:
    all_comments = pd.read_csv(file_path)
except pd.errors.ParserError:
    print(f"Error parsing file: {file_path}. Skipping.")
except Exception as e:
    print(f"An error occurred while reading {file_path}: {e}")

print(f"Total number of rows: {len(all_comments)}")
print(all_comments.info)
all_comments.describe

**STEP II:**

In our 4Chan data, we noticed there was an unwanted column: "Unnamed: 0". We thus proceeded to delete it.

Run cell below.

In [None]:
if 'Unnamed: 0' in all_comments.columns:
    all_comments = all_comments.drop(columns=['Unnamed: 0'])
    print("Column 'Unnamed: 0' successfully deleted.")
else:
    print("Column 'Unnamed: 0' not found in the DataFrame.")

all_comments.info #let's see if it worked

**STEP III:**

Let's now take a look at our data.

Run cell below.

In [None]:
print(len(all_comments))
print(all_comments.sample(40))

**STEP IV:**

The "time" column is in unix timestamps, in order of seconds. It is not a very useful format: let's convert it to actual datetimes.

Run cell below.

In [None]:
all_comments['time'] = pd.to_datetime(all_comments['time'], unit='s')
print(all_comments.head(10))
print(all_comments.tail(10))

Phase 2: Topic modelling

**STEP I:**

Make sure to pip install the following:
- !pip install transformers
- !pip install bertopic umap-learn hdbscan sentence-transformers
- !pip install NLTK

Run cell below.


In [None]:
from transformers import pipeline
import numpy as np
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

**STEP II:**

The following code sets up a pipeline to analyze text and uncover hidden topics.

Run cell below.

In [None]:
sentence_model = SentenceTransformer('all-mpnet-base-v2') #This converts text into a format that a machine can understand
# Dimensionality Reduction
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=10) #The text embeddings are very detailed, often with hundreds of dimensions.
                                                                                                  #To simplify and speed up analysis, we use UMAP to reduce these to just 5
                                                                                                  #dimensions while preserving the structure of the data. "min_dist"
                                                                                                  #controls how tightly the data points are packed after reduction. "metric"
                                                                                                  #specifies how similarity between data points is measured; "n_neighbors"
                                                                                                  #determines how much context each data point gets from its neighbors.


# Clustering
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=2, metric='euclidean', cluster_selection_method='eom') #Once the data is reduced, the next step is to group similar
                                                                                                                #comments together into clusters, which can be interpreted as
                                                                                                                #potential topics. HDBSCAN does this by looking for dense groups
                                                                                                                #of points. "min_cluster_size" determines the minimum size of
                                                                                                                #clusters. "min_samples" deermines their stability and "metric"
                                                                                                                #determines the way distances are measured.

# Tokenization
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2), min_df=5) # To be analyzed, words and phrases need to be divided in smaller parts.  "CountVectorizer" does this by
                                                                                       #by removing common stopwords, extracting single words and two-word combinations (ngram_range) and ignoring
                                                                                        #words that appear in fewr than 5 comments (min_df=5)

# Weighting Scheme
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True) #this highlights the words that are unique to specific topics
# Representation Tuning
representation_model = KeyBERTInspired() #this step picks the most representative words or phrases for each topic.

topic_model = BERTopic(
    embedding_model=sentence_model,
    ctfidf_model=ctfidf_model,
    representation_model=representation_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model
)                                                           #all previous tools are being now combined into one BERTopic model
docs = all_comments['comment'].dropna().astype(str).tolist()
fit_topic_model = topic_model.fit(docs)
topics, probs = topic_model.fit_transform(docs)

#info
topic_info = topic_model.get_topic_info()
topic_info

**STEP III:**

*Optional* 
If there are too many topics we can force reduce them. Let's reduce the number to a 100 topics. This is an example, but any number of choice can be used.

Run cell below.

In [None]:
topic_model.reduce_topics(docs, nr_topics=100) #The forced reduction of topics is a highly sensitive part of the analysis. It is always a good idea to change it many times, and try with different outputs. 


**STEP IV:**

We shall now work on the extracted topics with a few cleaning techniques. This phase can change depending on the data being analysed, as well as on the number of groups of topics extracted. In any case, it is always a good idea to remove stopwords from the topics, by using NLTK. After that, data must be checked to see how many outliers are in our data: topic "-1" is an automatically-produced row that contains all outliers. In our case, since there were too many outliers, we decided to force them into the other existing groups of topics.

Run cell below.

In [None]:
import nltk                                     #NLTK is maybe the best tool for removing stopwords. Unfortunately, many stopwords were used by our model to form the groups of topics,
nltk.download('stopwords')                      #resulting in topics such as "in", "of" and so on. This step ensures no more stopwords are used in the analysis.
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))


docs = docs
docs_cleaned = []
for doc in docs:
  words = doc.split()
  filtered_words = [w for w in words if not w.lower() in stop_words]
  docs_cleaned.append(" ".join(filtered_words))

topics, probs = topic_model.fit_transform(docs_cleaned)

**STEP V:**

Run cell below to see results.

In [None]:
topic_info = topic_model.get_topic_info()
topic_info

**STEP VI:**

*Optional* We still had too many outliers. Let's use them by using a distribution method.

Run cell below.

In [None]:
new_topics = topic_model.reduce_outliers(docs_cleaned, topics, strategy="distributions") if topic_model._outliers else topics
topic_model.update_topics(docs_cleaned, topics=new_topics)
topic_model.get_topic_info()

Congratulations! Now the topics are cleaned and ready to be explored through visualization techniques.

**STEP VII:**

TOPIC MAP NR. 1. This is a visual representation that shows how the identified topics relate to one another in terms of their content.

Run cell below.

In [None]:
topic_model.visualize_topics()

**STEP VIII:**

TOPIC MAP NR. 2. The following map will allow us to visualize how the cleaned text documents are represented in a reduced-dimensional space.

Run the three cells below.

In [None]:
embedding_model = sentence_model
document_embeddings = embedding_model.encode(docs_cleaned, show_progress_bar=True)  #this converts text into numerical embeddings. "show_progress_bar is highly optional, it just allows us to see the bar. I like progress bars."

In [None]:
topic_model.visualize_documents(docs=docs_cleaned, embeddings=document_embeddings, hide_annotations=True)

In [None]:
topic_model.get_topic_info()

Well done! Now that you have a knowledge of the densest topics and of the most important clusters, it is time to **analyse them timewise**. We can do that by adding the topics created as a **third column** to our dataset. In this way, we will be able to see on every row comment, datetime and associated topic group.

**STEP IX:**

Run cell below.

In [None]:
import pandas as pd
import os

all_comments['topics'] = topics

# Save DataFrame to CSV
csv_filename = r'' #fill in file name and path
all_comments.to_csv(csv_filename, index=False)

# Print the absolute path so you can manually download it
file_path = os.path.abspath(csv_filename)
print(f"File saved at: {file_path}")


Congrats! DATA_ANALYSIS_4CHAN_I_ is completed use the .CSV file with the topics for the second analysis notebook 'DATA_ANALYSIS_4CHAN_II_'