# Bleach topic words and do machine learning experiment



## LDA Topic Modeling using `scikit-learn` in Jupyter Notebook




### What is Topic Modeling?

Topic modeling is an unsupervised machine learning technique used to automatically identify topics present in a text corpus. By analyzing the patterns of words and their co-occurrence, topic modeling algorithms can group words into topics and assign topics to individual documents. This helps in understanding the main themes of large collections of texts without having to read through each document.

### How does LDA work?

Latent Dirichlet Allocation (LDA) is one of the most popular topic modeling techniques. At a high level, LDA works as follows:

1. **Initialization**: Randomly assign each word in each document to a topic.
2. **Iterative Assignment**:
    - For each document, go through each word and reassign it to a topic based on:
        a. How prevalent is that word across topics?
        b. How prevalent are topics in that document?
3. **Convergence**: The algorithm iteratively updates topic assignments until they stabilize and don't change much between iterations.

The result is that each document gets a distribution over topics, and each topic gets a distribution over words.

### What are the outcomes?

The output of LDA can be visualized in two primary ways:

1. **Document-Topic Distribution**: Each document is represented as a mixture of topics.
2. **Topic-Word Distribution**: Each topic is represented as a mixture of words.

For example, if we have a corpus of news articles, LDA might identify topics such as 'politics', 'sports', 'economy', etc. An article about a football match would have a high probability for the 'sports' topic, while an article about a new policy might have a high probability for the 'politics' topic.


## Read Data

We use the training data from all five conspiracies to generate topic words

In [1]:
import pandas as pd

# List all your CSV filenames
seeds = ['big.foot','flat.earth','climate', 'vaccine','pizzagate']

# Use a comprehension to read each file into a DataFrame
dfs = [pd.read_csv(f'./data/train_{s}.csv') for s in seeds]

# Concatenate all of these DataFrames
merged_df = pd.concat(dfs, ignore_index=True)



In [2]:
merged_df

Unnamed: 0.1,Unnamed: 0,text,label,doc_id,seeds
0,844,Sounds like a Dan Brown novel set in the Pacif...,mainstream,M108d5,big.foot
1,170,"ST. PETERSBURG, Fla. — Even on a night when th...",mainstream,M10dee,big.foot
2,942,"Our physical body if its human, needs its down...",conspiracy,C01c06,big.foot
3,461,For the first time since Franklin D. Roosevelt...,conspiracy,C0198f,big.foot
4,209,Imagine taking a helicopter to an uncharted is...,mainstream,M0c1b4,big.foot
...,...,...,...,...,...
5315,1095,The conspiracy theory took the internet by sto...,mainstream,M07f3c,pizzagate
5316,1130,"If QAnon’s claims were true, they would shake ...",mainstream,M0e932,pizzagate
5317,1294,"On Sunday afternoon, a 28-year-old man walked ...",mainstream,M13dfb,pizzagate
5318,860,June 22 (Reuters) - A North Carolina man who w...,mainstream,M1520a,pizzagate


## Text Preprocessing

1. Stop words
nltk default and a few added by Danny:

2. Lowercase all the data

Before performing LDA, we need to convert our documents into a matrix of token counts. We'll use CountVectorizer for this.


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords


# Lowercase the data

merged_df['text'] = merged_df['text'].str.lower()

# Stop words

STOP_WORDS = set(list(stopwords.words("english")+["you", "it", "we", "there", "they", "us", "the", "them", "i've", "i'm", "too", "you've"]))

# Initialize CountVectorizer
vectorizer = CountVectorizer(stop_words=list(STOP_WORDS), min_df=2)

# Transform the documents into a matrix of token counts
X = vectorizer.fit_transform(merged_df['text'])
 

## Build LDA model

In [9]:
from sklearn.decomposition import LatentDirichletAllocation

# Initialize LDA model with the desired number of topics
lda_model = LatentDirichletAllocation(n_components=5, random_state=42)

# Fit the LDA model on the data
lda_model.fit(X)

In [12]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))



In [13]:
def save_topics_to_file(model, feature_names, no_top_words):
    num_topics = model.n_components
    with open(f"{num_topics}_topic_words.txt", 'w') as file:
        file.write(f"Number of Topics: {num_topics}\n\n")
        for topic_idx, topic in enumerate(model.components_):
            file.write(f"Topic {topic_idx + 1}:\n")
            file.write(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
            file.write("\n\n")




In [14]:
no_top_words = 10
display_topics(lda_model, vectorizer.get_feature_names_out(), no_top_words)


Topic 1:
one like earth people would time world know also even
Topic 2:
trump state new government president people said one american would
Topic 3:
climate change said global world emissions new carbon would countries
Topic 4:
vaccine vaccines people may health also children study one climate
Topic 5:
said news trump media conspiracy people one clinton also fake


In [None]:
# Save topic words to a file
no_top_words = 20
save_topics_to_file(lda_model, vectorizer.get_feature_names_out(), no_top_words)


## Bleach words and generate new training data

In [15]:
def bleach_topic_words(topic_words, df): 
    bleach_words = topic_words
    bleach_lines = []
    for v in df['word_pos'].values:
        new_doc = []
        try:
            for i in v.split():
                word = i.split('_')[0]
                pos = i.split('_')[1]
                if word.lower() in bleach_words:
                    new_doc.append(pos)
                else:
                    new_doc.append(word)
        except:
            print (i)
        bleach_lines.append(' '.join(new_doc))
    
    df['bleach'] = bleach_lines
    
    
    return df

In [None]:
def gen_new_df(folder,n_topics)
    seeds = ['big.foot','flat.earth','climate', 'vaccine','pizzagate']
    for data_type in ['train','test','val']:
        for seed in seeds:
            file = f"{folder}{data_type}_{seed}.wp.csv"
            df = pd.read_csv(file)
            df = topic_words(df)
            df.drop(['text'], axis=1, inplace=True)
            df.drop(df.columns[0], axis=1, inplace=True)
            df.rename(columns={'bleach': 'text'}, inplace=True)
            df.to_csv(f"{folder}{data_type}_{seed}.tp{n_topics}.csv")
