<a href="https://colab.research.google.com/github/hbedle/DISC_NLPwrkshp/blob/main/DISC_workshop_Feb2024_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Background information

For today's workshop, we will being doing some basic visualizations and analysis of text using Natural Language Processing techniques.

One thing to always keep in mind is that machine learning techniques augment our analyses - and that many techniques exist, and depending on the task you are trying to accomplish  your task will require different methods.   I highly recommend the text '*Text as Data*' by Grimmer et al. if you really want to dig into NLP method more.

With any data analysis, it is important to consider the selection of your data (text) and any biases it may have when you are building your text documents (called corpus) to analyze.

Today, we are going to start with a techinque called **'Bag of Words'** because we are treating our corpus as a string of individual words.

**So the basic steps to get started are:**

1.   **Choose your unit of analysis**... is it news headlines, paragraphs in political speeches, tweets?

2.   **Tokenize**- this is how you will break down your document.  Typically we are looking at individual words [unigram], but you could have phrases that are important like "white house" and want to include bigrams [two word phrases].

3. **Reduce the complexity** of your text/words:
  - Make all text lowercase
  - Remove punctuation
  - Remove stop words [common words]
  - Create equivalent word classes
    - lemmentization
    - stemming
  - Filter words by frequency

4. Can then look at **Representations from language sequencing**, this can be things such as:
  - Parts of Speech tagging [POS]
    - are you interested in nouns? verbs? adjectives before a noun?
  - Named entity recognition [NER]

5. Looking at the words in multidimensional space in **Distributed Representation of Words** with **word embeddings**, **vector space modeling**

6. **Clustering** methods - these take in the text of each document, and then output each document into 'n' categories.  Common starting methods include k-means.

7. **Topic Modeling** - similar to clustering, but allows multiple topics per document.   
  - Latent Dirichiet Allocation [LDA] is one of the most popular methods for topic modeling.

**TODAY** - Using data from OU's CRCM National 2018 survey: http://crcm.ou.edu/epscordata/

**Codebook**:

  **Gender** 0 =female, 1 = male
  
  **glbcc** : In your view, are greenhouse gases, such as those resulting from the combustion of coal, oil, natural gas, and other materials, causing average global temperatures to rise?
        0 = No, = 1  Yes

  **glbcc_change**: In the last 5 years have you changed your beliefs about whether humans are/are not causing global climate change?
        0 = No, 1=Yes, 2=Don't know

  **party_w_lean** = 1 = democratic, 2=Republican  NA= other

  **glbcc_change_led1**:  - [only IF glbcc_change = 0 or 2] What led you to be more or less condifent in your climate change views than 5 years ago?

  **glbcc_change_led2**:  - [only IF glbcc_change = 1]  What led you to changeyour views about whether humans are causing global climate change?

**A couple of Google Colab shortcuts**

to **comment/ uncomment** out a section of code - highlight and press Ctrl + /
other shortcuts can be found in colab menu under 'Tools -> Keyboard Shortcuts'

## Import all needed python packages and the data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from nltk.probability import FreqDist
import string
from collections import Counter
from gensim import corpora
from gensim.models import LdaModel
import spacy
from gensim.models import CoherenceModel
#from gensim.models.ldamodel import LdaModel



In [None]:
# Download NLTK (Natural Language toolikt) resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
################  YOU WILL NEED TO CHANGE THE FILE PATH TO POINT TO YOUR EXCEL #####################################
# Read the Excel sheet containing text data
file_path =  '/content/gdrive/MyDrive/data_examples/CRCM_glbcc_txt.xlsx'
df = pd.read_excel(file_path)

In [None]:
#check out the header and top couple rows
df.head()

## Let's start by visualizing the data in Python

In [None]:
################## Pie chart for gender  #########################
# Replace numeric values with meaningful labels  & Count the occurrences of each gender
gender_counts = df['gender'].replace({0: 'Female', 1: 'Male'}).value_counts()

# Define colors for each gender - feel free to pick new colors!
gender_colors = {'Female': 'purple', 'Male': 'blue'}

#Create a pie chart
plt.pie(gender_counts, labels=gender_counts.index, colors=[gender_colors[gender] for gender in gender_counts.index], autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Gender')
plt.show()

In [None]:
print(df.head)

In [None]:
####################### Vertical bar chart for party lean #########################
party_counts = df['party_w_lean'].replace({1: 'Democrat', 2: 'Republican', 3: 'Other'}).value_counts()
party_colors = {'Other': 'grey', 'Democrat': 'blue', 'Republican': 'red'}

# Plot the bar chart with custom colors and counts on top
ax = party_counts.plot(kind='bar', color=[party_colors[party] for party in party_counts.index])
ax.bar_label(ax.containers[0], fmt='%d', label_type='edge', fontsize=8)  # Add counts on top of each bar
plt.title('Party Affiliation Distribution')
plt.ylabel('Number of Respondents')
plt.xticks(rotation=0)  # Adjust x-axis ticks
plt.show()

In [None]:
######################### Horizontal bar chart for beliefs about climate change  #########################
beliefs_labels = df['glbcc'].replace({0: 'No', 1: 'Yes'})
beliefs_counts = beliefs_labels.value_counts()
beliefs_colors = {'No': 'turquoise', 'Yes': 'orange'}

# Create a horizontal bar chart
fig, ax = plt.subplots()
bars = ax.barh(beliefs_counts.index, beliefs_counts, color=[beliefs_colors[belief] for belief in beliefs_counts.index])

# Add values at the end of each bar using a for loop
for bar in bars:
    ax.text(bar.get_width(), bar.get_y() + bar.get_height() / 2, f'{bar.get_width()}', ha='left', va='center')

# Set title and labels
plt.title('Beliefs About Climate Change')
plt.xlabel('Number of Respondents')
plt.show()

In [None]:
########################### NOW LETS COMBINE VARIABLES FOR A FEW MORE VISUALIZATIONS! USING SEABORN  ###########################
# Set the style for seaborn
sns.set(style="whitegrid")

# Map numeric values to meaningful labels - NOTE NOW WE ARE CHANGING THE DATAFRAME FROM NUMBERS TO LABELS!  SO COOL!
### Can see this with print(df.head())
df['genderFM'] = df['gender'].replace({0: 'Female', 1: 'Male'})
df['party_w_leanDRO'] = df['party_w_lean'].replace({1: 'Democrat', 2: 'Republican', 3: 'Other'})
df['glbccNY'] = df['glbcc'].replace({0: 'No', 1: 'Yes'})

# Create a countplot
plt.figure(figsize=(12, 8))
sns.countplot(x='party_w_leanDRO', hue='genderFM', data=df, palette={'Female': 'purple', 'Male': 'blue'}, hue_order=['Female', 'Male'])

# Customize the plot
plt.title('Distribution of Gender across Party Affiliation')
plt.xlabel('Party Affiliation')
plt.ylabel('Number of Respondents')
plt.legend(title='Gender')
plt.show()


In [None]:
# Or as an alternative can create a heatmap for the previous information

plt.subplots(figsize=(8, 8))
df_2dhist = pd.DataFrame({
    x_label: grp['party_w_leanDRO'].value_counts()
    for x_label, grp in df.groupby('genderFM')
})
sns.heatmap(df_2dhist, cmap='viridis')
plt.title('Distribution of Gender across Party Affiliation')
plt.xlabel('genderFM')
_ = plt.ylabel('party_w_leanDRO')

In [None]:
########################## CLIMATE CHANGE BELIEFS BY GENDER ##################
plt.figure(figsize=(10, 6))
sns.countplot(x='glbccNY', hue='genderFM', data=df, palette={'Female': 'purple', 'Male': 'blue'}, hue_order=['Female', 'Male'])

# Customize the plot
plt.title('Beliefs About Climate Change by Gender')
plt.xlabel('Beliefs About Climate Change')
plt.ylabel('Number of Respondents')
plt.legend(title='Gender')
plt.show()

In [None]:
################## HOW MANY HAVE CHANGED THEIR MIND ABOUT CLIMATE CHANGE?  ############################
plt.figure(figsize=(10, 6))
sns.countplot(x='glbcc_change', data=df, palette='viridis')

# Customize x-axis labels
plt.xticks(ticks=[0, 1, 2], labels=['No', 'Yes', "Don't know"])

plt.title('Distribution of Changes in Climate Change Beliefs')
plt.xlabel('Changes in Beliefs')
plt.ylabel('Number of Respondents')
plt.show()

## Now let's look at the textual answers!

Let's look at the survey answers just for *glbcc_change_led2*

In [None]:
#########   FIRST LETS MAKE A WORDCLOUD OF BOTH FREE RESPONSES   ########
# If you want to combine text from two variables - do this...
# Concatenate the free response columns after converting to strings
#free_responses = df['glbcc_change_led1'].astype(str).dropna() + ' ' + df['glbcc_change_led2'].astype(str).dropna()

free_responses =  df['glbcc_change_led2'].astype(str).dropna()

# Generate Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(free_responses))

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Free Responses')
plt.show()

In [None]:
#Let's print out the top of free_responses just to see what the data looks like...
free_responses[:20]

**HMMMMM.... MAYBE WE WANT TO TAKE OUT SOME OF THOSE WORDS.... LETS GO OVER SOME NLP PROCESSING BASICS.....**

**Stop words** are common words that are often removed from text data during NLP. These words, such as "the," "and," "is,"
Removing stop words can help focus on the more meaningful content of the text.

**Lemmatization** is the process of reducing words to their base or root form, known as the lemma.
It involves removing inflections or variations to standardize words. This helps in grouping together different forms of a word and simplifies an

  - *Example: The lemma of the words "running," "ran," and "runs" is "run."*

**Tokenization** is the process of breaking down a text into individual units, typically words or phrases, known as tokens.
These tokens are the building blocks used in NLP tasks. Tokenization is a crucial step in text analysis as it allows computers to understand and process the structure of the text.
  - *For example: The sentence "I love programming in Python" would be tokenized into individual words: ["I", "love", "programming", "in", "Python"].*

**Stemming** is another text normalization process that involves reducing words to its base or root form, known as the "stem."
Unlike lemmatization, stemming may result in a root form that is not a valid word, as it applies a set of rules to chop off prefixes or suffixes from words. The goal of stemming is to group together words with similar meanings and treat them as the same entity, even if the resulting stem is not a complete word.

  - *Example-- the stem of "happiness," "happy," and "happily" would be "happi."*

  - Stemming is often less linguistically precise than lemmatization but can be computationally faster.

In [None]:
###########################################  OK NOW LET'S PREPROCESS   ################################################
# Copy responses from the 'glbcc_change_led2' column and set as type 'string'
df['glbcc_change_led2_NLP'] = df['glbcc_change_led2'].astype(str)

# Extract responses from the 'glbcc_change_led2' column and drop blank values
responses = df['glbcc_change_led2_NLP'].dropna()

# Tokenization and cleaning with lemmatization
lemmatizer = WordNetLemmatizer()

#this is my custon stop word list - where I can add other words if I want
custom_stop_words = set(['na', 'nan'])  # Add your own custom stop words here!

def preprocess_text(text):
    # Convert to lowercase
    text = str(text).lower()

    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))

    # Tokenization and lemmatization
    tokens = [lemmatizer.lemmatize(token) for token in word_tokenize(text)]

    # Remove custom and NLTK stop words
    stop_words = set(stopwords.words('english'))
    stop_words = stop_words.union(custom_stop_words)
    tokens = [token for token in tokens if token not in stop_words]

    return tokens

# Apply tokenization and cleaning to each survey response
responses['Tokens'] = responses.apply(preprocess_text)

# put all the tokens in one list
all_tokens = [token for sublist in responses['Tokens'].tolist() for token in sublist]

In [None]:
print(responses['Tokens'].head(20))

In [None]:
# what does our tokenized concattenated text look like?
all_tokens[:10]

In [None]:
##############################  Now that we have pre-processed, lets make the word cloud again - notice any differences?   ######
# Word Cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_tokens))
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('NEW Word Cloud of Responses after NLP preprocessing')
plt.show()


In [None]:
#####################  WHAT ARE SOME OTHER WAYS TO VISUALIZE THE WORDS IN THE FREE RESPONSE?   ##################
# Bar chart of most common words  [falls under frequency analysis]
fdist = FreqDist(all_tokens)
common_words = fdist.most_common(15)
common_words_df = pd.DataFrame(common_words, columns=['Word', 'Frequency'])
plt.figure(figsize=(10, 5))
plt.bar(common_words_df['Word'], common_words_df['Frequency'], color='skyblue')
plt.title('Top 15 Most Common Words')
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

In [None]:
##################### Or you can just print this out to your terminal screen --- the most common words and their frequencies
print(common_words_df)

Let's see what happens if we look for phrases, not just inidividual words...

In [None]:
from nltk.probability import FreqDist
from nltk.util import ngrams

# Calculate frequency distribution of all tokens (unigrams)
fdist_all_tokens = FreqDist(all_tokens)

# Get the top 15 most common unigrams
top_15_unigrams = fdist_all_tokens.most_common(15)

# Print the top 15 unigrams and their frequencies
print("Top 15 Unigrams:")
for unigram, frequency in top_15_unigrams:
    print(f"{unigram}: {frequency}")

# Extract all bigrams from the list of tokens
all_bigrams = list(ngrams(all_tokens, 2))

# Calculate frequency distribution of bigrams
fdist_bigrams = FreqDist(all_bigrams)

# Get the top 15 most common bigrams
top_15_bigrams = fdist_bigrams.most_common(15)

# Print the top 15 bigrams and their frequencies
print("\nTop 15 Bigrams:")
for bigram, frequency in top_15_bigrams:
    print(f"{' '.join(bigram)}: {frequency}")


## Parts of speech (POS)

In [None]:
# Load the English language model
nlp = spacy.load('en_core_web_sm')

# Define a function to perform POS tagging on a text
def pos_tagging(text):
    doc = nlp(text)
    pos_tags = [(token.text, token.pos_) for token in doc]
    return pos_tags

# Example text
text = "Heather thinks its really fun to sit around and write code."

# Perform POS tagging
pos_tags = pos_tagging(text)

# Print the POS tags
for token, pos_tag in pos_tags:
    print(f"{token}: {pos_tag}")


In [None]:
# OK - so now lets do this on our data!
#Define a function to extract verbs from text
def extract_verbs(text):
    doc = nlp(text)
    verbs = [token.lemma_ for token in doc if token.pos_ == 'VERB']
    return verbs

# Apply the function to each response in the glbcc_change_led1 column
# recall this variable is -- What led you to be more or less condifent in your climate change views than 5 years ago?
all_verbs = [verb for response in df['glbcc_change_led1'] for verb in extract_verbs(str(response))]

# Count the frequency of each verb
verb_counts = Counter(all_verbs)

# Get the top 10 most common verbs
top_verbs = verb_counts.most_common(10)

# Extract verbs and their frequencies
verbs, frequencies = zip(*top_verbs)

# Plot the top verbs
plt.figure(figsize=(10, 6))
plt.bar(verbs, frequencies, color='turquoise')
plt.title('Top 10 Verbs in Responses')
plt.xlabel('Verbs')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


In [None]:
# Now we can expand this to look at adjective-noun pairs
def extract_adj_noun_pairs(text):
    doc = nlp(text)
    adj_noun_pairs = [(token.text, token.head.text) for token in doc if token.pos_ == 'ADJ' and token.head.pos_ == 'NOUN']
    return adj_noun_pairs

# Apply the function to each response in the glbcc_change_led2 column
all_adj_noun_pairs = [pair for response in df['glbcc_change_led1'] for pair in extract_adj_noun_pairs(str(response))]

# Count the frequency of each adjective-noun pair
adj_noun_pair_counts = Counter(all_adj_noun_pairs)

# Get the top 10 most common adjective-noun pairs
top_adj_noun_pairs = adj_noun_pair_counts.most_common(10)

# Extract adjective-noun pairs and their frequencies
adj_noun_pairs, frequencies = zip(*top_adj_noun_pairs)

cmap = plt.cm.inferno

# Plot the top adjective-noun pairs
plt.figure(figsize=(10, 6))
#plt.barh(range(len(adj_noun_pairs)), frequencies, color='coral')  # Use barh for horizontal bars
plt.barh(range(len(adj_noun_pairs)), frequencies, color=cmap(np.linspace(0, 1, len(adj_noun_pairs))))
plt.yticks(range(len(adj_noun_pairs)), adj_noun_pairs)  # Set y-ticks as adjective-noun pairs
plt.title('Top 10 Adjective-Noun Pairs in Responses')
plt.xlabel('Frequency')
plt.ylabel('Adjective-Noun Pairs')
plt.show()


## Named Entity Recognition (NER)

In [None]:
#######  First, lets only at folks who now believe in climate change  and try to understand why they changed their minds ######

# Filter the dataframe based on the condition glbcc_change = 1
changed_views_df = df[df['glbcc_change'] == 1]

# Extract the glbcc_change_led2 column
change_led2_responses = changed_views_df['glbcc_change_led2']

# Display the first few responses to understand the data
print(change_led2_responses.head())

# Filter out NaN values in glbcc_change_led2
valid_responses = changed_views_df['glbcc_change_led2'].dropna()

In [None]:
# Perform NER on each valid response
named_entities_list = []
for response in valid_responses:
    # Process the response with spaCy NLP pipeline
    doc = nlp(str(response))

    # Extract named entities
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]
    named_entities_list.extend(named_entities)

# Print the named entities
print("Named Entities:")
for entity, label in named_entities_list:
    print(f"{entity} - {label}")

# Create a DataFrame from the named_entities_list
df_named_entities = pd.DataFrame(named_entities_list, columns=['Entity', 'Type'])

## Note in the print out below that it is far from perfect - ie "Milder Winters" is a person?

In [None]:
# Count the occurrences of each entity type
entity_type_counts = Counter(df_named_entities['Type'])

# Plot a bar chart
plt.figure(figsize=(10, 6))
plt.bar(entity_type_counts.keys(), entity_type_counts.values())
plt.title('Named Entity Types')
plt.xlabel('Entity Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Print the named entities in the geo-political entity category
print("Named Entities in GPE category:")
for entity, label in named_entities_list:
    if label == 'GPE':
        print(f"{entity} - {label}")

## K-means clustering

**K-means clustering** is a popular unsupervised machine learning algorithm used for clustering data points into groups or clusters based on their similarities.

The goal of k-means clustering is to partition the data into a predefined number of clusters, with each cluster represented by its centroid (the mean of all data points in the cluster). The algorithm works iteratively by first randomly initializing cluster centroids and then assigning each data point to the nearest centroid. After the initial assignment, the centroids are updated based on the mean of the data points assigned to each cluster. This process repeats until the centroids no longer change significantly, indicating convergence. K-means clustering is widely used in various applications such as customer segmentation, image compression, and anomaly detection.

In [None]:
#First lets look at the data we want to understand
print(valid_responses.head())
print('  ')
print('The number of valid responses is:')
print(valid_responses.size)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD

# Convert all responses to string type
valid_responses = valid_responses.astype(str)

# Define the number of clusters
num_clusters = 5

# Define the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')

# Define the KMeans clustering model with n_init set explicitly
kmeans = KMeans(n_clusters=num_clusters, n_init=10, random_state=42)

# Create a pipeline with TF-IDF vectorizer, TruncatedSVD (for dimensionality reduction), and KMeans
pipeline = make_pipeline(tfidf_vectorizer, TruncatedSVD(n_components=50), kmeans)

# Fit the pipeline to the valid responses
pipeline.fit(valid_responses)

# Get the cluster labels
cluster_labels = pipeline.predict(valid_responses)

# Print the cluster labels
print("Cluster labels:")
print(cluster_labels)


In [None]:
# Or can print our cluster label next to responses:

# Print cluster labels along with corresponding responses for top 20
for index, label in enumerate(cluster_labels[:20]):
    print(f"Cluster {label}: {valid_responses.iloc[index]}")

#But if in your own dataset you want to print out all of them - uncomment and use code below:
# for i, response in enumerate(valid_responses):
#     print(f"Cluster {cluster_labels[i]}: {response}")

In [None]:
# print out top 20 comments in Cluster 4
cluster_4_responses = valid_responses[cluster_labels == 4]

for i, response in enumerate(cluster_4_responses[:20]):
    print(f"Response {i+1}: {response}")

In [None]:
# And if you want to save the cluster and response to a spreadsheet
# Create a DataFrame with cluster labels and corresponding responses
cluster_data = pd.DataFrame({'Cluster Label': cluster_labels, 'Response': valid_responses})

# Save the DataFrame to an Excel file
cluster_data.to_excel('/content/gdrive/MyDrive/data_examples/cluster_responses_NLPwkshp.xlsx', index=False)

## Visualizing Cluster results in multi-dimensional space

Here we are going to look at two different methods:

1.   Truncated Singular Value Decomposition (SVD)
2.   t-distributed Stochastic Neighbor Embedding (t-SNE)

**So what's the difference?**

**SVD** focuses on capturing the overall variance of the data by decomposing it into its principal components. It's suitable for summarizing the **global structure** of the data and reducing its dimensionality while preserving as much information as possible. SVD is often used for tasks like noise reduction, feature extraction, and data compression.

**t-SNE** emphasizes the preservation of local structure, aiming to represent nearby points in the high-dimensional space as close together in the low-dimensional space. This makes it effective for visualizing clusters and identifying **local patterns** in the data. It's particularly useful for exploring relationships between nearby data points and uncovering clusters with complex shapes.

To apply this to **NLP**:

In **natural language processing (NLP) analysis, data often exists in a high-dimensional space where each dimension represents a unique word or feature.** Techniques like t-SNE and SVD help reduce this high-dimensional space to a lower-dimensional space, making it easier to visualize and interpret.

When we talk about **global patterns in NLP**, we're referring to the overall structure of the dataset, such as **broad topics or themes that encompass the entire corpus of text**. SVD is particularly useful for capturing these global patterns by decomposing the data into its principal components, which represent the main sources of variation in the text.

On the other hand, **local patterns refer to more specific relationships or clusters of words that occur within smaller subsets of the data**. t-SNE excels at preserving these local patterns by focusing on the relationships between nearby data points. In the context of NLP, t-SNE can help reveal clusters of words that are semantically similar or frequently co-occur within documents.

So, in **NLP analysis, SVD can provide insights into the overarching structure and main themes of the text, while t-SNE can help uncover more nuanced relationships and clusters of words that exist within smaller sections of the data**.

Clusters are tricky to visualize.

First lets try  **Truncated Singular Value Decomposition (SVD)** for dimensionality reduction (because they are clustered in higher dimensional space, and we want to plot it in two dimension)

The plot shows each data point as a dot, colored by its assigned cluster, allowing us to observe patterns and structures in the data. Closer dots indicate higher similarity, while separated clusters represent distinct groups within the data.

In [None]:
from sklearn.decomposition import TruncatedSVD

# Fit TruncatedSVD to the TF-IDF vectors
svd = TruncatedSVD(n_components=2)
tfidf_vectors = tfidf_vectorizer.fit_transform(valid_responses)
tfidf_vectors_2d = svd.fit_transform(tfidf_vectors)

# Plot the clusters
plt.figure(figsize=(10, 6))
for cluster_num in range(num_clusters):
    plt.scatter(tfidf_vectors_2d[cluster_labels == cluster_num, 0], tfidf_vectors_2d[cluster_labels == cluster_num, 1], label=f'Cluster {cluster_num}', alpha=0.5)
plt.title('K-means Clustering Visualization')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.legend()
plt.show()


Next - **t-SNE** to project high-dimensional data into a 2D space while preserving the local structure of the data points. Each data point, representing a response, is plotted on a scatter plot, with its position determined by its similarity to other points. Points belonging to different clusters are color-coded, allowing us to observe how well-separated or overlapped the clusters are in the 2D space. This visualization helps us gain insights into the relationships between responses and the effectiveness of the clustering algorithm in distinguishing different groups based on their content.

In [None]:

from sklearn.manifold import TSNE
import seaborn as sns

# Initialize t-SNE with 2 components (2D space) and random initialization
tsne = TSNE(n_components=2, random_state=42, init='random')

# Fit and transform the TF-IDF vectors to 2D space
tfidf_vectors_2d = tsne.fit_transform(tfidf_vectors)

# Create a DataFrame with the 2D vectors and cluster labels
tsne_df = pd.DataFrame({'X': tfidf_vectors_2d[:, 0], 'Y': tfidf_vectors_2d[:, 1], 'Cluster Label': cluster_labels})

# Plot the scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=tsne_df, x='X', y='Y', hue='Cluster Label', palette='viridis', alpha=0.7)
plt.title('t-SNE Visualization of Cluster Overlap')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()

In [None]:
# Cluster 1 looks very tight - lets print it out to see why:
# print out top 20 comments in Cluster 1
cluster_1_responses = valid_responses[cluster_labels == 1]

for i, response in enumerate(cluster_1_responses[:10]):
    print(f"Response {i+1}: {response}")

## NLP Theme Analysis

Recall that **topic modeling** involves identifying and understanding recurring patterns, topics, or themes within a document.

Lots of methods - I'll cover one basic one **LDA**

In [None]:
#######  First, lets only at folks who now believe in climate change  and try to understand why they changed their minds ######

# Filter the dataframe based on the condition glbcc_change = 1
changed_views_df = df[df['glbcc_change'] == 1]

# Extract the glbcc_change_led2 column
change_led2_responses = changed_views_df['glbcc_change_led2']

# Display the first few responses to understand the data
print(change_led2_responses.head())

# Filter out NaN values in glbcc_change_led2
valid_responses = changed_views_df['glbcc_change_led2'].dropna()

**LDA analysis - BASIC TOPIC MODELING**

Latent Dirichlet Allocation (LDA) is a probabilistic topic modeling technique designed to reveal hidden themes within a set of documents. Operating as a generative probabilistic model, LDA assumes that documents are composed of a mix of topics, and topics are comprised of words. The algorithm iteratively  assigns topics to words and adjusts probability distributions, ultimately producing the probability distribution of topics for each document and the probability distribution of words for each topic.

In [None]:
# Add additional stop words if needed
custom_stop_words = set(stopwords.words('english'))
custom_stop_words.update(['can', 'will'])

# Tokenize and preprocess the responses using lemmatization and refined stop word removal
lemmatizer = WordNetLemmatizer()
tokenized_responses = [
    [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(str(response)) if word.isalnum() and word.lower() not in custom_stop_words and len(word) > 2]
    for response in valid_responses
    if isinstance(response, str)  # Check if the response is a string
]

# Remove empty sequences
tokenized_responses = [tokens for tokens in tokenized_responses if len(tokens) > 0]

# Check if there are any valid tokenized responses
if not tokenized_responses:
    print("No valid tokenized responses to analyze.")
else:
    # Create a dictionary representation of the documents
    dictionary = corpora.Dictionary(tokenized_responses)

    # Create a bag-of-words representation of the documents
    corpus = [dictionary.doc2bow(tokens) for tokens in tokenized_responses]

    # Train the LDA model
    lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

    # Print the topics and their keywords  note usually start with 3 to 5 keywords
    topics = lda_model.show_topics(num_topics=5, num_words=4, formatted=False)
    for topic_num, keywords in topics:
        print(f"Topic {topic_num + 1}: {', '.join([word[0] for word in keywords])}")

    # Compute coherence score using C_v coherence which is a metric used to evaluate topic coherence by measuring the semantic
    # similarity between the top words within each topic. Higher C_v coherence scores indicate more coherent and interpretable
    # topics in topic modeling results.
    coherence_model = CoherenceModel(model=lda_model, texts=tokenized_responses, corpus=corpus, coherence='c_v')
    cv_coherence = coherence_model.get_coherence()

    print(f"C_v Coherence Score: {cv_coherence}")

Scientific evidence seems to be a theme - lets look at those folks.

Side note --- I've chatted with some NLP folks on campus and they often use LLMs (like ChatGPT) to provide themes.  Of course... this depends on the privacy of your data.

**OK but let's work with just these words...**

In [None]:
# First, lets do this with science as a key word

# Filter the DataFrame to include only responses mentioning 'science' in glbcc_change_led2
science_responses_df = df[df['glbcc_change_led2'].str.contains('science', case=False, na=False)].copy()

# Replace numeric values with meaningful labels for gender
science_responses_df.loc[:, 'gender'] = science_responses_df['gender'].replace({0: 'Female', 1: 'Male'})

# Replace numeric values with meaningful labels for party affiliation
science_responses_df.loc[:, 'party_w_lean'] = science_responses_df['party_w_lean'].replace({1: 'Democrat', 2: 'Republican', 3: 'Other'})

# Create a new column to represent combinations of gender and party affiliation
science_responses_df.loc[:, 'gender_party'] = science_responses_df['gender'] + ' ' + science_responses_df['party_w_lean']

# Count the occurrences of each combination of gender and party affiliation
gender_party_counts = science_responses_df['gender_party'].value_counts()

# Plot the distribution with specific color coding
plt.figure(figsize=(10, 6))
# Define colors for each combination of gender and party affiliation
colors = {'Male Democrat': 'blue', 'Female Democrat': 'lightblue',
          'Male Republican': 'red', 'Female Republican': 'lightcoral',
          'Male Other': 'green', 'Female Other': 'lightgreen'}
# Update the color dictionary to include all possible combinations
colors = {key: colors.get(key, 'gray') for key in gender_party_counts.index}
gender_party_counts.plot(kind='bar', color=[colors[x] for x in gender_party_counts.index])
plt.title('Distribution of Party Affiliation across Gender for Respondents Mentioning "Science"')
plt.xlabel('Gender and Party Affiliation')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()



In [None]:
###  Obviously we can dice this many ways....  here is looking at it based on the word 'weather'
# but as a percentage of folks who changed their climate belifef

# Filter the DataFrame to include only respondents who changed their climate beliefs
changed_beliefs_df = df[df['glbcc_change'] == 1]

# Calculate the total number of respondents who changed their climate beliefs
total_changed_beliefs = len(changed_beliefs_df)

# Filter the DataFrame to include only responses mentioning 'weather' in glbcc_change_led2
weather_responses_df = changed_beliefs_df[changed_beliefs_df['glbcc_change_led2'].str.contains('weather', case=False, na=False)].copy()

# Replace numeric values with meaningful labels for gender
weather_responses_df.loc[:, 'gender'] = weather_responses_df['gender'].replace({0: 'Female', 1: 'Male'})

# Replace numeric values with meaningful labels for party affiliation
weather_responses_df.loc[:, 'party_w_lean'] = weather_responses_df['party_w_lean'].replace({1: 'Democrat', 2: 'Republican', 3: 'Other'})

# Create a new column to represent combinations of gender and party affiliation
weather_responses_df.loc[:, 'gender_party'] = weather_responses_df['gender'] + ' ' + weather_responses_df['party_w_lean']

# Count the occurrences of each combination of gender and party affiliation
gender_party_counts = weather_responses_df['gender_party'].value_counts()

# Calculate the percentages
gender_party_percentages = (gender_party_counts / total_changed_beliefs) * 100

# Plot the distribution with specific color coding
plt.figure(figsize=(10, 6))
# Define colors for each combination of gender and party affiliation
colors = {'Male Democrat': 'blue', 'Female Democrat': 'lightblue',
          'Male Republican': 'red', 'Female Republican': 'lightcoral',
          'Male Other': 'green', 'Female Other': 'lightgreen'}
# Update the color dictionary to include all possible combinations
colors = {key: colors.get(key, 'gray') for key in gender_party_percentages.index}
gender_party_percentages.plot(kind='bar', color=[colors[x] for x in gender_party_percentages.index])
plt.title('Distribution of Party Affiliation across Gender for Respondents Mentioning "Weather" (as a percentage of climate belief changers)')
plt.xlabel('Gender and Party Affiliation')
plt.ylabel('Percentage')
plt.xticks(rotation=45)
plt.show()


## Ok.. now one last visualization - Co-Occurance Analysis

A **co-occurrence matrix** is a fundamental concept in natural language processing (NLP) that **helps us understand how often words appear together** in a given context.

Imagine we have a large collection of text documents, and we want to analyze which words tend to occur together frequently. To create a co-occurrence matrix, we first identify all the unique words in our corpus (vocabulary). Then, for each pair of words, we count how many times they appear together within a specified window of text. The resulting matrix provides a numerical representation of word co-occurrences, where each cell indicates the frequency with which two words appear together.

In [None]:
# Combine all responses into a single text
all_responses = ' '.join(str(response) for response in changed_views_df['glbcc_change_led2'])

# Tokenize the text
tokens = word_tokenize(all_responses)

# Remove stopwords and non-alphabetic tokens
stop_words = set(stopwords.words('english'))
filtered_tokens = [word.lower() for word in tokens if word.isalpha() and word.lower() not in stop_words]

# Calculate word frequencies
freq_dist = FreqDist(filtered_tokens)

# Extract the top 20 co-occurring words
top_words = [word for word, _ in freq_dist.most_common(20)]

# Create a co-occurrence matrix
co_occurrence_matrix = pd.DataFrame(index=top_words, columns=top_words, data=0)

# Count co-occurrences in the responses
for response in changed_views_df['glbcc_change_led2']:
    response_tokens = word_tokenize(str(response))
    response_tokens = [word.lower() for word in response_tokens if word.isalpha() and word.lower() not in stop_words]

    for i, word1 in enumerate(top_words):
        if word1 in response_tokens:
            for j, word2 in enumerate(top_words):
                if j > i and word2 in response_tokens:
                    co_occurrence_matrix.at[word1, word2] += 1
                    co_occurrence_matrix.at[word2, word1] += 1

# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(co_occurrence_matrix, annot=True, cmap="YlGnBu", fmt='g')
plt.title('Co-occurrence Heatmap of Top 20 Words  WHY PEOPLE CHANGED MIND ABOUT CC')
plt.show()