# Cluster & Topic Analysis on thehill articles
This code will load the scraped articles CSV file into a DataFrame, preprocess the article text, transform it using TfidfVectorizer, apply k-means clustering and LDA topic modeling, and display the top contributing keywords for each topic. 

In [1]:
import pandas as pd
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation as LDA
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from collections import Counter
import pyLDAvis
import pyLDAvis.sklearn

# Load the CSV file into a pandas DataFrame
df = pd.read_csv("the_hill_articles.csv")

# Preprocess the article text (in the 'body' column)
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Remove punctuation and convert to lowercase
    text = re.sub(f"[{string.punctuation}]", " ", text.lower())
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords and lemmatize tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    
    return " ".join(tokens)

df['preprocessed_body'] = df['body'].apply(preprocess_text)

# Transform the preprocessed text using TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['preprocessed_body'])

# Apply k-means clustering
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['cluster'] = kmeans.fit_predict(X)

# Apply LDA topic modeling
num_topics = 4  # Since we have 4 categories
lda = LDA(n_components=num_topics, random_state=42)
lda.fit(X)

# Display the top contributing keywords for each topic
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")

        # Find the most common words in the topic
        topic_word_counts = Counter({feature_names[i]: topic[i] for i in range(len(topic))})
        most_common_words = topic_word_counts.most_common(no_top_words)

        # Display the most common words and their counts
        for word, count in most_common_words:
            print(f"{word}: {count:.2f}")
        print()

no_top_words = 10
display_topics(lda, vectorizer.get_feature_names(), no_top_words)


Topic 1:
trump: 112.00
asylum: 81.46
law: 63.04
student: 61.76
bragg: 60.69
fazio: 60.25
antisemitism: 56.94
gas: 55.49
va: 55.30
legal: 54.75

Topic 2:
said: 249.98
percent: 141.45
year: 119.12
biden: 118.03
house: 115.46
state: 112.24
would: 111.78
president: 111.66
trump: 111.63
senate: 109.83

Topic 3:
epa: 74.49
transit: 70.86
hampshire: 49.67
china: 46.74
iowa: 46.42
russia: 43.79
assembly: 42.28
public: 41.96
wafer: 39.77
ford: 36.61

Topic 4:
bank: 108.37
rate: 90.18
violence: 87.51
fed: 79.07
china: 78.64
child: 69.01
ai: 62.32
israel: 59.65
care: 56.20
price: 56.04





**Interpretation**: 

**Topic 1:**
This topic seems to revolve around the Trump administration, asylum policies, legal issues, and antisemitism. It may involve discussions about immigration policies and legal aspects surrounding them.

**Topic 2:**
This topic appears to cover political subjects, focusing on President Biden, former President Trump, the White House, and the Senate. The keywords suggest that it might involve policies, political events, and approval ratings.

**Topic 3:**
This topic seems to be related to environmental policies, with keywords like 'EPA', 'transit', and international affairs involving China and Russia. It could be about environmental regulations, public transit policies, and geopolitical matters.

**Topic 4:**
This topic appears to involve financial and social issues, including banks, interest rates, the Federal Reserve, violence, and child care. It may discuss economic policies, as well as social issues and their impact on communities.

## Topic Modeling Visualization

In [2]:
pyLDAvis.enable_notebook()
lda_viz = pyLDAvis.sklearn.prepare(lda, X, vectorizer, mds='tsne')

pyLDAvis.display(lda_viz)

  default_term_info = default_term_info.sort_values(


**Interpretation:** Since the bubbles on the graph are very well spread out, this means that the topics are well-separated and do not share many common keywords, which may indicate that the topic model has successfully captured different themes within the dataset.

The topics seem to relate to similar characteristics of the above topic modeling topics

Topic 1:
Political subjects and events
Biden and Trump administrations, legislation, and policies

Topic 2:
Financial and economic policies
Social issues and international relations

Topic 3:
Trump, asylum policies, and legal matters
Immigration policies and social issues

Topic 4:
Regional and international affairs
Public transit policies and political leadership

# Biggest Challenges of the Project

**1. Class imbalance in our dataset**
- Finding samples with richer sentiments would lead to a more realisitic model and perform better on real examples
- Adding more data to the sample set might also assist with model performance

Having a dataset with imbalanced classes, such as having a majority of neutral sentiments, can pose challenges for the model's performance. The model may learn to predict the majority class more accurately while ignoring the minority classes. This is why in our example sentences our model predicted Neutral for many of them

**2. Evaluation of Classifiers** 
- Evaluating the performance of a text classification model was more challenging. Common evaluation metrics, such as accuracy, may not be suitable for imbalanced datasets, so knowing which model to use and performance metrics is important