# Modeling

In this notebook we perform our unsupervised modeling using the Non-negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA) models.

In [35]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation

We start by loading the cleaned data that we performed already.

In [10]:
# Load the cleaned data
cleaned_df = pd.read_csv('MSADS509_News_Project_Dataset/cleaned.csv')

# Display the first few rows of the dataframe
cleaned_df.head()

Unnamed: 0,source,url,content,tokens
0,cnn,https://www.cnn.com/2024/02/12/politics/cq-bro...,Chairman of the Joint Chiefs of Staff Gen. CQ ...,"['chairman', 'joint', 'chiefs', 'staff', 'gen'..."
1,cnn,https://www.cnn.com/2024/02/12/politics/trump-...,Trump has endorsed North Carolina Republican P...,"['trump', 'endorsed', 'north', 'carolina', 're..."
2,cnn,https://www.cnn.com/2024/02/12/politics/senate...,The Senate is inching closer to final passage ...,"['senate', 'inching', 'closer', 'final', 'pass..."
3,cnn,https://www.cnn.com/2024/02/12/politics/bidens...,Biden and King Abdullah II of Jordan met Monda...,"['biden', 'king', 'abdullah', 'ii', 'jordan', ..."
4,cnn,https://www.cnn.com/2024/02/12/politics/trump-...,Trump on Monday asked the SupremeCourt to step...,"['trump', 'monday', 'asked', 'supremecourt', '..."


Now we split it up by source since our goal is to do a topic analysis of CNN and Fox News separately then compare the results.

In [15]:
# Split up the cleaned DataFrame by source
cnn_df = cleaned_df[cleaned_df['source'] == 'cnn']
foxnews_df = cleaned_df[cleaned_df['source'] == 'foxnews']

### NMF Model

We need to start by converting the text into a numerical format using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization to make it suitable for the NMF model. We will perform TF-IDF on the "content" column.

In [16]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

Next, we start with CNN and fit the tfidf vectorizer on the content column. We retrieve the top five topics.

In [18]:
# Process and display topics for CNN
tfidf_cnn = tfidf_vectorizer.fit_transform(cnn_df['content'])
nmf_model_cnn = NMF(n_components=5, random_state=42).fit(tfidf_cnn)
tfidf_feature_names_cnn = tfidf_vectorizer.get_feature_names_out()
print("CNN Topics:")
display_topics(nmf_model_cnn, tfidf_feature_names_cnn, 5)

CNN Topics:
Topic 1:
democrat republican trump suozzi voter
Topic 2:
trump case trial court supremecourt
Topic 3:
nato trump defense spending alliance
Topic 4:
biden hur classified report documents
Topic 5:
ukraine aid senate border republican


Next, we perform the analysis of the NMF model on the Fox News data.

In [20]:
# Process and display topics for Fox News
tfidf_foxnews = tfidf_vectorizer.fit_transform(foxnews_df['content'])
nmf_model_foxnews = NMF(n_components=5, random_state=42).fit(tfidf_foxnews)
tfidf_feature_names_foxnews = tfidf_vectorizer.get_feature_names_out()
print("\nFox News Topics:")
display_topics(nmf_model_foxnews, tfidf_feature_names_foxnews, 5)


Fox News Topics:
Topic 1:
house border senate republican aid
Topic 2:
bobulinski biden hunterbiden business family
Topic 3:
read trump campaign trail biden
Topic 4:
biden hur president special counsel
Topic 5:
trump nato alliance putin russia


### LSA Model

We will use the same TF-IDF Vectorization we performed, and set up the LSA model on it. We will use the TruncatedSVD, which works for LSA modeling. We will also reinitialize the TF-IDF and reapply it to the CNN and Fox data.

In [27]:
# Re-initialize the TF-IDF Vectorizer and apply it to both subsets again for clarity
tfidf_vectorizer_cnn = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf_cnn = tfidf_vectorizer_cnn.fit_transform(cnn_df['content'])
tfidf_feature_names_cnn = tfidf_vectorizer_cnn.get_feature_names_out()

tfidf_vectorizer_foxnews = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf_foxnews = tfidf_vectorizer_foxnews.fit_transform(foxnews_df['content'])
tfidf_feature_names_foxnews = tfidf_vectorizer_foxnews.get_feature_names_out()

In [28]:
# Apply Truncated SVD for LSA on the TF-IDF matrices
lsa_model_cnn = TruncatedSVD(n_components=5, random_state=42)
lsa_cnn = lsa_model_cnn.fit_transform(tfidf_cnn)

lsa_model_foxnews = TruncatedSVD(n_components=5, random_state=42)
lsa_foxnews = lsa_model_foxnews.fit_transform(tfidf_foxnews)

In [30]:
# Function to display topics for LSA (using Truncated SVD components)
def display_topics_svd(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

LSA Topics for CNN:
Topic 1:
trump biden said republican election
Topic 2:
trump case trial willis court
Topic 3:
nato ukraine russia trump defense
Topic 4:
hur biden classified report documents
Topic 5:
ukraine aid willis case border


In [31]:
# Display LSA Topics for CNN
print("LSA Topics for CNN:")
display_topics_svd(lsa_model_cnn, tfidf_feature_names_cnn, 5)

LSA Topics for CNN:
Topic 1:
trump biden said republican election
Topic 2:
trump case trial willis court
Topic 3:
nato ukraine russia trump defense
Topic 4:
hur biden classified report documents
Topic 5:
ukraine aid willis case border


In [32]:
# Display LSA Topics for Fox News
print("\nLSA Topics for Fox News:")
display_topics_svd(lsa_model_foxnews, tfidf_feature_names_foxnews, 5)


LSA Topics for Fox News:
Topic 1:
biden read trump bobulinski house
Topic 2:
bobulinski hunterbiden biden business family
Topic 3:
read biden counsel hur trail
Topic 4:
hur biden president memory report
Topic 5:
trump nato read bobulinski alliance


### LDA Model

Next, we set up the LDA (Latent Dirichlet Allocation) model. This will require a different vectorization process. Unlike LSA and NMF, which both use TF-IDF, LDA uses count data. Therefore, we will use CountVectorizer from sklearn. 

We will start by initializing the CountVectorizer and declaring the number of topics we want to use.

In [42]:
# Initialize the CountVectorizer
count_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

# Number of topics
n_topics_lda = 5

Next, we will do one news organization at a time, beginning with CNN. 

In [43]:
# Apply the vectorizer to the 'content' column for CNN
count_cnn = count_vectorizer.fit_transform(cnn_df['content'])

# Get the feature names (words)
count_feature_names_cnn = count_vectorizer.get_feature_names_out()

In [44]:
# Initialize and fit the LDA model to the CNN data
lda_model_cnn = LatentDirichletAllocation(n_components=n_topics_lda, random_state=42)
lda_cnn = lda_model_cnn.fit_transform(count_cnn)

In [45]:
# Display LDA Topics for CNN
print("LDA Topics for CNN:")
display_topics_svd(lda_model_cnn, count_feature_names_cnn, 5)

LDA Topics for CNN:
Topic 1:
trump said nato biden house
Topic 2:
democrat trump republican senate biden
Topic 3:
said intelligence cnn turner statement
Topic 4:
biden trump said president voter
Topic 5:
trump election case newyork said


In [46]:
# Apply the vectorizer to the 'content' column for Fox News
count_foxnews = count_vectorizer.fit_transform(foxnews_df['content'])

# Get the feature names (words)
count_feature_names_foxnews = count_vectorizer.get_feature_names_out()

# Initialize and fit the LDA model to the Fox News data
lda_model_foxnews = LatentDirichletAllocation(n_components=n_topics_lda, random_state=42)
lda_foxnews = lda_model_foxnews.fit_transform(count_foxnews)

# Display LDA Topics for Fox News
print("\nLDA Topics for Fox News:")
display_topics_svd(lda_model_foxnews, count_feature_names_foxnews, 5)


LDA Topics for Fox News:
Topic 1:
biden president trump house special
Topic 2:
house said border senate security
Topic 3:
biden bobulinski hunterbiden business said
Topic 4:
trump said president sen ukraine
Topic 5:
read trump biden republican special
