# CNN Topic Modeling

This project aims to analyze and visualize the topics covered in CNN articles using the BERTopic model. The goal is to gain insights into the variety of subjects addressed by CNN and to showcase the effectiveness of topic modeling techniques.

In [None]:
# Importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from bertopic import BERTopic
import plotly.express as px
from umap import UMAP

In [None]:
# Load the data

articles1 = pd.read_csv("articles1.csv")

In [None]:
articles1.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [None]:
# Retaining the data for CNN only

cnn = articles1[articles1['publication'] == 'CNN']

In [None]:
cnn.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
31584,31592,50358,Istanbul attack: Dozens killed at nightclub,CNN,Euan McKirdy,2016-12-31,2016.0,12.0,,Istanbul (CNN) At least 39 people were killed ...
31585,31593,50359,"Alabama, Clemson back in national title game",CNN,Jill Martin,2016-12-31,2016.0,12.0,,Atlanta (CNN) This season’s College Football P...
31586,31594,50360,New year celebrations ring in 2017,CNN,Ray Sanchez,2016-12-31,2016.0,12.0,,(CNN) Revelers on the United States’ west coa...
31587,31595,50361,Trump says he has inside information on hacking,CNN,Kevin Liptak,2017-01-01,2017.0,1.0,,"West Palm Beach, Florida (CNN) Donald Trump s..."
31588,31596,50362,3 dead in Texas plane crash collision,CNN,Tony Marco,2017-01-01,2017.0,1.0,,(CNN) Two small planes collided in Texas on S...


In [None]:
cnn_content = cnn['content'].tolist()

# Create a BERTopic model
cnn_model = BERTopic(embedding_model="distilbert-base-nli-mean-tokens")

# Fit the model and extract topics
topics, embeddings = cnn_model.fit_transform(cnn_content)

In [None]:
# Display the top topics
topic_freq = cnn_model.get_topic_info()
print(topic_freq.head())

   Topic  Count                            Name
0     -1   6401                -1_the_to_and_of
1      0    496            0_isis_attack_in_the
2      1    355        1_trump_his_he_president
3      2    187  2_police_officers_said_officer
4      3    150                3_her_she_and_it


In [None]:
# Visualize the topics using a heatmap
cnn_model.visualize_heatmap()

In [None]:
# Reduce dimensionality of embeddings using UMAP
umap_embeddings = UMAP(n_neighbors=15, n_components=2, metric="cosine").fit_transform(embeddings.reshape(-1, 1))

In [33]:
# Create a mapping of topic numbers to their most representative words
topic_mapping = {}
for topic_num in set(topics):
    if topic_num == -1:
        topic_mapping[topic_num] = "Outliers"
    else:
        topic_words = [word for word, _ in cnn_model.get_topic(topic_num)[:5]]
        topic_mapping[topic_num] = ", ".join(topic_words)

In [34]:
# Create a dataframe for visualization
vis_df = pd.DataFrame(umap_embeddings, columns=["x", "y"])
vis_df["Topic"] = [topic_mapping[t] for t in topics]

In [36]:
# Visualize the topics using a scatter plot
fig = px.scatter(vis_df, x="x", y="y", color="Topic", hover_name="Topic", color_continuous_scale="Viridis")
fig.show()

## Insights

- News content covers a wide range of topics, from politics (e.g., Trump, Clinton, Obamacare) and international relations (e.g., China, North Korea, Syria) to sports (e.g., NBA, NFL, golf) and natural disasters (e.g., earthquakes, hurricanes, storms).

- Significant coverage is dedicated to both national and international events, reflecting the global nature of CNN's audience and interests.

- The topics reveal a focus on major personalities and figures, such as politicians (e.g., Trump, Clinton), sports stars (e.g., Woods, Ronaldo), and public figures (e.g., Cosby, Griffin).

- Health-related issues and studies are also a prevalent theme in the topics (e.g., cancer, zika, marijuana, opioids), showcasing the importance of health news and its impact on the general public.

- A variety of human interest stories and lighter topics are covered, such as architecture, fashion, and animal stories, providing a diverse mix of content for readers.

These insights demonstrate that CNN's news content covers a broad spectrum of topics, catering to a wide array of interests and audiences.