# Topic Modeling -- Text Segmentation Using Unsupervised Learning

In this project, ~ 12,000 articles of NPR have been clustered into several groups, using the two common text segmentation techniques: Latent Dirichlet Allocation and Non-Negative Magtrix Factorization.

## Import Required Packages

In [1]:
# Data manipulation libraries
import pandas as pd
import numpy as np
# to restrict the float value to 2 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Data visualization libraries
from IPython.display import display, Markdown

# NLP libraries
import spacy
nlp = spacy.load('en_core_web_sm')

# Modeling libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

# Mounting the drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


## Loading Data

In [2]:
# Importing the text data
data_orig = pd.read_csv('/content/drive/MyDrive/Python_files/npr.csv')
# Ensuring the original data remains unchanged
df = data_orig.copy()

## Data Overview

In [3]:
# Viewing a sample of data
df.sample(10)

Unnamed: 0,Article
4552,Want to launder $20 million in illicit drug mo...
7818,The U. S. spends a lot of money on preschool ...
11493,The holiday competition to warm the cold cockl...
688,Biologist Shaun Clements stands in the winter ...
9070,One of the most surprising stories of the Olym...
6486,Imagine this. You’re a student in a remote ...
10296,Scientists are one step closer to understandin...
8710,There is an old stereotype about women in poli...
489,Gene Demby of NPR’s Code Switch team is in our...
8072,The Afghan army commander said the treacherous...


In [4]:
# Shape of data
print(f'The data contains {df.shape[0]} articles.')

The data contains 11992 articles.


In [5]:
# Randomly viewing a few questions
from random import randint, seed
seed(0)
for i in range(5):
    print(f'Random Article {i+1}:')
    article = nlp(df.Article[randint(0, df.shape[0])])
    display(Markdown(str(article[:100]) + '...'))
    print('\n'+ ('*'*100 + '\n')*2)

Random Article 1:


I first met Tess Johnston in the late 1990s in a yellow, stucco apartment building where she lived in Shanghai’s former French Concession. As was her habit, she dropped the key from her   window and I let myself in. Her drafty apartment was crammed with books and street maps. Over tea, Johnston, then in her late 60s, regaled me with her latest adventures, rushing through the city’s back alleys to photograph old   villas before they succumbed to sledge hammers. ”There are heartbreaking times when we...


****************************************************************************************************
****************************************************************************************************

Random Article 2:


The anonymous source behind the Panama Papers document dump has offered to help law enforcement officials pursue wrongdoers, in exchange for immunity from prosecution. ”Legitimate   who expose unquestionable wrongdoing, whether insiders or outsiders, deserve immunity from government retribution,” the source said, in a statement released late Thursday. It was verified by Süddeutsche Zeitung, the German newspaper that helped bring the documents to worldwide attention. In early April, media organizations around the world reported that millions of documents were hacked from Mossack Fonseca, a Panama law firm...


****************************************************************************************************
****************************************************************************************************

Random Article 3:


Even as President Trump takes steps to restrict visitors from some   countries, he and his family continue to do business in some of the others. Ethics experts question whether that might indicate conflicts between Trump’s business interests and his role as U. S. president. The executive action, ”Protecting The Nation From Foreign Terrorist Entry Into The United States,” targets seven nations: Iran, Iraq, Libya, Somalia, Sudan, Syria and Yemen. Trump has no business interests in those countries. One other thing they have in...


****************************************************************************************************
****************************************************************************************************

Random Article 4:


Last week, the New York Times published an   titled ”In Defense of Cultural Appropriation” in which writer Kenan Malik attempted to extol the virtues of artistic appropriation and chastise those who would stand in the way of necessary ”cultural engagement.” (No link, because you have Google and I’d rather not give that piece more traffic than it deserves.) What would have happened, he argues, had Elvis Presley not been able to swipe the sounds of black musicians? Malik is not the first person to defend...


****************************************************************************************************
****************************************************************************************************

Random Article 5:


Each month, NPR Music asks disc jockeys and music experts from public radio stations across the country to share one song they can’t get enough of. In July’s edition, you’ll hear new music by Queens rocksteady band The Frightnrs, RB producer and singer Blood Orange, Irish   James Vincent McMorrow and more  —   an eclectic mix that’s just right for summer....


****************************************************************************************************
****************************************************************************************************



## Sanity Checks on Data

Here, we ensure there is no NAN or blank questions in the data.

In [6]:
print('No. of NAN questions:', df['Article'].isnull().sum())

No. of NAN questions: 0


In [7]:
no_blanks = 0
for i in range(df.shape[0]):
    Q = df['Article']
    if type(Q) == str:
      if Q.isspace():
        no_blanks += 1
print('No. of blank questions:', no_blanks)        

No. of blank questions: 0


#### Observations:

As can be seen, there is no missing value or blank article in the data.

## Model Building

Here, we first leverage the Non-Negative Matrix Factorization technique to cluster our text data, and then compare its results to those of the Latent Dirichlet Allocation algorithm. Note that the hyperparameters of the models, especially the no. of clusters, were found iteratively, and are set at their optimal values, to render the most insightful results.

### Non-Negative Matrix Factorization

In [8]:
# Instantiating the TF_IDF vectorizer required for NMF
tiv = TfidfVectorizer(min_df = 5, max_df = 0.9, stop_words = 'english')

# Creating the the TF-IDF frequency matrix of the data
tim = tiv.fit_transform(df['Article'])
tim.shape

(11992, 33085)

#### Notes:

Here, words appearing in more than 90% of the documents have been assumed too common, and those appearing in less than 5 documents have been deemed too specific. This has resulted in a dictionary of 33,085 words across all 11,992 articles.

In [9]:
# Applying NMF to the sparse word frequency matrix
nmf_model = NMF(n_components = 7, max_iter = 1000, random_state = 1)
nmf_results = nmf_model.fit_transform(tim)

In [10]:
# Finding the top 20 words for each topic
for topic_index in range(nmf_model.components_.shape[0]):
    freq_words_index = nmf_model.components_[topic_index, :].argsort()[-20:]
    freq_words = tiv.get_feature_names_out()[freq_words_index[::-1]]
    print(f'Top 20 repeating words for Topic {topic_index}:\n\n {np.array(freq_words)}')
    print('\n'+ ('*'*100 + '\n')*2)

Top 20 repeating words for Topic 0:

 ['says' 'people' 'zika' 'food' 'water' 'study' 'women' 'virus' 'disease'
 'percent' 'like' 'health' 'patients' 'new' 'research' 'scientists'
 'years' 'university' 'brain' 'researchers']

****************************************************************************************************
****************************************************************************************************

Top 20 repeating words for Topic 1:

 ['trump' 'president' 'said' 'campaign' 'donald' 'house' 'white' 'obama'
 'republican' 'election' 'administration' 'russia' 'presidential' 'comey'
 'pence' 'gop' 'republicans' 'office' 'nominee' 'intelligence']

****************************************************************************************************
****************************************************************************************************

Top 20 repeating words for Topic 2:

 ['health' 'care' 'insurance' 'medicaid' 'coverage' 'obamacare'
 'affordable' 'repu

In [11]:
# Assigning topics to the dataframe of the questions
nmf_label_dicts = {0: 'Health/Science', 1: 'Trump/White House', 
                   2: 'Domestic/Legislative', 3: 'International/Immigration',
                   4: 'Election', 5: 'Art/Culture', 6: 'Education/Activism'}
df['NMF_Label'] = nmf_results.argmax(axis = 1) 
df['NMF_Label'] = df['NMF_Label'].map(nmf_label_dicts)
df.head()

Unnamed: 0,Article,NMF_Label
0,"In the Washington of 2016, even when the polic...",Trump/White House
1,Donald Trump has used Twitter — his prefe...,Trump/White House
2,Donald Trump is unabashedly praising Russian...,Trump/White House
3,"Updated at 2:50 p. m. ET, Russian President Vl...",International/Immigration
4,"From photography, illustration and video, to d...",Education/Activism


#### Notes:

The title of each topic (cluster) has been chosen based on the dominant words in the cluster. 

In [12]:
# Selecting a few random articles to examine how reasonable the article segmentation looks
for i in range(10):
    index = randint(0, df.shape[0])
    article = nlp(df.loc[index, 'Article'])
    print('Label:', df.loc[index, 'NMF_Label'])
    display(Markdown(str(article[:100]) + '...'))
    print('\n'+ ('*'*100 + '\n')*2)

Label: International/Immigration


The U. S. State Department issued its annual Trafficking in Persons report on Thursday, and the big news is the status of Thailand. Thailand is now on the ”Tier 2 Watch List” for countries that do not meet the minimum U. S. standards for the elimination of trafficking, but are making significant efforts to do so. Last year it was on the ”Tier 3” list of the worst human trafficking offenders  —   countries making no significant effort to meet minimum U. S. standards. That list includes Burma, Haiti...


****************************************************************************************************
****************************************************************************************************

Label: Health/Science


Over the past decade, states have passed laws intended to help women understand the results of their breast cancer screening mammograms if they have dense breasts. But those notifications can be downright confusing and may, in fact, cause more misunderstanding than understanding. A study published Tuesday in JAMA, the journal of the American Medical Association, finds the wording of some notifications so complex that only a Ph. D. could understand them. This lack of simple, direct information could lead to greater health disparities in diagnosis and treatment of breast cancer...


****************************************************************************************************
****************************************************************************************************

Label: Election


Stephanie Hundley is an enthusiastic Bernie Sanders supporter. The    from Waterloo is also enthusiastic about the fact that she’s not going to vote for Hillary Clinton just because she’s a woman. ”I don’t think she’s the woman to be representative of women,” Hundley said. She ticked off a list of Clinton criticisms: the ”damn emails,” the ”” her vote to go to war in Iraq. Citing Sanders’ record of supporting women’s rights, Hundley said his overall views embody hers more...


****************************************************************************************************
****************************************************************************************************

Label: Health/Science


The American Kennel Club says it is officially granting full status to the pumi, a herding breed originally from Hungary. It’s the 190th breed recognized by the AKC, the ”largest purebred dog registry in the world,” which oversees some 22, 000 events annually. This opens the door for the energetic canine known for its ”whimsical expression” to compete in the Westminster Kennel Club Dog Show for the first time. To learn more about the lengthy process of obtaining full recognition from the AKC, The   spoke with...


****************************************************************************************************
****************************************************************************************************

Label: Health/Science


Faced with her own forgetfulness, former NPR correspondent and author Barbara Bradley Hagerty tried to do something about it. She’s written about her efforts in her book on midlife, called Life Reimagined. To her surprise, she discovered that an older dog can learn new tricks.  A confession: I loathe standardized tests, and one of the perks of reaching midlife is that I thought I’d never have to take another. But lately I’ve noticed that in my 50s, my memory isn’t the same as it once...


****************************************************************************************************
****************************************************************************************************

Label: Art/Culture


Artists make the best cultural critics. They reveal what’s happening around us with whatever level of transparency they see fit, with whatever level of opaqueness they desire to sustain mystery. They’re observers, internally and outwardly, operating in a space that allows us, the voyeur, the listener, to learn. Kim Gordon has been teaching us for over three decades. Now she’s doing it under her own name. The experimental music icon got her start in Sonic Youth 35 years ago, revolutionizing the ’80s New York...


****************************************************************************************************
****************************************************************************************************

Label: International/Immigration


Crude oil is now flowing through the Dakota Access Pipeline, despite months of protests against it by Native American tribes and environmental groups. The pipeline spans more than 1, 000 miles from North Dakota to Illinois and cost some $3. 8 billion to construct. It is expected to transport approximately 520, 000 barrels of oil daily. ”Construction on the project was supposed to wrap up late last year,” as Prairie Public Broadcasting’s Amy Sisk reported. ”But protests led to delays in permitting the final stretch of...


****************************************************************************************************
****************************************************************************************************

Label: Art/Culture


For much of the   age, and particularly in such   cerebral genres as indie rock, contemporary folk and Americana, artists have been more likely to command critical respect for cultivating their songwriting voices than for interpreting songs from others’ pens. But John Prine, who was once pegged as a new Dylan, seems to be having a fine time toying with that modern musical hierarchy. The profoundly human characters,   wisdom and wry turns of phrase in his songs have earned his spot in the Nashville Songwriters Hall of Fame many...


****************************************************************************************************
****************************************************************************************************

Label: Health/Science


In emergencies, administering drugs quickly and easily can be a matter of life and death. This has emergency departments turning to the nose as a delivery route because it’s so accessible and doesn’t require direct contact with a needle. Using the nose as a passage for steroids like Flonase and vaccines like FluMist has been common practice for decades. In recent years, more Americans have also become aware of the emergency drug naloxone, which is used to reverse the effects of an opioid overdose, even when someone has stopped breathing....


****************************************************************************************************
****************************************************************************************************

Label: Trump/White House


The sixth Republican debate began harmoniously. Every candidate on stage agreed on one thing: Obama is a terrible president. Then Ted Cruz’s citizenship issue came up and the gloves were off. Cruz was attacked much of the night, but never backed down. After explaining the Constitution to Donald Trump, the former solicitor general said: ”I’m not going to take legal advice from Donald Trump.” Trump doubled down on his   and    policies. And proudly professed: ”I will gladly accept the mantle of anger...


****************************************************************************************************
****************************************************************************************************



#### Observations:

As one can notice, the segmentation has been done with a reasonable accuracy.

In [13]:
# Share of each topic in the whole corpus
print('Share of Each Subject Among All Articles:')
df.NMF_Label.value_counts(1)

Share of Each Subject Among All Articles:


Art/Culture                 0.282
International/Immigration   0.236
Health/Science              0.210
Trump/White House           0.109
Election                    0.055
Education/Activism          0.054
Domestic/Legislative        0.053
Name: NMF_Label, dtype: float64

### Latent Dirichlet Allocation

In [14]:
# Instantiating the TF_IDF vectorizer required for LDA
cv = CountVectorizer(min_df = 5, max_df = 0.3, stop_words = 'english')

# Creating the the TF-IDF frequency matrix of the data
cm = cv.fit_transform(df['Article'])
cm.shape

(11992, 33050)

#### Notes:

*   Here, words appearing in more than 30% of the documents have been assumed too common, and those appearing in less than 5 documents have been deemed too specific. This has resulted in a dictionary of 33,050 words across all 11,992 articles.

*   The reason the max_df hyperparameter, here, has been taken much lower than that of the NMF method is that, otherwise, LDA would've identified many commonplace English words as the frequent key words of the clusters.  

In [15]:
# Applying NMF to the sparse word frequency matrix
lda_model = LatentDirichletAllocation(n_components = 6, random_state = 1)
lda_results = lda_model.fit_transform(cm)

In [16]:
# Finding the top 20 words for each topic
for topic_index in range(lda_model.components_.shape[0]):
    freq_words_index = lda_model.components_[topic_index, :].argsort()[-20:]
    freq_words = cv.get_feature_names_out()[freq_words_index[::-1]]
    print(f'Top 20 repeating words for Topic {topic_index}:\n\n {np.array(freq_words)}')
    print('\n'+ ('*'*100 + '\n')*2)

Top 20 repeating words for Topic 0:

 ['school' 'students' 'university' 'study' 'science' 'women' 'research'
 'children' 'kids' 'schools' 'parents' 'different' 'education' 'human'
 'use' 'college' 'team' 'life' 'need' 'brain']

****************************************************************************************************
****************************************************************************************************

Top 20 repeating words for Topic 1:

 ['police' 'government' 'reports' 'news' 'department' 'reported' 'law'
 'security' 'russia' 'case' 'officials' 'statement' 'court' 'attack'
 'china' 'public' 'information' 'military' 'investigation' 'federal']

****************************************************************************************************
****************************************************************************************************

Top 20 repeating words for Topic 2:

 ['trump' 'clinton' 'campaign' 'obama' 'house' 'republican' 'election'
 'white' '

In [17]:
# Assigning topics to the dataframe of the questions
lda_label_dicts = {0: 'Education', 1: 'International', 2: 'Election',
                   3: 'Life/Art/Culture', 4: 'Health/Science', 5: 'Politics/Economics'}
df['LDA_Label'] = lda_results.argmax(axis = 1) 
df['LDA_Label'] = df['LDA_Label'].map(lda_label_dicts)
df.head()

Unnamed: 0,Article,NMF_Label,LDA_Label
0,"In the Washington of 2016, even when the polic...",Trump/White House,Election
1,Donald Trump has used Twitter — his prefe...,Trump/White House,Election
2,Donald Trump is unabashedly praising Russian...,Trump/White House,Election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",International/Immigration,International
4,"From photography, illustration and video, to d...",Education/Activism,Politics/Economics


#### Notes:

The subject of each cluster has been selected based on its most frequent words.

In [18]:
# Selecting a few random articles to examine how reasonable the article segmentation looks
for i in range(10):
    index = randint(0, df.shape[0])
    article = nlp(df.loc[index, 'Article'])
    print('Label:', df.loc[index, 'LDA_Label'])
    display(Markdown(str(article[:100]) + '...'))
    print('\n'+ ('*'*100 + '\n')*2)

Label: Life/Art/Culture


A huge, internally flawless, 59.  diamond called the ”Pink Star” went for a whopping $71. 2 million at auction in Hong Kong  —   the highest price ever for a jewel. The oval gem was purchased at Sotheby’s by Hong   jewelry company Chow Tai Fook in a bid placed over the phone by the chairman, who renamed the jewel ”CTF PINK STAR” in honor of his late father. ”Not only was the price more than double the previous record for a fancy vivid...


****************************************************************************************************
****************************************************************************************************

Label: International


President Trump signed a new executive order on Monday, after his first action temporarily barring refugees and travel from specific   countries faced a slew of criticism and lawsuits. The revised order has a number of changes, including dropping Iraq from the list of countries with restrictions. It also explicitly does not apply to lawful permanent residents (green card holders) or existing visa holders. The order goes into effect on March 16. Journalists across NPR will be annotating the full text of the order....


****************************************************************************************************
****************************************************************************************************

Label: Politics/Economics


When the Labor Department announces the September   numbers on Friday, presidential candidates will pounce, hoping to find data to support their talking points on the economy. For the past three months, the numbers have been favoring the incumbent Democratic Party. Candidate Hillary Clinton could point to a steady, low unemployment rate of 4. 9 percent and average growth of 232, 000 jobs per month, a robust pace. But the September numbers might show the economy is slowing  —   giving Republican Donald Trump an opportunity to strengthen his...


****************************************************************************************************
****************************************************************************************************

Label: International


When faced with allegations of sex abuse against one of its bishops, the Church of England ”colluded and concealed rather than seeking to help those who were brave enough to come forward,” the church’s leader acknowledged Thursday. ”For the survivors who were brave enough to share their story and bring Peter Ball to justice, I once again offer an unreserved apology,” Justin Welby, archbishop of Canterbury, said in a statement. ”There are no excuses whatsoever for what took place and the systemic abuse of trust perpetrated...


****************************************************************************************************
****************************************************************************************************

Label: Life/Art/Culture


Seven teenagers stand in the courtyard of the Smithsonian National Portrait Gallery, dressed in costumes and surrounded by onlookers. Some of the characters they play are immediately recognizable  —   Malcolm X, Albert Einstein  —   and others don’t register until they announce their names. ”We stand before you as a reflection of community,” the group announces in unison. One after another, they speak up: ”As reminders of social activists.” ”Some of us are leaders.” ”Or presidents.” Then...


****************************************************************************************************
****************************************************************************************************

Label: International


Among the queries included in a questionnaire sent by   Donald Trump’s transition team to workers at the Department of Energy is a request for an inventory of all agency employees or contractors who attended meetings or conferences on climate change. Another question asks for a current list of professional society memberships of any lab staff. The   questionnaire has raised fears among civil rights lawyers specializing in federal worker whistleblower protections, who say the incoming administration is at a minimum trying to influence or limit the research at the Department of Energy. And at...


****************************************************************************************************
****************************************************************************************************

Label: Election


More than 100 million people are expected to watch the first debate between Hillary Clinton and Donald Trump on Monday night, potentially the largest audience for a campaign event in American history. Why? What do we expect from this   faceoff? A watershed moment in our history? A basis on which to choose between the candidates? Or just a ripping good show? Obviously, many of us hope to get all three. [The debate from Hofstra University in Hempstead N. Y. will be broadcast live on NPR beginning at 9 p....


****************************************************************************************************
****************************************************************************************************

Label: Life/Art/Culture


Sinkane opened its Tiny Desk Concert with a song that has been a bit of an anthem for me lately. ”U’Huh” contains the Arabic phrase ”kulu shi tamaam,” which translates to ”everything’s great  —   it’s all going to be all right.” Sinkane is the music of Ahmed Gallab  —   and such hopeful music it is. He grew up in London and has lived in Sudan and in Ohio and, these days, New York City. His band reflects his own love for...


****************************************************************************************************
****************************************************************************************************

Label: Life/Art/Culture


Here’s what we know: Coldplay and Beyoncé will perform at Sunday’s Super Bowl halftime. The duo just released a song called ”Hymn for the Weekend.” But they won’t be performing it  —   because it’s too new, according to the band. ”I don’t think it would be quite right,” said frontman Chris Martin, according to The Associated Press. The decision comes as the song’s music video has ignited a heated debate about cultural appropriation. The video, which uses India...


****************************************************************************************************
****************************************************************************************************

Label: International


Gunmen dressed as medical staff stormed a military hospital in Kabul on Wednesday morning, killing at least 30 people and injuring dozens more in a raid that lasted hours. In a statement published on the Islamic   Aamaq news agency, the militant group claimed responsibility for the assault in the Afghan capital. The attack on Sardar Mohammad Daud Khan hospital ended midafternoon local time, after several hours of    clashes with Afghan security forces left all four attackers dead, according to Gen. Dawlat Waziri, an Afghan defense ministry spokesman. ”While we...


****************************************************************************************************
****************************************************************************************************



#### Observations:

The labels mostly agree with the content of the articles.

In [19]:
# Share of each topic in the whole corpus
print('Share of Each Subject Among All Articles:')
df.LDA_Label.value_counts(1)

Share of Each Subject Among All Articles:


Life/Art/Culture     0.277
International        0.193
Politics/Economics   0.162
Election             0.158
Education            0.122
Health/Science       0.088
Name: LDA_Label, dtype: float64

## Summary:

*   Both models, more or less, identify similar topics for the articles, while NMF does it in a much more efficient and less time-consuming fashion.

*   The NMF-based model detects an extra specific cluster called Trump/White House for the news for which Trump (or his WH) is the centerpiece. Note that these articles go back to 2016 and 2017, when he was an inevitable part of American politics. The LDA-based model, however, doesn't get that detailed and particular, and lump all Trump news into Politics/Economics or Election clusters. This is why NMF ends up with one more segment, compared to LDA.

*   Note that NMF's Domestic/Legislative cluster is analogous to LDA's Politics/Economics cluster, except the former doesn't contain the White-House-centric news.

*   Interestingly, cultural and international news are identified the first and second most frequent topic, respectively, by both methods, with similar percentages.

*   The share of Health/Science news is erroneously low in the LDA results. This method categorizes many of the health- or science/research-related news as political, since the government may also be involved in the decisions, and, as a result, some political terms would emerge in the article as well.  