<a href="https://colab.research.google.com/github/harishkollana/Topic-Modeling-on-News-Articles-Clustering/blob/main/Topic_Modeling_on_News_Articles_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Extraction/identification of major topics & themes discussed in news articles. </u></b>

## <b> Problem Description </b>

### In this project your task is to identify major themes/topics across a collection of BBC news articles. You can use clustering algorithms such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc.

## <b> Data Description </b>

### The dataset contains a set of news articles for each major segment consisting of business, entertainment, politics, sports and technology. You need to create an aggregate dataset of all the news articles and perform topic modeling on this dataset. Verify whether these topics correspond to the different tags available.

In [1]:
#import libraries for topic modeling on news articles
import numpy as np
import pandas as pd
import scipy.stats as stats

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import glob
import string
import nltk
nltk.download('omw-1.4')

from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')

from sklearn.feature_extraction.text import CountVectorizer

from textblob import TextBlob

from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
!pip install spacy
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-py3-none-any.whl size=98051301 sha256=ea0ca8fdec2de8315e1c423592e79e3ad0587671d861c54e9f36f79203c209d1
  Stored in directory: /tmp/pip-ephem-wheel-cache-ufdrhrhp/wheels/69/c5/b8/4f1c029d89238734311b3269762ab2ee325a42da2ce8edb997
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [3]:
!pip install pyLDAvis==3.2.1

Collecting pyLDAvis==3.2.1
  Downloading pyLDAvis-3.2.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 3.9 MB/s 
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.2.1-py2.py3-none-any.whl size=136187 sha256=cd1eb586102aad7abf53a8abc74655e64de7074574e52a9bc52d11ebab68f8f7
  Stored in directory: /root/.cache/pip/wheels/c6/ee/a6/7c17a63623f940dff0b9cbd7e48a27543f088fa55a7d2b62d0
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.17 pyLDAvis-3.2.1


In [4]:
!pip install -U pandas-profiling

Collecting pandas-profiling
  Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
[K     |████████████████████████████████| 261 kB 3.2 MB/s 
[?25hCollecting visions[type_image_path]==0.7.4
  Downloading visions-0.7.4-py3-none-any.whl (102 kB)
[K     |████████████████████████████████| 102 kB 10.6 MB/s 
[?25hCollecting joblib~=1.0.1
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[K     |████████████████████████████████| 303 kB 55.6 MB/s 
Collecting multimethod>=1.4
  Downloading multimethod-1.7-py3-none-any.whl (9.5 kB)
Collecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting phik>=0.11.1
  Downloading phik-0.12.0-cp37-cp37m-manylinux2010_x86_64.whl (675 kB)
[K     |████████████████████████████████| 675 kB 39.2 MB/s 
[?25hCollecting requests>=2.24.0
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.6 MB/s 
Collecting tangled-up-in-unicode==0.1.0
  Downloading tangled_up_in_unico

Import Data

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
#load path to data
path = '/content/drive/MyDrive/Capstone Projects/Topic Modeling on News Articles/bbc'

In [7]:
#Importing text file paths
business = glob.glob(path+'/business/*')
entertainment = glob.glob(path+'/entertainment/*')
politics = glob.glob(path+'/politics/*')
sports = glob.glob(path+'/sport/*')
tech = glob.glob(path+'/tech/*')

In [8]:
business[0:5]

['/content/drive/MyDrive/Capstone Projects/Topic Modeling on News Articles/bbc/business/507.txt',
 '/content/drive/MyDrive/Capstone Projects/Topic Modeling on News Articles/bbc/business/505.txt',
 '/content/drive/MyDrive/Capstone Projects/Topic Modeling on News Articles/bbc/business/471.txt',
 '/content/drive/MyDrive/Capstone Projects/Topic Modeling on News Articles/bbc/business/498.txt',
 '/content/drive/MyDrive/Capstone Projects/Topic Modeling on News Articles/bbc/business/510.txt']

In [9]:
# Making the data lists for different topics.
def make_list(data):
    list = []
    for i in range(len(data)):
      file = open(data[i],'r')
      list.append(file.read())
    return(list)

In [10]:
sports_text=[]

for i in range(len(sports)):
  f=open(sports[i],errors='ignore')
  a=f.read()
  sports_text.append(a)

  print('List ended !!')

List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List ended !!
List e

In [11]:
sports_text[0]

'Henman to face Saulnier test\n\nBritish number one Tim Henman will face France\'s Cyril Saulnier in the first round of next week\'s Australian Open.\n\nGreg Rusedski, the British number two, is in the same quarter of the draw and could face Andy Roddick in the second round if he beats Swede Jonas Bjorkman. Local favourite Lleyton Hewitt will meet France\'s Arnaud Clement, while defending champion and world number one Roger Federer faces Fabrice Santoro. Women\'s top seed Lindsay Davenport drew Spanish veteran Conchita Martinez.\n\nHenman came from two sets down to defeat Saulnier in the first round of the French Open last year, so he knows he faces a tough test in Melbourne. The seventh seed, who has never gone beyond the quarter-finals in the year\'s first major and is lined up to meet Roddick in the last eight, is looking forward to the match. "He\'s tough player on any surface, he\'s got a lot of ability," he said. "We had a really tight one in Paris that went my way so I\'m going 

In [None]:
business_texts= make_list(business)
entertainment_text = make_list(entertainment)
politics_texts= make_list(politics)
tech_text = make_list(tech)

In [None]:
#Number of documents in every topics
print(len(business_texts),len(entertainment_text),len(politics_texts),len(sports_text),len(tech_text))

In [None]:
# Combine the topics.
complete_text = business_texts + entertainment_text + politics_texts + sports_text + tech_text

In [None]:
len(complete_text)

From the above we can see that, the length of the complete text is 2225.

In [None]:
# Make the dataframe of texts.
df = pd.DataFrame({'text': complete_text, 'type': ['business']*len(business_texts) + ['entertainment']*len(entertainment_text) + ['politics']*len(politics_texts) + ['sport']*len(sports_text) + ['tech']*len(tech_text)})

Data Cleaning

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# removing duplicate observations
df=df.drop_duplicates()

In [None]:
# Removal of "\n"
# Converting the words to the lowercase.
# Removal of stopword.

def text_processing(data):
  data = data.map(lambda x: x.replace('\n',' '))
  data = data.map(lambda x: x.lower())
  #data = data.map(lambda x: ''.join([i for i in x if i not in string.punctuation]))
  data = data.map(lambda x: ' '.join([i for i in x.split(' ') if i not in stopwords.words('english')]))
  return data

In [None]:
# Converting column into astring
df['text'] = df['text'].astype('str') 

In [None]:
#check data
df.head()

In [None]:
#add a new column for number of sentences in the text
df['sentence_count'] = [len(i) for i in df['text'].apply(nltk.sent_tokenize)]

#remove punctuation
df['text'] = df['text'].map(lambda x: ''.join([i for i in x if i not in string.punctuation]))

#add a new column for number of words in the text
df['word_count'] = [len(i.split()) for i in df['text']]

#apply lemmatization
lemmatizer = WordNetLemmatizer()
df['text'] = df['text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

#add a new column for number of characters in the text
df['char_count'] = df['text'].str.len()

#add a new column for average sentence length in the text
df['avg_sentence_length'] = df['word_count']/df['sentence_count']

#add a new column for average word length in the text
df['avg_word_length'] = df['char_count']/df['word_count']

#add a new column for number of unique words in the text
df['unique_word_count'] = df['text'].apply(lambda x: len(set(x.split(' '))))

#add a new column for number of digits in the text
df['digit_count'] = df['text'].apply(lambda x: len([c for c in x if c in string.digits]))

#check data
df.head()

Explorative Data Analysis

In [None]:
#create a word cloud for popular words in each topic
def word_cloud(data, topic):
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    import matplotlib.patches as mpatches
    import matplotlib.lines as mlines
    from matplotlib.offsetbox import AnnotationBbox, OffsetImage
    from PIL import Image
    import nltk
    nltk.download('wordnet')
    from nltk.stem import WordNetLemmatizer
    from wordcloud import WordCloud
    from nltk.corpus import stopwords
    stop = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()
    wordcloud = WordCloud(background_color="white", max_words=100, stopwords=stop, max_font_size=40,
                          scale=3, random_state=1).generate(' '.join([i for i in data['text'] if i not in stop]))
    fig = plt.figure(1, figsize=(20, 20))
    plt.axis('off')
    plt.imshow(wordcloud)
    plt.title(topic)
    plt.show()

word_cloud(df, 'business')
word_cloud(df, 'entertainment')
word_cloud(df, 'politics')
word_cloud(df, 'sport')
word_cloud(df, 'tech')



In [None]:
#create a word cloud for popular words in all topics
def word_cloud_all(data):
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    import matplotlib.patches as mpatches
    import matplotlib.lines as mlines
    from matplotlib.offsetbox import AnnotationBbox, OffsetImage
    from PIL import Image
    import nltk
    nltk.download('wordnet')
    from nltk.stem import WordNetLemmatizer
    from wordcloud import WordCloud
    from nltk.corpus import stopwords
    stop = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()
    wordcloud = WordCloud(background_color="white", max_words=100, stopwords=stop, max_font_size=40,
                          scale=3, random_state=1).generate(' '.join([i for i in data['text'] if i not in stop]))
    fig = plt.figure(1, figsize=(20, 20))
    plt.axis('off')
    plt.imshow(wordcloud)
    plt.title('All topics')
    plt.show()

word_cloud_all(df)


In [None]:
#set figure size
plt.rcParams['figure.figsize'] = [10, 7]

#create a countplot for type in df
sns.countplot(x='type', data=df)
plt.xlabel("Type Of News")
plt.ylabel("News Count")
plt.title("Type Of news Counts")
plt.show()



In [None]:
#create pairplot for all the features
sns.pairplot(df, hue='type', palette='Set1')
plt.show()

In [None]:
df.head()

Topic Modelling

hierarchical clustering

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [None]:
tfidf = TfidfVectorizer(decode_error='ignore', lowercase = True, min_df=2)

dtm=tfidf.fit_transform(df['text'])

dtm.shape


In [None]:
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering

In [None]:

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)


X = dtm.toarray()

In [None]:
# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

model = model.fit(X)
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode='level', p=3)
#plt.xlabel("Number of points in node")
plt.show()

From the above dendogram we can see that, we have successfully got 5 different clusters.

In [None]:
clustering = AgglomerativeClustering(n_clusters=5).fit(X)
clustering
AgglomerativeClustering()

In [None]:
pd.Series(clustering.labels_).unique()

heirdf=pd.DataFrame(dtm)

In [None]:
heirdf.head()

Latent Dirichlet Allocation (LDA)

In [None]:
import spacy

In [None]:
!pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
nlp = spacy.load("en_core_web_md", disable=['parser', 'ner'])

In [None]:
# Tokenizing the words.

def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']): 
       output = []
       for sent in texts:
             doc = nlp(sent) 
             output.append([token.lemma_ for token in doc if token.pos_ in allowed_postags ])
       return output

In [None]:
text_list=df['text'].tolist()
print(text_list[1])
tokenized_texts = lemmatization(text_list)
print(tokenized_texts[1])