<a href="https://www.kaggle.com/code/paulolsson/bbc-news-analysis?scriptVersionId=95636517" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!pip install -U spacytextblob

Collecting spacytextblob
  Downloading spacytextblob-4.0.0-py3-none-any.whl (4.5 kB)
Collecting textblob<0.16.0,>=0.15.3
  Downloading textblob-0.15.3-py2.py3-none-any.whl (636 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m636.5/636.5 KB[0m [31m748.5 kB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions<4.0.0.0,>=3.7.4
  Downloading typing_extensions-3.10.0.2-py3-none-any.whl (26 kB)
Installing collected packages: typing-extensions, textblob, spacytextblob
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.2.0
    Uninstalling typing_extensions-4.2.0:
      Successfully uninstalled typing_extensions-4.2.0
  Attempting uninstall: textblob
    Found existing installation: textblob 0.17.1
    Uninstalling textblob-0.17.1:
      Successfully uninstalled textblob-0.17.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the sour

In [2]:
import pyLDAvis.gensim
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from spacy.tokens import DocBin
from tqdm import tqdm
import warnings

warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)


  """
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.object:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool:
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def randint(low, high=None, size=None, dtype=onp.int):  # pylint: disable=missing-function-docstring
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-

In [3]:
# Configuration
N_TOPICS = 4
CUSTOM_STOP_WORDS = ['say', 's', 'mr', 'Mr', 'said', 'says', 'saying', 'today', 'be', 'I', 'm']

# Topics and sentiment in the BBC News Dataset

### Research

First step was to review the various language and code library options for doing NLP analysis. Python was my preferred choice as it's the language I use most. It was a top recommended choice, so an easy decision.

Second which Python libraries? This wasn't as straightforward to answer but reading around it seems like nltk is a foundational library, great for learning, but outdated. SpaCy is strong for production environments. However GENSIM was recommended for Latent Dirichlet Allocation (LDA) which was relevant for topic analysis.

### Learning

To learn NLP techniques in Python, I read around the subject, stole and adapted code in Pycharm, and repeated until I'd run out of time. Here are some of the sources I used the most:

> Analysis on the same dataset with some nice graphics: 
https://medium.com/analytics-vidhya/bbc-news-text-classification-a1b2a61af903

> Notebook on kaggle with worked example using spaCy and GENSIM 
https://www.kaggle.com/code/faressayah/text-analysis-topic-modelling-with-spacy-gensim/notebook

> Sentiment analysis
https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483


### Analysing
Having gathered some useful analysis techniques, I ran and reviewed and changed the model to improve. This largely comprised adding to the stop words and tweaking the number of topics.

### Findings
LDA analysis on the entire dataset with 10 topics wasn't very conclusive, but there was a well separated cluster for sport-type key words. Reducing to 5 topics improved the separation of topics. A sports topic was still clear but the other keywords groups didn't clear describe a topic to me.

### Next steps/extending the analysis
Quality assurance and refining the model would be the next step. I don't think the results are valid yet without further work.

It would also be possible to scrape news data from the bbc website. This could increase the number of articles and would remove some preprocessing that had already occured in the data before it was saved in the csv which meant the distinction between the header, article summary and article main text was lost.



## Load the data
Remove exact duplicate articles

In [4]:
import pandas as pd
from spacy.cli import download


def validate_data(df):
    # check is as expected and described
    for category in df.category.unique():
        assert category in ['business', 'entertainment', 'politics', 'sport', 'tech']
    assert not any(df.duplicated())


def remove_duplicates(df):
    if any(df.duplicated()):
        dup = df.duplicated().value_counts()
        df = df[~df.duplicated()]
        print(f'{dup[True]} duplicated articles removed from dataset')
    return df

In [5]:
# read the data
df = pd.read_csv('../input/newsgroup20bbcnews/bbc-text.csv')
df = remove_duplicates(df)
validate_data(df)

99 duplicated articles removed from dataset


In [6]:
# summarise
df.describe()

Unnamed: 0,category,text
count,2126,2126
unique,5,2126
top,sport,tv future in the hands of viewers with home th...
freq,504,1


### Preprocessing
Use spaCy to process each article using one of the spaCy models appropritely trained on web data such as new articles.
The spaCy preprocessing separates the data into tokens, grouping noun chunks, removes punctuation, stop words, numbers, spaces and lemminizes. 


In [7]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
nlp.add_pipe("spacytextblob")

for stopword in CUSTOM_STOP_WORDS:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True
    
def preprocess(df):

    articles, article = [], []
    doc_bin = DocBin()
    dictionary = Dictionary()
    corpus = []

    print('cleaning and create dictionary mapping of tokens in articles for analysis')
    for doc, category in tqdm(nlp.pipe(zip(df.text.values, df.category.values), as_tuples=True, disable=["tok2vec"])):
        doc_bin.add(doc)
        doc.cats['category'] = category
        article = [
            token.lemma_ for token in doc 
            if not token.is_stop 
            and not token.is_punct 
            and not token.like_num
            and not token.is_space]
        dictionary.add_documents([article])
        corpus.append(dictionary.doc2bow(article))
        
    return articles, doc_bin, dictionary, corpus

In [8]:
articles, doc_bin, dictionary, corpus = preprocess(df)

cleaning and create dictionary mapping of tokens in articles for analysis


2126it [02:22, 14.95it/s]


### Build a Latent Dirichlet Allocation (LDA) model
Let's try and find meaningful subtopics.

In [9]:
print('Building LDA model')
lda_model = LdaModel(corpus=corpus, num_topics=N_TOPICS, id2word=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

Building LDA model


In [10]:
def format_topics_sentences(ldamodel, corpus):
    sent_topics_df = pd.DataFrame()
    for i, row_list in enumerate(lda_model[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    sent_topics_df = pd.concat([sent_topics_df, df], axis=1)
    return sent_topics_df

df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus)

## Review the text and topics
Look at the top percentage contributions for topic keyword collections, along with the original text. 

In [11]:
df_dominant_topic = df_topic_sents_keywords.reset_index().sort_values('Perc_Contribution', ascending=False)
df_dominant_topic.head(10)

Unnamed: 0,index,Dominant_Topic,Perc_Contribution,Topic_Keywords,category,text
1552,1552,1.0,0.9992,"new, people, year, blair, film, uk, labour, wo...",tech,attack prompts bush site block the official re...
226,226,2.0,0.9991,"year, $, new, sales, world, t, players, game, ...",tech,learning to love broadband we are reaching the...
1851,1851,2.0,0.9989,"year, $, new, sales, world, t, players, game, ...",politics,observers to monitor uk election ministers wil...
2026,2026,1.0,0.9985,"new, people, year, blair, film, uk, labour, wo...",politics,job cuts false economy - tuc plans to shed ...
1004,1004,2.0,0.9984,"year, $, new, sales, world, t, players, game, ...",entertainment,critics back aviator for oscars martin scorses...
873,873,2.0,0.9982,"year, $, new, sales, world, t, players, game, ...",politics,howard rejects bnp s claim tory leader michael...
790,790,2.0,0.9982,"year, $, new, sales, world, t, players, game, ...",tech,text message record smashed uk mobile owners c...
668,668,1.0,0.9982,"new, people, year, blair, film, uk, labour, wo...",business,sec to rethink post-enron rules the us stock m...
257,257,2.0,0.9982,"year, $, new, sales, world, t, players, game, ...",tech,sporting rivals go to extra time the current s...
430,430,2.0,0.998,"year, $, new, sales, world, t, players, game, ...",business,s korea spending boost to economy south korea ...


### Review the detail of some of the data
Let's article with the highest percentage contribution was about civil service job cuts.

In [12]:
index_id = 984

df_dominant_topic.Topic_Keywords[index_id]
df_dominant_topic.text[index_id]

'music mogul fuller sells company pop idol supremo simon fuller has sold his 19 entertainment company to an us entrepreneur in a $156m (£81.5m) deal.  robert sillerman s sports entertainment enterprises  which is to be renamed cfx  recently also bought an 85% share in the estate of elvis presley. mr fuller has been appointed to the cfx board and will plan and implement the company s creative strategy. the 19 firm handles a roster of music artists  tv shows and pr strategies for stars including the beckhams. the deal sees mr fuller receive £64.5m in cash and about 1.9 million shares in sports entertainment. there will also be a further £19.2m in either cash or stocks by the end of the financial year in june. mr fuller has signed a long-term agreement with the company which will see him continue to expand and develop entertainment brands. he said:  this is a hugely exciting new partnership for myself and 19 entertainment.   ckx will provide 19 with a powerful platform for global growth a

### Correlation between new topics and categories


In [13]:
# to be added
pass


### Repeat the analysis for indvidual categories



In [14]:
df.columns
category = 'tech'

articles, doc_bin, dictionary, corpus = preprocess(df[df.category==category])
lda_model = LdaModel(corpus=corpus, num_topics=N_TOPICS, id2word=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

cleaning and create dictionary mapping of tokens in articles for analysis


347it [00:30, 11.38it/s]
