# NLP Project
## Topic extraction
The purpose of this project is to classify news articles by which topic they belong to via topic extraction.

## Step-by-step Process
1. Find a suitable NLP model to use for topic extraction
2. Find a dataset to train the model on
3. Preprocess the data
4. Train the model
5. Evaluate the performance of the model against a test set
6. Visualise results

## Concepts to Explore
- BERT: Bi-directional Encoder Representations from Transformers, text-embedding while keeping contextual information, https://www.tensorflow.org/text/tutorials/classify_text_with_bert
- NER: Name Entity Recognition
- LDA: https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb
- Topic modelling: https://monkeylearn.com/blog/introduction-to-topic-modeling/

- SpaCy
    - https://machinelearningknowledge.ai/spacy-nlp-pipeline-tutorial-for-beginners/
    - https://towardsdatascience.com/structured-natural-language-processing-with-pandas-and-spacy-7089e66d2b10

In [31]:
# import dependencies
import pandas as pd
# from sklearn.model_selection import train_test_split
import spacy
from spacy import displacy
from collections import Counter

### Data Pre-processing

In [32]:
# read in data
# df_1 = pd.read_csv('Data/articles1.csv')['content'].to_frame()  # only get content-column
# df_2 = pd.read_csv('Data/articles2.csv')['content'].to_frame()
# df_3 = pd.read_csv('Data/articles3.csv')['content'].to_frame()
# df = df_1.append(df_2).append(df_3)
df = pd.read_csv('Data/articles1.csv')['content'].to_frame()
df.drop(df.index[0:40000],0,inplace=True)
print('\nData set, shape:', df.shape)

print(df.head(5))


Data set, shape: (10000, 1)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

In [33]:
# # split data into ~67% training and ~33% testing
# train, test = train_test_split(df, test_size=0.33, random_state=1)
# print('\nTraining data set, shape:', train.shape)
# print('Testing data set, shape:', test.shape)

In [34]:
# check for missing data
print(df.isna().sum())  # shows no null values in content-column

# print(train.isna().sum())  # shows no null values in content-column
# print('\n', test.isna().sum())  # shows no null values in content-column

content    0
dtype: int64


In [35]:
# reset indices
# train = train.reset_index(drop=True)
# test = test.reset_index(drop=True)
# print(train.head(5))
# print('\n', test.head(5))

### Building the Data Pipeline

In [36]:
# load pipeline
nlp = spacy.load('en_core_web_sm')

In [37]:
# initialise array of type of keywords to get
useful_entities = ['PERSON', 'NORP', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']

In [38]:
# run pipeline on first article and print keywords + text with highlighted keywords for demonstration
pd.set_option('display.max_colwidth', None)
doc = nlp(df['content'].iloc[0])

entities = []
for ent in doc.ents:
    if ent.label_ in useful_entities:
        entities.append(ent.lemma_)
cmn_ents = Counter(entities).most_common(10)
keywords = [e[0] for e in cmn_ents]
print('Keywords:', keywords)
displacy.render(doc, style="ent", jupyter=True)

Keywords: ['Shin', 'Harden', 'north korean']


In [39]:
# initialise array of keywords
article_keywords = []

# function to apply to each article:
#   gets a list of the three most common keywords in each article and
#   adds that list to the list of keywords 'article_keywords'
def add_keywords(x):
    doc = nlp(x)
    entities = []
    for ent in doc.ents:
        if ent.label_ in useful_entities:
            entities.append(ent.lemma_)
    cmn_ents = Counter(entities).most_common(1)
    keywords = [e[0] for e in cmn_ents]
    article_keywords.append(keywords)
    return x

# apply add_keywords-function to all articles
df['content'] = df['content'].apply(add_keywords)

# add 'article_keywords' to df as new column
df['keywords'] = article_keywords

In [40]:
df.to_csv('Data/articles_keywords.csv', index=False)
print(df['keywords'].head(10))

40000          [Shin]
40001        [Taiwan]
40002    [Super Bowl]
40003         [Bravo]
40004         [Yemen]
40005        [Romney]
40006    [Super Bowl]
40007         [Rocky]
40008        [Israel]
40009          [King]
Name: keywords, dtype: object


### Explore other methods of topic extraction for better results:
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
- https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation
- https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
- https://medium.com/analytics-vidhya/topic-modelling-using-latent-dirichlet-allocation-in-scikit-learn-7daf770406c4
- https://towardsdatascience.com/introduction-to-topic-modeling-using-scikit-learn-4c3f3290f5b9
- https://towardsdatascience.com/effortless-nlp-using-pre-trained-hugging-face-pipelines-with-just-3-lines-of-code-a4788d95754f
- https://kyawkhaung.medium.com/multi-label-text-classification-with-bert-using-pytorch-47011a7313b9
- https://towardsdatascience.com/question-answering-with-a-fine-tuned-bert-bc4dafd45626
- https://www.analyticsvidhya.com/blog/2021/06/part-3-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/