# NLP Project
## Topic extraction
The purpose of this project is to extract topics from news articles.

### Step-by-step Process
1. Find a suitable NLP model to use for topic extraction: NER (SpaCy)
2. Preprocess the data
3. Get results
4. Documentation

In [1]:
# import dependencies
import pandas as pd
import gensim
import spacy
from spacy import displacy
from collections import Counter

### Data Pre-processing

In [2]:
# Large dataset model:
# from sklearn.model_selection import train_test_split

# # read in data
# df_1 = pd.read_csv('Data/articles1.csv')['content'].to_frame()  # only get content-column
# df_2 = pd.read_csv('Data/articles2.csv')['content'].to_frame()
# df_3 = pd.read_csv('Data/articles3.csv')['content'].to_frame()
# df = df_1.append(df_2).append(df_3)
# print('\nData set, shape:', df.shape)
# print(df.head(1))

# # check for missing data
# print(df.isna().sum())  # shows no null values in content-column

# # split data into ~67% training and ~33% testing
# train, test = train_test_split(df, test_size=0.33, random_state=1)
# print('\nTraining data set, shape:', train.shape)
# print('Testing data set, shape:', test.shape)

# # reset indices
# train = train.reset_index(drop=True)
# test = test.reset_index(drop=True)
# print(train.head(5))
# print('\n', test.head(5))

In [3]:
# read in data
df = pd.read_csv('Data/articles1.csv')['content'].to_frame()
df.drop(df.index[0:49900],0,inplace=True)
print('\nData set, shape:', df.shape)
print(df.head(1))


Data set, shape: (100, 1)
                                                 content
49900  “Let us not be timid,” Paul Ryan exhorted memb...


In [4]:
# check for missing data
print(df.isna().sum())  # shows no null values in content-column

content    0
dtype: int64


### Loading the Pipeline

In [5]:
# load pipeline
nlp = spacy.load('en_core_web_sm')

In [6]:
# initialise array of type of keywords to get
# useful_entities = ['PERSON', 'NORP', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']
useful_entities = ['ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']

pd.set_option('display.max_colwidth', None)

In [7]:
# function to apply to each article: gets a list of the 3 most common keywords
def add_keywords(x):
    doc = nlp(x)  # run article through pipeline
    entities = []  # initialise array of entities
    for ent in doc.ents:
        if ent.label_ in useful_entities:
            entities.append(ent.lemma_)
    cmn_ents = Counter(entities).most_common(3)
    keywords = [e[0] for e in cmn_ents]
    return keywords

### Testing and Running the Pipeline

In [8]:
# run pipeline on first article and print keywords plus text with highlighted keywords for demonstration
first_keywords = add_keywords(df['content'].iloc[0])
print('Keywords:', first_keywords)
displacy.render(nlp(df['content'].iloc[0]), style="ent", jupyter=True)

Keywords: ['House', 'GOP', 'Congress']


In [9]:
# apply add_keywords-function to all articles to load each one's keywords into new column
df['keywords'] = df['content'].apply(add_keywords)

In [10]:
# save the df with added keywords into a csv file
df.to_csv('Data/articles_keywords.csv', index=False)

# print first 10 rows of topics
print(df['keywords'].head(10))

49900                                        [House, GOP, Congress]
49901           [Congress, the 115th Congress, Pew Research Center]
49902                                     [OA, American Honey, HBO]
49903                   [North Korea, the United States, Pyongyang]
49904                                 [Reynolds, Hollywood, Fisher]
49905                                        [Trump, House, Senate]
49906                   [Jewels, run the Jewels 3, the Oval Office]
49907                                         [Texas, Trump, Obama]
49908                           [the White House, Obama, Instagram]
49909    [Obama, White House correspondent’ Dinner, the Daily Show]
Name: keywords, dtype: object


## Future Concepts to Explore
- BERT: Bi-directional Encoder Representations from Transformers, text-embedding while keeping contextual information, https://www.tensorflow.org/text/tutorials/classify_text_with_bert
- LDA
- 
- Entity linking
- Coreferencing
- Knowledge graph
- Academic papers
- 
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
- https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation
- https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
- https://medium.com/analytics-vidhya/topic-modelling-using-latent-dirichlet-allocation-in-scikit-learn-7daf770406c4
- https://towardsdatascience.com/introduction-to-topic-modeling-using-scikit-learn-4c3f3290f5b9
- https://towardsdatascience.com/effortless-nlp-using-pre-trained-hugging-face-pipelines-with-just-3-lines-of-code-a4788d95754f
- https://kyawkhaung.medium.com/multi-label-text-classification-with-bert-using-pytorch-47011a7313b9
- https://towardsdatascience.com/question-answering-with-a-fine-tuned-bert-bc4dafd45626
- https://www.analyticsvidhya.com/blog/2021/06/part-3-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/