# NLP Project
## Topic extraction
The purpose of this project is to extract topics from news articles.

### Step-by-step Process
1. Find a suitable NLP model to use for topic extraction: NER
2. Preprocess the data
3. Get results
4. Documentation

In [4]:
# import dependencies
import pandas as pd
import spacy
from spacy import displacy
from collections import Counter

### Data Pre-processing

In [5]:
# read in data
df = pd.read_csv('Data/articles1.csv')['content'].to_frame()
df.drop(df.index[0:49990],0,inplace=True)
print('\nData set, shape:', df.shape)
print(df.head(10))


Data set, shape: (10, 1)
                                                 content
49990  In January 1999, Prosecutor General Yury Skura...
49991          This article is part of a feature we a...
49992  President Obama’s farewell speech was an exerc...
49993  Updated on January 11 at 5:56 p. m. ET,   Dona...
49994  A large cohort of Americans have reservations ...
49995  As chairman and CEO of ExxonMobil, Rex Tillers...
49996  I’ve spent nearly 20 years looking at intellig...
49997    Donald Trump will not be taking necessary st...
49998  Dozens of   colleges could be forced to close ...
49999  The force of gravity can be described using a ...


  df.drop(df.index[0:49990],0,inplace=True)


In [6]:
# check for missing data
print(df.isna().sum())  # shows no null values in content-column

content    0
dtype: int64


### Loading the Pipeline

In [7]:
# load pipeline
nlp = spacy.load('en_core_web_sm')

In [8]:
# initialise array of type of keywords to get
# useful_entities = ['PERSON', 'NORP', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']
useful_entities = ['ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART']

pd.set_option('display.max_colwidth', None)

In [9]:
# define function to apply to each article: gets a list of the 3 most common keywords
# based on the types of entities defined in useful_entities above
def add_keywords(x):
    doc = nlp(x)  # run article through pipeline
    entities = []  # initialise array of entities
    for ent in doc.ents:
        if ent.label_ in useful_entities:
            entities.append(ent.lemma_)
    cmn_ents = Counter(entities).most_common(3)
    keywords = [e[0] for e in cmn_ents]
    return keywords

### Testing and Running the Pipeline

In [10]:
# run pipeline on first article and print keywords plus text with highlighted keywords for demonstration
first_keywords = add_keywords(df['content'].iloc[0])
print('Keywords:', first_keywords)
displacy.render(nlp(df['content'].iloc[0]), style="ent", jupyter=True)

Keywords: ['Kremlin', 'Russia', 'Moomoo']


In [11]:
# apply add_keywords-function to all articles to load each one's keywords into new column
df['keywords'] = df['content'].apply(add_keywords)

In [12]:
# save the df with added keywords into a csv file
df.to_csv('Data/articles_keywords.csv', index=False)

# print first 10 rows of topics
print(df['keywords'].head(10))

49990                                         [Kremlin, Russia, Moomoo]
49991                           [Trump, Politics  Policy Daily, Russia]
49992                                            [Trump, CIA, Congress]
49993                                          [Trump, Trump’s, Dillon]
49994                           [Washington, Hollywood, South Carolina]
49995                             [Tillerson, Exxon, the United States]
49996                                                 [FBI, CNN, Trump]
49997                          [Trump, Trump’s, the Trump Organization]
49998    [George Washington University, the Education Department, NBER]
49999                     [Sagittarius, Center for Astrophysics, Chile]
Name: keywords, dtype: object
