# NLP Project
## Topic extraction
The purpose of this project is to extract topics from news articles.

### Step-by-step Process
1. Find a suitable NLP model to use for topic extraction: RAKE
2. Preprocess the data
3. Get results
4. Documentation

In [1]:
# import dependencies
import pandas as pd
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short, stem_text
import gensim  # necessary?
from gensim import corpora
import spacy
from rake_nltk import Rake
import nltk
import warnings
warnings.filterwarnings('ignore')

### Data Pre-processing

In [2]:
# read in data
df = pd.read_csv('Data/articles1.csv')['content'].to_frame()
# df.drop(df.index[0:49999],0,inplace=True)  # drop a few rows to make dataset smaller and more manageable
print('\nData set, shape:', df.shape)

# check for missing data
print(df.isna().sum())  # shows no null values in content-column

# pd.set_option('display.max_colwidth', None)
print(df.head(10))


Data set, shape: (50000, 1)
content    0
dtype: int64
                                             content
0  WASHINGTON  —   Congressional Republicans have...
1  After the bullet shells get counted, the blood...
2  When Walt Disney’s “Bambi” opened in 1942, cri...
3  Death may be the great equalizer, but it isn’t...
4  SEOUL, South Korea  —   North Korea’s leader, ...
5  LONDON  —   Queen Elizabeth II, who has been b...
6  BEIJING  —   President Tsai   of Taiwan sharpl...
7  Danny Cahill stood, slightly dazed, in a blizz...
8  Just how   is Hillary Kerr, the    founder of ...
9  Angels are everywhere in the Muñiz family’s ap...


### Pre-process the Data

In [3]:
# load spacy nlp pre-processing pipeline to use for lemmatization
nlp = spacy.load('en_core_web_sm')

In [4]:
# create filter for gensim nlp pre-processing pipeline to include all steps except stemmatization
CUSTOM_FILTERS = [lambda x: x.lower(),  # lowercase
                  strip_tags,
                  strip_punctuation,  # replace punctuation with whitespace
                  strip_multiple_whitespaces,  # remove repeating whitespaces
                  strip_numeric,  # remove numbers
                  remove_stopwords,  # remove stopwords
                  strip_short,  # remove words with less than 3 characters
                  #  stem_text  # return porter-stemmed text,
                 ]

In [5]:
sample = "Hello, my name is something you'll never guess, Kim! ...But I wrote my signature. Right! My parents called me this, what can I say?"
print(sample)

Hello, my name is something you'll never guess, Kim! ...But I wrote my signature. Right! My parents called me this, what can I say?


In [6]:
# test sample string without filtered pipeline, i.e., with stemmatizer
test_a = preprocess_string(sample)
print(test_a)

['hello', 'guess', 'kim', 'wrote', 'signatur', 'right', 'parent', 'call']


In [7]:
# test sample string with filtered pipeline and lemmatizer
test_b = ' '.join(preprocess_string(sample, CUSTOM_FILTERS))  # pre-process without stemmatizing
lem = [token.lemma_ for token in nlp(test_b)]  # lemmatize
print(lem)

['hello', 'guess', 'kim', 'write', 'signature', 'right', 'parent', 'call']


In [8]:
def preprocess_articles(x):
    prep = ' '.join(preprocess_string(x, CUSTOM_FILTERS))
    return [token.lemma_ for token in nlp(prep)]

In [9]:
# apply final pipeline to all data
df['preprocessed'] = df['content'].apply(preprocess_articles)

In [10]:
# print head of preprocessed df
print(df['preprocessed'].head(1))

0    [washington, congressional, republicans, new, ...
Name: preprocessed, dtype: object


In [11]:
# nltk.download('stopwords')
r = Rake()

In [18]:
r.extract_keywords_from_text(df['content'].iloc[0])
r.get_ranked_phrases_with_scores()

[(22.25, 'comment ,” said phillip j'),
 (18.67857142857143, 'disputed subsidies could conceivably cause'),
 (18.677777777777777, 'house republicans last month told'),
 (17.28846153846154, 'incoming trump administration could choose'),
 (16.0, 'despite widespread internal skepticism'),
 (15.25, 'many legal experts said'),
 (15.25, 'eligible consumers could race'),
 (14.733333333333334, 'republicans gain full control'),
 (14.5, '“ upon taking office'),
 (14.03409090909091, 'health care systems generally'),
 (14.0, 'trump era might come'),
 (13.538461538461538, 'trump administration may come'),
 (13.2, '” republican leadership officials'),
 (13.0, 'required — even though'),
 (12.8, 'preserving executive branch prerogatives'),
 (12.733333333333334, 'washington — congressional republicans'),
 (12.7, '“ cascading effects ”'),
 (12.538461538461538, 'administration initially sought one'),
 (11.13409090909091, 'obama health care law'),
 (9.125, 'affordable care act'),
 (9.0, 'trump transition e