## NLP List ##
- Grab data from JSON file
- Target XML files and grab articles for individual news sites
- Parsing and organizing data

## JSON Input ##
Reading webiste list from JSON file.

In [11]:
# Imports for JSON Input Section
import json
import os

In [12]:
# Two individual files, 
file_name = "testdata.json"     # One website
# file_name = "websites.json"     # All websites

with open(file_name) as file:
    websites = []    
    # Load in JSON file
    data = json.load(file)  
    # For each site in JSON file, append to websites
    for site in data['target-sites']:
        websites.append(site['parse-type']['url'])


In [13]:
# You can see the websites by uncommiting the following line
print(websites)

['https://www.cnn.com/sitemaps/cnn/news.xml']


## Creating our corpus ##


In [14]:
# Imports for section
import requests
import timeit
import pandas as pd

from bs4 import BeautifulSoup

# Uncomment to test the speed of the aggregation process
import time
start = timeit.default_timer()

Rather then appending a new entry, we will grow a list. Reference [here](https://stackoverflow.com/a/56746204/7560483).

In [15]:
# Todays current date
today = pd.to_datetime("today").tz_localize('US/Central').date()
data = [] 

# Create empty dataframe if articles.csv doesn't exist
try:
    # Import all articles stored within the csv
    df = pd.read_csv('articles.csv', index_col=0)
    # Remove all articles not published today
    df = df[~(df['pub-date'] == today)]
except FileNotFoundError:
    # Create empty dataframe if one doesnt exist
    df = pd.DataFrame(data, columns=['pub-date','source','name','title','content'])
    

for site in websites:
    # Grab individual sites XML 
    resp = requests.get(site)

    # Parse XML using BeautifulSoup
    soup = BeautifulSoup(resp.content, 'lxml-xml')
    # For each Article Listed that is today
    for article in soup.find_all('url'):
        # Grab and parse the publication date of the current article
        pub_date = pd.to_datetime(article.publication_date.get_text()).tz_convert('US/Central').date()
        # print(pub_date , today , pub_date - today)
        if article.loc.get_text() in df['source'].values:     # Continue if previously processed article
            continue
        elif pub_date < today:     # Continue if pub-date not current date
            continue
        # Grab and parse the article
        page = requests.get(article.loc.get_text())
        soup2 = BeautifulSoup(page.content, 'html.parser')

        # Build content based on HTML format
        content = []
        for ele in soup2.findAll('div',{'class':'zn-body__paragraph'}):
            content.append(ele.get_text())
        if not content:
            for par in soup2.find_all('p'):
                content.append(par.get_text())
        # Append article onto data list
        data.append([
            pub_date,
            article.loc.get_text(),
            article.find('name').get_text(),
            article.title.get_text(),
            " ".join(content)
        ])
        
        # Sleep for etiquette
        time.sleep(.5)

# Concat the data as a DataFrame object with respects to existing DataFrame
df = pd.concat([df, pd.DataFrame(data, columns=['pub-date','source','name','title','content'])], ignore_index=True)

# Final revision on data aggregated 
df['pub-date'] = pd.to_datetime(df['pub-date'], utc=True)
df['content'] = df['content'].astype(str)
# Export data
df.to_csv("articles.csv")

# Uncomment to test the speed of the aggregation process
stop = timeit.default_timer()
print("Time: ", stop-start)

Time:  3.713070900000048


In [17]:
df.head()

Unnamed: 0,pub-date,source,name,title,content,summary
0,2021-02-01 00:00:00+00:00,https://www.cnn.com/2021/02/01/us/aclu-first-b...,CNN,"ACLU, for first time in its 101-year history, ...","The ACLU made the announcement Monday, calling...","""After beginning my career as an ACLU fellow, ..."
1,2021-02-01 00:00:00+00:00,https://www.cnn.com/2021/02/01/politics/joe-ma...,CNN,White House reached out to Manchin after Harri...,The outreach comes after Harris' apparent move...,"In an interview with WSAZ Thursday, Harris sai..."
2,2021-02-01 00:00:00+00:00,https://www.cnn.com/2021/02/01/health/us-coron...,CNN,The US is in an &apos;absolute race&apos; with...,"""What concerns me most is that we already know...","While the state has the capacity to give 250,0..."
3,2021-02-01 00:00:00+00:00,https://www.cnn.com/2021/02/01/investing/googl...,CNN,Google investors may have forgotten how much l...,Alphabet (GOOGL) is trading near an all-time h...,But broader concerns about the pandemic could ...
4,2021-02-01 00:00:00+00:00,https://edition.cnn.com/2021/02/01/europe/nean...,CNN International,Prehistoric teeth hint at Stone Age sex with N...,"During this time, Homo sapiens and Neanderthal...",The team was trying to recover DNA from the te...


## Cleaning and preparations ## 

In [18]:
# Imports for Cleaning and preparations
import numpy as np
import heapq
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

Now that we've aggreggated articles for our data,lets preview an article.

In [19]:
df['content'][0]

'The ACLU made the announcement Monday, calling Archer "an established civil rights attorney, scholar, and teacher." In addition to her professorship, Archer is the co-faculty director at NYU\'s Center on Race, Inequality, and the Law, and the director of the Civil Rights Clinic at NYU School of Law. Archer has been a part of the ACLU for years, beginning her career as a legal fellow in the ACLU Racial Justice Program, the organization stated. She\'s been a member of the board since 2009 and a general counsel since 2017. "After beginning my career as an ACLU fellow, it is an honor to come full circle and now lead the organization as board president," Archer said in a statement. "The ACLU has proven itself as an invaluable voice in the fight for civil rights in the last four years of the Trump era, and we are better positioned than ever to face the work ahead." Susan Herman previously held the role, serving for 12 years.  Archer\'s duties will include leading its more than 60 members in

Notice the symbols and characters within the article content. Before we calculate the frequencies, lets use regex to clean things up by defining some functions.

In [20]:
summary_data = np.full([len(df)], "", dtype=np.object)
for index, row in df.iterrows():
    # Remove all extra spaces
    text = re.sub(r'\s+', ' ', row['content'])
    # Remove all numbers and special characters, used for frequency
    format_text = re.sub('[^a-zA-Z]', ' ', text )
    format_text = re.sub(r'\s+', ' ', format_text)

    # Remove stopwords from tokenized array
    stop_words = list(stopwords.words('english'))
    word_tokens = np.array(word_tokenize(format_text))
    format_text = np.array([word for word in word_tokens if not word in stop_words])

    # Splitting sentences
    sentence_list = np.array(text.split('. '))
    
    word_freq = {}
    for word in format_text:
        if word not in word_freq.keys():
            word_freq[word] = 1
        else: 
            word_freq[word] += 1

    max_req = max(word_freq.values(), default=0)
    for word in word_freq.keys():
        word_freq[word] = (word_freq[word]/max_req)
    
    sentence_score = {}
    sentence_avg = round(1.5*sum([len(sent.split(' ')) for sent in sentence_list])/len(sentence_list))
    for sentence in sentence_list:
        for word in word_tokenize(sentence.lower()):
            if word in word_freq.keys():
                if len(sentence.split(' ')) < sentence_avg:
                    if sentence not in sentence_score.keys():
                        sentence_score[sentence] = word_freq[word]
                    else:
                        sentence_score[sentence] += word_freq[word]

    summary_sentences = heapq.nlargest(7, sentence_score, key=sentence_score.get)
    summary_data[index] = '. '.join(summary_sentences)+"."

# Append new column
df['summary'] = summary_data 
df.to_csv("articles.csv")

In [26]:
print(summary_sentences)
print(sentence_list)

["Rubio's comments about the bus incident The video's description of Rubio's comments about the bus incident is fair enough", 'On January 8, Rubio posted a video statement on Twitter that mixed criticism of the insurrection with criticism of others', 'This claim, about Rubio contributing to harmful political "conditions," is a more nuanced and defensible claim than the suggestion in the video that insurrectionists heard and acted on Rubio\'s words', 'The video goes on to say that Rubio applauded the convoy of Donald Trump supporters that was filmed surrounding a Joe Biden campaign bus on an interstate highway in Texas in October', "The video itself, however, doesn't offer this kind of nuanced criticism of the strength of Rubio's responses to the attack", 'Still, though, he indisputably condemned the insurrection the MeidasTouch video claims he did not condemn.', 'This is 3rd world style anti-American anarchy." That night on Fox News, Rubio said to host Tucker Carlson that the insurrect

In [27]:
# Now lets compare the differences between the content and summary for a random element
print('Content:\n'+df['content'][0])
print('Summary:\n'+df['summary'][0])

Content:
The ACLU made the announcement Monday, calling Archer "an established civil rights attorney, scholar, and teacher." In addition to her professorship, Archer is the co-faculty director at NYU's Center on Race, Inequality, and the Law, and the director of the Civil Rights Clinic at NYU School of Law. Archer has been a part of the ACLU for years, beginning her career as a legal fellow in the ACLU Racial Justice Program, the organization stated. She's been a member of the board since 2009 and a general counsel since 2017. "After beginning my career as an ACLU fellow, it is an honor to come full circle and now lead the organization as board president," Archer said in a statement. "The ACLU has proven itself as an invaluable voice in the fight for civil rights in the last four years of the Trump era, and we are better positioned than ever to face the work ahead." Susan Herman previously held the role, serving for 12 years.  Archer's duties will include leading its more than 60 membe