<a href="https://colab.research.google.com/github/linesn/reddit_analysis/blob/main/Notebooks/Newspaper_Search-2.0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News searching
*Nick Lines*

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Parameters" data-toc-modified-id="Parameters-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Parameters</a></span></li><li><span><a href="#Functions-and-Classes" data-toc-modified-id="Functions-and-Classes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Functions and Classes</a></span></li><li><span><a href="#System-dependent-Configuration" data-toc-modified-id="System-dependent-Configuration-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>System-dependent Configuration</a></span></li></ul></li><li><span><a href="#Collect-Data" data-toc-modified-id="Collect-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collect Data</a></span><ul class="toc-item"><li><span><a href="#Collect-Newspaper-Articles" data-toc-modified-id="Collect-Newspaper-Articles-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Collect Newspaper Articles</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Introduction
This notebook is adapted from a playbook provided by the Discovery Lab, Applied Intelligence, Accenture Federal Services. While the notebook you are reading is original work, I credit the Discovery Lab playbook for pointing out all the packages and libraries in use.

The purpose of this notebook is to allow the user to search for a specific term or set of terms over a given time period, and receive the google news headlines that were found to be associated with that search. These are output as a CSV. The notebook is designed to run either in Google Colab or on a desktop.

# Setup


<p> The imports, function and class defintions, global variables, and system-dependent configuration are in this section. </p>


## Imports

The first few cells are essential to run this notebook in colab. All other imports are usually already available in the colab environment

In [11]:
try:
  from selenium import webdriver
  from selenium.common.exceptions import StaleElementReferenceException
  from selenium.webdriver.common.keys import Keys
  from selenium.webdriver.chrome.options import Options
  from selenium.webdriver.support.ui import WebDriverWait
except:
  !pip install selenium
  from selenium import webdriver
  from selenium.common.exceptions import StaleElementReferenceException
  from selenium.webdriver.common.keys import Keys
  from selenium.webdriver.chrome.options import Options
  from selenium.webdriver.support.ui import WebDriverWait 

In [12]:
try:
  from newspaper import Article, fulltext
except:
  !pip install newspaper3k
  from newspaper import Article, fulltext

In [13]:
try:
  from GoogleNews import GoogleNews
except:
  !pip install GoogleNews  
  from GoogleNews import GoogleNews

In [14]:
try:
  from unidecode import unidecode
except:
  !pip install unidecode
  from unidecode import unidecode

In [15]:
"""This cell imports necessary Python modules and performs initial configuration
"""

### Data manipulation libraries
# import json
import pandas as pd 
#import csv

### Visualization and Interaction
# import matplotlib.pyplot as plt
# plt.style.use('ggplot')

from IPython.display import set_matplotlib_formats, display, clear_output, HTML
set_matplotlib_formats('retina')

#import plotly.graph_objs as go
#from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
#init_notebook_mode(connected=True)

#import ipywidgets as widgets
#from ipywidgets import interact, interactive, fixed, interact_manual
#from ipywidgets import VBox, HBox, Button, HTML, Label

### Computation libraries 
import numpy as np
import re
import random

### Graph analysis
# import networkx as nx
# import community

### System related
# import warnings;
# warnings.filterwarnings('ignore')
import io
import os
import platform
from pathlib import Path
import sys
# from joblib import Parallel, delayed

### Datetime libraries
from datetime import datetime, timedelta
import time
from pytz import timezone

### NLP dependencies
# import spacy
# from spacy.tokenizer import Tokenizer
# nlp = spacy.load('en')
# tokenizer = Tokenizer(nlp.vocab)

# from langdetect import detect

### Scraping libraries

from bs4 import BeautifulSoup

### Machine learning libraries
# from sklearn import datasets
# from sklearn import linear_model
# from sklearn.feature_selection import f_regression, mutual_info_regression
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import classification_report

### Logging
import logging 
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
#import spacy
# nlp = spacy.load('en')


## System-dependent Configuration
This cell allows the user to save the results of the query in Google Drive if the notebook is running on Colab, or locally if they are running this on their own machine.

In [26]:
"""This cell defines system-dependent configuration such as those different in Linux vs. Windows
"""
if 'COLAB_GPU' in os.environ: # a hacky way of determining if you are in colab.
  print("Notebook is running in colab")
  from google.colab import drive
  drive.mount("/content/drive")
  OUTPUT_DIR = "./drive/My Drive/Data/raw/"
  
else:
  # Get the system information from the OS
  PLATFORM_SYSTEM = platform.system()

  # Darwin is macOS
  if PLATFORM_SYSTEM == "Darwin":
      EXECUTABLE_PATH = Path("../dependencies/chromedriver")
  elif PLATFORM_SYSTEM == "Windows":
      EXECUTABLE_PATH = Path("../dependencies/chromedriver.exe")
  else:
      logging.critical("Chromedriver not found or Chromedriver is outdated...")
      exit()
  OUTPUT_DIR = "../Data/raw/"
output_file = OUTPUT_DIR + "articles_output.csv"
os.makedirs(OUTPUT_DIR, exist_ok=True)

Notebook is running in colab
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Parameters

In [52]:
"""This cell defines global variables and parameters used throughout the playbook
"""

# Set this to True if you want to watch Selenium scrape pages
WATCH_SCRAPING = True

# Set this to True if you want to use incognito mode
USE_INCOGNITO = True

# The data is written 
# RAW_DATA_DIRECTORY = Path("../data/raw/")

# Setup logging level
LOGGING_LEVEL = logging.INFO 
logging.basicConfig(level=LOGGING_LEVEL)
start_date = "02/02/2021" # (datetime.today() - timedelta(days = 1)).strftime('%m/%d/%Y')
end_date = "02/03/2021" # datetime.today().strftime('%m/%d/%Y')
lang = "en"
search_terms = "biden"

## Functions and Classes

I did not use most of these, but have left them in case they are of use later.

In [28]:
"""This cell defines functions and classes used throughout the playbook
"""

# APIs
import requests

import random
user_agent_list = [
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36','Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36','Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)','Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko','Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)','Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko','Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko','Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko','Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)','Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko','Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)','Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko','Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)','Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]


# logging
import logging
log = logging.getLogger(__name__)

# general python 
import time
import sys
import re
import json
from datetime import datetime, timedelta
import csv
columns = ['url', 'body', 'title', 'summary', 'keywords', 'authors3k', 'pubdate', 'quotes', 'rel_sents']

# nlp: download language package outside of shell before running
# python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")

# multithreading
import threading
import queue
global lck 
lck = threading.Lock()

# start forreal



def collect_urls(query, p_results, date_start, date_end, lang):
    results_list = []
    googlenews = GoogleNews(lang=lang, start=date_start, end=date_end) # format: '07/24/2020'
    googlenews.search(query)
    for i in range(min(p_results, )):
        results_list.extend(googlenews.result())
        googlenews.getpage(i+2)
    
    if False:
        skips = ['videos'] # sites that don't contain article content
        results_list = [item for item in results_list if not any([skip for skip in skips if skip in item])]
    
    log.info('links to attack: {}'.format(results_list))
    return results_list


class Worker(threading.Thread):


    def __init__(self, q, i, *args, **kwargs):
        self.q = q
        self.i = i
        super().__init__(*args, **kwargs)
        
        
    def run(self):
        while True:
            try:
                j, url, keywords = self.q.get(timeout=4)  # 3s timeout
                i = self.i
            except queue.Empty:
                return
            
            article_dict = {}
            body = ''
            try:
                article_dict['url'] = url
                log.info('[t{}] {}- processing\n\tlink: {}'.format(i, j, url))
                # randomize user agents
                user_agent = random.choice(user_agent_list)
                headers = {'User-Agent': user_agent}
                article = requests.get(url, headers=headers)
                article_orig = article
                # nespaper3k processing
                article = Article(url)
                article.download()
                article.parse()
                article.nlp()
                
                article_dict['title'] = article.title
                article_dict['summary'] = article.summary
                article_dict['keywords'] = article.keywords
                article_dict['authors3k'] = article.authors
                article_dict['pubdate'] = article.publish_date
                body = ''
                try:
                    body = article.text
                except:
                    body = 'fail'

                try:
                    html_body = article_orig.text
                    body = fulltext(html_body)
                except:
                    body = 'fail'
                    
                article_dict['body'] = body
                
            except:
                log.error('\n\n[t{}] {}- error: \n\turl: {}\n\tdetails [line {}]:{}\n\n'.format(i, j, url, sys.exc_info()[-1].tb_lineno, sys.exc_info()[0]))
            
            # extract quotes from body, remove stuff like l & r quotes first
            body = unidecode(body)
            if len(body) > 5:
                # extract quotes
                quotes = []
                terms = ['said', 'say', 'state', 'argue', 'told', 'wrote', 'writ', 'tweet', 'announc']
                for term in terms:
                    r = re.compile(
                        r'''(?:"(?P<quote>[^"]+)"\W+(?i:{0}\w*)(?P<speaker>(?:\s(?:he|she|they|[A-Z]+[a-z.]*))+)(?:\s|(?:,(?P<title>(?:(?:\s[\w\']+))+)[,.])))|(?:"(?P<quote1>[^"]+)"(?P<speaker1>(?:\s(?:he|she|they|(?:[A-Z]+[a-z.]*\s)+))+)(?:\s?|(?:,(?P<title1>(?:(?:\s[\w\']+))+),))(?i:{0}\w*))|(?:(?P<speaker2>(?:he|she|they|(?:[A-Z]+[a-z.]*\s)+))(?:\s?|(?:,(?P<title2>(?:\s[\w\']+)+),\s))(?i:{0}\w*)\s(?:\w+\s)*"(?P<quote2>[^"]+)")|(?:(?i:{0}\w*)(?P<speaker3>[^"]+)"(?P<quote3>[^"]+)"\.)'''.format(term))
                    re_dicts = [m.groupdict() for m in r.finditer(body)]
                    for re_dict in re_dicts:
                        clean_dict = {k: v for k, v in re_dict.items() if v}
                        new_dict = {}
                        for k, v in clean_dict.items():
                            new_dict[re.sub(r'\d+', '', str(k))] = v
                        quotes.append(new_dict)
                # extract relevant sentences "about" the subject
                sentences = []
                if len(keywords) > 0:
                    for sentence in body.split('.'):
                        doc = nlp(sentence)
                        if any([keyword for keyword in keywords for token in doc if keyword in token.text.lower() and token.dep_ in ['dobj','attr']]):
                            sentences.append(sentence)
                if len(sentences) == 0:
                    sentences = 'none'
                        
                # clean out empty regex returns
                # quotes = [[item for item in tup if len(item)>0] for tup in quotes] # for tuples
                # clean out html junk; obselete since using newspaper3k to grab body
                # body = re.sub(r'#[\w\-\s.:()]+{[^}]+}', '', body)
                article_dict['quotes'] = json.dumps(quotes)
                article_dict['rel_sents'] = sentences
                log.info('[t{}] {}- success\n\turl: {}'.format(i, j, url))
                
            else:
                log.error('[t{}] {}- unsuccessful: \n\turl: {}'.format(i, j, url))
                for kee in columns:
                    if kee not in article_dict: article_dict[kee] = 'fail'
                
            lck.acquire()
            with open(output_file, 'a',encoding='utf-8-sig', newline='') as g:
                csv.DictWriter(g, fieldnames=columns).writerow(article_dict)
            lck.release()
            log.info('[t{}] {}- written\n\tlink: {}'.format(i, j, url))
            self.q.task_done()


def process(query=None, urls=[], keywords = [], n_threads: int=40, p_results=1, 
             date_start=(datetime.today() - timedelta(days = 1)).strftime('%m/%d/%Y'), date_end=datetime.today().strftime('%m/%d/%Y'), lang='en'):
    log.info('article scraping start time: {}'.format(datetime.now().strftime("%Y-%m-%d-%H.%M.%S")))
    start_time = time.time()
    
    if (query and len(urls)>0) or (not query and len(urls)==0):
        log.error('provide query or urls; not both or neither')
        sys.exit(1)
    
    if query:
        log.info('query: {}'.format(query))
        urls = collect_urls(str(query), int(p_results), date_start, date_end, lang)
        if len(urls) ==0:
            log.error('no news results found')
            sys.exit(1)
        
    log.info('URL count: {}'.format(len(urls)))
    if len(urls) < n_threads: n_threads = round(len(urls)/4)

    with open(output_file, 'w',encoding='utf-8-sig', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=columns)
        writer.writeheader()
    
    q = queue.Queue()
    for k, result in enumerate(urls):
        q.put_nowait((k, result['link'], keywords))
    for _ in range(n_threads):
        Worker(q, _).start()
        time.sleep(1)
    q.join()
    
    log.info('article scraping finished. end time: {}'.format(datetime.now().strftime("%Y-%m-%d-%H.%M.%S")))
    log.info('article scraping completed in {}'.format(timedelta(seconds=int(time.time() - start_time))))


    import pandas as pd
import numpy as np
from ast import literal_eval
import re
from textblob import TextBlob


def literal_return(val):
    try:
        return literal_eval(val)
    except (ValueError, SyntaxError) as e:
        return val


def position_process(keywords = None):
    articles_df = pd.read_csv(output_file)
    articles_df.quotes = articles_df.quotes.apply(lambda x: literal_return(str(x)))


    # split quotes into columns
    quote_df = articles_df[['url', 'quotes']]
    quote_df = quote_df.quotes.apply(pd.Series) \
        .merge(quote_df, left_index = True, right_index = True) \
        .drop(["quotes"], axis = 1) \
        .melt(id_vars = ['url'], value_name = "quote") \
        .dropna() \
        .drop("variable", axis = 1) \
        .reset_index(drop=True)
        
    # to explode dictionary
    # quote_df = quote_df.rename(columns={'quote':'quote_dict'})
    # quote_df.join(quote_df.quote_dict.apply(pd.Series))


    # split sentences into columns
    sent_df = articles_df[['url', 'rel_sents']]
    sent_df = sent_df.rel_sents.apply(pd.Series) \
        .merge(sent_df, left_index = True, right_index = True) \
        .drop(["rel_sents"], axis = 1) \
        .melt(id_vars = ['url'], value_name = "rel_sents") \
        .dropna() \
        .drop("variable", axis = 1) \
        .reset_index(drop=True)
        
    # select only the split columns' rows if others cannot be filtered out
    # quote_df = quote_df[quote_df.variable.apply(lambda x: isinstance(x, (int)))]

    # get rid of fails
    quote_df = quote_df[quote_df.quote != 'fail']

    # get rid of empties
    sent_df = sent_df[sent_df.rel_sents != 'none']

    def clean_keys(dict_cur):
        new_dict = {}
        for k, v in dict_cur.items():
            new_dict[re.sub(r'\d+', '', str(k))] = v
        return new_dict

    quote_df.quote = quote_df.quote.apply(clean_keys)

    # source: https://medium.com/swlh/simple-sentiment-analysis-for-nlp-beginners-and-everyone-else-using-vader-and-textblob-728da3dbe33d
    # setup:
    # pip install -U textblob
    # python -m textblob.download_corpora

    # set sentiment column when keywords appear in the quote
    quote_df['position'] = np.NaN
    quote_df['position'] = quote_df.apply(lambda x: TextBlob(x['quote']['quote']).sentiment.polarity \
    # if not keywords or any([keyword for keyword in keywords if keyword in x['quote']['quote'].lower()]) \
    # else x['position']
    , axis=1)
    quote_df['speaker'] = quote_df['quote'].apply(lambda x: x['speaker'].strip())
    quote_df['quote'] = quote_df['quote'].apply(lambda x: x['quote'].strip())
    quote_df.to_csv('quotes.csv')
    # or sentence
    sent_df['position'] = np.NaN
    sent_df['position'] = sent_df.apply(lambda x: TextBlob(x['rel_sents']).sentiment.polarity \
    if not keywords or any([keyword for keyword in keywords if keyword in x['rel_sents'].lower()]) \
    else x['position'], axis=1)
    # quote_df['speaker'] = quote_df['rel_sents'].apply(lambda x: x['speaker'])
    # quote_df['rel_sents'] = quote_df['rel_sents'].apply(lambda x: x['rel_sents'])
    sent_df.to_csv('sentences.csv')

    # TODO: quote_mean_df = quote_df.groupby(['url']).mean()
    articles_df.to_csv('position_results.csv')
    """
    quote_mean_df = quote_df.groupby(['url'])
    try: 
        sent_mean_df = sent_df.groupby(['url']).mean()
        df_means = pd.merge(quote_mean_df, sent_mean_df, on=['url'])
        df_means['pos_mean'] = df_means.mean(axis=1)
        
    except: 
        df_means = quote_mean_df
        
    
    df_final = pd.merge(articles_df, df_means, on=['url'])
    df_final.to_csv('position_results.csv')
    """

In [53]:
googlenews = GoogleNews(lang=lang, start=start_date, end=end_date) # format: '07/24/2020'
googlenews.search(search_terms)

In [54]:
results = googlenews.results()

In [55]:
df = pd.DataFrame.from_records(data=results)

In [56]:
df.to_csv(output_file)

# Collect Data

## Collect Newspaper Articles

# Conclusion

In [49]:
!head ./drive/My\ Drive/Data/raw/articles_output.csv

,title,media,date,datetime,desc,link,img
0,"FACT SHEET: President Biden Outlines Steps to Reform Our Immigration System by Keeping Families Together, Addressing the Root Causes of Irregular Migration, and Streamlining the Legal Immigration System",,1 day ago,2021-02-02 20:39:50.560515,"President Biden's strategy is centered on the basic premise that our country is safer, stronger, and more prosperous with a fair, safe and orderly immigration system ...",https://www.whitehouse.gov/briefing-room/statements-releases/2021/02/02/fact-sheet-president-biden-outlines-steps-to-reform-our-immigration-system-by-keeping-families-together-addressing-the-root-causes-of-irregular-migration-and-streamlining-the-legal-immigration-syst/,"data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="
1,Biden's Executive Orders: President to Sign 3 Rolling Back Trump's Immigration Agenda,The New York Times,th · 1 day ago,2021-02-02 20:39:50.662227,"As a candidate, Mr. Biden vowed to o

In [50]:
"""Add post-processing steps here
"""

# Clean up the environment
# driver.quit()

'Add post-processing steps here\n'