# Automated News Update, or, "Dwyer's attempt at automating himself out of a job"

### Sources:
* For the **requests** package: https://stackoverflow.com/questions/25067580/passing-web-data-into-beautiful-soup-empty-list
* Stanford course on web scraping, notes: http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html
* BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Selenium: https://stackoverflow.com/questions/13960326/how-can-i-parse-a-website-using-selenium-and-beautifulsoup-in-python
* Digging through iframes: https://stackoverflow.com/questions/7534622/selecting-an-iframe-using-python-selenium https://stackoverflow.com/questions/23028664/python-beautifulsoup-iframe-document-html-extract

# Change Log
* 8/29/2018: Added Citylab, Electrek, cleaned code
* 8/7/2018: Added Transport Reviews to academic paper scraper
* 7/30/2018: Fixed GovTech scraper
* 6/29/2018: Changed the whole scraper over to utilize a new class called *scraypah*. 
* 5/12/2018: Added Semiconductor Engineering scraper and academic articles scraper (~3 hours)
* 4/13/2018: Integrated word document production through python
* 3/19/2018: Added OEM/Gov section that quickly checks 17 sites for updates - only prints a notification that it needs to be checked if there are new updates from the past week
* 2/27/2018: Wrote a function *page_scan* to more efficiently create the relevant web page dictionary "profiles"
* 2/27/2018: Added 21CTP trucking news keywords to search for. Integrated functionality into existing web scraper.
* 2/14/2018: Added NGV Global scraper for AFV stuff
* 2/14/2018: Added fuel cells, hybrid, hybrid-electric, 'electric buses', 'electric truck', 'electric trucks', 'electric drive' to the search terms for AFVs...
* 1/31/2018: Added *print_results* function to streamline printed results for each scraper. Added counter to track #articles that were too old. Added meta-data tracking capability (dumps into SQL database every week)
* 1/31/2018: Split EV market analysis and web scraper into two different Notebooks
* 1/26/2018: Added Lexology scraper
* 1/19/2018: Fixed GreenCarCongress scraper (site redesign)
* 1/4/2018: Added Engadget scraper
* 1/4/2018: Added "replace_em" function to streamline removal of meaningless substrings from body text summaries
* 12/29/2017: Added Reuters, MITNews, and ARSTechnica scrapers. Did some streamlining in the EV Sales analysis
* 12/20/2017: Wrote up quick-guide to all the post-Python processing needed for the final News Update doc.
* 12/20/2017: Changed to .xls format. Had to import a different package to do so, but makes mail merge work better
* 12/13/2017: Fixed Trucks.com scraper - was pulling out the wrong date for each article (pulled a date from the sidebar...)
* 12/8/2017: Edited Trucks.com search so that it doesn't pick up paragraph tags that are actually image captions (added condition that "class = None")
* 12/8/2017: Added a bunch of comments, specifically in the first code segment ("IEEE Spectrum") for explanatory purposes

# To Do
* Add all relevant companies to keywords
* Functionality to add news article to SQL database after the fact
* Auto open word doc in gen_docx
* MORE AUTOMATION
* Fix Reuters, Quartz
* Add Nikkei: https://asia.nikkei.com/
* Add Phys.org
* Add Jalopnik
* Add some debugging assistance of some kind
    * Quick "request/bs4 of a url"
* Truck News
* Transport Policy
* [Journal of Modern Transportation](https://link.springer.com/journal/40534)

## Info

html.parser - BeautifulSoup(markup, "html.parser")

* Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

* Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

lxml - BeautifulSoup(markup, "lxml")

* Advantages: Very fast, Lenient

* Disadvantages: External C dependency

html5lib - BeautifulSoup(markup, "html5lib")

* Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

* Disadvantages: Very slow, External Python dependency

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

# Import packages, define important stuff

In [None]:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer

from selenium import webdriver
import requests
import datetime as dt
import certifi
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import docx
from docx.enum.text import WD_COLOR_INDEX
from docx.shared import Pt
import time

import win32com.client as win32
import os

tick_fontsize = 12
axislabel_fontsize = 20
title_fontsize = 25
legend_fontsize = 15
plt.rcParams.update({'legend.fontsize':15, 'xtick.labelsize': 15, 'ytick.labelsize': 15,
                    'figure.titlesize': 30, 'axes.labelsize':25, 'xtick.major.pad': 8, 'ytick.major.pad':8, 
                    'axes.titlepad': 0, 'axes.labelpad': 15})
sns.set(rc={"figure.figsize": (12, 9)}, style='ticks', context='poster')


In [None]:
# import sys
# !conda install -c conda-forge --yes --prefix {sys.prefix} python-docx

In [None]:
cav_keywords = ['self-driving','automated', 'self driving', 'autonomous', 'MaaS', 'ride-sharing', 'ridesharing', 'ride-hailing', 
                'ridehailing', 'lidar', 'LiDAR','rideshare', 'ridehail', 'ride-hail', 'ridesource', 'ride-source', 'ride-sourcing',
                'carsharing', 'car-sharing', 'carshare', 'car-share', 'Uber', 'Lyft', 'Chariot', 'connected car']
afv_keywords = ['rare-earth', 'rare earth', 'natural gas', 'electric vehicles', 'electric vehicle', 'electric car', 'EV', 'electrification', 'alternative fuel', 'CNG', 'LNG',
                'alt-fuel', 'propane', 'charging stations', 'EVSE', 'electric vehicle charging', 'HEV', 'hybrid', 'hybrid-electric', 'plug-in', 'PHEV', 'electric motor',
               'bio-fuel', 'biofuel', 'idle reduction', 'fuel cell', 'electric bus', 'electric buses', 'electric truck', 'electric trucks', 'electric drive',
               'battery-electric', 'battery electric', 'battery-electric-powered']
truck_keywords = ['alternative fuels', 'natural gas', 'compressed natural gas', 'liquefied natural gas', 'CNG', 'LNG', 'propane', 'LPG', 'dimethyl ether', 'DME', 'electric', 'electricity', 'electrified', 'electric drive', 
                  'battery', 'energy storage', 'hydrogen', 'fuel cell', 'hybrid', 'hybrid electric', 'hybrid hydraulic',' Phase 2', 'Phase II', 'efficiency', 'fuel efficiency', 'fuel economy', 'aftertreatment',
                  'emission control', 'diesel particulate filter', 'DPF', 'selective catalytic reduction', 'SCR', 'aerodynamics', 'sustainability', 'waste heat recovery', 'Rankine', 'organic Rankine', 'SuperTruck', 
                  'automated manual', 'AMT', 'platooning', 'lithium', 'biofuel', 'fast charging', 'downspeed', 'downsize', 'clean diesel', 'turbocompound', 'rolling resistance', 'skirt', 'boat tail', 'axle', 'low viscosity',
                  'catenary', 'autonomy', 'autonomous', 'connected and autonomous', 'connected', 'telematics', 'driver assist', 'CACC', 'active cruise control', 'crash avoidance', 'crashworthiness', 'weigh-in-motion', 'weigh in motion', 
                  'high productivity', 'truck size and weight', 'V2I', 'V2V', 'vehicle to infrastructure',' vehicle to vehicle',  'restructuring', 'acquisition', 'fuel cost', 'driver cost', 'operational efficiency', 'facility', 
                  'facilities', 'proving ground', 'partnership', 'regional haul', 'joint venture', 'grant', 'FOA', 'funding opportunity', 'unveil', 'announce', 'offer', 'expansion', 'greenhouse gas', 'GHG', 'emission regulation', 
                  'emissions regulation', 'idle', 'idling', 'zero emissions', 'strategic plan', 'SmartWay', 'VIUS', 'well to wheels', 'pump to wheels', 'well to pump', 'CARB', 'CEC', 'air resources board', 'energy commission', 'EPA', 
                  'Environmental Protection Agency', 'smart mobility', 'smart cities']
hyperloop_keywords = ['hyperloop', 'high-speed train', 'high speed train', 'bullet train']

# Used for diagnostics/tracking later
scrape_specs = {}

# Age filter, in days
max_age = 7

# For file naming and tracking
search_date = str(dt.date.today())
print('Time of search: {}'.format(search_date))

# Needed for web scraping "browser"
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

# For database update; ensures duplicates aren't loaded
db_update = False

# driver = webdriver.Firefox(executable_path='geckodriver64.exe')

In [None]:
scraped_count=0
skip_count=0
too_old=0
iteration=0
skip_ind = []
old_ind = []
def reset_trackers():
    ''' Resets tracking metrics - executed before each scraper is run '''
    global scraped_count
    scraped_count=0
    global skip_count
    skip_count=0
    global too_old
    too_old=0
    global iteration
    iteration=0
    global skip_ind
    skip_ind = []
    global old_ind
    old_ind = []
def replace_em(text):
    '''Replaces odd characters in text. Used for page titles and summaries'''
    bad_chars = ['â€œ', 'â€™', 'â€�','\n', 'Â', 'â€”', '(earlier post)', 'â€?', '\t', 'â€œ']
    for bad_char in bad_chars:
        text = text.replace(bad_char,'')
    return text

def grab_homepage(url):
    '''Creates BeautifulSoup object using input url'''
    headers = {'user-agent': 'Mozilla/5.0'}
    page_1 = requests.get(url,headers=headers)
    return BeautifulSoup(page_1.content, "html5lib")

def print_results(site, scraped_count, skip_count, too_old, df, duration, scrape_specs):
    '''Prints out a quick summary of one website's full scraping and adds summary specs to scrape_specs dictionary'''
    print('{} {} article(s) scraped'.format(scraped_count,site)) 
    print('{} {} article(s) skipped due to error'.format(skip_count,site))  #(see urls_to_scrape[skip_ind+1])
    print('{} {} article(s) skipped due to age'.format(too_old,site)) #(see urls_to_scrape[old_ind+1])
    print('{} relevant article(s) collected'.format(df.shape[0])) 
    scrape_specs[f"{site}"] = {'Pages Scraped': scraped_count, 'Relevant Articles': df.shape[0], 'Errors': skip_count, 
                               'Too old': too_old, 'Time spent':duration}
    return scrape_specs

def page_scan(title, summary, url, date, source):
    '''Searches a web page title and summary for keywords; returns the dictionary object that is used to create the final dataframe'''
    afv_bool=0
    cav_bool=0
    truck_bool=0
    hyperloop_bool=0
    if any(keyword in title for keyword in hyperloop_keywords) | any(keyword in summary for keyword in hyperloop_keywords):
        hyperloop_bool = 1
    if any(keyword in title for keyword in cav_keywords) | any(keyword in summary for keyword in cav_keywords):
        cav_bool = 1
    if any(keyword in title for keyword in afv_keywords) | any(keyword in summary for keyword in afv_keywords):
        afv_bool=1
    if (any(keyword in title.lower() for keyword in truck_keywords)&(('truck' in title.lower())|('trucks' in title.lower()))) | \
       (any(keyword in summary.lower() for keyword in truck_keywords)&(('truck' in summary.lower())|('trucks' in summary.lower()))):
        truck_bool=1
    if (afv_bool == 1)|(cav_bool == 1)|(truck_bool == 1)|(hyperloop_bool == 1):
        return {'title':title, 'summary':summary, 'link':url, 'source':source, 'date':date, 'AFV':afv_bool, 'CAV':cav_bool, 
                '21CTP':truck_bool, 'Hyperloop':hyperloop_bool}
    else:
        return 'Most definitely nope'

### The following two functions are for the Word document output!
def add_hyperlink(paragraph, url, text):
    '''
    :param paragraph: The paragraph we are adding the hyperlink to.
    :param url: A string containing the required url
    :param text: The text displayed for the url
    :return: The hyperlink object
    '''
    # This gets access to the document.xml.rels file and gets a new relation id value
    part = paragraph.part
    r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)

    # Create the w:hyperlink tag and add needed values
    hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
    hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )

    # Create a w:r element
    new_run = docx.oxml.shared.OxmlElement('w:r')

    # Create a new w:rPr element
    rPr = docx.oxml.shared.OxmlElement('w:rPr')
    
    # bold the text
    u = docx.oxml.shared.OxmlElement('w:b')
#     u.set(docx.oxml.shared.qn('w:val'), 'single')
    rPr.append(u)

    # Join all the xml elements together add add the required text to the w:r element
    new_run.append(rPr)
    new_run.text = text
    hyperlink.append(new_run)

    paragraph._p.append(hyperlink)

    return hyperlink

CA_nums = 'NEED TO INSERT'
def gen_docx(newstype, dwyer=True, CA_nums = CA_nums):
    '''
    Generates news Word doc using data file from web scrape
    :param newstype: Either "21CTP", "CAV", or "AFV"
    :param dwyer: If not running on Dwyer's computer, set this to False and put all needed files in the same directory
    :param CA_nums: Input string for the CA EVSE numbers (automatically populates the caption for the EVSE bar chart figure)
    '''
    
    # select data file (xls) based on the newstype and date. Note that search_date is a global variable defined outside
    # of this function. Each news update only happens once a week --> only one xls file per newstype per week --> can't just
    # pick any old search_date and make a file.
    if dwyer:
        data_file = f"{newstype.lower()}_news_updates/{search_date}_{newstype}_news_download.xls"    # Name of the excel file (standardized)
    else:
        data_file = f"{search_date}_{newstype}_news_download.xls"
    
    # Read the data in from the selected file
    df = pd.read_excel(data_file)
    df = df.reset_index(drop=True).T.to_dict()
    
    # Start creating the word doc
    newsdoc = docx.Document(docx='python_docx.docx')
    
    # Add up-front stuff - title, headers, and for the AFV update, some other stuff (two captions and some text)
    if newstype == 'AFV':
        newsdoc.add_heading(f"Alternative Fuel Vehicle Weekly News Update – {dt.date.today().strftime('%m/%d/%Y')}",0)
        newsdoc.add_heading('EVSE Market Analysis', 1)
        evse_bar_chart = newsdoc.add_paragraph().add_run('INSERT EVSE BAR CHART HERE')
        evse_bar_chart.font.bold = True
        evse_bar_chart.font.size=Pt(16)
        evse_bar_chart.font.highlight_color = WD_COLOR_INDEX.YELLOW
        newsdoc.add_paragraph('Figure: Number of EVSE plugs (note: not stations) by state and charging level.' 
                              'CA is not included, since it would make the rest of the state numbers illegible.' 
                              f"CA holds a disproportionately large share of the total EVSE plugs: {CA_nums} "
                              'of Level 1, Level 2, and DCFC plugs respectively. Data Source: U.S. DOE AFDC Station Locator.',
                             style='Caption')
        newsdoc.add_paragraph(' ')
        newsdoc.add_paragraph('The table below summarizes overall changes in number of EV charging stations by state between '
                              f"{(dt.date.today() - dt.timedelta(7)).strftime('%m/%d/%Y')} and {dt.date.today().strftime('%m/%d/%Y')}:",
                             style='Normal')
        newsdoc.add_paragraph('Table 1: Change in number of EV charging stations by state, between '
                              f"{(dt.date.today() - dt.timedelta(7)).strftime('%m/%d/%Y')} and {dt.date.today().strftime('%m/%d/%Y')}",
                             style='Caption')
        evse_bar_chart = newsdoc.add_paragraph().add_run('INSERT EVSE DELTA TABLE HERE')
        evse_bar_chart.font.bold = True
        evse_bar_chart.font.size=Pt(16)
        evse_bar_chart.font.highlight_color = WD_COLOR_INDEX.YELLOW    
    
    if newstype == 'CAV':
        newsdoc.add_heading(f"Connected and Automated Vehicle Weekly News Update – {dt.date.today().strftime('%m/%d/%Y')}",0)
        
    if newstype == '21CTP':
        newsdoc.add_heading(f"21CTP Trucking Weekly News Update – {dt.date.today().strftime('%m/%d/%Y')}",0)
    
    for header in ['Business and Market Analysis','Technology, Testing, and Analysis','Policy and Government']:
        newsdoc.add_heading(header, 1)
        newsdoc.add_paragraph('')
# Add all of the actual news items
    for row in df:
        row = df[row]
        newsdoc.add_heading(row['title'],level=2)
        p = newsdoc.add_paragraph(row['summary'] + ' ')
        p.add_run('(')
        add_hyperlink(p, '{}'.format(row['link']), '{}'.format(row['source']))   # This is where the add_hyperlink function is used
        p.add_run(')')
    if newstype == 'CAV':
        newsdoc.add_heading('Relevant Transportation Research', 1)
        newsdoc.add_paragraph('This section includes publications, papers, articles, and conferences that investigate and/or'
                              'discuss transportation and travel demand impacts of MaaS or other “future travel” considerations.'
                              'Portions of the abstract or description (not my words) are included under each title for more information.')
    if dwyer:
        newsdoc.save(f"{newstype.lower()}_news_updates/Energetics {newstype} News Update - {search_date}.docx")
    else:
        filename = f"Energetics {newstype} News Update - {search_date}.docx"
        newsdoc.save(filename)
        
def which_keyword_found(row):
    '''
    Identifies and stores which keywords triggered the news item pull
    '''
    words_found = []
    for keyword in cav_keywords:
        try:
            if (row['summary'].find(keyword) > 0)|(row['title'].find(keyword) > 0):
                words_found.append(keyword)
        except:
            continue
    for keyword in afv_keywords:
        try:
            if (row['summary'].find(keyword) > 0)|(row['title'].find(keyword) > 0):
                words_found.append(keyword)
        except:
            continue
    return ', '.join(words_found)

def keyword_pull(string):
    '''
    Pulls all capitalized words out of the title, as a quick "keyword" list
    
    https://stackoverflow.com/questions/13205343/code-to-detect-all-words-that-start-with-a-capital-letter-in-a-string
    '''
    try:
        string = string.lstrip().split(' ')
        keywords = []
        for word in string:
            try:
                if (word[0].isupper()) & (word != 'The') & (word != 'This') & (word != 'I'):
                    keywords.append(word)
            except:
                continue
        return ', '.join(keywords)
    except:
        return np.nan


### Testing class-based scraper
Quartz doesn't work

Bloomberg doesn't allow scraping

Reuters doesn't work

In [None]:
class scraypah:
    headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    
    def __init__(self, params):
        self.base_url = params['url']
        self.source = params['source']
        self.strainer = params['strain_bool']
        if self.strainer:
            self.strain_tag = params['strain_tag']
            self.strain_attr_type = params['strain_attr_type']
            self.strain_attr = params['strain_attr']
        self.date_loc = params['date_loc']
        self.date_format = params['date_format']
        self.sum_loc = params['sum_loc']
        self.title_loc = params['title_loc']
        self.url_list_query = params['url_list_query']
        
    def get_urls(self):
        self.urls_to_scrape=[]
        if isinstance(self.base_url, str):
            if not self.strainer:
                page = requests.get(self.base_url, headers=headers)
                time.sleep(0.5)
                self.base_soup = BeautifulSoup(page.content, "lxml")
            else:
                only_parse = SoupStrainer(self.strain_tag, attrs={self.strain_attr_type:self.strain_attr})
                self.base_soup = BeautifulSoup(requests.get(self.base_url, headers=headers).content, "lxml", parse_only=only_parse)
            self.urls_to_scrape = eval(self.url_list_query)
        else:
            for url in list(self.base_url):
#                 print(url)
                if not self.strainer:
                    page = requests.get(url, headers=headers)
                    time.sleep(1)
                    self.base_soup = BeautifulSoup(page.content, "lxml")
                else:
                    only_parse = SoupStrainer(self.strain_tag, attrs={self.strain_attr_type:self.strain_attr})
                    self.base_soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml", parse_only=only_parse)
#                 print('one base_soup')
                self.urls_to_scrape += eval(self.url_list_query)
#                 print(self.urls_to_scrape)
        
    def scrape_em(self):
        self.relevant_articles = {}
        self.scraped_count = 0
        self.skip_count=0
        self.too_old=0
        self.iteration=0
        self.skip_ind = []
        self.old_ind = []
        for url in self.urls_to_scrape:
            self.iteration+=1
            try:
                page = requests.get(url, headers=headers)
                if self.source in ['Semiconductor Engineering', 'Reuters', 'Recode']:
                    article=BeautifulSoup(page.content, "html5lib")
                else:
                    article=BeautifulSoup(page.content, "lxml")
                date = pd.to_datetime(eval(self.date_loc).strip().replace('\\xa0','').replace(' -\nBy:',''), format=self.date_format).date()
                if (date - dt.date.today()).days >= -max_age:
                    if self.source == 'Autoblog':
                        try:
                            summary = eval(self.sum_loc)
                            summary = replace_em(summary[0].text+ ' '+summary[1].text+ ' '+summary[2].text)
                        except:
                            summary = ' '.join(article.find('div', attrs={'class':'post-body'}).text.replace('\\t','').replace('\\n\\n', '\n').split('\n')[1:4])
                    else:
                        summary = eval(self.sum_loc)
                        summary = replace_em(summary[0].text+ ' '+summary[1].text+ ' '+summary[2].text)
                    title = eval(self.title_loc).replace('â€™',"'").replace('\\xa0',' ').replace('\\n','').lstrip().replace('  ','')
                    temp = page_scan(title, summary, url, date, self.source)
                    if temp != 'Most definitely nope':
                        self.relevant_articles[self.scraped_count] = temp
                    self.scraped_count+=1
                else:
                    self.too_old += 1
                    self.old_ind.append(self.iteration-1)
            except Exception as exc:
                print(f"{str(exc)}: {url}")
                self.skip_count+=1
                self.skip_ind.append(self.iteration-1)
                continue
        self.relevant_df = pd.DataFrame.from_dict(self.relevant_articles).T

In [None]:
scraper_dict = {'MIT': {'url':'http://news.mit.edu/mit-news', 
                        'source': 'MIT',
                        'strain_tag':'ul', 
                        'strain_attr_type':'class', 
                        'strain_attr':'view-mit-news clearfix', 
                        'url_list_query':"['http://news.mit.edu'+item.a['href'] for item in self.base_soup.find('ul', class_='view-mit-news clearfix').find_all('li')]",
                        'date_loc': "article.find('span', attrs={'itemprop':'datePublished'}).text", 
                        'date_format':None,
                        'sum_loc': "article.find('div', attrs={'class': 'field-item even'}).find_all('p')",
                        'title_loc':"article.find('h1', attrs={'class':'article-heading'}).text", 
                        'strain_bool':True},
                'SemEng': {'url':'http://semiengineering.com/category-main-page-iot-security/', 
                          'source': 'Semiconductor Engineering', 
                          'strain_tag':'div', 
                          'strain_attr_type':'class', 
                          'strain_attr':'l_col', 
                          'url_list_query':"[item['href'] for item in self.base_soup.find('div', class_='l_col').find_all('a', href=True,title=True)]",
                          'date_loc': "article.find('div',class_='loop_post_meta').contents[0]", 
                          'date_format':None,
                          'sum_loc': "article.find('div', class_='post_cnt post_cnt_first_letter').find_all('p')[1:4]",
                          'title_loc':"article.find('h1', class_='post_title').text", 
                          'strain_bool':True},
                'Quartz': {'url':'https://qz.com/search/self-driving', 
                           'source': 'Quartz', 
                           'url_list_query':"[item.a['href'] for item in self.base_soup.find_all(class_='queue-article')]",
                           'date_loc': "article.find('span', attrs={'class':'timestamp'}).text", 
                           'date_format':None,
                           'sum_loc': "article.find_all('p')[:3]",
                           'title_loc':"article.find('h1').text", 
                           'strain_bool':False},
                'Recode': {'url':'https://www.recode.net/', 
                           'source': 'Recode', 
                           'strain_tag':'a', 
                           'strain_attr_type':'data-analytics-link', 
                           'strain_attr':'article', 
                           'url_list_query':"[item['href'] for item in self.base_soup.find_all('a', attrs={'data-analytics-link':'article'})]",
                           'date_loc': "article.time.text.replace('\\n', '')", 
                           'date_format':None,
                           'sum_loc': "article.find_all('p')",
                           'title_loc':"article.h1.text", 
                           'strain_bool':True},
                'GovTech': {'url':'http://www.govtech.com/fs/transportation/',
                            'source':'GovTech', 
                            'url_list_query':"[item.a['href'] for item in self.base_soup.find_all(class_=['sub-feature-article','feature-article'])]",
                            'date_loc':"article.find('span', class_='date').text.strip()", 
                            'date_format':None,
                            'sum_loc':"[item for item in article.find(class_='col-md-10').find_all('div') if len(str(item)) > 12] \
                                        if len([item for item in article.find(class_='col-md-10').find_all('p')]) < 3 \
                                        else [item for item in article.find(class_='col-md-10').find_all('p')]" ,
                            'title_loc': "article.find('h1').text.strip()", 
                            'strain_bool':False},
                'Reuters': {'url':'https://www.reuters.com/news/technology', 
                            'source': 'Reuters', 
                            'url_list_query':"[item.a['href'] for item in self.base_soup.find_all('h2', class_='headline_ZR_Fh')]",
                            'date_loc': "article.find('div', attrs={'class':'date_V9eGk'}).text.split('/')[0]", 
                            'date_format':None,
                            'sum_loc': "article.find('div', attrs={'class':'body_1gnLA'}).find_all('p')",
                            'title_loc':"article.h1.text", 
                            'strain_bool':False},
                'CityLab': {'url':'https://www.citylab.com/transportation/', 
                            'source': 'Citylab', 
                            'strain_tag':['h2','h1'], 
                            'strain_attr_type':'class', 'strain_attr':['c-promo__hed','c-river-item__hed c-river-item__hed--'], 
                            'url_list_query':"[item.a['href'] for item in self.base_soup.find_all(['h1','h2'], class_=['c-promo__hed','c-river-item__hed c-river-item__hed--'])]",
                            'date_loc': "article.time.text", 
                            'date_format':None,
                            'sum_loc': "article.find_all('p')[1:]",
                            'title_loc':"article.h1.text", 
                            'strain_bool':True},
                'Engadget': {'url':['https://www.engadget.com/tags/transportation/','https://www.engadget.com/tag/transportation/page/2/'], 
                             'source': 'Engadget', 
                             'strain_tag':'a', 
                             'strain_attr_type':'class', 
                             'strain_attr':'o-hit__link', 
                             'url_list_query':"['https://www.engadget.com'+item['href'] for item in self.base_soup.find_all('a', attrs={'class':'o-hit__link'})]",
                             'date_loc': "article.find('meta', attrs={'name':'published_at'})['content']", 
                             'date_format':None,
                             'sum_loc': "article.find('div', attrs={'class':'container@m-'}).find_all('p')",
                             'title_loc':"article.title.text", 
                             'strain_bool':True},
                'Autoblog': {'url':'https://www.autoblog.com/',
                             'source': 'Autoblog',
                             'strain_tag':'li',
                             'strain_attr_type':'class', 
                             'strain_attr':'flex-item promo-list-item',
                             'url_list_query':"['https://www.autoblog.com/'+item.a['href'] for item in self.base_soup.find_all('li', attrs={'class':'flex-item promo-list-item'})]",
                             'date_loc': "article.find('div', attrs={'class':'post-date'}).text", 
                             'date_format':None,
                             'sum_loc': "article.find('div', attrs={'class':'post-body'}).find_all('p')",
                             'title_loc':"article.h1.text", 
                             'strain_bool':True},
                'Electrek': {'url':'https://electrek.co/', 
                             'source': 'Electrek', 
                             'strain_tag':'h1', 
                             'strain_attr_type':'class', 'strain_attr':'post-title', 
                             'url_list_query':"[item.a['href'] for item in self.base_soup.find_all('h1', class_='post-title')]",
                             'date_loc': "article.find('p', class_='time-twitter').text", 
                             'date_format':None,
                             'sum_loc': "article.find('div', class_='post-body').find_all('p')[1:]",
                             'title_loc':"article.find('h1', class_='post-title').text", 
                             'strain_bool':True},
                'The Verge': {'url':'https://www.theverge.com/transportation',
                              'source': 'The Verge',
                              'strain_tag':'h2', 
                              'strain_attr_type':'class', 'strain_attr':'c-entry-box--compact__title', 
                              'url_list_query':"[item.a['href'] for item in self.base_soup.find_all('h2', class_='c-entry-box--compact__title')]",
                              'date_loc': "article.time.text",
                              'date_format':None,
                              'sum_loc': "article.find_all('p')",
                              'title_loc':"article.h1.text",
                              'strain_bool':True},
                'TechCrunch': {'url':['https://techcrunch.com/', 'https://techcrunch.com/page/2/', 'https://techcrunch.com/page/3/','https://techcrunch.com/page/4/'], 'source': 'TechCrunch',
                               'strain_tag':'a', 
                               'strain_attr_type':'class', 
                               'strain_attr':'post-block__title__link',
                               'url_list_query':"[item['href'] for item in self.base_soup.find_all('a', class_='post-block__title__link')]",
                               'date_loc': "url[23:33]", 
                               'date_format':None,
                               'sum_loc': "article.find('div', attrs={'class':'article-content'}).find_all('p')",
                               'title_loc':"article.find('h1', attrs={'class':'article__title'}).text", 
                               'strain_bool':True},
                'NGV Global': {'url':'http://www.ngvglobal.com/', 
                               'source': 'NGV Global', 
                               'strain_tag':'h2', 
                               'strain_attr_type':'class', 
                               'strain_attr':'entry-title', 
                               'url_list_query':"[item.a['href'] for item in self.base_soup.find_all('h2', attrs={'class':'entry-title'})]",
                               'date_loc': "article.find('time')['title']", 
                               'date_format': None,
                               'sum_loc': "article.find('div', attrs={'class':'pf-content'}).find_all('p')",
                               'title_loc':"article.find('h1', attrs={'class':'entry-title'}).text", 
                               'strain_bool':True},
                'Charged EVs': {'url':['https://chargedevs.com/category/newswire/','https://chargedevs.com/category/newswire/page/2/'],
                                'source':'Charged EVs', 
                                'strain_tag':'h3',
                                'strain_attr_type':'class',
                                'strain_attr':'h2', 
                                'url_list_query':'[item.a["href"] for item in self.base_soup.find_all("h3", class_="h2")]',
                                'date_loc':"article.find('time').text", 
                                'date_format':None,
                                'sum_loc':"article.find('section',class_='entry-content clearfix').find_all('p')",
                                'title_loc': "article.find('h2', class_='page-title').text", 
                                'strain_bool':True},
               'ARS Technica': {'url':'https://arstechnica.com/cars/', 
                                'source': 'ARS Technica', 
                                'strain_tag':'a', 
                                'strain_attr_type':'class', 
                                'strain_attr':'overlay', 
                                'url_list_query':"[item['href'] for item in self.base_soup.find_all('a', attrs={'class': 'overlay'})]",
                                'date_loc': "article.find('time', attrs={'class':'date'}).text", 
                                'date_format':None,
                                'sum_loc': "article.find('div', attrs={'itemprop':'articleBody'}).find_all('p', attrs={'class':None})",
                                'title_loc':"article.h1.text", 
                                'strain_bool':True},
                'IEEE Spectrum': {'url':'https://spectrum.ieee.org/transportation', 'source': 'IEEE Spectrum', 
                                  'strain_tag':'article',
                                  'strain_attr_type':'class', 
                                  'url_list_query': "['https://spectrum.ieee.org'+item.a['href'] for item in self.base_soup.find_all('article')]",
                                  'strain_attr':'item sml_article transportation',
                                  'date_loc': "article.label.text", 
                                  'date_format':'%d %b %Y | %H:%M GMT',
                                  'sum_loc': "article.find_all('p', limit=5)",
                                  'title_loc':"article.h1.text", 
                                  'strain_bool':True},
                'GreenCarCongress': {'url':['http://www.greencarcongress.com/', 'http://www.greencarcongress.com/page/2/'], 
                                     'source': 'GreenCarCongress', 
                                     'strain_tag':'article', 
                                     'strain_attr_type':'class', 
                                     'strain_attr':'post entry', 
                                     'url_list_query':"[item.a['href'] for item in self.base_soup.find_all('article', attrs={'class': 'post entry'})]",
                                     'date_loc': "article.find('span', attrs={'class':'entry-date'}).a.text", 
                                     'date_format':None,
                                     'sum_loc': "article.find_all('p', limit=5)",
                                     'title_loc':"article.h2.a.text", 
                                     'strain_bool':True},
#                 'Bloomberg': {'url':['https://www.bloomberg.com/search?query=self+driving','https://www.bloomberg.com/search?query=electric%20vehicles'], 'source': 'Bloomberg', 
#                               'url_list_query':"[item.a['href'] for item in self.base_soup.find_all('h1')]",
#                               'date_loc': "article.find('time', attrs={'class':'article-timestamp'})['datetime']", 'date_format':None,
#                               'sum_loc': "article.find('div', attrs={'class':'body-copy-v2 fence-body'}).find_all('p')",
#                               'title_loc':"article.find('h1', attrs={'class':'lede-text-v2__hed'}).text", 'strain_bool':False},
}

In [None]:
scrape_specs = {}
scraypahs = {}
start_time = time.time()

for site in list(scraper_dict.keys()):
    temp_start_time = time.time()
    print('\n'+site.upper())
    scraypahs[site] = scraypah(scraper_dict[site])
    scraypahs[site].get_urls()
    scraypahs[site].scrape_em()
    scrape_specs = print_results(scraypahs[site].source, scraypahs[site].scraped_count, scraypahs[site].skip_count, 
                             scraypahs[site].too_old, scraypahs[site].relevant_df, round(time.time()-temp_start_time,2), 
                             scrape_specs)

# Trucks.com scraper is unique, can't use standard class
start_time=time.time()
url = 'https://www.trucks.com/'
soup = BeautifulSoup(requests.get(url,headers=headers).content, "html5lib")
trucks_urls_to_scrape = []

for item in soup.find_all('div', attrs={'class':'content-block'}):
    try:
        trucks_urls_to_scrape.append({'link':item.find('div', attrs={'class':'title'}).a['href'],
                                      'date':pd.to_datetime(item.find('div', attrs={'class':'date'}).text + ', {}'.format(dt.date.today().year))})
    except:
        continue

print('\nTRUCKS.COM')
trucks_relevant_articles = {}
reset_trackers()
source = 'Trucks.com'

for url in trucks_urls_to_scrape:
    iteration+=1
    date = url['date'].date()
    
    try:
        if (date - dt.date.today()).days >= -max_age:
            page = requests.get(url['link'], headers=headers)
            article=BeautifulSoup(page.content, "lxml")

            summary = article.find('section', attrs={'itemprop':'articleBody'}).find_all('p', attrs={'class':None})
            summary = replace_em(summary[0].text+ ' '+summary[1].text+ ' '+summary[2].text)
            title = article.h1.text.replace('â€™',"'").replace('\xa0',' ').replace('\n','').lstrip().replace('  ','')

            temp = page_scan(title, summary, url['link'], date, source)
            if temp != 'Most definitely nope':
                trucks_relevant_articles[scraped_count] = temp
            scraped_count+=1
        
        else:
            too_old += 1
            old_ind.append(iteration-1)
        
    except:
        skip_count+=1
        skip_ind.append(iteration-1)
        continue
        
trucks_df = pd.DataFrame.from_dict(trucks_relevant_articles).T
scrape_specs = print_results(source, scraped_count, skip_count, too_old, trucks_df, round(time.time()-start_time,2), scrape_specs)

#### For quick testing

In [None]:
url = 'https://www.citylab.com/transportation/'
page = requests.get(citylab.urls_to_scrape[0], headers = headers)
soup = BeautifulSoup(page.content, 'lxml')

In [None]:
soup.find_all('p')[1:]

In [None]:
test_dict = {'url':'https://electrek.co/', 
             'source': 'Electrek', 
             'strain_tag':'h1', 
             'strain_attr_type':'class', 'strain_attr':'post-title', 
             'url_list_query':"[item.a['href'] for item in self.base_soup.find_all('h1', class_='post-title')]",
             'date_loc': "article.find('p', class_='time-twitter').text", 
             'date_format':None,
             'sum_loc': "article.find('div', class_='post-body').find_all('p')[1:]",
             'title_loc':"article.find('h1', class_='post-title').text", 
             'strain_bool':True
            }

In [None]:
def test_a_scraypah(attr_dict):
    scraper = scraypah(attr_dict)
    scraper.get_urls()
    scraper.scrape_em()
    return scraper

In [None]:
electrek = test_a_scraypah(test_dict)

In [None]:
electrek.relevant_df

# Summary/Concatenate

In [None]:
scrape_specs_df = pd.DataFrame.from_dict(scrape_specs).T.reset_index()
scrape_specs_df['Time per relevant article'] = scrape_specs_df['Time spent']/scrape_specs_df['Relevant Articles']
display(scrape_specs_df)

all_news_dfs = []
for key, value in scraypahs.items():
    all_news_dfs.append(value.relevant_df)

### Stack all of the articles into a single dataframe and do some cleaning (drop duplicate articles)
all_df = pd.concat(all_news_dfs)
all_df = all_df[['title', 'date', 'AFV', 'CAV', '21CTP', 'Hyperloop', 'summary', 'source', 'link']].sort_values('date', ascending=False)
all_df.drop_duplicates(subset='title', inplace=True)
all_df = all_df.replace('\$','$', regex=True)

print('Smart Mobility articles found: {}'.format(all_df['CAV'].sum().astype(int)))
print('Alternative Fuel Vehicle articles found: {}'.format(all_df['AFV'].sum().astype(int)))
print('21CTP articles found: {}'.format(all_df['21CTP'].sum().astype(int)))
print('Hyperloop articles found: {}'.format(all_df['Hyperloop'].sum().astype(int)))

### Populate meta-data columns
all_df['reason_for_tag'] = all_df.apply(which_keyword_found, axis=1)
all_df['keywords'] = all_df['title'].apply(keyword_pull)

### Format for excel writing
AFV_news = all_df[all_df['AFV'] == 1].sort_values('date', ascending=False)
CAV_news = all_df[all_df['CAV'] == 1].sort_values('date', ascending=False)
truck_news = all_df[all_df['21CTP'] == 1].sort_values('date', ascending=False)
hyperloop_news = all_df[all_df['Hyperloop'] == 1].sort_values('date', ascending=False)

In [None]:
if (dt.date.today().weekday() == 0):
    print('Monday!')
    filename = f'cav_news_updates/{search_date}_cav_news_download.xls'
    CAV_news.to_excel(filename)
    if hyperloop_news.shape[0] > 0:
        filename2 = f'hyperloop_news_updates/{search_date}_hyperloop_news_download.xls'
        hyperloop_news.to_excel(filename2)
        print('Some hyperloop stuff!')
elif (dt.date.today().weekday() == 2):
    print('Wednesday!')
    filename = f'afv_news_updates/{search_date}_afv_news_download.xls'
    AFV_news.to_excel(filename)
elif (dt.date.today().weekday() == 4):
    print('Friday!')
    filename = f'21CTP_news_updates/{search_date}_21CTP_news_download.xls'
    truck_news.to_excel(filename)

# # Open excel file to edit or add any additional news items
# cwd = os.getcwd()
# xls_file = cwd+'/'+filename

# excel = win32.gencache.EnsureDispatch('Excel.Application')
# excel.Visible = True

# # open the file
# excel.Workbooks.Open(xls_file)

# # wait before closing
# _ = input("Press enter to close Excel: ")
# excel.Application.Quit()

In [None]:
# Open excel file to edit or add any additional news items
cwd = os.getcwd()
xls_file = cwd+'/'+filename

excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.Visible = True

# open the file
excel.Workbooks.Open(xls_file)

# wait before closing
_ = input("Press enter to close Excel: ")
excel.Application.Quit()

## Generate xls and database to track news items
Only run with **final** news item spreadsheet

In [None]:
last_week = str((pd.to_datetime(search_date) - dt.timedelta(days=7)).date())

In [None]:
conn = sqlite3.connect('news_updates.db')
if (dt.date.today().weekday() == 0)&(~db_update):
    print('CAV')
    pd.read_excel('cav_news_updates/{}_cav_news_download.xls'.format(search_date)).drop('21CTP', axis=1).to_sql('CAV', conn, if_exists='append', index=False)
    db_update = True
elif (dt.date.today().weekday() == 2)&(~db_update):
    print('AFV')
    pd.read_excel('afv_news_updates/{}_afv_news_download.xls'.format(search_date)).drop('21CTP', axis=1).to_sql('AFV', conn, if_exists='append', index=False)
    db_update = True
conn.close()

if dt.date.today().weekday() == 2:
    print('Uploaded metadata! So many datas!')
    conn = sqlite3.connect('news_updates_meta.db')
    scrape_specs_df.drop(['Time spent', 'Time per relevant article'], axis=1).to_sql('news_updates_meta', conn, if_exists='append', index=False)
    conn.close()

## Python docx
https://github.com/python-openxml/python-docx/issues/384

In [None]:
if dt.date.today().weekday() == 2:
    print('AFV')
    gen_docx('AFV')
elif dt.date.today().weekday() == 0:
    print('CAV')
    gen_docx('CAV')

In [None]:
gen_docx('21CTP')

## Academic articles
Still preliminary . . . needs more honing/tinkering. Should also add other sites.

NEED TO ADD: TRB!!!

https://www.springer.com/engineering/civil+engineering/journal/42421?TrucksFoT

In [None]:
driver = webdriver.Firefox(executable_path='geckodriver64.exe')

In [None]:
def paypuh_scraypuh(url, source):
    '''
    bad_egg: Missing a key component (usually abstract), so skip printout/tracking
    still_more: Date is still within past week, continue scraping!
    '''
    soup = grab_homepage(url)
    papers_to_scrape = [paper.a['href'] for paper in soup.find_all('div', attrs={'class':'pod-listing-header'})]
    still_more = True
    scraped_count=0
    papers = {}

    for paper in papers_to_scrape:
        bad_egg=False
        if not still_more:
            break
        driver.get(paper)
        try:
            driver.find_element_by_css_selector("span[class='CollapseText']").click()
            date = pd.to_datetime(driver.find_element_by_css_selector("dl[class='articleDates smh']").text.split('Available online ')[1]).date()
            soup = BeautifulSoup(driver.page_source,"html5lib")
            if (date - dt.date.today()).days > -max_age:
                title = soup.find('h1',class_='svTitle').text
                summary = soup.find('div',class_='abstract svAbstract ').p.text
            else:
                still_more=False
        except:
            try:
                driver.find_element_by_css_selector("button[class='show-hide-details']").click()
                soup = BeautifulSoup(driver.page_source,"html5lib")
                date = pd.to_datetime(soup.find('div',class_='wrapper').p.text.split('Available online ')[1]).date()
                if (date - dt.date.today()).days > -max_age:
                    title = soup.find('span',class_='title-text').text
                    summary = soup.find('div',class_='abstract author').p.text
                else:
                    still_more=False
            except:
                bad_egg = True
                print('bad egg in {}: {}'.format(source, paper))
                pass
        scraped_count+=1

        if still_more and not bad_egg:
            papers[scraped_count] = {'title':title, 'summary':summary, 'link':paper, 'source':source, 'date':date}
            
    print('{} new papers in {}'.format(scraped_count,source))

    return pd.DataFrame(papers).T

In [None]:
tparta = paypuh_scraypuh('https://www.journals.elsevier.com/transportation-research-part-a-policy-and-practice/recent-articles', 'Transportation Part A')
tpartb = paypuh_scraypuh('https://www.journals.elsevier.com/transportation-research-part-b-methodological/recent-articles','Transportation Part B')
tpartc = paypuh_scraypuh('https://www.journals.elsevier.com/transportation-research-part-c-emerging-technologies/recent-articles','Transportation Part C')
tpartd = paypuh_scraypuh('https://www.journals.elsevier.com/transportation-research-part-d-transport-and-environment/recent-articles','Transportation Part D')
tparte = paypuh_scraypuh('https://www.journals.elsevier.com/transportation-research-part-e-logistics-and-transportation-review/recent-articles','Transportation Part E')
tpartf = paypuh_scraypuh('https://www.journals.elsevier.com/transportation-research-part-f-traffic-psychology-and-behaviour/recent-articles','Transportation Part F')

In [None]:
source = 'Transport Reviews'
soup = BeautifulSoup(requests.get('https://www.tandfonline.com/action/showAxaArticles?journalCode=ttrv20').content, 'lxml')
date = pd.to_datetime(soup.find(class_='tocEPubDate').text.split(':')[1])
url = 'https://www.tandfonline.com' + soup.find(class_='tocEPubDate').find_parent().find_parent().find_all('a')[0]['href']

transport_review_dict = {}
i=0
for article in soup.find_all('div', class_="tocArticleEntry include-metrics-panel"):
    i += 1
    date = pd.to_datetime(article.find(class_='tocEPubDate').text.split(':')[1]).date()
    if (dt.date.today() - date).days < 10:
        url = 'https://www.tandfonline.com' + article.find(class_='tocEPubDate').find_parent().find_parent().find_all('a')[0]['href']
        article_soup = BeautifulSoup(requests.get(url).content, 'lxml')
        title = article_soup.find('span', class_='NLM_article-title hlFld-title').text
        abstract = article_soup.find('div', class_='abstractSection abstractInFull').text
        transport_review_dict[i] = {'title':title, 'summary':abstract, 'link':url, 'source':source, 'date':date}

In [None]:
week_o_papers = pd.concat([tparta, tpartb, tpartc, tpartd, tparte, tpartf, pd.DataFrame.from_dict(transport_review_dict)])
# week_o_papers.to_excel('{} papers.xls'.format(search_date))

In [None]:
newsdoc = docx.Document(docx='python_docx.docx')

for row in week_o_papers.reset_index(drop=True).T:
    row = week_o_papers.iloc[row,:]
    newsdoc.add_heading(row['title'],level=2)
    p = newsdoc.add_paragraph(row['summary'] + ' ')
    p.add_run('(')
    add_hyperlink(p, '{}'.format(row['link']), '{}'.format(row['source']))
    p.add_run(')')
newsdoc.save('{} papers.docx'.format(search_date))

In [None]:
conn = sqlite3.connect('news_papers.db')
week_o_papers.to_sql('news_papers', conn, if_exists='append', index=False)
conn.close()