# Business Need

The business needs to be able to scrape Google News API for articles related to the outbreak of various diseases, parse the article for the locations of the outbreak, and send the locations to another service for Geolocation tracking purposes.

### What is the problem you are attempting to solve? 
Copied from the proposal sent to me by company, R Zero.
"(Some intro stuff)...

We call Doppler "the Health Forecast" - it's an app on your phone that you open up to see any health risks for where you are today. We've done the conceptual work and are now ready to move into real development.

We want you to help us create a system that retrieves information from a variety of sources through an RSS feed and processes it so the information can be smoothly incorporated into our platform.

Specifically, we want an application that:
* Retrieves information from an RSS feed for
    * Locations (for example, an article that mentioned Zika in the US would mention Miami Beach, Florida - we need that)
        * And, ideally processes the location for a geo-coordinate (latitude/longitude)
    * Diseases/Conditions (such as Zika, the flu, etc)
    * Number of cases, if included
    
* Records the information into an easily accessible format (a CSV file would be great for now)
    * Include the processed information and the link to the source

* Has thorough code comments throughout, so developers that extend it in the future can understand the logic behind it."

### How is your solution valuable?
The software will be purchased by R Zero to be extended and licensed out to healthcare institutions.

### What is your data source and how will you access it?
Data wil be gathered by scraping Google News API to get URLs to articles that discuss the onset, diagnoses, spread, etc. of disease. The articles will be scraped for locations, disease and conditions, and number of cases.

### What techniques from the course do you anticipate using?
Scraping to get the data as described above.

Unsupervised Neural Networks for Natural Language Processing of the articles to extract the important information.

Supervised Neural Networks to create a regression model that predicts the likelihood of a disease spreading to an area depending on its distance from the current outbreak. 

### What do you anticipate will be the biggest challenge you'll face?
Getting reliable data from the scraping process.


### Note:
Before running this program, the machine on which this program is run will need to have Python 3.6 (I suggest downloading Anaconda) and run "pip install newsapi-python" from the command line interface.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import json
import requests
from scrapy.linkextractors import LinkExtractor

#Build a crawler to crawl Google's top news articles and pull articles mentioning the given set of diseases
class GoogleSpider(scrapy.Spider):
    name = "GS"
    
    allowed_domains = ['newsapi.org']
    
    # The API Key used here belongs to Matthew Kennedy. The business can register for an api key for free 
    # using this link: https://developers.google.com/maps/documentation/javascript/get-api-key
    start_urls = ['https://newsapi.org/v2/everything?q=Zika&apiKey=2613ce5e838a464b814b7d5b4c2e6bf8'
                  # Additional urls can go here. Change 'q=Zika' to another disease and add it to the list
                  # Links need to be in '' and comma separated.
                 ]
      
    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        data = json.loads(response.body_as_unicode())
        data2 = []
        for article in data['articles']:
            yield {
                'url' : article['url']
                , 'title': article['title']
                , 'description': article['description']
                , 'publish_date': article['publishedAt']
            }
                

process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    # The FEED_URI is the name of the file that will be saved
    'FEED_URI': 'HealthAppScraper.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    # This will need to be changed to the business' name
    'USER_AGENT': 'MatthewGoogleNewsCrawler (makennedy626@gmail.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 10 links.    
    # 'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# Starting the crawler with our spider.
process.crawl(GoogleSpider)
process.start()
print('First 100 links extracted!')



First 100 links extracted!


In [3]:
from pprint import pprint

with open('C:\\Users\\mkennedy\\HealthAppScraper.json') as data_file:    
    data = json.load(data_file)
pprint(data)

[{'description': 'Zika virüsü, sivrisinek ısırığıyla bulaşır fakat virüsün '
                 'insandan insana geçip geçmediği hakkında bir bilgi mevcut '
                 "değil. İnsanlarda ilk kez 1954 yılında Nijerya'da görülen "
                 'Zika virüsü, insandan önce 1947... Devamı için tıklayınız',
  'publish_date': '2018-01-09T15:53:21Z',
  'title': 'Zika virüsü nedir? (Zika virüsü belirtileri neler?)',
  'url': 'https://www.sabah.com.tr/saglik/2018/01/09/zika-virusu-nedir-zika-virusu-belirtileri-neler'},
 {'description': 'Zika virüsünün Türkiye’nin bazı bölgelerine yerleştiği iddia '
                 'edildi. Peki zika virüsünün bulaştığı nasıl anlaşılır, hangi '
                 'belirtiler görülür? Uzmanlar virüsün belirtilerini açıkladı. '
                 'Sivrisineklerden insanlara bulaşan zika virüsü nedir? Zika '
                 'virüsü insanlara …',
  'publish_date': '2018-01-11T14:47:24Z',
  'title': 'Zika virüsü nedir? Zika virüsü insanlara nasıl bulaşır? Hamile

In [4]:
print(str(data[1]['description']))

Zika virüsünün Türkiye’nin bazı bölgelerine yerleştiği iddia edildi. Peki zika virüsünün bulaştığı nasıl anlaşılır, hangi belirtiler görülür? Uzmanlar virüsün belirtilerini açıkladı. Sivrisineklerden insanlara bulaşan zika virüsü nedir? Zika virüsü insanlara …


In [12]:
import pandas as pd
import numpy as np
# The Json is a list of dictionary values. This code will store it to a DataFrame.
results = {'Description': [], 'Publish Date': [], 'URL': []}
columns = results.keys()
df_data = pd.DataFrame(data=results, columns=columns)

for entry in data:
    #print(data[entry]['description'])
    description = entry['description']
    date = entry['publish_date']
    url = entry['url']
    results = {'Description':[description], 'Publish Date':[date], 'URL': [url]}
    df_data = df_data.append(pd.DataFrame(data=results, columns=results.keys()), ignore_index = False)

In [13]:
df_data

Unnamed: 0,Description,Publish Date,URL
0,"Zika virüsü, sivrisinek ısırığıyla bulaşır fak...",2018-01-09T15:53:21Z,https://www.sabah.com.tr/saglik/2018/01/09/zik...
0,Zika virüsünün Türkiye’nin bazı bölgelerine ye...,2018-01-11T14:47:24Z,http://www.sozcu.com.tr/2018/saglik/zika-virus...
0,Dünya üzerinde yayılım göstererek hayati tehli...,2018-01-10T07:33:00Z,http://www.haber7.com/guncel/haber/2520534-zik...
0,Did you ever wonder what happened to Zika? You...,2018-01-08T04:01:00Z,https://www.lewrockwell.com/2018/01/dr-david-b...
0,Monkeys who catch Zika virus through bites fro...,2017-12-13T21:11:05Z,https://www.sciencedaily.com/releases/2017/12/...
0,( University of Wisconsin-Madison ) Monkeys wh...,2017-12-13T05:00:00Z,https://www.eurekalert.org/pub_releases/2017-1...
0,"Zika virüsü, neden olduğu rahatsızlıklar neden...",2018-01-11T09:13:48Z,http://www.hurriyet.com.tr/zika-virusu-nedir-n...
0,Reuters US study sheds light on how Zika cause...,2017-12-13T23:28:54Z,https://www.reuters.com/article/us-health-zika...
0,Patch.com Sexually Transmitted Zika Case Found...,2017-11-03T19:30:20Z,https://patch.com/florida/miami/sexually-trans...
0,Los Angeles Times Los Angeles: 1st sexually tr...,2018-01-05T10:00:57Z,http://outbreaknewstoday.com/los-angeles-1st-s...


In [None]:
############################################################################
###### Now, create a crawler to crawl the information of each url ##########
############################################################################

# Importing in each cell because of the kernel restarts.
import scrapy
from scrapy.crawler import CrawlerProcess


# Make a list for the urls to be stored into from the dataframe
url_list = []
for entry in data['URL']:
    url_list.append(entry)
    
print(url_list)


class ESSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "ESS"
    
    # URL(s) to start with.
    start_urls = [
        url_list
    ]

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for article in response.xpath('//article'):
            
            # Yield a dictionary with the values we want.
            yield {
                # This is the code to choose what we want to extract
                # You can modify this with other Xpath expressions to extract other information from the site
                'name': article.xpath('header/h2/a/@title').extract_first(),
                'date': article.xpath('header/section/span[@class="entry-date"]/text()').extract_first(),
                'text': article.xpath('section[@class="entry-content"]/p/text()').extract(),
                'tags': article.xpath('*/span[@class="tag-links"]/a/text()').extract()
            }

# Tell the script how to run the crawler by passing in settings.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'firstpage.json',  # Name our storage file.
    'LOG_ENABLED': False           # Turn off logging for now.
})

# Start the crawler with our spider.
process.crawl(ESSpider)
process.start()
print('Success!')

# Trying a different attempt

In [1]:
#Import the necessary modules
import datetime
from newsapi import NewsApiClient
import json
import requests

# Matthew Kennedy's API Key, business will need to register for their own api and place it below (registration is free)
newsapi = NewsApiClient(api_key='2613ce5e838a464b814b7d5b4c2e6bf8')

now = datetime.datetime.now()
date = now.day

all_articles = newsapi.get_everything(q=['Zika', 'H1N1']
                                     #, from_paramater = date #this can be changed to any date
                                     , to = date
                                     , language = 'en')

In [2]:
data = all_articles['articles']

import pandas as pd
import numpy as np
# The Json is a list of dictionary values. This code will store it to a DataFrame.
results = {'Description': [], 'Publish Date': [], 'URL': []}
columns = results.keys()
df_data = pd.DataFrame(data=results, columns=columns)

for entry in data:
    #print(data[entry]['description'])
    description = entry['description']
    date = entry['publishedAt']
    url = entry['url']
    results = {'Description':[description], 'Publish Date':[date], 'URL': [url]}
    df_data = df_data.append(pd.DataFrame(data=results, columns=results.keys()), ignore_index = False)
    
df_data.head()

Unnamed: 0,Description,Publish Date,URL
0,Did you ever wonder what happened to Zika? You...,2018-01-08T04:01:00Z,https://www.lewrockwell.com/2018/01/dr-david-b...
0,Monkeys who catch Zika virus through bites fro...,2017-12-13T21:11:05Z,https://www.sciencedaily.com/releases/2017/12/...
0,( University of Wisconsin-Madison ) Monkeys wh...,2017-12-13T05:00:00Z,https://www.eurekalert.org/pub_releases/2017-1...
0,Reuters US study sheds light on how Zika cause...,2017-12-13T23:28:54Z,https://www.reuters.com/article/us-health-zika...
0,FiercePharma Takeda's Zika vaccine candidate w...,2018-01-29T20:12:01Z,https://www.fiercepharma.com/vaccines/takeda-s...


In [3]:
############################################################################
###### Now, create a crawler to crawl the information of each url ##########
############################################################################

# Importing in each cell because of the kernel restarts.
import scrapy
from scrapy.crawler import CrawlerProcess


# Make a list for the urls to be stored into from the dataframe
url_list = []
for entry in df_data['URL']:
    url_list.append(entry)
    
print(url_list)


class ESSpider(scrapy.Spider):
    # Naming the spider is important if you are running more than one spider of
    # this class simultaneously.
    name = "ESS"
    
    # URL(s) to start with.
    start_urls = url_list

    # Use XPath to parse the response we get.
    def parse(self, response):
        
        # Iterate over every <article> element on the page.
        for article in response.xpath('//article'):
            
            # Yield a dictionary with the values we want.
            yield {
                # This is the code to choose what we want to extract
                # You can modify this with other Xpath expressions to extract other information from the site
                #'name': article.xpath('header/h2/a/@title').extract_first(),
                #'date': article.xpath('header/section/span[@class="entry-date"]/text()').extract_first(),
                'text': article.xpath('//*[@id="post-669657"]/div[3]').extract(),
                #'tags': article.xpath('*/span[@class="tag-links"]/a/text()').extract()
            }

# Tell the script how to run the crawler by passing in settings.
process = CrawlerProcess({
    'FEED_FORMAT': 'json',         # Store data in JSON format.
    'FEED_URI': 'firstpage.json',  # Name our storage file.
    'LOG_ENABLED': False           # Turn off logging for now.
})

# Start the crawler with our spider.
process.crawl(ESSpider)
process.start()
print('Success!')



['https://www.lewrockwell.com/2018/01/dr-david-brownstein/zika-schmika-iv-where-is-our-billion-dollars/', 'https://www.sciencedaily.com/releases/2017/12/171213161105.htm', 'https://www.eurekalert.org/pub_releases/2017-12/uow-mib121317.php', 'https://www.reuters.com/article/us-health-zika-paralysis/u-s-study-sheds-light-on-how-zika-causes-nerve-disorder-idUSKBN1E735E', 'https://www.fiercepharma.com/vaccines/takeda-s-zika-vaccine-gets-fda-fast-track-though-virus-no-longer-emergency', 'https://in.reuters.com/article/us-health-zika-paralysis/u-s-study-sheds-light-on-how-zika-causes-nerve-disorder-idINKBN1E735E', 'https://www.laboratoryequipment.com/news/2017/12/cdc-report-infants-affected-zika-not-reaching-developmental-milestones', 'https://medicalxpress.com/news/2017-11-zika-outbreak.html', 'http://outbreaknewstoday.com/los-angeles-1st-sexually-transmitted-zika-infection-reported-13985/', 'https://uk.reuters.com/article/us-health-zika-birthdefects/more-birth-defects-seen-in-u-s-areas-whe

In [4]:
import scrapy
from scrapy.crawler import CrawlerProcess

# Make a list for the urls to be stored into from the dataframe
url_list = []
for entry in df_data['URL']:
    url_list.append(entry)
    
print(url_list)

class WikiSpider(scrapy.Spider):
    name = "WS"
    
    # Here is where we insert our API call.
    start_urls = url_list

    # Identifying the information we want from the query response and extracting it using xpath.
    def parse(self, response):
        for item in response.xpath('//lh'):
            # The ns code identifies the type of page the link comes from.  '0' means it is a Wikipedia entry.
            # Other codes indicate links from 'Talk' pages, etc.  Since we are only interested in entries, we filter:
            if item.xpath('@ns').extract_first() == '0':
                yield {
                    'text': item.xpath('//*[@id="post-669657"]/div[3]').extract_first() 
                    ,'body': item.xpath('//*[@id="single-post"]').extract_first()
                    }
        # Getting the information needed to continue to the next ten entries.
        next_page = response.xpath('continue/@lhcontinue').extract_first()
        
        # Recursively calling the spider to process the next ten entries, if they exist.
        if next_page is not None:
            next_page = '{}&lhcontinue={}'.format(self.start_urls[0],next_page)
            yield scrapy.Request(next_page, callback=self.parse)
            
    
process = CrawlerProcess({
    'FEED_FORMAT': 'json',
    'FEED_URI': 'PythonLinks.json',
    # Note that because we are doing API queries, the robots.txt file doesn't apply to us.
    'ROBOTSTXT_OBEY': False,
    'USER_AGENT': 'ThinkfulDataScienceBootcampCrawler (thinkful.com)',
    'AUTOTHROTTLE_ENABLED': True,
    'HTTPCACHE_ENABLED': True,
    'LOG_ENABLED': False,
    # We use CLOSESPIDER_PAGECOUNT to limit our scraper to the first 100 links.    
    'CLOSESPIDER_PAGECOUNT' : 10
})
                                         

# Starting the crawler with our spider.
process.crawl(WikiSpider)
process.start()
print('First 100 links extracted!')



['https://www.lewrockwell.com/2018/01/dr-david-brownstein/zika-schmika-iv-where-is-our-billion-dollars/', 'https://www.sciencedaily.com/releases/2017/12/171213161105.htm', 'https://www.eurekalert.org/pub_releases/2017-12/uow-mib121317.php', 'https://www.reuters.com/article/us-health-zika-paralysis/u-s-study-sheds-light-on-how-zika-causes-nerve-disorder-idUSKBN1E735E', 'https://www.fiercepharma.com/vaccines/takeda-s-zika-vaccine-gets-fda-fast-track-though-virus-no-longer-emergency', 'https://www.laboratoryequipment.com/news/2017/12/cdc-report-infants-affected-zika-not-reaching-developmental-milestones', 'https://in.reuters.com/article/us-health-zika-paralysis/u-s-study-sheds-light-on-how-zika-causes-nerve-disorder-idINKBN1E735E', 'https://medicalxpress.com/news/2017-11-zika-outbreak.html', 'http://outbreaknewstoday.com/los-angeles-1st-sexually-transmitted-zika-infection-reported-13985/', 'http://outbreaknewstoday.com/birth-defects-seen-parts-u-s-local-zika-spread-cdc-83806/', 'https://u