<a href="https://colab.research.google.com/github/mco-gh/pylearn/blob/master/notebooks/9_Exceptions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 9 - Project

**Make a copy of this notebook by selecting File->Save a copy in Drive from the menu bar above.**

Things you'll learn in this lesson:
- how to formulate and carry out a Python project

[Previous Lesson](https://pylearn.io/lessons/8-Files/)

# Let's do a project together

## Problem Statement
Enable people to automatically find articles of interest from their favorite news sites.

## Requirements

### Must...
- maintain a configurable list of target websites
- support a per-user configurable list of topics of interest
- keep track of what we've already seen
- be automated, no manual steps other than running the app
- present results via web app

### Should...
- be able to automatically and regularly run app on a scheduled basis
- provide ability to send daily summaries by email
- provide a more sophisticated way of gauging interest than topic enumeration (e.g. embeddings)
- run in the cloud

## Problem Decomposition
- we can follow a pattern that many data science projects use
```
gather => format => model => report
```

1. **gather** - data acquisition, getting your hands on the data you care about
1. **format** - data engineering, convert the data into a format you can use
1. **model** - data modeling, build prediction and/or classification model(s) to categorize and assess discovered data
1. **report** - present the insights visually and/or analytically

## Step 1 - Gather

Given a list of websites, gather all available articles.

## Can we reuse some code?

Yes! We're going to use the [NewsCatcher API](https://github.com/kotartemiy/newscatcher), which is a Python library that claims to: *Programmatically collect normalized news from (almost) any website.*



In [None]:
# Let's install it...
!pip install newscatcher

In [None]:
# Let's try it out...
from newscatcher import Newscatcher
nc = Newscatcher(website='theguardian.com')
results = nc.get_news()
print(type(results))
print(results.keys())
print(results)

In [None]:
# Find out how many websites are supported...
from newscatcher import urls
sites = urls()
print('number of sites supported:', len(sites))
unique_sites = set(sites)
print('unique sites:', len(unique_sites))

In [None]:
# Get some articles...
nc = Newscatcher(website='theguardian.com')
results = nc.get_news()
articles = results['articles']
print(type(articles))
print('number of articles:', len(articles))
print('article keys:', articles[0].keys())
print()

cnt = 1
for i in articles:
  id = i['id']
  title = i['title']
  print(f'{cnt:2d}. {title:70.70s}  {id}')
  cnt += 1

In [None]:
# List topics...
from newscatcher import describe_url
describe = describe_url('nytimes.com')
print(describe['topics'])
describe = describe_url('fivethirtyeight.com')
print(describe['topics'])

In [None]:
# Get articles with a specific topic...
nc = Newscatcher(website='fivethirtyeight.com', topic='science')
results = nc.get_news()
articles = results['articles']

cnt = 1
for i in articles:
  id = i['id']
  title = i['title']
  print(f'{cnt:2d}. {title:70.70s}  {id}')
  cnt += 1

In [None]:
# Function to get articles from a given site with a given topic...
def get_new_articles(site, topic):
  nc = Newscatcher(website=site, topic=topic)
  results = nc.get_news()
  # Return the articles
  if results:
    if 'articles' in results:
      return results['articles']
  return None

# Function to display articles from a set of results...
def display(articles):
  cnt = 1
  for i in articles:
    id = i['id']
    title = i['title']
    print(f'{cnt:2d}. {title:70.70s}  {id}')
    cnt += 1

In [None]:
results = get_new_articles('nytimes.com', 'food')
display(results)

In [None]:
# Let define some selection criteria...

# sites of interest
sites = [
  'nytimes.com',
  'washingtonpost.com',
  'theguardian.com',
  'si.com',
]

# topics of interest
topics = [
  'politics',
  'tech',
  'business',
  'sport',
]

print('sites:', sites)
print('topics:', topics)

In [None]:
all_articles = []
for i in sites:
  tmp = describe_url(i)
  topic_list = tmp['topics']
  for j in topics:
    print(f'site: {i:20.20s}  topic: {j:20.20s}', end='')
    if j not in topic_list:
      print('topic not available')
      continue
    articles = get_new_articles(i, j)
    print(len(articles))
    all_articles += articles

#display(all_articles)

In [None]:
def get_news(sites, topics):
  all_articles = []
  for i in sites:
    topic_list = describe_url(i)['topics']
    for j in topics:
      if j not in topic_list:
        continue
      articles = get_new_articles(i, j)
      all_articles += articles
  return all_articles


In [None]:
results = get_news(['nytimes.com', 'washingtonpost.com'], topics)
display(results)

[Previous Lesson](https://pylearn.io/lessons/8-Files/)