<a href="https://colab.research.google.com/github/mco-gh/pylearn/blob/master/notebooks/8_Files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 8 - Files

**Reading & Writing Files, Shared Project**

Link to this notebook: [mco.fyi/py6](https://mco.fyi/py6)

**Make a copy of this notebook by selecting File->Save a copy in Drive from the menu bar above.**

Things you'll learn in this lesson:
- todo

# Reading and Writing Files

## Our programs have amnesia
- Program variables reside in memory,
- and main memory is not persistent,
- so when you terminate your program or you turn off your computer, your data disappears.
- imagine having to re-enter your list of friends every time you used Facebook
- so we'll need a way to store and retrieve things



## Storage Tradeoffs
- there are two kinds of storage in your computer
  - main memory, aka RAM, is fast, but transient (like human memory)
  - disk storage is slow(er), but permanent (like a notebook) and higher capacity
- all the things we've worked with so far (variables, functions, expressions) reside in main memory
- we'll save information across program executions using disk storage in units we call files

## What is a file anyway?
- a named chunk of persistent disk storage is called a `file`
- files are organized into hierarchical structures, called directories or folders
- examples...
  - Windows:  `c:\Users\marccohen\my_fave_movies.md`
  - Mac/Linux: `/Users/marccohen/my_fave_movies.md`
- `path` is the location, e.g. `c:\Users\marccohen\`
  - it's the "where"
- `filename` is the name, e.g. `my_fave_movies.md`
  - it's the "which"
  
```
Queen Elizabeth   -----> the which
Buckingham Palace \
London, England    |---> the where
SW1A 1AA.         /
```

## Opening a File
- Before you can read or write a file, you need to open it
- Use the `open()` function
- prototype: `variable = open(filename, mode)`
- example: `file = open("myfile", "r")`
- the first argument is a file specificaton, which can include a path or not
- if no path provided, the filename is assumed to reside in the "current directory"
- we'll cover the second argument in the next cell
- open returns a special type, called a file object, which is used for subsequent operations on the file


## File Access Modes
Mode|Description|access|if file exists...|if file doesn't exist...
----|-----------|------|-----------------|------------------------
"r"|read from a file|read|open file|generate error
"w"|write to a file|write|overwrite & open|create file & open
"a"|append to a file|write|open for append|create file & open
"r+"|read/write from/to a file|read/write|open file|generate error
"w+"|write/read from/to a file|read/write|overwrite & open|create file & open
"a+"|append/read a text file|read/write|open for append|create file & open


## Closing a File
- the opposite of `open()` is `close()`
- when you're done working with a file, you should close it
- closing a file cleans up the loose ends
- `close()` is a method of the file object
- example: `file.close()`


## Writing to a File
- `file.write('this is a line of text\n')`
- file must have been opened with write access
- writes the passed string into the file
- you have to include newline characters where you want them, otherwise subsequent write calls will build one long line
- writes may not be visible until you close the file


In [None]:
f = open('test.txt', 'w')
f.write('This is my test file.\n')
for i in range(10):
  f.write('line number ' + str(i) + '\n')
f.close()


## Reading From a File
- `mystr = file.read()`
- file must have been opened with read access
- reads the entire file into memory
- the result is returned in a string
- you can pass an argument to limit how many characters are read

In [None]:
f = open('test.txt', 'r')
s = f.read()
f.close()
print(s)

## Reading a file iteratively
- `for line in file:`
- this iterates over the lines in a file
- each iteration of the loop reads a line from the file and sets the loop variable (line, in this case) to the string value of each line in the file
- the string includes the trailing newline
- this is a very handy way of processing a text file one line at a time
- also space-efficient because it only needs to store one line at a time in main memory


In [None]:
! cat test.txt


In [None]:
#myfile = open('test.txt', 'r')
for text in myfile:
  print(text, end='')
myfile.close()


## The `with` Statement

- automatically ensures files get closed (and other resoures get cleaned up)
- without the `with` statement...
```
file = open('file_path', 'w')
file.write('Hello world!')
file.close()
```
- using the `with` statement...
```
with open('file_path', 'w') as file:
    file.write('Hello world!')
```

In [None]:
with open('test.txt', 'r') as f:
  for line in f:
    print(line)

## Summary of File Functions and Methods
- `open()` - open a file
- `close()` - close a file
- `read(n)` - read up to n chars from current position to end of file and return in a string. if n not provided, read all chars from current position to end of file.
- `readlines()` - read all remaining lines in a file and return in a list of strings.
- `write(s)` - write string s to a file
- `writelines(list)` - write the strings in the passed list to a file


# When you run a cell in a notebook, where does it actually run?



<img src="https://lh3.googleusercontent.com/LWgdIXTXW6nO0Wi5rGpEJoZ5Hd4EtXq8gm55_wyfIcfZOs07paFyWlrlFUyl9bRCKFKpS_I3nP6O4CN8vXwWG0bV2XtAUH4X2PRWiQ=w1200-l80-sg-rj-c0xffffff">

[Learn more.](https://www.google.com/about/datacenters/)


<img src="https://docs.google.com/drawings/d/e/2PACX-1vQh48LKMQ7Y4bNewgnLj2a429ZjV4yFS2ghfKXiK1Wn1skq5JH1sMGtrGaU7MYPkN_m-bXCRWqSZpB-/pub?w=1440&h=1080">




# Let's do a project together

## Problem Statement
Enable people to automatically find articles of interest from their favorite news sites.

## Requirements

### Must...
- maintain a configurable list of target websites
- support a per-user configurable list of topics of interest
- keep track of what we've already seen
- must be automated, no manual steps other than running the app
- must present results via web app

### Should
- should be able to automatically and regularly run app on a scheduled basis
- should provide ability to send daily summaries by email
- should provide a more sophisticated way of gauging interest than topic enumeration (e.g. machine learning)
- should run in the cloud


## Problem Decomposition
- we can follow a pattern that many data science projects use
```
gather => format => model => report
```

1. **gather** - data acquisition, getting your hands on the data you care about
1. **format** - data engineering, convert the data into a format you can use
1. **model** - data modeling, build prediction and/or classification model(s) to categorize and assess discovered data
1. **report** - present the insights visually and/or analytically

## Step 1 - Gather

Given a list of websites, gather all available articles.

## Can we reuse some code?

Yes! We're going to use the [NewsCatcher API](https://github.com/kotartemiy/newscatcher), which is a Python library that claims to: *Programmatically collect normalized news from (almost) any website.*



In [None]:
# Let's install it...
!pip install newscatcher

In [None]:
# Let's try it out...
from newscatcher import Newscatcher
nc = Newscatcher(website='theguardian.com')
results = nc.get_news()
print(type(results))
print(results.keys())
print(results)

In [None]:
# Find out how many websites are supported...
from newscatcher import urls
sites = urls()
print('number of sites supported:', len(sites))
unique_sites = set(sites)
print('unique sites:', len(unique_sites))

In [None]:
# Get some articles...
nc = Newscatcher(website='theguardian.com')
results = nc.get_news()
articles = results['articles']
print(type(articles))
print('number of articles:', len(articles))
print('article keys:', articles[0].keys())
print()

cnt = 1
for i in articles:
  id = i['id']
  title = i['title']
  print(f'{cnt:2d}. {title:70.70s}  {id}')
  cnt += 1

In [None]:
# List topics...
from newscatcher import describe_url
describe = describe_url('nytimes.com')
print(describe['topics'])
describe = describe_url('fivethirtyeight.com')
print(describe['topics'])

In [None]:
# Get articles with a specific topic...
nc = Newscatcher(website='fivethirtyeight.com', topic='science')
results = nc.get_news()
articles = results['articles']

cnt = 1
for i in articles:
  id = i['id']
  title = i['title']
  print(f'{cnt:2d}. {title:70.70s}  {id}')
  cnt += 1

In [None]:
# Function to get articles from a given site with a given topic...
def get_new_articles(site, topic):
  nc = Newscatcher(website=site, topic=topic)
  results = nc.get_news()
  # Return the articles
  if results:
    if 'articles' in results:
      return results['articles']
  return None

# Function to display articles from a set of results...
def display(articles):
  cnt = 1
  for i in articles:
    id = i['id']
    title = i['title']
    print(f'{cnt:2d}. {title:70.70s}  {id}')
    cnt += 1

In [None]:
results = get_new_articles('nytimes.com', 'food')
display(results)

In [None]:
# Let define some selection criteria...

# sites of interest
sites = [
  'nytimes.com',
  'washingtonpost.com',
  'theguardian.com',
  'si.com',
]

# topics of interest
topics = [
  'politics',
  'tech',
  'business',
  'sport',
]

print('sites:', sites)
print('topics:', topics)

In [None]:
all_articles = []
for i in sites:
  tmp = describe_url(i)
  topic_list = tmp['topics']
  for j in topics:
    print(f'site: {i:20.20s}  topic: {j:20.20s}', end='')
    if j not in topic_list:
      print('topic not available')
      continue
    articles = get_new_articles(i, j)
    print(len(articles))
    all_articles += articles

#display(all_articles)

In [None]:
def get_news(sites, topics):
  all_articles = []
  for i in sites:
    topic_list = describe_url(i)['topics']
    for j in topics:
      if j not in topic_list:
        continue
      articles = get_new_articles(i, j)
      all_articles += articles
  return all_articles


In [None]:
results = get_news(['nytimes.com', 'washingtonpost.com'], topics)
display(results)

# Lesson 6 Homework


* Make a copy of this notebook (if you haven't already done so) and complete the challenges above. You can make a copy of this notebook by selecting File->Save a copy in Drive from the menu bar above.
* Review your copy of this notebook.
  * Complete the questions below.
  * If something is unclear, experiment and see if you can understand it better.
* For those who want to go deeper...
  * Read [Chapter 9 - Reading and Writing Files](https://automatetheboringstuff.com/2e/chapter9/) in our textbook to learn more about web scraping.


##Question 1

In the Mu IDE (or your chosen IDE if you use another one), write a function named `enumerate()` that takes a list of strings and enumerates them, i.e. it returns a list where each passed string is prefixe by a sequential number, starting at 1. For example...
```
li = ['test', 'another test', 'last test']
results = enumerate(li)
for i in results:
  print(i)
```

should produce this output:
```
1. test
2. another test
3. last test


##Question 2



Download this file: `input.txt`. Using Mu, write a function called `read_file()` that takes one string argument, opens the named file, and uses a `for` loop to read each line of the file into a list of strings. Get rid of newlines at the end of each line using the `strip()` string method. Return the list of strings to the caller.

For example:
```
results = read_file('input.txt')
for i in results:
  print(i)
```
should produce this output:
```
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty,
and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated,
can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field,
as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting
and proper that we should do this.

But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men,
living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world
will little note, nor long remember what we say here, but it can never forget what they did here. It is for us
the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly
advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored
dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here
highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth
of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.
```

##Question 3
Using Mu, write a function called write_file() that takes two arguments: a filename and a list of strings, opens the named file for write access and uses a `for` loop to write the list contents into the file, one string per line.

For example:
```
li = ['test', 'another test', 'last test']
write_file('output.txt', li)
```
Using your systems file explorer or command line, verify the file was created and has the expected contents. If you're not sure how to do that, you could also use your new `read_file()` function!

## Question 4
Now tie everything together by writing a program that reads the contents of `input.txt` (using `read_file()`), enumerates the lines found therein (using `enumerate()`, and writes the enumerated lines to `output.txt` (using `write_file()`).

## Question 5
Copy/paste the code from our newsfinder program to Mu and see if you can get it working locally, on your own computer. Play around with the sites and topics lists to customize the results to your own needs.

**NOTE**: You will need to install two packages in Mu. Do this by clicking on the gear icon in the lower right corner, as shown here:

<img src="https://mco.dev/img/mu1.png">

Enter the two required packages `feedparser==6.0.0` (two equal signs!) and `newscatcher`  into this dialog and click OK.

<img src="https://mco.dev/img/mu2.png">

# Next Week - Web Servers

We're going to use files to keep track of which articles we've already seen and we're going to build a web user interface on top of our newsfinder program.