# DSCI 511: Data acquisition and pre-processing<br>Chapter 7: Building and Maintaining a Robust Acquisition Stream

## 7.0 Callenges with live or recurrent data acquisition
We've already discussed some basic methods for acquiring a dataset and also for working with said data,
but what if we are working on a project where we'd like to continually collect data? Obviously, creating a data
acquisition stream can be very difficult, as there are way more moving pieces at work (so a lot more things can 
go wrong). For example, here are some of the challenges we might face creating an acquisition stream:
* Controlling your rate of access to follow rate limits
* Avoiding obtaining redundant data
* Intelligent storage of large and/or varietal data
* Running your application recurrently with a desired frequency
* Handling errors or missing data in a way that won't collapse the stream
* Handling potential system failure out of your control (equipment failure)
* Dealing with constantly changing APIs and terms of service (very important!)

## 7.1 Rate limiting
A _rate limit_ refers to a specific rate at which the owner of a source of data allows other internet users to obtain that data. Some sources of data have no rate limit (which is probably a bad idea, potentially allowing for easy DoS attacks), but on most sites you'll encounter they'll have one. Each site sets their own limits, and they often vary wildly from platform to platform (or even on a single platform over time.

### 7.1.1 Rate limiting with APIs
If you're using an API to collect data there'll almost certainly be a rate limit. When APIs are more commoditized/commercialized their platforms will generally reflect this with more watchfull behaviors that shut your app down quickly. While your app might depend on a particular rate limit, it's not uncommon for these to change, too. For example, the Facebook Graph API used to have a limit of about 3600 calls per hour for any app in development more, but it reduced this to 200 in 2018 with v.3). Generally, it is the duty of the collector to be diligent of rate limits both as a kindness to hosts and to make streams robust. This means following details on rate limiting in the documentation for the data you are working with. For example, current pointers to Twitter and Facebook are here:
- Facebook: https://developers.facebook.com/docs/graph-api/advanced/rate-limiting/
- Twitter: https://developer.twitter.com/en/docs/basics/rate-limiting.html

#### 7.1.1.2 Exercise: Understanding API rate limits
Read each of the above API docs and describe the how much API usage is allowed per day from each platform for a given app. Do all apps get the same bandwidth? What methods/metrics do the platforms use to determine limits and overuse? How should an app be constructed to maximize data access?

_Response._

#### 7.1.1.3 Monitoring API Rate Limits
When it comes to APIs, the actual process of obtaining the data itself is usually
orders of magnitude easier than by writing scripts to scrape it, but this comes at the cost of oftentimes stricter and more complicated rate limiting. Also, APIs and their rate limits can change at any moment! So, it's definitely important to check the documentation of the API you're using very frequently, and to become very familiar with it. This is the place (usually) to find the rate limiting information for the API.

Oftentimes, there are several components to API rate limiting. For example, on the Facebook Graph API there are three separate categories that you need to look out for (on an hourly basis): total number of calls, total time used, and the total CPU usage. If your script goes over any of these three categories in an hour, then you'll be locked out of further use of the API until the next hour. Unfortunately, this means that if we want to remain within the rate limits, we need to do a lot more work than just setting some sleep call. We might be tempted to just run a few calls, note how much of each category these calls "use up", then divide an hour by these figures to get the number of calls we might be allowed per hour. But this isn't a good idea, because every call is different. Some may take up a lot of CPU power, while others take up very little.

So, how do we deal with these rate limits? Luckily, most APIs have a method that will, when queried, return to the user various statistics regarding current API usage. A good solution to our problem is to have our script request these statistics after each call, and make sure to only continue making calls while under the rate limits. For a transparent example we'll use the Facebook Graph API. 

Note: the Facebook example code will not run unless you authenticate with application and access tokens and pass application review and hence has more value for use conceptually. Information on these processes can be found through the (very dense) graph api documentation.

#### 7.1.1.4 Example: Facebook Graph API
First, we should pick upper limits for the various rate limit categories so that the script will stop running once they are hit. Upon examining the documentation, we see that the rate limiting data for the three categories are returned as percentages, so once we hit 100% on any of the three categories, the API will be shut off for the rest of the hour. So, to be extra careful, let's tell our script to shut off once any of these categories hits 95%. Next, we need to figure out how to get the rate limit data itself! It turns out that this data is returned with all API calls as an _HTML header_. Basically, this is an extra bit of information (request metadata) that can come along with URL requests. To get the response headers from a urllib response, simply invoke the method: `response.info()`.

Upon inspection of the Facebook documentation: 
- https://developers.facebook.com/docs/graph-api/advanced/rate-limiting/

the limiting information is returned as a dictionary object of the following form:
```
{
  "call_count"    : x, 
  "total_time"    : y, 
  "total_cputime" : z
}
```

So we just need to get this `dict` and read `x`, `y`, and `z`.
Luckily, our old tools for accessing HTML will allow us also to read this data:

In [None]:
## supposing we wanted/were authorized to access the Drexel page's feed
## the following URL would return the most recent post
url = "https://graph.facebook.com/drexeluniv/feed?&fields=attachments,created_time,message&limit=1"

## We already decided on capping the rate limit at 95%
RATE_MAX = 95

## Initialize some variables to store the category data
total_time, total_cpu, calls = 0, 0, 0

## Now let's use our trusty friend the While loop to keep running 
## and collect the latest post so long as we don't hit the API max:
while total_time < RATE_MAX and total_cpu < RATE_MAX and calls < RATE_MAX:

    ## build the request
    request = urllib.request.Request(url = url)

    ## Open the URL
    response = urllib.request.urlopen(request)
    
    ## Now we can grab the rate limit dict using the .info() method
    headers = dict(web_response.info())
    total_time = headers['total_time']
    total_cpu = headers['total_cputime']
    calls = headers['call_count']

This loop will run until one of the three categories hits 95% usage, and then stop. So now, if we can run this script every hour, we'll be sure to never be rate-limited, and all will be well!

#### 7.1.1.5 Example: http headers from Twitter using `Twython`
Using `urllib` its straightforward to view the headers in a reqponse to a urllib request, using the syntax `response.info()`. Howrever, url construction and authorization are a bit more complicated with Twitter:
- https://developer.twitter.com/en/docs/basics/authentication/guides/authorizing-a-request.html

so for a working example we'll take the headers as being passed down from a Python API client we're familiar with: `Twython`. According to the rate limiting docs:
- https://developer.twitter.com/en/docs/basics/rate-limiting.html

the headers we're interested in are:
- `x-rate-limit-limit`: the rate limit ceiling for that given endpoint
- `x-rate-limit-remaining`: the number of requests left for the 15 minute window
- `x-rate-limit-reset`: the remaining window before the rate limit resets, in UTC epoch seconds

To get the header's back from `Twython` we can consult the docs:
- https://twython.readthedocs.io/en/latest/usage/advanced_usage.html#access-headers-of-previous-call

These can be accessed from the previous Twython call using the `.get_lastfunction_header(header)` method:

In [1]:
from twython import Twython

## place authorization strings here to run code
consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""

## initilize the module
twitter = Twython(consumer_key, consumer_secret)

## The notable tweet IDs from Chapter 3
IDs = ["1121915133", "64780730286358528", "64877790624886784", "20", "467192528878329856", 
       "474971393852182528", "475071400466972672", "475121451511844864", "440322224407314432",
       "266031293945503744", "3109544383", "1895942068", "839088619", "8062317551", "232348380431544320",
       "286910551899127808", "286948264236945408", "27418932143", "786571964", 
       "467896522714017792", "290892494152028160", "470571408896962560"]

headers = ['x-rate-limit-limit', 'x-rate-limit-remaining', 'x-rate-limit-reset']

for ID in IDs:
    status = twitter.show_status(id = ID)
    print(status["text"])
    for header in headers:
        print(header, twitter.get_lastfunction_header(header))
    print()

http://twitpic.com/135xa - There's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy.
x-rate-limit-limit 900
x-rate-limit-remaining 899
x-rate-limit-reset 1537895497

Helicopter hovering above Abbottabad at 1AM (is a rare event).
x-rate-limit-limit 900
x-rate-limit-remaining 898
x-rate-limit-reset 1537895497

So I'm told by a reputable person they have killed Osama Bin Laden. Hot damn.
x-rate-limit-limit 900
x-rate-limit-remaining 897
x-rate-limit-reset 1537895497

just setting up my twttr
x-rate-limit-limit 900
x-rate-limit-remaining 896
x-rate-limit-reset 1537895497

India has won! भारत की विजय। अच्छे दिन आने वाले हैं।
x-rate-limit-limit 900
x-rate-limit-remaining 895
x-rate-limit-reset 1537895497

We can neither confirm nor deny that this is our first tweet.
x-rate-limit-limit 900
x-rate-limit-remaining 894
x-rate-limit-reset 1537895497

Thank you for the @Twitter welcome! We look forward to sharing great #unclassified content with you.
x-rate-limit-limit 90

### 7.1.2 Rate limited web content: Robots.txt
Twitter and Facebook both have extensive documentation regarding using their APIs and the rate limits inherent to both.
But what if you've scoured your data source's website and haven't been able to find any information
regarding their rate limiting? Well, first make sure you've checked the terms of service. Oftentimes, you can find
your legal rights in both using and harvesting the data. Apart from this, something almost every website has is a
text file calls `robots.txt`. This file is used to tell web crawling programs (usually referred to as __spiders__) how
they should behave, and also even bans certain programs from accessing the data at all! Programs which are known to be
abusive and which refuse to follow rate-limiting guidelines can end up being banned. How are they set up? Well, they're actually written in a way that's pretty close to plain English. Let's look at Twitter's robots.txt:

In [2]:
import urllib.request

url = "https://twitter.com/robots.txt"

# Make the request
req = urllib.request.Request(url = url)

# Open the URL
handler = urllib.request.urlopen(req)

# Read/view the data as a string
robots = handler.read().decode('utf-8')
print(robots)

#Google Search Engine Robot
User-agent: Googlebot
Allow: /?_escaped_fragment_

Allow: /*?lang=
Allow: /hashtag/*?src=
Allow: /search?q=%23
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following

Disallow: /account/not_my_account

#Yahoo! Search Engine Robot
User-Agent: Slurp
Allow: /?_escaped_fragment_

Allow: /*?lang=
Allow: /hashtag/*?src=
Allow: /search?q=%23
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following

Disallow: /account/not_my_account

#Yandex Search Engine Robot
User-agent: Yandex
Allow: /?_escaped_fragment_

Allow: /*?lang=
Allow: /hashtag/*?src=
Allow: /search?q=%23
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following

Disallow: /account/not_my_account

#Microsoft Search Engine Robot
User-Agent: msnbot
Allow: /?_escaped_fra

#### 7.1.2.1 What does this all mean?
Near the bottom we see a line that says `Crawl-delay: 1`. This is the rate limit! This is telling spiders to wait 1 second in between calls, so if we decided to scrape Twitter, we'd have to make sure to do it no more than 3600 times per hour. But what is all the other stuff? 

When you access web content, you always send some information to the server about your own identity (at least, roughly where you are and what browser you're using). This information is referred to as your __`User-agent`__. The file above forbids certain User-agents from crawling specific portions of Twitter with the `Disallow` tag. Notice as well that there's a somewhat mysterious comment above the wild card (`*`) `User-agent`:
```
# Every bot that might possibly read and respect this file.
User-agent: *
Allow: /*?lang=
Allow: /hashtag/*?src=
Allow: /search?q=%23
Disallow: /search/realtime
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/followers
Disallow: /*/following
```
Technically, this pertains to us or any bot we create!
#### 7.1.2.3 Exercise: robots.txt
Take a look at the robots file for each of `facebook.com` and `amazon.com`. Determine and discuss any allowances/disallowances for bots that you might create to crawl these sites. Do you infer any cultural differences around data sharing and access between these companys and also with Twitter?

_Response._

#### 7.1.2.4 implementing the rules in robots.txt
So, how can we follow these rules that are handed to us by websites? Well, as burgeoning data mungers we might be tempted to attach `robots.txt` as a regex challenge. As with most things in Python there's another handy module. There's one inside of urllib: `robotparser`, but instead we'll be using an improved module built around the rules that the big tech companies like Google use. There's a lot of history and disagreement with no truly universal standard on how to parse/interpret robots.txt, but the makers of the improved module, `robotexclusionrulesparser`, provide a nice discussion of the ecosystem:
- http://nikitathespider.com/python/rerp/

Let's use `robotexclusionrulesparser.RobotFileParserLookalike` to confirm that we're allowed to search for the top (`/search?q=%23`) and not the most recent (`/search/realtime`) tweets matching a search term; we'll use `'data science'`:

In [3]:
## don't use this one if you want to actually scrape
## it will basically just tell you that you can't scape things
# import urllib.robotparser as robotparser

## use this updated module to access the big tech parse for scraping rules
import robotexclusionrulesparser

## spin up the module
rp = robotexclusionrulesparser.RobotFileParserLookalike()

## parse the robots file
rp.parse(robots)

## we're allowed to scrape the top, relatively current matches
print(rp.can_fetch("*", "https://twitter.com/search?q=%23/data science"))
## we're not entitled to scrape the most current matches
print(rp.can_fetch("*", "https://www.twitter.com/search/realtime/data science"))

True
False


#### 7.1.2.5 Specifying a `User-agent`
Since we'll be accessing web content using Python and not a browser, we have the option of setting our `User-agent`. This can allow us to get more data than if we were to just leave it unspecified. If you're trying to access data online and you're getting errors, a handy trick to fix it is specify a `User-agent` that would ordinarily be presented by a web browser, like `'Mozilla/5.0'`:

In [4]:
# Let's say we're having trouble opening up our trusty example page, example.com
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.example.com/'

# Create the user-agent header
# NOTE: This is about as basic as they come. This just tells the site that you're using Mozilla Firefox,
# which makes you look more like a human than just using the command line to make requests.
header = {'User-Agent': 'Mozilla/5.0'}

# Make the request
req = urllib.request.Request(url = url, headers = header)

# Open the URL
handler = urllib.request.urlopen(req)

soup = BeautifulSoup(handler.read(), 'html.parser')
print(soup)

<!DOCTYPE doctype html>

<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used

There are all kinds of possibilities for a User-Agent, but the vast majority of the time, if changing the User-Agent is going to help you get better results, just using this basic configuration supplied will do the trick.

### 7.1.3 Sleep timers
After making sure that we are allowed to obtain data from a website and determining what the time limitations are for acquiring said data, and possibly having set up the appropriate `User-Agent`, we might actually be getting data! So how do we make sure to follow the time limitations? By telling Python to sleep!

The easiest way to make sure we don't overload a website is by telling Python to wait a certain amount of time after making each request. For example, it's common for `robots.txt` to tell spiders to wait 30 seconds in between web scrapes. So here we'd just make sure to tell Python to wait 30 seconds each time after we use `urllib.request`. This functionality occurs in the `time` module. So, let's say we want to (for some reason) scrape and print out the HTML of `http://example.com/` every 30 seconds, forever:

In [5]:
import time
from bs4 import BeautifulSoup
import urllib

counter = 0
while True: # The easiest way to set up an infinite loop in Python
    # This is exactly the same code as before
    counter += 1 # Don't want this to run forever!
    html_text = urllib.request.urlopen("http://www.example.com/").read()
    soup = BeautifulSoup(html_text, 'html.parser')
    print("finished request number", counter)
    time.sleep(30) # This command makes Python do nothing for 30 seconds
    if counter > 3:
        break

finished request number 1
finished request number 2
finished request number 3
finished request number 4


You don't have to put a set number inside `time.sleep()`. If we wanted to, we could create a script that has a variable wait time instead, perhaps dynamically in response to a platforms rate limit information. As a simple example, after each successive call let's make the script wait an additional second:

In [8]:
wait_time = 0
counter = 0

while True:
    counter += 1
    wait_time += 1  # Increment the wait time
    html_text = urllib.request.urlopen("http://www.example.com/").read()
    soup = BeautifulSoup(html_text, 'html.parser')
    print("finished request number", counter, "after waiting for ", wait_time, "seconds")
    time.sleep(wait_time)  
    if counter > 3:
        break

finished request number 1 after waiting for  1 seconds
finished request number 2 after waiting for  2 seconds
finished request number 3 after waiting for  3 seconds
finished request number 4 after waiting for  4 seconds


When it comes to web scraping, it helps to appear less robotic and more human in our scraping. In this case, we could make the amount of time that we wait in between scrapes random:

In [7]:
import random
counter = 0

while True:
    counter += 1
    wait_time = 30 + random.randrange(0, 30)  # Here the wait time is a random number between 30 and 60
    html_text = urllib.request.urlopen("http://www.example.com/").read()
    soup = BeautifulSoup(html_text, 'html.parser')
    print("finished request number", counter, "after waiting for", wait_time, "seconds")
    time.sleep(wait_time)  
    if counter > 3:
        break

finished request number 1 after waiting for 45 seconds
finished request number 2 after waiting for 47 seconds
finished request number 3 after waiting for 30 seconds
finished request number 4 after waiting for 57 seconds


## 7.2 Recurrent script execution

### 7.2.1 The cron utility

Alright, so let's assume that we're working with an API that has hourly rate limits (most of them), and we're finished writing our script which will acquire data from the API while respecting the rate limits. How do we get our computer to automatically run this script hourly? If you're running a UNIX-like operating system (Mac OS X, any flavor of GNU/Linux, or BSD), then this functionality is built into the `cron` utility.

`cron` is a time-based job scheduling utility for UNIX-like systems used to schedule jobs to run periodically at fixed intervals of time. Basically, `cron` uses a cron table file (called `crontab`) to execute commands at the specified times. To open up the `crontab` file for editing, just type the following into your terminal:

`$ crontab -e`

### 7.2.1.1 Editing crontab

This will bring up a text file in your default text editor. So, first we need to see where the script that we'd like to run hourly is located on the machine. You need to specify the whole path, so if you aren't sure, navigate to the file(s) in your terminal, then use the UNIX command `pwd` (present working directory), which will display the full path to the current directory you are in. So, as an example, let's say we've done this, and we'd like to run a file named `script.py` hourly with the following path: `/Projects/scraper/script.py`.

Once we're editing the `crontab`, we can add a new job by adding the correct line of text. This line will fit the following schema:

```
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of week (0 - 6) (Sunday to Saturday;
# │ │ │ │ │                                       7 is also Sunday on some systems)
# │ │ │ │ │
# │ │ │ │ │
# * * * * *  command to execute
```

So, by specifying numbers in the place of the asterisks, one can have jobs run with very specific timings (leaving an asterisk in a spot will ignore that specification, meaning for example an asterisk in the day of the month column will have the command execute every single day of the month).

This may seem a bit complicated, and it is. Luckily, most implementations of `cron` allow for shortcuts for widely-used timings. Instead of what's presented above, one can just use:
+ `@hourly command to execute` - run the command once an hour
+ `@daily command to execute` - run the command once daily
+ `@weekly command to execute` - run the command weekly
+ etc.

So, since we need to call python to execute our script, our command will look like this:

`python3 /Projects/scraper/script.py`

So, to perform our scraping once hourly, we just add the following line to `crontab`:

`@hourly python3 /Projects/scraper/script.py`

Make sure to save the text file after adding this, and we're done! Our script will run at the beginning of each hour.

#### 7.2.1.2 Exercise: understanding a crontab for a recurrent, whole-site data access application
Gutenberg is an open data repository, so we should be able to download all of its data!  To start, let's review the robots file on Project Gutenberg's website:
- http://www.gutenberg.org/robots.txt

What do you notice about this file. Is anyone allowed to crawl the site? Do you think Gutenberg uses the newer, big tech rules? How frequently can we make requests?

Use the `robotexclusionrulesparser` module from Section 7.1.2.4 to determine if we can access a given data file. Use the URL for the text copy of Moby dick: 
- https://www.gutenberg.org/files/2701/2701-0.txt

Following the above, review the instructions on mirroring the repository:
- https://www.gutenberg.org/wiki/Gutenberg:Mirroring_How-To

and explain why Gutenberg requests using the `rsync` command-line utility to copy its data. Can you decode the two presented crontab patterns?

_Response._

In [None]:
## place code here

## 7.3 Monitoring processes

So, we've set up a stream that follows all the rate limits and automatically executes on an hourly basis, downloading all kinds of useful data. What happens if something goes wrong? As we have it set up, if there's some kind of bug or error which stops the execution of the script, it will go unnoticed until someone decides to manually check up on it. This could mean you haven't been collecting data for days, weeks, or even months! It would be nice, and give us peace of mind, to create an additional mechanism solely dedicated to checking up on our stream, and in case it goes down for some reason, restarting it. This is made easy with the `os` Python module, and a feature on UNIX-like systems called `top`.

### 7.3.1 Checking processes with `ps`
If you've run Microsoft Windows for any amount of time, you've probably had an application that you were using freeze up. Usually, a simple solution to fix this problem is to just go into the process manager (by pressing `ctrl + alt + del`), and end the process, then restart it back up. This is exactly what we'd like to do with our stream script, but we need to find a way to automate it. Fortunately, there are multiple process managers in UNIX-like systems akin to the Windows process manager. For this example, we'll just use one of the most basic ones available: `ps` (many people use `htop` or`top` but ps has simpler output). The output of `ps` will yield a line for each process currently running.

#### 7.3.1.1 A restarter script using `ps`
Using the output of `ps`, we can regularly check if our streaming process is running, and if the script finds that it isn't, restart it. This is tricky, involving regular expressions&mdash;we just need to parse the output of the `ps` command, and search it for our specific streaming process. 

Note: this example is intended to generalize process monitoring to environments where all that is available is text output on active/inactive processes, like on a cluster's queue. For Pythonic use cases, we'll explore a version of this code that uses a Python module (`psutil`), resuing some peices that we first build up now incrementally. The most important thing is to develop ideas for applications in your own work!

Before we get started, let's just make a dummy script just loops endlessly, simply sleeping for a second at a time inside of a while loop, randomly exiting with 1/10 probability.

In [9]:
%%writefile dummy.py
import random, time

while random.random() <= 0.90:
    time.sleep(1)

Overwriting dummy.py


To run this script we can use the `nohup` command which disconnects the executing terminal from the connection. The `&` at the end specifices that this is a child process, run separately from our python notebook. This means our script will run are our notebook won't have to wait for it to finish.

In [14]:
import os

status = os.system("nohup python3 ./dummy.py &")
print(status)

0


Now, to get things started we import the other necessary modules, and obtain the text output of the `top` command:

In [15]:
import re
import datetime

## the os.popen command creates an instance of the process we're opening 
## that's ready to read into text (this is piping)
processes = os.popen("ps -A").read().split("\n")
processes[:15]

['  PID TTY           TIME CMD',
 '    1 ??        47:39.21 /sbin/launchd',
 '   43 ??         2:22.02 /usr/sbin/syslogd',
 '   44 ??         1:59.36 /usr/libexec/UserEventAgent (System)',
 '   47 ??         0:19.65 /System/Library/PrivateFrameworks/Uninstall.framework/Resources/uninstalld',
 '   48 ??         0:27.99 /usr/libexec/kextd',
 '   49 ??        14:57.82 /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/FSEvents.framework/Versions/A/Support/fseventsd',
 '   52 ??         0:43.89 /opt/cisco/anyconnect/bin/vpnagentd -execv_instance',
 '   53 ??         0:06.43 /System/Library/PrivateFrameworks/MediaRemote.framework/Support/mediaremoted',
 '   55 ??         0:10.37 /System/Library/CoreServices/appleeventsd --server',
 '   56 ??         1:57.65 /usr/sbin/systemstats --daemon',
 '   58 ??         7:04.27 /usr/libexec/configd',
 '   59 ??         6:13.62 /System/Library/CoreServices/powerd.bundle/powerd',
 '   60 ??         0:00.07 /Library/Application Suppor

How do we process this? Since we're going so low level we'll have to use regex. Inspecting the above, the rows appear to have 4 columns, with the last being the process name. Each column is separated by one or more whitespace characters `pattern = '\s+'`.

In [16]:
def check_process(process_name):
    ## review the current running processes
    processes = os.popen("ps -A").read().split("\n")
    
    ## Get the process names
    ## Splits up the row by whitespace, then looks at the last element (the name)
    process_names = [re.split('\s+', row.strip())[-1] for row in processes]

    ## create a boolean valued list indicating if the process is running
    is_running = [name for name in process_names if re.search(process_name, name)]

    ## Let the user know which processes are running
    return(is_running)

In [17]:
# name of the process we're looking for
name = "dummy.py"
check_process(name)

['./dummy.py']

Ok, now we have an idea if our stream is up or not. Next, we need to reset the stream if it isn't running (and keep a note in the log and error files, or do nothing (and keep a note in the log file):

In [18]:
## create a log file for our script-monitoring code
logfile = 'restarter.log'
open(logfile, 'w').close()

## perform the initial execution
current_time = datetime.datetime.strftime(datetime.datetime.now(), "%Y-%m-%d-%H-%M")
status = os.system("nohup python3 ./dummy.py &")
with open(logfile, 'a') as f:  # Open the logfile
    f.writelines("Started the process " + name +"\n") 

while 1:
    current_time = datetime.datetime.strftime(datetime.datetime.now(), "%Y-%m-%d-%H-%M")
    instances = check_process(name)

    if instances:  # The stream is working fine
        with open(logfile, 'a') as f:  # Open the logfile
            f.writelines(current_time + " : The process " + name + " is running \n")  # Write that all is well
            f.writelines("sleeping for another 10 seconds...\n\n")
    else:  # Stream is down
        with open(logfile, 'a') as f:  # Open the logfile
            f.writelines(current_time + " : The process " + name + " died! Restarting... \n") # Note error
        status = os.system("nohup python3 ./dummy.py &")  # Finally, restart the stream
        ## for the notebook---break if the code got restarted
        break
    ## sleep for another 2 seconds
    time.sleep(2)

In [19]:
!cat restarter.log

Started the process dummy.py
2018-09-25-13-03 : The process dummy.py is running 
sleeping for another 10 seconds...

2018-09-25-13-03 : The process dummy.py died! Restarting... 


### 7.3.2 Monitoring processes with `psutil`
Knowing how to regex with command-line utilities like `ps`, `top`, `htop`, or anything text based on a cluster or server is important to have in your back pocket, but in a basic local environment there's of course a more Pythonic way. There's a Python module specifically geared towards working with your machine's currently running processes&mdash;`psutil`.  While the regex approach with `ps` is one that we can generalize to other process-monitoring scenarios, we can use `psutil` to much more easily get a list of our processes on a single machine. Here's here's how we can get the process names using `psutil`:

In [20]:
import psutil

# use psutil to easily get the list of processes we want
processes = list(psutil.process_iter())

# we can use psutil to easily access all sorts of information
print(processes[0].name(), processes[0].status(), processes[0].pid)

kernel_task running 0


Note: psutil is a little tricky and the `.name()` method throws an error if a process is has `status='zombie'`. Apparently, using `ps` in our `os.system()` command has left it a zombie! Let's gather the names of the zombie and non-zombie processes using a list comprehension with some control for status:

In [21]:
processes = list(psutil.process_iter())

## collect the process names as long as they're not zombies
process_names = [process.name() for process in processes if process.status() != "zombie"]
## collect the zombies
zombies = [process for process in processes if process.status() == "zombie"]

print("Here are the zombie processes: ")
print(zombies)
print()

print("Here are the living processes: ")
print(process_names[:10])

Here are the zombie processes: 
[psutil.Process(pid=10079, status='zombie'), psutil.Process(pid=32868, name='ps', started='13:03:08')]

Here are the living processes: 
['kernel_task', 'launchd', 'syslogd', 'UserEventAgent', 'uninstalld', 'kextd', 'fseventsd', 'vpnagentd', 'mediaremoted', 'appleeventsd']


#### 7.3.2.1 Killing a process
What do we do with a process running in the background that we need to stop? Well, it's not actally possible to kill zombies&mdash;those are unreaped child processes that will disappear once their parent processes finish (e.g., the server running this notebook). However, if we need to take down our streaming application for service or anything else we might need to interrupt it. This 
can be done using the `kill <PID>` bash utility, after obtaining the `PID` information from psutil as `process.pid`, or from reading/parsing one of the command-line utilities. However, `psutil` makes it even easier with the `process.kill()` method.

#### 7.3.2.2 Exercise: a script restarter using psutil that also kills zombies
Rewrite `check_process(name)` above by using psutil to 1) obtain process names more easily without regex, and use this 2) to restart our dummy process if it's finished after 3 or fewer passes in the while loop, and kill it if it's still running after 4 or more passes. 

In [None]:
## Enter code here