# Getting the Bugs Out
### Troubleshooting Code in Python

In this workshop we'll look at some examples of Python code that's not working as intended or that could be improved, and we'll implement strategies to do so.

### Understanding exceptions in Python

#### Example 1: Analyzing Twitter data

Data from the Twitter API is in [JSON-L](https://jsonlines.org/) format, in which each line of a text file is a valid string in JSON (JavaScript Object Notation). 

We have a file containing a couple thousand Tweets by U.S. Senators. They are all (are should be all) retweets -- meaning, instances where the Twitter user has retweeted someone else's Tweet. 

We want to parse this file and examine which accounts are being retweeted by which Senators?

#### Loading and parsing JSON-L data

Our first step is to import a couple of libraries we'll need for this task.

In [None]:
import requests
import jsonlines

Now we'll load the Twitter dataset (originally retrieved from GW LAI's [TweetSets](https://tweetsets.library.gwu.edu/) platform, but hosted here on GitHub for convenience).

In [None]:
data = requests.get('https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/python-debugging/sample-retweets.jsonl')

Now let's parse this data with the `jsonlines` module, which lets us iterate over a sequence of JSON objects in a file, converting each to Python data structures.

1. Create a `Reader` object. 
2. Use a [list comprehension](https://realpython.com/list-comprehension-python/) to collect each Tweet in a list called `tweets`.

In [None]:
reader = jsonlines.Reader(data)
tweets = [r for r in reader]

#### Finding accounts retweeted by Senators

Each Tweet that is a retweet has a `retweeted_status` element with information about the original Tweet. So if we look for the `user` property of that (retweeted) Tweet, we could associate it with the `user` property of the retweet, in order to find out which Senators are retweeting which other Twitter accounts. 

One way to do this would be to associate a list of Senatorial accounts with each retweeted account that we find.

The following code is a first pass at doing so.

In [None]:
# Find the accounts that Senators are retweeting, and record which Senators are retweeting which accounts
# To hold our retweeted user accounts
retweeted_accounts = [] 
# Iterate over all our Tweets
for tweet in tweets:
    # Access the information about the original Tweet (what's being retweeted)
    retweet = tweet['retweeted_status']
    # Account of the original Tweet
    account = retweet['user']
    # Save this and associate with the account of the current (re)Tweet
    retweeted_accounts[account].append(tweet['user']) 

### Best practices

#### Encapsulation

In addition to catching exceptions, there are other patterns conducive to bug-free code. One very common one is to use functions generously to encapsulate discrete behaviors in your code. 

Functions allow us to avoid repeating ourselves, which in turn reduces the chance of syntax errors and the like. Functions also make code easier to test, debug, and reason about.

Let's say we want to load another dataset of Tweets, also in JSON-L. This dataset contains a mix of different kinds of Tweets (including retweets and quotes as well as original Tweets).

We could re-run the cell we use above or copy and paste the code into a new cell. But if this is something we might need to do regularly, it's useful to define a new function to contain this logic.

In [None]:
def load_tweets(url_of_tweets):
    data = requests.get(url_of_tweets)
    reader = jsonlines.Reader(data.text.splitlines())
    tweets = [r for r in reader]
    return tweets

Functions consist of the following:
- a `def` statement followed by a unique name (try to avoid names used by other Python functions) and parentheses
- optionally, a list of parameters inside the parentheses. Paremeters with a default value (which take the form `parameter=default_value`) are optional; if not provided, they receive the default value. Paremeters without a default value are required. If they are missing from the function call, Python raises an exception.
- a `return` statement. (Otherwise, the function generally has no effect.)

We can use our function to load a new dataset.

In [None]:
tweets_2 = load_tweets('https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/python-debugging/tweets-shuffled.jsonl')

Functions can also make complicated logic easier to read by breaking it into smaller units. Let's say we want to extract from each Tweet the account screen name, the number of times it was retweeted, and the text of the Tweet, and we want to include retweeted and quoted Tweets, too.

#### Writing DRY code

In [None]:
tweet_metadata = []
for tweet in tweets_2:
    rt = tweet.get('retweeted_status')
    if rt:
        rt_screen_name = tweet['user']['screen_name']
        rt_text = tweet['full_text']
        rt_num_rt = tweet['retweet_count']
        tweet_metadata.append({'screen_name': rt_screen_name,
                               'text': rt_text,
                               'num_retweets': rt_num_rt})
    else:
        if tweet.get('quoted_status'):
            qt_screen_name = tweet['user']['screen_name']
            qt_text = tweet['full_text']
            qt_num_rt = tweet['retweet_count']
            tweet_metadata.append({'screen_name': qt_screen_name,
                                   'text': qt_text,
                                   'num_retweets': qt_num_rt})
    screen_name = tweet['user']['screen_name']
    text = tweet['full_text']
    num_rt = tweet['retweet_count']
    tweet_metadata.append({'screen_name': screen_name,
           'text': text,
           'num_retweets': num_rt})

#### Working with third-party libraries

#### Example 2: Residential household energy consumption in Arlington, VA by hour and day, 2014

Debugging code that uses third-party Python libraries presents its own challenges. The following example highlights some pitfalls with datetime values in the pandas library and illustrates how built-in library methods can simplify and optimize your code.

The data source is originall from the [Open Energy Data Initiative](https://data.openei.org/submissions/153), but I've modified it for this lesson. It has one row for each day/hour in the year for 2014, and columns corresponding to the energy usage for various types of utilities and appliances.

In [None]:
import pandas as pd

In [None]:
energy_df = pd.read_csv('https://raw.githubusercontent.com/gwu-libraries/gwlibraries-workshops/master/python-debugging/energy-consumption-arlington-2014.csv')

In [None]:
energy_df

#### Troubleshooting date/time problems 

pandas has a handy `to_datetime` method that can convert a string to a Python datetime value. That will be useful if, say, we want to plot this data as a time series. But running it on the `Date/Time` column throws an exception.

In [None]:
pd.to_datetime(energy_df['Date/Time'])

#### When `for` loops are superfluous

Let's say we wanted to plot the usage of these various utilities over the course of the year. Hourly data often proves rather noisy when plotted, which can make it hard to discern patterns. 

Plotting just one type of utility makes for a very noisy chart, so it will be hard to compare different utilities/types of consumption.

In [None]:
energy_df.plot(x='Date/Time', y='Electricity:Facility [kW](Hourly)')

In these cases, the [rolling mean](https://en.wikipedia.org/wiki/Moving_average) can be a useful technique. By smoothing values over a particular period of time, it can reveal trends that emerge at different time scales. It lets you "zoom out" on the data, so to speak. 

Since the data is hourly, we might want to calculate the rolling mean over a 24-hour period, which would give us a better impression of the trends from day to day. 

Our `for` loop approach to the rolling mean uses pandas indexing and slicing to average over the last 24 hours of data, hour by hour in our dataset. 

In [None]:
mean_kw = []
for i in range(23, len(energy_df)):
    first_idx = i - 23
    period = energy_df.iloc[first_idx:i+1]['Electricity:Facility [kW](Hourly)']
    mean_kw.append(period.mean())