In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# These lines load the tests.

from gofer.ok import check

# Homework 4: Functions, Histograms, and Groups #
## Due: Thursday 3/25/2020 11:59pm ##
Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests and modules you will be using for this homework.

**Reading**: 

* [Visualizing Numerical Distributions](https://www.inferentialthinking.com/chapters/07/2/visualizing-numerical-distributions.html) 
* [Functions and Tables](https://www.inferentialthinking.com/chapters/08/functions-and-tables.html)

Throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Moreover, please be sure to only put your written answers in the provided cells. 

## 1. Working with Text using Functions


The following table contains the words from four chapters of Charles Dickens' [*A Tale of Two Cities*](http://www.gutenberg.org/cache/epub/98/pg98.txt).  We're going to compute some simple facts about each chapter.  Since we're performing the same computation on each chapter, it's best to encapsulate each computational procedure in a function, and then call the function several times. Run the cell to get a table with one column.

In [None]:
# Just run this cell to load the data.
tale_chapters = Table.read_table("tale.csv")
tale_chapters

**Question 1.** Write a function called `word_count` that takes a single argument, the text of a single chapter, and returns the number of words in that chapter.  Assume that words are separated from each other by spaces. 

*Hint:* Try the string method [`split`](https://docs.python.org/3/library/stdtypes.html#str.split) and the function [`len`](https://docs.python.org/3/library/functions.html#len).

In [None]:
...

word_count(tale_chapters.column("Chapter text").item(0))

In [None]:
check('tests/q1_1.py')

**Question 2.** Create an array called `chapter_lengths` which contains the length of each chapter in `tale_chapters`.

**Hint:** Consider using `apply` along with the function you have defined in the previous question.

In [None]:
chapter_lengths = ...
chapter_lengths

In [None]:
check('tests/q1_2.py')

**Question 3.** Write a function called `character_count`.  It should take a string as its argument and return the number of characters in that string that aren't spaces (" "), periods ("."), exclamation marks ("!"), or question marks ("?"). Remember that `tale_chapters` is a table, and that the function takes in only the text of one chapter as input.

*Hint:* Try using the string method `replace` several times to remove the characters we don't want to count.

In [None]:
...

In [None]:
check('tests/q1_3.py')

**Question 4.** Write a function called `chapter_number`.  It should take a single argument, the text of a chapter from our dataset, and return the number of that chapter, as a Roman numeral.  (For example, it should return the string "I" for the first chapter and "II" for the second.)  If the argument doesn't have a chapter number in the same place as the chapters in our dataset, `chapter_number` can return whatever you like.

To help you with this, we've included a function called `text_before`.  Its documentation describes what it does.

In [None]:
def text_before(full_text, pattern):
    """Finds all the text that occurs in full_text before the specified pattern.

    Parameters
    ----------
    full_text : str
        The text we want to search within.
    pattern : str
        The thing we want to search for.

    Returns
    -------
    str
        All the text that occurs in full_text before pattern.  If pattern
        doesn't appear anywhere, all of full_text is returned.
    
    Examples
    --------
    
    >>> text_before("The rain in Spain falls mainly on the plain.", "Spain")
    'The rain in '
    >>> text_before("The rain in Spain falls mainly on the plain.", "ain")
    'The r'
    >>> text_before("The rain in Spain falls mainly on the plain.", "Portugal")
    'The rain in Spain falls mainly on the plain.'
    """
    return np.array(full_text.split(pattern)).item(0)

def chapter_number(chapter_text):
    ...

In [None]:
check('tests/q1_4.py')

## 2. Uber


In hw03 we worked with the same Uber data. Below we load tables containing 200,000 weekday Uber rides in the Manila, Philippines, and Boston, Massachusetts metropolitan areas from the [Uber Movement](https://movement.uber.com) project. The `sourceid` and `dstid` columns contain codes corresponding to start and end locations of each ride. The `hod` column contains codes corresponding to the hour of the day the ride took place. The `ride time` table contains the length of the ride, in minutes.

In [None]:
boston = Table.read_table("boston.csv")
manila = Table.read_table("manila.csv")
print("Boston Table")
boston.show(4)
print("Manila Table")
manila.show(4)

We produce the corresponding histgorams as below.

In [None]:
bins = np.arange(0, 120, 5)
boston.hist('ride time', bins = bins, unit = 'ride time')

In [None]:
manila.hist('ride time', bins = bins, unit = 'ride time')

Now consider the following two questions.

**Question 1.** The `hod` column in each table represents the hour of the day during which the Uber was called. 0 corresponds to 12-1 AM, 1 to 1-2 AM, 13 to 1-2 PM, etc. Write a function which takes in a table like `boston` or `manila`, and an `hod` number between 0 and 23, and displays a histogram of ride lengths from that hour in that city. Use the same bins as before.

In [None]:
def hist_for_time(tbl, hod):
    bins = np.arange(0, 120, 5)
    ...

#DO NOT DELETE THIS LINE! 
hist_for_time(boston, 12)

**Question 2.** Which city has a larger difference between Uber ride times at 10 AM vs. 10 PM? In other words, which is larger: the difference between 10 AM and 10 PM Uber ride times in Manila or the difference between 10 AM and 10 PM uber ride times in Boston. Use the function you just created to answer this question. You do not need to calculate an actual difference.

Assign `larger_diff` to the number 1 if the answer is Manila, and 2 if the answer is Boston. 

In [None]:
larger_diff = ... 

In [None]:
check('tests/q2_5.py')

## 3. NBA player salaries


Recall in class we have worked with the nba player salaries dataset, 2015-2016 season.

In [None]:
# This table can be found online: 
# https://www.statcrunch.com/app/index.php?dataid=1843341

# NBA players, 2015-2016 season
nba = Table.read_table('nba_salaries.csv').relabeled(3, 'SALARY').sort('PLAYER')
nba

We want to use this table to generate arrays with the names of each player in each position.

**Question 1** Set `player_names` to a table with two columns. The first column should be called "position" and have the name of every position once, and the second column should be called "name" and contain an *array* of the names of all players in that position. 

*Hint:* Think about how ```group``` works: it collects values into an array and then applies a function to that array. We have defined two functions below for you, and you will need to use one of them in your call to ```group```.

In [None]:
# Pick between the two functions defined below 
def identity(array):
    return array 

def first(array):
    return array.item(0)

In [None]:
player_names = ...
player_names

In [None]:
check('tests/q3_1.py')

**Question 2** At the moment, the ```name``` column is sorted by first name. Would the arrays you generated in the previous part be the same if we had sorted by last name instead before generating them? Two arrays are the **same** if they contain the same number of elements and the elements located at corresponding indexes in the two arrays are identical. Explain your answer. If you feel you need to make certain assumptions about the data, feel free to state them in your response. 

*Write your answer here, replacing this text.*

**Question 3** Set `biggest_range_position` to the name of the position with the largest salary range, where range is defined as the **difference between the lowest and highest salaries in the position**. 

*Hint:* First you'll need to define a new function `salary_range` which takes in an array of salaries and returns the salary range of the corresponding position. Then, set `position_ranges` to a table containing the names and salary ranges of each position. 

In [None]:
# Define salary_range in this cell
...
    ...

In [None]:
position_ranges = ...
biggest_range_position = ...
biggest_range_position

In [None]:
check('tests/q3_3.py')

## 4. Submission


Congratulations, you're done with Homework 4!  Be sure to 
- **Run all the tests and verify that they all pass** (the next cell has a shortcut for that), 
- **Save and Checkpoint** from the `File` menu,

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import grade_notebook
if not globals().get('__GOFER_GRADER__', False):
    display(grade_notebook('hw04.ipynb', sorted(glob.glob('tests/q*.py'))))