# DSC 80: Lab 01

### Due Date: Tuesday January 14, Midnight (11:59 PM)

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab01.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab01.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!pip install pandas

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Collecting pandas
  Downloading pandas-1.0.1-cp37-cp37m-manylinux1_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 3.6 MB/s eta 0:00:01
[?25hCollecting pytz>=2017.2
  Downloading pytz-2019.3-py2.py3-none-any.whl (509 kB)
[K     |████████████████████████████████| 509 kB 55.4 MB/s eta 0:00:01
Collecting numpy>=1.13.3
  Downloading numpy-1.18.1-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
[K     |████████████████████████████████| 20.1 MB 50.3 MB/s eta 0:00:01
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.18.1 pandas-1.0.1 pytz-2019.3


In [3]:
import lab01 as lab

In [4]:
import os
import pandas as pd
import numpy as np

## Python Basics

---
**Question 0 (EXAMPLE):**

Write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two adjacent list elements that are consecutive integers.
* Otherwise, returns `False`.

For example, because `9` is next to `8`:
```
>>> lab.consecutive_ints([5,3,6,4,9,8])
True
```
Whereas:
```
>>> lab.consecutive_ints([1,3,5,7,9])
False
```

*Note*: This question is done for you, to demonstrate a completed homework problem.

In [5]:
# Develop your code here (or in an IDE) if you'd like.
# Though only code in lab01.py will be graded!

In [6]:
# Add more cells if you'd like!

Test your code in two ways:
1. Run the cell below to test your code. You should also copy the cell and change the input to test further (i.e. write your own doctests)! Does it work for corner cases? Real-world data is **very messy** and you should expect your data processing code to break without thorough testing!
2. Run doctests on `lab01.py` by running the following command on the commandline:
```
python -m doctest lab01.py
```
If the doctests pass, then there should be *no* output.

In [7]:
# test your code!
lab.consecutive_ints([1,3,2,4])

True

In [8]:
lab.consecutive_ints([0])

False

In [9]:
lab.consecutive_ints([])

False

---
**Question 1 (median):**

Write a function called *median* that takes a non-empty list of numbers, returning the median element of the list. If the list has even length, it should return the mean of the two elements in the middle. Do not use any imported libraries for this question; you may use any built-in function.


In [10]:
# Try this
lab.median([0, -1, 1, 100])

0.0

---
**Question 2 (List Distances):**

Similar to Question 0, write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two list elements $i$ places apart, whose distance as integers is also $i$.
* Otherwise, returns `False`.

Assume your inputs tend to satisfy the condition, and the pair(s) saitifying the condition tend to be close together; design your function to run faster for this case. (Optimizing your code for an assumed distribution of incoming data is very common in data science).

For example, because `3` and (the second) `5` are two places apart, and $|3-5| = 2$:
```
>>> lab.same_diff_ints([5,3,1,5,9,8])
True
```
Whereas:
```
>>> lab.same_diff_ints([1,3,5,7,9])
False
```

*Note*: Make sure to define some extreme test cases. Use the `%time` command to time your function!

In [11]:
%time lab.same_diff_ints([5,3,1,5,9,8])

CPU times: user 99 µs, sys: 33 µs, total: 132 µs
Wall time: 111 µs


True

---
## Strings and Files

The following questions will help you (re)learn the basics of working with strings and reading data from files (which are read in as strings, by default).

---
**Question 3 (Prefixes):**

Write a function `prefixes` that takes a string and returns a string of every consecutive prefix of the input string. For example, `prefixes('Data!')` should return `'DDaDatDataData!'`.  (See the doctests for more examples).

Recall that [strings may be sliced](https://docs.python.org/3/tutorial/introduction.html#strings), like lists.


In [12]:
lab.prefixes('Data!')

'DDaDatDataData!'

---
**Question 4 (Evens reversed):**

Write a function `evens_reversed` that takes in a non-negative integer $N$ and returns a string containing all even integers from $1$ to $N$ (inclusive) in reversed order, separated by spaces. Additionally, [zero pad](https://www.tutorialspoint.com/python/string_zfill.htm) each integer, so that each has the same length.

In [13]:
lab.evens_reversed(7)

'6 4 2'

---

[Recall](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) that the built-in function `open` takes in a file path and returns *a file object* (sometimes called a *file handle*). Below are a few properties of file objects:

* `open(path)` opens the file at location `path` for reading.
* `open(path)` is an *iterable*, which contains successive lines of the file.
* Once a file object is opened, after use it should be closed to avoid memory leaks. To ensure a file is closed once done, you should use a *context manager* as follows:
```
with open(path) as fh:
    for line in fh:
        process_line(line)
```
* To read the entire file into a string, use the read method:
```
with open(path) as fh:
    s = fh.read()
```
However, you should be careful when reading an entire file into memory that the file isn't too big! *You should avoid this whenever possible!*

**Question 5 (Reading Files):**

Create a function `last_chars` that takes a file object and returns a string consisting of the last character of the line.

*Remark:* A newline is the "delimiter" of the lines of a file, and doesn't count as part of the line (as the tests imply). Every other character is part of the line. For more info on this, see [the interpretation](https://en.wikipedia.org/wiki/Newline#Interpretation) of files as a 'newline delimited variables' file.



In [14]:
import os

In [15]:
fp = os.path.join('data', 'chars.txt')
lab.last_chars(open(fp))

'hrg'

---

## `numpy` exercises

For an introduction to arrays and `numpy` recall the relevant section of [DSC 10](https://www.inferentialthinking.com/chapters/05/1/Arrays.html).

**Question 6 (Basic Arrays):**

Create the following functions using `numpy` methods satisfying the requirements given in each part. Your solutions should **not** contain any loops or list comprehensions.

* A function `arr_1` that takes in a `numpy` array and adds to each element the square-root of the index of each element.

* A function `arr_2` that takes in a `numpy` array of integers and returns a boolean array (i.e. an array of booleans) whose `ith` element is `True` if and only if the `ith` element of the input array is divisble by 16.

* A function `arr_3` that takes in a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share on successive days in USD and returns an array of growth rates. That is, the `ith` number of the output array should contain the rate of growth in stock price between the $i^{th}$ day to the $(i+1)^{th}$ day. The growth rate should be a proportion, rounded to the nearest hundredth.

* Suppose:
    - `A` is a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share for a company on successive days in USD 
    - you start with \\$20, and put aside \\$20 at the end of each day to buy as much stock as possible the following day. 
    - Any money left-over after a given day is saved for possibly buying stock on a future day. 
    - Create a function `arr_4` that takes in `A` and returns the day on which you can buy at least one share from 'left-over' money. If this never happens, return `-1`. The first stock purchase occurs on day 0. *Note: you cannot buy fractions of a share of stock*.
    
*Example:* If the stock price is \\$3 every day, then the answer is 'day 1':
* day 0: buy 6 shares; \\$2 left-over; \\$22 at end of day.
* day 1: buy 7 shares; \\$1 left-over; \\$21 at end of day.
This is more than the 6 shares that \\$20 can buy.

In [16]:
fp = os.path.join('data', 'stocks.csv')
stocks = np.array([float(x) for x in open(fp)])
stocks

array([ 9.89,  9.87,  9.97,  9.83,  9.86,  9.9 ,  9.86, 10.05, 10.14,
       10.38, 10.51, 10.58, 10.6 , 10.62, 10.7 , 10.69, 10.61, 10.59,
       10.62, 10.48, 10.54, 10.54, 10.52, 10.68, 10.71, 10.78, 10.74,
       10.79, 10.94, 10.76, 10.82, 10.87, 10.72, 10.86, 10.88, 10.85,
       10.79, 10.9 , 11.19, 11.12, 11.1 , 11.23, 11.3 , 11.33, 11.35,
       11.32, 11.42, 11.52, 11.51, 11.53, 11.73, 11.63, 11.56, 11.71,
       11.61, 11.74, 11.95, 11.89, 11.75, 11.74, 11.8 , 11.81, 11.79,
       11.8 , 11.96, 11.95, 12.04, 12.01, 12.12, 12.22, 12.31, 12.29,
       12.25, 12.37, 12.38, 12.4 , 12.61, 12.39, 12.38, 12.47, 12.5 ,
       12.63, 12.77, 12.73, 12.48, 12.33, 12.26, 12.11, 11.99, 12.01,
       12.11, 12.18, 12.27, 12.25, 12.25, 12.2 , 12.11, 12.26, 12.41,
       12.45])

In [17]:
out1 = lab.arr_1(stocks)
out1

array([   9.89,   10.87,   13.97,   18.83,   25.86,   34.9 ,   45.86,
         59.05,   74.14,   91.38,  110.51,  131.58,  154.6 ,  179.62,
        206.7 ,  235.69,  266.61,  299.59,  334.62,  371.48,  410.54,
        451.54,  494.52,  539.68,  586.71,  635.78,  686.74,  739.79,
        794.94,  851.76,  910.82,  971.87, 1034.72, 1099.86, 1166.88,
       1235.85, 1306.79, 1379.9 , 1455.19, 1532.12, 1611.1 , 1692.23,
       1775.3 , 1860.33, 1947.35, 2036.32, 2127.42, 2220.52, 2315.51,
       2412.53, 2511.73, 2612.63, 2715.56, 2820.71, 2927.61, 3036.74,
       3147.95, 3260.89, 3375.75, 3492.74, 3611.8 , 3732.81, 3855.79,
       3980.8 , 4107.96, 4236.95, 4368.04, 4501.01, 4636.12, 4773.22,
       4912.31, 5053.29, 5196.25, 5341.37, 5488.38, 5637.4 , 5788.61,
       5941.39, 6096.38, 6253.47, 6412.5 , 6573.63, 6736.77, 6901.73,
       7068.48, 7237.33, 7408.26, 7581.11, 7755.99, 7933.01, 8112.11,
       8293.18, 8476.27, 8661.25, 8848.25, 9037.2 , 9228.11, 9421.26,
       9616.41, 9813

In [18]:
np.all(out1 >= stocks)

True

In [19]:
out2 = lab.arr_2(np.array([1, 2, 16, 17, 32, 33]))
out2

array([False, False,  True, False,  True, False])

In [20]:
out2.dtype == np.dtype('bool')

True

In [21]:
out3 = lab.arr_3(stocks)
out3

array([-0.  ,  0.01, -0.01,  0.  ,  0.  , -0.  ,  0.02,  0.01,  0.02,
        0.01,  0.01,  0.  ,  0.  ,  0.01, -0.  , -0.01, -0.  ,  0.  ,
       -0.01,  0.01,  0.  , -0.  ,  0.02,  0.  ,  0.01, -0.  ,  0.  ,
        0.01, -0.02,  0.01,  0.  , -0.01,  0.01,  0.  , -0.  , -0.01,
        0.01,  0.03, -0.01, -0.  ,  0.01,  0.01,  0.  ,  0.  , -0.  ,
        0.01,  0.01, -0.  ,  0.  ,  0.02, -0.01, -0.01,  0.01, -0.01,
        0.01,  0.02, -0.01, -0.01, -0.  ,  0.01,  0.  , -0.  ,  0.  ,
        0.01, -0.  ,  0.01, -0.  ,  0.01,  0.01,  0.01, -0.  , -0.  ,
        0.01,  0.  ,  0.  ,  0.02, -0.02, -0.  ,  0.01,  0.  ,  0.01,
        0.01, -0.  , -0.02, -0.01, -0.01, -0.01, -0.01,  0.  ,  0.01,
        0.01,  0.01, -0.  ,  0.  , -0.  , -0.01,  0.01,  0.01,  0.  ])

In [22]:
out3.max() == 0.03

True

In [30]:
out4 = lab.arr_4(np.array([3, 3, 3, 3]))
out4

1

---
## Getting Started with Pandas

The following questions will help you get comfortable with Pandas. These questions are similar to questions on tables in DSC 10; review the [textbook](https://www.inferentialthinking.com) as necessary. As always for Pandas questions:
1. Avoid writing loops through the rows of the dataset to do the problem, and
2. Test the output/correctness of your code with the help of the dataset given, but be sure your code will also run on data "like" the dataset given (sampling rows using the `.sample` method is useful for this!).

**Question 7 (Pandas basics):**

Read in the file `movies_by_year.csv` in the `data` directory and understand the dataset by answering the following questions. To do this, create a function `movie_stats` that takes in a dataframe like `movies` and returns a series containing the following statistics:
* The number of years covered by the dataset (`num_years`).
* The total number of movies made over all years in the dataset (`tot_movies`).
* The year with the fewest number of movies made; a tie should return the earliest year (`yr_fewest_movies`).
* The average amount of money grossed over all the years in the dataset (`avg_gross`).
* The year with the highest gross *per movie* (`highest_per_movie`).
* The name of the top movie during the second-lowest (total) grossing year (`second_lowest`).
* The average number of movies made the year *after* a Harry Potter movie was the #1 movie (`avg_after_harry`).

The index of the output series are given in parenthesis above.

*Note*: Your function should work on a dataset of the same format that contains information from other years. You may assume that none of the answers involving ranking returns a tie.

*Note*: To make sure your function still runs, in the event that one of the 7 parts throws an exception (e.g. due to a very incorrect answer), use `Try... Except...` structures.

In [31]:
movie_fp = os.path.join('data', 'movies_by_year.csv')
movies = pd.read_csv(movie_fp)
movies

Unnamed: 0,Year,Total Gross,Number of Movies,#1 Movie
0,2015,11128.5,702,Star Wars: The Force Awakens
1,2014,10360.8,702,American Sniper
2,2013,10923.6,688,Catching Fire
3,2012,10837.4,667,The Avengers
4,2011,10174.3,602,Harry Potter / Deathly Hallows (P2)
5,2010,10565.6,536,Toy Story 3
6,2009,10595.5,521,Avatar
7,2008,9630.7,608,The Dark Knight
8,2007,9663.8,631,Spider-Man 3
9,2006,9209.5,608,Dead Man's Chest


In [57]:
movies['Year'].nunique()
movies['Number of Movies'].sum()
movies['Number of Movies'].min()
movies.loc[movies['Number of Movies'] == movies['Number of Movies'].min()]['Year'].values[0]
movies['Total Gross'].mean()
avg = movies['Total Gross']/movies['Number of Movies']
avg.max()
movies.nsmallest(2, 'Total Gross')['#1 Movie'].values[0]
hp = movies.loc[movies['#1 Movie'].str.contains('Harry Potter')]['Year']
total = 0
for i in np.arange(len(hp)):
    year = hp.values[i]+1
    total += movies.loc[movies['Year'] == year]['Number of Movies'].values[0]
total/len(hp)

573.0

In [72]:
out = lab.movie_stats(movies)
out

num_years                 34
tot_movies             17834
yr_fewest_movies        1990
avg_gross            7226.91
highest_per_movie    20.3369
second_lowest           E.T.
avg_after_harry          573
dtype: object

In [73]:
isinstance(out, pd.Series)

True

In [74]:
'num_years' in out.index

True

In [75]:
isinstance(out.loc['second_lowest'], str)

True

---

## CSV Files

**Question 8 (Reading malformed csv files):**

`malformed.csv` contains a file of comma-separated values, containing the following fields:


|column name|description|type|
|---|---|---|
|first|first name of person|str|
|last|last name of person|str|
|weight|weight of person (lbs)|float|
|height|height of person (in)|float|
|geo|location of person; comma-separated latitude/longitude|str|

Unfortunately, the entries contains errors that cause the Pandas `read_csv` function to fail parsing the file with the default settings. Instead, you must read in the file manually using Python's built-in `open` function.

Clean the csv file into a Pandas DataFrame with columns as described in the table above, by creating a function called `parse_malformed` that takes in a file path and returns a parsed, properly-typed dataframe. The dataframe should contain columns as described in the table above (with the specified types); it should agree with `pd.read_csv` when the lines are not malformed.


*Note:* Assume that the given csv file is a sample of a larger file; you will be graded against a **different** sample of the larger file that has the same type of parsing errors. That is, you should **not** hard-code your cleaning of the data to specific errors on specific lines in the data.

In [92]:
mf_fp = os.path.join('data', 'malformed.csv')
mf = open(mf_fp)
mf.readline()[:-1].split(",")
mf.readline()[:-1].split(",")

['Julia', 'Wagner', '142.0', '86.0', '"39.8', '15.4"']

In [93]:
count = 0
while True:
        line = mf.readline()
        print(line)
        count += 1
        if line == '':
            break
print(count)

Angelica,Rija,155.0,56.0,"38.2,-71.7"

Tyler,Micajah,116.0,73.0","38.0,6.9"

Kathleen,Nakea,163.0,69.0,"36.3,-86.8"

Axel,Ronit,95.0,74.0,"36.8,128.2"

Amiya,Kyona,130.0,72.0,"36.3,114.5"

Torrey,Joshuacaleb,105.0,79.0,"38.3,145.1"

Mariah,Alese,149.0,68.0,"36.1,45.7"

Grayson,Daimen,140.0,80.0","38.1,-72.6"

Yvette,Trayce,179.0,67.0,"36.9,-8.3"

Cody,Hatim,150.0,63.0,"38.0,-7.3"

Marissa,Daud,135.0,58.0,"37.3,11.0"

Logan,Cristel,133.0,67.0,"35.5,-110.2"

Kaiyah,Brinden,187.0,82.0,"34.8,83.2"

Ivan,Devyne,193.0,54.0,"36.6,262.0"

Shamaria,Aldrick",139.0,73.0,"38.5,-94.6"

Travis,Anavictoria,117.0,62.0,"36.3,69.5"

Kennedy,Dalynn,171.0",77.0,"37.3,-27.5"

Alina,Danniell,105.0,55.0,"37.4,314.7"

Cameron,Angelica,139.0,56.0,"38.8,-79.3"

Madison,Barkley,120.0",69.0,"38.2,86.1"

Jackson,Taylr,113.0,78.0,"36.7,56.7"

Agustin,Stephanye,91.0,62.0,"36.4,54.5"

Janesha,Jhayla,143.0,64.0,"35.9,-70.5"

Nickolas,Karenna,159.0,75.0,"35.9,-73.9"

Stacy,Meaghen,149.0,68.0,"36.6,-27.7"

Matthew,Kalis

In [143]:
df = lab.parse_malformed(mf_fp)
df

Unnamed: 0,first,last,weight,height,geo
0,Julia,Wagner,142.0,86.0,"39.8,15.4"
1,Angelica,Rija,155.0,56.0,"38.2,-71.7"
2,Tyler,Micajah,116.0,73.0,"38.0,6.9"
3,Kathleen,Nakea,163.0,69.0,"36.3,-86.8"
4,Axel,Ronit,95.0,74.0,"36.8,128.2"
...,...,...,...,...,...
95,Yasmeen,Jahron,135.0,84.0,"38.3,-127.3"
96,Meghan,Carlyann,101.0,66.0,"36.6,80.5"
97,Tess,Shree,146.0,68.0,"38.8,64.9"
98,Maria,Kalvyn,115.0,51.0,"37.1,-90.4"


In [144]:
cols = ['first', 'last', 'weight', 'height', 'geo']
list(df.columns) == cols

True

In [145]:
df['last'].dtype == np.dtype('O')

True

In [146]:
df['height'].dtype == np.dtype('float64')

True

In [147]:
df['geo'].str.contains(',').all()

True

In [148]:
len(df) == 100

True

In [149]:
df.iloc[9:13]

Unnamed: 0,first,last,weight,height,geo
9,Yvette,Trayce,179.0,67.0,"36.9,-8.3"
10,Cody,Hatim,150.0,63.0,"38.0,-7.3"
11,Marissa,Daud,135.0,58.0,"37.3,11.0"
12,Logan,Cristel,133.0,67.0,"35.5,-110.2"


In [150]:
dg = pd.read_csv(mf_fp, nrows=4, skiprows=10, names=cols)
dg.index = range(9, 13)
(dg == df.iloc[9:13]).all().all()

True

In [151]:
dg

Unnamed: 0,first,last,weight,height,geo
9,Yvette,Trayce,179.0,67.0,"36.9,-8.3"
10,Cody,Hatim,150.0,63.0,"38.0,-7.3"
11,Marissa,Daud,135.0,58.0,"37.3,11.0"
12,Logan,Cristel,133.0,67.0,"35.5,-110.2"


## Congratulations! You're done!

* Submit the lab on Gradescope

In [152]:
lab.check_for_graded_elements()

True