# Assignment

For this assignment, we assume that you are already familiar with the basics of Python. One of the Python libraries data scientists often use is `numpy`, which is a library that facilitates array computations, such as matrix algebra. What this means is that using numpy, we can manipulate vectors, matrices or any n-dimensional array mostly without the need to write loops, so the code is cleaner and more succinct. In this exercise, we want to learn the basics of `numpy` and `pandas`.

1. Create a Python list whose elements are the numbers 3, 7, 1, 3, 5. <span style="color:red" float:right>[1 point]</span>

In [79]:
# Create a Python list
list1 = list([3,7,1,3,5])
print(list1)

[3, 7, 1, 3, 5]


A python list of [3, 7, 1, 3, 5] is created called "list1"

2.  Write a function that computes the average of a list of numbers. Run your function on the above list so that it returns its average. Your function should only make use of Python **built-ins** (no libraries). <span style="color:red" float:right>[2 point]</span>

In [80]:
# Create a function to compute the average of a list
def list_avg(a): 
    mean = sum(a)/len(a)
    return mean

results = list_avg(list1)
print(results)

3.8


The average of the given list is 3.8

3. Use the `%%timeit` magic to compute the average runtime of your function. Use the `-n 100` switch to choose to re-run the function 100 times (the more often you re-run it, the more accurate the average runtime is). <span style="color:red" float:right>[1 point]</span>

In [81]:
%%timeit -n 100
# use timeit magic to compute the time needed to run the below function with 100 re-runs
list_avg(list1)


240 ns ± 129 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


The average compute time using my function is 186 ns ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each).

4. Load the `numpy` library and use it to turn the above list into a `numpy` 1-D array. HINT: Use `numpy.array`. <span style="color:red" float:right>[1 point]</span>

In [82]:
# load the numpy library
import numpy as np
# convert python list to numpy 1-D array
list2 = np.array(list1)
print(list2)

[3 7 1 3 5]


I converted python list "list1" to a numpy 1-D array "list2"

Because getting the average of an array is a common operation, with `numpy` we don't have to "reinvent the wheel": we can just call the `mean` function. There are two ways of doing this: (1) you can call the `numpy.mean` function and pass it the array, or (2) you can call the `mean` method of the array. 

5. Print the average of the above array. Get the average using `numpy` in **both** of the ways described above. <span style="color:red" float:right>[2 point]</span>

In [83]:
# compute and print the average of the numpy array using two methods

# method 1 - call numpy.mean
print(np.mean(list2))

# method 2 - call the mean method of the array
print(list2.mean())

3.8
3.8


Both methods return the same average: 3.8

6. Compare the runtime of the average computation using `numpy` with the runtime of the function you wrote earlier. <span style="color:red" float:right>[1 point]</span>

In [84]:
%%timeit -n 100
list_avg(list1)

190 ns ± 33.6 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [85]:
%%timeit -n 100
np.mean(list2)

7.24 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [86]:
%%timeit -n 100
list2.mean()

4.46 µs ± 519 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


**Comparisons**

1. Using the function I wrote with built-in python functions, the runtime was 191 ns ± 32.6 ns per loop (mean ± std. dev. of 7 runs, 100 loops each).
2. Using the numpy.mean function, the runtime was 5.99 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each).
3. Using the mean method of the array, the runtime was 4.18 µs ± 494 ns per loop (mean ± std. dev. of 7 runs, 100 loops each).

**Conclusion**
The python built in functions computes the average of the list faster than the two methods from numpy.

Of course, most data scientists don't write their own machine learning algorithms. Instead we use existing algorithms and apply them to real-world problems. So `numpy` is to some extent too "low level" and we need a higher level library like `pandas` to work with data. So what do `numpy` and `pandas` have in common? First let's see what a 1-D array looks like in `pandas`:

7. Load the `pandas` library and use `pandas.Series` to create a pandas `Series` object, which is the equivalent of a `numpy` 1-D array. <span style="color:red" float:right>[1 point]</span>

In [87]:
# Load pandas 
import pandas as pd

# convert the numpy array into a panda Series
list3 = pd.Series(list2)
print(list3)

0    3
1    7
2    1
3    3
4    5
dtype: int64


I converted the numpy array "list2" to a pandas Series "list3"

8. Pass the `Series` to the `numpy.mean` function to confirm it returns its average. <span style="color:red" float:right>[1 point]</span>

In [88]:
# pass the series to the numpy.mean function
np.mean(list3)

3.8

Using numpy.mean on a pandas Series returns the same average we had before, which is 3.8

9. Call the `mean` method of the `Series` and confirm it returns its average. <span style="color:red" float:right>[1 point]</span>

In [89]:
list3.mean()

3.8

Using the mean method of the Series also returns 3.8

So you can think of a `Series` in `pandas` almost as the same thing as a 1-D array in `numpy`. In fact calling the `values` attribute of the `Series` returns it as a numpy array.

10. Show that by calling the `values` attribute of a `Series` object, you get a `numpy` array. HINT: You can use the `type` built-in to check its type. <span style="color:red" float:right>[1 point]</span>

In [90]:
# show the series's values
list_values = list3.values
print(list_values)

# check its type
print(type(list_values))

[3 7 1 3 5]
<class 'numpy.ndarray'>


**Conclusion**

The pandas Series is almost the same as a 1-D array. The values in the pandas Series has the attributes of a numpy array. 

### Bonus
While numpy has computational overhead that makes it less efficient for small computations, it uses advanced optimizations that generally make it faster when working with large data. Can you write code that will determine the approximate length of the python list when numpy's mean method becomes faster than your program using python built-in functions?

**Problem Approach**

1. Use a while loop to find length "length" as long as 
(time of computing the average of the list using numpy array - time of computing average of the python list using python built in function)
is positive
1. Use random integers between 0 - 99 to make up the list
2. Repeat the while loop 1000 times to find the average as the closest approximated length

**Psuedocode** 

1. import numpy, random, time
1. create an empty results list to store results
1. create a for loop to run n = 1000
    1. initialize a list with 5 random integers
    1. initialize timer_list = 0
    1. initialize timer_array = 1
    1. create while looop if timer_arry - timer_list > 0
        - append a random integer to the list
        - start time = current time
        - get average of list
        - collect time: timer_list = current time - start time
        - convert list to np array
        - start time = current time
        - get average of array
        - collect time: timer_array = current time - start time
    1. print length of list
    2. store length in results 
1. compute average of results list
1. print result
   

In [91]:
import numpy as np
import random
import time

# create a list to store length results
results = []

# create a for loop to run this 1000 times
for _ in range(1000):
    nums = [random.randint(0, 99) for _ in range(5)]
    timer_list = 0 # initialize timer for list
    timer_array = 1 # initialize timer for array (bigger number than timer for list)
    
    while timer_array - timer_list > 0:
        # add an integer to the list
        nums.append(random.randint(0,99))
        # get current time as start time
        start_time = time.time()
        # compute average using built-in functions
        mean = sum(nums)/len(nums)
        # collect compute time used
        timer_list = time.time() - start_time
        # now convert list to numpy array
        nums_arr = np.array(nums)
        # reset start time
        start_time = time.time()
        # compute average using numpy mean function
        np.mean(nums_arr)
        # collect compute time used
        timer_array = time.time() - start_time
    
    # print length at the end of each run of while loop
    # print("Length of the list is ", len(nums))
    # store the length in results
    results.append(len(nums))

# compute the average of the results
ans = sum(results)/len(results)
# print the final approximated length
print("Numpy's mean method is faster than using python built-in functions when length is approximately ", ans)

Numpy's mean method is faster than using python built-in functions when length is approximately  294.105


**Conclusions:**

I tested the codes with with lists containing integers of 0 - 9 and with list containing integers of 0 - 99, the approximate lengths were different. It seems that when the list only contains single digit integers, numpy's mean method is faster than python built-in functions when the length of list reaches about 260, while if the list contains integers 0 - 99, the length of list seems to be longer at around 300. However, the number of iterations (constrainted by run time) may not be high enough to get a very close approximation. 

Based on my codes, Numpy's mean method is faster than using python built-in functions when length is approximately  300. 








# End of assignment