## Basic Python Data Structures and Numpy

The objective of this assignment is to introduce Jupyter notebook as a development environment and to introduce data structures in Python that provide the basis for much of the data manipulation and, eventually, visualization work that we will perform in later sections.

**Recall the keyboard shortcuts introduced in the video lectures:**
- Convert current chunk to Markown: *Esc* -> *M*
- New code chunk...
    - above current cell: *Esc* -> *A*
    - below current cell: *Esc* -> *B*
- Run code chunk: *Shift+Enter*
- Comment: *ctrl + /* (PC) or *command + /* (Mac)

### Task 1: Create and Modify a List
We first begin with one of the simplest ways to store data: a list. Recall that when we were working in R, lists could store data of different types (characters, integers, logicals) and they could retain that data type when referenced.

Fortunately, Python offers some similar functionality. We will start by exploring some of the basic features of a list

*1a) Creating a list*

Create a list using square brackets

In [1]:
# Create a list
my_list = [1, 'Friends', 3, True] # Add some elements (characters, integers, logicals) to this list, separated by a comma

*1b) Appending a list and selecting the last item*

Use *.append()* to an element to the end of that list, then reference it using [-1]. 

In [2]:
my_list.append('Freedom')

### Task 2: Subsetting a list
*2a:* Select the first element of that list.

**Question**: How is this different from R? R subsets using number 1 first whereas Python uses 0 in their subsetting options to pick an argument in the list.

In [3]:
my_list[0]

1

*2b:* Select the last element of that list

In [4]:
my_list[-1]

'Freedom'

### Task 3: Modifying elements of a list and list comprehension
*3a:* Try creating another list of just numbers (e.g. [1, 3, 4]) and then try adding 1 to every element of that list (even if it contains a string). What happens? The addition of 1 adds to each element in list_2.


In [5]:
list_2 = [1,2,3,4]
list_2 = [x+1 for x in list_2]
list_2

[2, 3, 4, 5]

*3b*: We can try this again using list comprehension. Starting with the following code as a base, complete the previous task; add 1 to each element of list_2.

In [6]:
# Every element of the list x

# x is used to just reference each specific element in this list
[str(x+ 1) + ' BOO' for x in list_2] # we could also do [item for item in list2]


['3 BOO', '4 BOO', '5 BOO', '6 BOO']

## Numpy: A comparison to R
Once you start working with numpy, you will begin to realize that many of its functions and objects bear a striking resemblence to those used in R. 

We can begin by considering the atomic vector in R. Recall that in R this is just a single data type (e.g. character, integer, float). Numpy Vectors in Python work the same way! We will illustrate this by first creating a numpy array. 

In [7]:
# !! Don't forget to import numpy!!
import numpy as np

### Task 4: Creating numpy array
Using the list you used in task 3, you can create a numpy array with the following code. What happens when you add a character to that original list and try converting it to a numpy array? How does this compare to R?

The array is converted to all strings and in R it doesn't have quotation marks to denote it's a string because an atomic vector in R doesn't have different data types in an atomic vector, so it's essentially the same as python.


In [8]:
list_2.append('cool')
my_array = np.array(list_2)
my_array

array(['2', '3', '4', '5', 'cool'], dtype='<U21')

As with R, there are several built-in functions within numpy that can help us create arrays and matrices. Here we have listed just a few:

    - np.arange()
    - np.linspace()
    - np.random.randn()
    - np.random.randint()
    - np.random.random()

*4b*: We will use np.random.randn() to create a vector and convert it to a 10x10 matrix. Modify the code to create a 20 x 5 matrix

In [9]:
my_matrix = np.random.randn(100).reshape(20, 5)
my_matrix

array([[ 0.52151283, -0.05688287, -0.59252923,  0.4535095 , -0.04155082],
       [ 0.37487575, -0.73589984, -1.20198882, -1.14765131,  1.36589016],
       [ 2.00801076,  0.01641455, -0.24728926,  0.50809245,  1.77044481],
       [ 1.74639888,  0.44580057, -0.15638228,  0.31481783, -0.21251275],
       [-1.07274618,  0.4068032 , -0.99349265,  1.30188345,  0.47110615],
       [ 1.01290256, -0.18050331, -1.49796784,  0.29706307,  0.35245327],
       [-1.2716223 ,  1.32882956, -0.4299807 , -0.51412089, -0.58227673],
       [ 0.55027653, -0.05004354,  1.30428293,  0.28011727,  0.29698488],
       [-1.45954203, -1.60708692, -1.64595636,  0.4199458 , -1.01368032],
       [-0.02796228,  0.38496829,  0.67419387,  0.3225068 ,  0.58408416],
       [-0.2258757 ,  0.32292358,  0.65935294,  2.07938649, -0.91048001],
       [ 0.19216612, -0.81562688,  0.04454732, -1.04923655,  0.71023616],
       [ 1.3264207 ,  0.79101987, -0.51946672, -0.6175505 ,  1.72280586],
       [-1.11295414,  0.20966343,  0.3

### Task 5: Subsetting and summarizing a numpy array


As with R, we can subset matrices (2d arrays) using the syntax:

         my_matrix[row_subset, col_subset]
         
*row_subset* and/or *col_subset* can be logical arrays or arrays of integers designating the rows/columns we would like to keep. Crucially, if we do not want to subset by row or column, you must include a colon. For example, if we are only filtering by columns (not rows) we would write:

        my_matrix[:, col_subset]

*5a*: Filter the matrix created in 4b to include only columns 2 and 3.

In [20]:
my_matrix[:,1:3]

array([[-0.05688287, -0.59252923],
       [-0.73589984, -1.20198882],
       [ 0.01641455, -0.24728926],
       [ 0.44580057, -0.15638228],
       [ 0.4068032 , -0.99349265],
       [-0.18050331, -1.49796784],
       [ 1.32882956, -0.4299807 ],
       [-0.05004354,  1.30428293],
       [-1.60708692, -1.64595636],
       [ 0.38496829,  0.67419387],
       [ 0.32292358,  0.65935294],
       [-0.81562688,  0.04454732],
       [ 0.79101987, -0.51946672],
       [ 0.20966343,  0.38616493],
       [ 1.0497069 ,  0.29637932],
       [ 0.62830753, -0.98326907],
       [-0.24755138, -0.80758018],
       [ 1.75422248,  0.16652487],
       [ 0.23807625,  1.05331481],
       [-0.21322954,  1.90981943]])

*5b*: Recall the distinction between functions and methods. For each of the following lines of code, create a comment describing its function, then rewrite it using a method instead of the function.

In [24]:
np.ravel(my_matrix)  #np.ravel flattens a multi-dimensional array into a one dimensional array
my_matrix.ravel()

array([ 0.52151283, -0.05688287, -0.59252923,  0.4535095 , -0.04155082,
        0.37487575, -0.73589984, -1.20198882, -1.14765131,  1.36589016,
        2.00801076,  0.01641455, -0.24728926,  0.50809245,  1.77044481,
        1.74639888,  0.44580057, -0.15638228,  0.31481783, -0.21251275,
       -1.07274618,  0.4068032 , -0.99349265,  1.30188345,  0.47110615,
        1.01290256, -0.18050331, -1.49796784,  0.29706307,  0.35245327,
       -1.2716223 ,  1.32882956, -0.4299807 , -0.51412089, -0.58227673,
        0.55027653, -0.05004354,  1.30428293,  0.28011727,  0.29698488,
       -1.45954203, -1.60708692, -1.64595636,  0.4199458 , -1.01368032,
       -0.02796228,  0.38496829,  0.67419387,  0.3225068 ,  0.58408416,
       -0.2258757 ,  0.32292358,  0.65935294,  2.07938649, -0.91048001,
        0.19216612, -0.81562688,  0.04454732, -1.04923655,  0.71023616,
        1.3264207 ,  0.79101987, -0.51946672, -0.6175505 ,  1.72280586,
       -1.11295414,  0.20966343,  0.38616493,  0.65223537, -0.02

In [27]:
np.argsort(my_matrix, axis = 1) # np.argsort returns the indices that would sort an array in ascending order
my_matrix.argsort(axis = 1)

array([[2, 1, 4, 3, 0],
       [2, 3, 1, 0, 4],
       [2, 1, 3, 4, 0],
       [4, 2, 3, 1, 0],
       [0, 2, 1, 4, 3],
       [2, 1, 3, 4, 0],
       [0, 4, 3, 2, 1],
       [1, 3, 4, 0, 2],
       [2, 1, 0, 4, 3],
       [0, 3, 1, 4, 2],
       [4, 0, 1, 2, 3],
       [3, 1, 2, 0, 4],
       [3, 2, 1, 0, 4],
       [0, 4, 1, 2, 3],
       [0, 4, 2, 3, 1],
       [2, 0, 1, 4, 3],
       [4, 0, 2, 1, 3],
       [4, 2, 3, 0, 1],
       [4, 1, 0, 2, 3],
       [0, 1, 3, 4, 2]])

In [29]:
np.cumsum(my_matrix, axis = 0) # calculates the cumulative sum of elements along a specified axis in a NumPy array
my_matrix.cumsum(axis = 0)

array([[ 0.52151283, -0.05688287, -0.59252923,  0.4535095 , -0.04155082],
       [ 0.89638859, -0.79278271, -1.79451804, -0.69414181,  1.32433934],
       [ 2.90439934, -0.77636816, -2.0418073 , -0.18604936,  3.09478416],
       [ 4.65079823, -0.33056759, -2.19818958,  0.12876847,  2.88227141],
       [ 3.57805204,  0.07623561, -3.19168223,  1.43065191,  3.35337756],
       [ 4.5909546 , -0.1042677 , -4.68965007,  1.72771498,  3.70583083],
       [ 3.31933231,  1.22456186, -5.11963077,  1.21359409,  3.1235541 ],
       [ 3.86960883,  1.17451833, -3.81534784,  1.49371136,  3.42053898],
       [ 2.4100668 , -0.4325686 , -5.4613042 ,  1.91365716,  2.40685866],
       [ 2.38210452, -0.04760031, -4.78711033,  2.23616396,  2.99094282],
       [ 2.15622883,  0.27532327, -4.12775739,  4.31555045,  2.08046281],
       [ 2.34839495, -0.54030361, -4.08321007,  3.2663139 ,  2.79069897],
       [ 3.67481565,  0.25071626, -4.60267679,  2.6487634 ,  4.51350483],
       [ 2.5618615 ,  0.46037969, -4.2