<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>

## Week 1 | Lab : Introduction to Python and its Numerical Stack

**Clemson University**<br>
**Instructor(s):** Tim Ransom <br>

---
## Learning goals

- Write basic Python code using functions, loops, and data structures.
- Compare and contrast Python lists and NumPy arrays.
- Utilize statistical libraries like scipy.stats and statsmodels.

-------------

## Programming Expectations
All lab assignments and Homework sets for this class will use Python and the browser-based iPython notebook format you are currently viewing.  Programming at the level of CPSC 2120 is a prerequisite for this course.   If you have concerns about this, come speak with any of the instructors.

We will refer to the Python 3 [documentation](https://docs.python.org/3/) in this lab and throughout the course.  

## About 

- This introductory lab is a condensed introduction to Python numerical programming.  
- By the end of this lab, you will feel more comfortable:
     1. Writing short Python code using functions, loops, lists, numpy arrays, and dictionaries.
     2. Manipulating Python lists and numpy arrays and understanding the difference between them.

In [None]:
""" RUN THIS CELL TO GET THE RIGHT FORMATTING """
import requests
from IPython.core.display import HTML
css_file = 'https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/css/cpsc6300.css'
styles = requests.get(css_file).text
HTML(styles)

# 1 - Getting Started with Python and it's numerical stack

### 1.1 Importing modules
All notebooks should begin with code that imports *modules*, collections of built-in, commonly-used Python functions.  Below we import the Numpy module, a fast numerical programming library for scientific computing.  Future labs will require additional modules, which we'll import with the same syntax.

`import MODULE_NAME as MODULE_NICKNAME`

In [None]:
import numpy as np

Now that Numpy has been imported, we can access some useful functions.  For example, we can use `mean` to calculate the mean of a set of numbers.

In [None]:
my_list = [1.2, 2, 3.3]

np.mean(my_list)

### 1.2 Calculations and variables

In [None]:
# // is integer division
print("Hello")
1/2, 1//2, 1.0/2, 3*3.2

The last line in a cell is returned as the output value, as above.  For cells with multiple lines of results, we can display results using ``print``, as can be seen below.

In [None]:
print(1 + 3.0, "\n", 9, 7)
5/3

We can store integer or floating point values as variables.  The other basic Python data types -- booleans, strings, lists -- can also be stored as variables.

In [None]:
a = 1
b = 2.0

Here is the storing of a list

In [None]:
a = [1, 2, 3]

Think of a variable as a label for a value, not a box in which you put the value

![sticksnotboxes.png](attachment:sticksnotboxes.png)

(**Image:** Fluent Python by Luciano Ramalho)

In [None]:
b = a
b

This DOES NOT create a new copy of `a`. It merely puts a new label on the memory at a, as can be seen by the following code:

In [None]:
print("a", a)
print("b", b)
a[1] = 7
print("a after change", a)
print("b after change", b)

### 1.3 Tuples

Multiple items on one line in the interface are returned as a *tuple*, an immutable sequence of Python objects. See the end of this notebook for an interesting use of `tuples`.

In [None]:
a = 1
b = 2.0
a + a, a - b, b * b, 10*a

#### `type()`

We can obtain the type of a variable, and use boolean comparisons to test these types. VERY USEFUL when things go wrong and you cannot understand why this method does not work on a specific variable!

In [None]:
type(a) == float

In [None]:
type(a) == int

In [None]:
type(a)

For reference, below are common arithmetic and comparison operations.

![ops1_v2.png](attachment:ops1_v2.png)


![ops2_v2.png](attachment:ops2_v2.png)

<div class='exercise'> <b> EXERCISE 1:  Create a tuple called `tup` with the following seven objects: </div>

- The first element `a` is an integer of your choice
- The second element `b` is a float of your choice  
- The third element `c` is the sum of the first two elements
- The fourth element `d` is the difference of the first two elements
- The fifth element `e` is the first element divided by the second element
- Create a tuple called `tup` and add all above elements to it.
- Display the output of `tup`.  What is the type of the variable `tup`? What happens if you try and chage an item in the tuple?

In [None]:
"""Your code for exercise 1 here:"""

# your code here
raise NotImplementedError

In [None]:
# this is supposed to throw an error so you can see what an error looks like in python notebooks (this is not graded)
try:
    tup[2] = 3
except NameError: 
    print('Tuple object does not support item assignment')  

### 1.4 Lists

Much of Python is based on the notion of a list.  In Python, a list is a sequence of items separated by commas, all within square brackets.  The items can be integers, floating points, or another type.  Unlike in C arrays, items in a Python list can be different types, so Python lists are more versatile than traditional arrays in C or other languages.

Let's start out by creating a few lists.  

In [None]:
empty_list = []
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mixed_list = [1, 2., 3, 4., 5]

print(empty_list)
print(int_list)
print(mixed_list, float_list)

Lists in Python are zero-indexed, as in C.  The first entry of the list has index 0, the second has index 1, and so on.

In [None]:
print(int_list[0])
print(float_list[1])

What happens if we try to use an index that doesn't exist for that list?  Python will complain!

In [None]:
try:
    print(float_list[10])
except IndexError:
    print("list index out of range")

You can find the length of a list using the built-in function `len`:

In [None]:
print(float_list)
len(float_list)

### 1.5 Indexing on lists plus Slicing

And since Python is zero-indexed, the last element of `float_list` is

In [None]:
float_list[len(float_list)-1]

It is more idiomatic in Python to use -1 for the last element, -2 for the second last, and so on

In [None]:
float_list[-3]

We can use the ``:`` operator to access a subset of the list.  This is called **slicing.**

In [None]:
print(float_list[1:-1])
print(float_list[0:2])

In [None]:
lst = ['hi', 7, 'c', 'cat', 'hello', 8]

Below is a summary of list slicing operations:
![ops3_v2.png](attachment:ops3_v2.png)

In [None]:
lst = ['hi', 7, 'c', 'cat', 'hello', 8]
lst[:2]

You can slice "backwards" as well:

In [None]:
float_list[:-2] # up to second last

In [None]:
float_list[:4] # up to but not including 5th element

You can also slice with a stride:

In [None]:
float_list[:4:2] # above but skipping every second element

We can iterate through a list using a loop.  Here's a **for loop.**

In [None]:
for ele in float_list:
    print(ele)

What if you wanted the index as well?

Use the built-in python method `enumerate`,  which can be used to create a list of tuples with each tuple of the form `(index, value)`.

In [None]:
for i, ele in enumerate(float_list):
    print(i, ele)

### 1.6 Appending and deleting

We can also append items to the end of the list using the `+` operator or with `append`.

In [None]:
new_list = float_list + [.333]
new_list

In [None]:
float_list.append(.444)

In [None]:
print(float_list)
len(float_list)

Now, run the cell with `float_list.append()` a second time. Then run the subsequent cell. What happens?  

To remove an item from the list, use `del.`

In [None]:
del(float_list[2])
print(float_list)

You may also add an element (elem) in a specific position (index) in the list

In [None]:
elem = '3.14'
index = 1
float_list.insert(index, elem)
float_list

### 1.7 List Comprehensions

Lists can be constructed in a compact way using a *list comprehension*.  Here's a simple example.

In [None]:
int_list

In [None]:
squaredlist = []

for i in int_list:
    squaredlist.append(i*i)
squaredlist

In [None]:
squaredlist = [i*i for i in int_list]
squaredlist

And here's a more complicated one, requiring a conditional.

In [None]:
comp_list1 = [2*i for i in squaredlist if i % 2 == 0]
print(comp_list1)

This is entirely equivalent to creating `comp_list1` using a loop with a conditional, as below:

In [None]:
comp_list2 = []
for i in squaredlist:
    if i % 2 == 0:
        comp_list2.append(2*i)

print(comp_list2)

The list comprehension syntax

```python
[expression for item in list if conditional]

```
is equivalent to the syntax

```python
for item in list:
    if conditional:
        expression
```

<div class='exercise'><b> Exercise 2.1: Build a list called "primes" that contains every prime number between 1 and 100, in two different ways: </div>

**Using `for` loop and conditional `if` statements.**

- Build a list named `primes` that contains every prime number between 1 and 100 
- **Note:** Complete above task by only using `for` loops and conditional `if` statements.

In [None]:
"""Your code for exercise 2.1 here:"""

# your code here
raise NotImplementedError

<div class='exercise'><b> Exercise 2.2: Build a list called "primes" that contains every prime number between 1 and 100, in two different ways: </div>
    
**(Stretch Goal)** Using a list comprehension.  You should be able to do this in one line of code. 

- Build a list named `primes` that contains every prime number between 1 and 100 
- **Complete above task using list comprehension.**
    
**Hint:** It might help to look up the function `all()` in this [documentation](https://docs.python.org/3/library/functions.html).

In [None]:
primes = [] 

"""Your code for exercise 2.2 here:"""

# your code here
raise NotImplementedError

### 1.8 Functions

A *function* object is a reusable block of code that does a specific task.  Functions are commonplace in Python, either on their own or as they belong to other objects. To invoke a function `func`, you call it as `func(arguments)`.

We've seen built-in Python functions and methods (details below). For example, `len()` and `print()` are built-in Python functions. And at the beginning, you called `np.mean()` to calculate the mean of three numbers, where `mean()` is a function in the numpy module and numpy was abbreviated as `np`. This syntax allows us to have multiple "mean" functions in different modules; calling this one as `np.mean()` guarantees that we will execute numpy's mean function, as opposed to a mean function from a different module.

#### 1.8.1 User-defined functions

We'll now learn to write our own user-defined functions.  Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.

```
def name_of_function(arg):
    ...
    return(output)
```

We can write functions with one input and one output argument.  Here are two such functions.

In [None]:
def square(x):
    x_sqr = x*x
    return(x_sqr)

def cube(x):
    x_cub = x*x*x
    return(x_cub)

square(5),cube(5)

What if you want to return two variables at a time? The usual way is to return a tuple:

In [None]:
def square_and_cube(x):
    x_cub = x*x*x
    x_sqr = x*x
    return(x_sqr, x_cub)

square_and_cube(5)

#### 1.8.2 Lambda functions

Often we quickly define mathematical functions with a one-line function called a *lambda* function.  Lambda functions are great because they enable us to write functions without having to name them, ie, they're *anonymous*.  
No return statement is needed.


In [None]:
# create an anonymous function and assign it to the variable square
square = lambda x: x*x
print(square(3))

hypotenuse = lambda x, y: x*x + y*y

## Same as
# def hypotenuse(x, y):
#     return(x*x + y*y)

hypotenuse(3,4)

### 1.9 Methods
A function that belongs to an object is called a *method*. By "object," we mean an "instance" of a class (e.g., list, integer, or floating point variable).

For example, when we invoke `append()` on an existing list, `append()` is a method.

In other words, a *method* is a function on a specific *instance* of a class (i.e., *object*). In this example, our class is a list. `float_list` is an instance of a list (thus, an object), and the `append()` function is technically a *method* since it pertains to the specific instance `float_list`.

In [None]:
float_list = [1.0, 2.09, 4.0, 2.0, 0.444]
print(float_list)
float_list.append(56.7)
float_list

<div class='exercise'><b> Exercise 3: generate a list of the prime numbers between 1 and 100</div>
    
- In Exercise 2, above, you wrote code that generated a list of the prime numbers between 1 and 100. 
- Now, write a function called `isprime()` that takes in a positive integer $N$, and determines whether or not it is prime.  
- Return `True` if it's prime and return `False` if it isn't. 
- Then, using a list comprehension and `isprime()`, create a list `myprimes` that contains all the prime numbers less than 100.

In [None]:
"""Your code for exercise 3 here:"""

# your code here
raise NotImplementedError

### 1.10 Introduction to Numpy
Scientific Python code uses a fast array structure, called the numpy array. Those who have programmed in Matlab will find this very natural. For reference, the numpy documention can be found [here](https://docs.scipy.org/doc/numpy/reference/).  

Let's make a numpy array:

In [None]:
my_array = np.array([1, 2, 3, 4])
my_array

In [None]:
# works as it would with a standard list
len(my_array)

The shape array of an array is very useful (we'll see more of it later when we talk about 2D arrays -- matrices -- and higher-dimensional arrays).

In [None]:
my_array.shape

Numpy arrays are **typed**. This means that by default, all the elements will be assumed to be of the same type (e.g., integer, float, String).

In [None]:
my_array.dtype

Numpy arrays have similar functionality as lists! Below, we compute the length, slice the array, and iterate through it (one could identically perform the same with a list).

In [None]:
print(len(my_array))
print(my_array[2:4])

for ele in my_array:
    print(ele)

There are two ways to manipulate numpy arrays a) by using the numpy module's methods (e.g., `np.mean()`) or b) by applying the function np.mean() with the numpy array as an argument.

In [None]:
print(my_array.mean())
print(np.mean(my_array))

A ``constructor`` is a general programming term that refers to the mechanism for creating a new object (e.g., list, array, String).

There are many other efficient ways to construct numpy arrays. Here are some commonly used numpy array constructors. Read more details in the numpy documentation.

In [None]:
np.ones(10) # generates 10 floating point ones

Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float. (Each float uses either 32 or 64 bits of memory, depending on if the code is running a 32-bit or 64-bit machine, respectively).

In [None]:
np.dtype(float).itemsize # in bytes (remember, 1 byte = 8 bits)

In [None]:
np.ones(10, dtype='int') # generates 10 integer ones

In [None]:
np.zeros(10)

Often, you will want random numbers. Use the `random` constructor!

In [None]:
np.random.random(10) # uniform from [0,1]

You can generate random numbers from a normal distribution with mean 0 and variance 1:

In [None]:
normal_array = np.random.randn(1000)
print("The sample mean and standard devation are %f and %f, respectively." %(np.mean(normal_array), np.std(normal_array)))

In [None]:
len(normal_array)

You can sample with and without replacement from an array. Let's first construct a list with evenly-spaced values:

In [None]:
grid = np.arange(0., 1.01, 0.1)
grid

Without replacement

In [None]:
np.random.choice(grid, 5, replace=False)

In [None]:
# supposed to error out - read the error message!
np.random.choice(grid, 20, replace=False)

With replacement:

In [None]:
np.random.choice(grid, 20, replace=True)

### 1.11 Tensors

We can think of tensors as a name to include multidimensional arrays of numerical values. While tensors first emerged in the 20th century, they have since been applied to numerous other disciplines, including machine learning. In this class you will only be using **scalars**, **vectors**, and **2D arrays**, so you do not need to worry about the name 'tensor'.

We will use the following naming conventions:

- scalar = just a number = rank 0 tensor  ($a$ ∈ $F$,)
<BR><BR>
- vector = 1D array = rank 1 tensor ( $x = (\;x_1,...,x_i\;)⊤$ ∈ $F^n$ )
<BR><BR>
- matrix = 2D array = rank 2 tensor ( $\textbf{X} = [a_{ij}] ∈ F^{m×n}$ )
<BR><BR>
- 3D array = rank 3 tensor ( $\mathscr{X} =[t_{i,j,k}]∈F^{m×n×l}$ )
<BR><BR>
- $\mathscr{N}$D array = rank $\mathscr{N}$ tensor ( $\mathscr{T} =[t_{i1},...,t_{i\mathscr{N}}]∈F^{n_1×...×n_\mathscr{N}}$ )


### Slicing a 2D array

![slicing_2D_oreilly.png](attachment:slicing_2D_oreilly.png)

[source: oreilly](https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html)

In [None]:
# how do we get just the second row of the above array?

#### Numpy supports vector operations

What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays.

In [None]:
first = np.ones(5)
second = np.ones(5)
first + second # adds in-place

Note that this behavior is very different from python lists where concatenation happens.

In [None]:
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list # concatenation

On some computer chips, this numpy addition actually happens in parallel and can yield significant increases in speed. But even on regular chips, the advantage of greater readability is important.

#### Broadcasting

Numpy supports a concept known as *broadcasting*, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.

In [None]:
first_list = np.array([1., 1., 1., 1., 1.])
first_list + 1

In [None]:
first*5

This means that if you wanted the distribution $N(5, 7)$ you could do:

In [None]:
normal_5_7 = 5 + 7*normal_array
np.mean(normal_5_7), np.std(normal_5_7)

Multiplying two arrays multiplies them element-by-element

In [None]:
(first +1) * (first*5)

You might have wanted to compute the dot product instead:

In [None]:
np.dot((first +1) , (first*5))

### 1.12 Probabilitiy Distributions from `scipy.stats` and `statsmodels`

Two useful statistics libraries in python are `scipy` and `statsmodels`.

For example to load the z_test: (A two-tailed test is appropriate if you want to determine if there is any difference between the groups you are comparing. For instance, if you want to see if Group A scored higher or lower than Group B, then you would want to use a two-tailed test.)

In [None]:
import statsmodels
from statsmodels.stats.proportion import proportions_ztest

In [None]:
x = np.array([74,100])
n = np.array([152,266])

zstat, pvalue = statsmodels.stats.proportion.proportions_ztest(x, n)
print("Two-sided z-test for proportions: \n","z =",zstat,", pvalue =",pvalue)

In [None]:
#The `%matplotlib inline` ensures that plots are rendered inline in the browser.
%matplotlib inline
import matplotlib.pyplot as plt

Let's get the normal distribution namespace from `scipy.stats`. See here for [Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html).

In [None]:
from scipy.stats import norm

Let's create 1,000 points between -10 and 10

In [None]:
x = np.linspace(-10, 10, 1000) # linspace() returns evenly-spaced numbers over a specified interval
x[0:10], x[-10:]

Let's get the pdf of a normal distribution with a mean of 1 and standard deviation 3, and plot it using the grid points computed before:

In [None]:
pdf_x = norm.pdf(x, 1, 3)
plt.plot(x, pdf_x)

And you can get random variables using the `rvs` function.

### 1.13 Dictionaries
A dictionary is another data structure (aka storage container) -- arguably the most powerful. Like a list, a dictionary is a sequence of items. Unlike a list, a dictionary is unordered and its items are accessed with keys and not integer positions.  

Dictionaries are the closest data structure we have to a database.

Let's make a dictionary with a few courses and their corresponding enrollment numbers.

In [None]:
enroll2017_dict = {'CS50': 692,
                   'CS109A / Stat 121A / AC 209A': 352,
                   'Econ1011a': 95,
                   'AM21a': 153,
                   'Stat110': 485}
enroll2017_dict

One can obtain the value corresponding to a key via:

In [None]:
enroll2017_dict['CS50']

If you try to access a key that isn't present, your code will yield an error:

In [None]:
# supposed to error out - READ THE ERROR MESSAGE!
enroll2017_dict['CS630']

Alternatively, the `.get()` function allows one to gracefully handle these situations by providing a default value if the key isn't found:

In [None]:
enroll2017_dict.get('CS630', 5)

Note, this does not _store_ a new value for the key; it only provides a value to return if the key isn't found.

In [None]:
# supposed to error out - READ THE ERROR MESSAGE!!!
enroll2017_dict['CS630']

In [None]:
enroll2017_dict.get('C730', None)

All sorts of iterations are supported:

In [None]:
enroll2017_dict.values()

In [None]:
enroll2017_dict.items()

We can iterate over the tuples obtained above:

In [None]:
for key, value in enroll2017_dict.items():
    print("%s: %d" %(key, value))

Simply iterating over a dictionary gives us the keys. This is useful when we want to do something with each item:

In [None]:
second_dict={}

for key in enroll2017_dict:
    second_dict[key] = enroll2017_dict[key]

second_dict

The above is an actual __copy__ of _enroll2017_dict's_ allocated memory, unlike, `second_dict = enroll2017_dict` which would have made both variables label the same memory location.

In the previous dictionary example, the keys were strings corresponding to course names.  Keys don't have to be strings, though; they can be other _immutable_ data type such as numbers or tuples (not lists, as lists are mutable).

### Dictionary comprehension: "Do not try this at home"

You can construct dictionaries using a *dictionary comprehension*, which is similar to a list comprehension. Notice the brackets {} and the use of `zip` (see next cell for more on `zip`)

In [None]:
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

my_dict = {k:v for (k, v) in zip(int_list, float_list)}
my_dict

#### Creating tuples with `zip`

`zip` is a Python built-in function that returns an iterator that aggregates elements from each of the iterables. This is an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The iterator stops when the shortest input iterable is exhausted. The `set()` built-in function returns a `set` object, optionally with elements taken from another iterable. By using `set()` you can make `zip` printable. In the example below, the iterables are the two lists, `float_list` and `int_list`. We can have more than two iterables.

In [None]:
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

viz_zip = set(zip(int_list, float_list))
viz_zip

In [None]:
type(viz_zip)

In [None]:
set((1,2,3,2,1,4))

### References

A useful book by Jake Vanderplas:  [PythonDataScienceHandbook](https://jakevdp.github.io/PythonDataScienceHandbook/).

You may also benefit from using [Chris Albon's web site](https://chrisalbon.com) as a reference. It contains lots of useful information.

# END