# Introduction to Python and Jupyter

Welcome to CHM 493, A.I. in Chemistry! 

This course will cover diverse topics related to the increasing presence of data-driven techniques in chemistry and biochemistry. The best way to become fully immersed in these topics is to run, write, and analyze our own models of chemical phenomena, which requires some practice with programming.


I'll do my best to provide as much help as possible with programming, and that starts with this notebook. Many of your assignments will be encapsulated in these jupyter notebooks. A jupyter notebook is a flexible programming environment that lets you write and run python code, in addition to write written responses to conceptial questions. We will also be able to run scientific software and look at molecules, proteins, etc. all within these notebooks. This first notebook is designed to introduce/refresh what I view are the most important parts of Python to know, ending with an assignment to practice these tools. The assignment also is designed to get you all thinking about numerical models, which we'll apply to chemical problems soon enough.   

For all assignments **the first thing you need to do is go to File->Make a Copy.** Then rename the copy to include your initials. To submit your work, you'll need to upload both the .ipynb file and a pdf of the notebook.

## Using Jupter

Jupyter notebooks are composed of cells. Cells can have types, but we'll only use the Markdown and Code types. You can see the type of cell at the toolbar at the top. Markdown cells contain text (this cell is markdown), and Code cells contain python code. To execute any cell, use the `SHIFT+ENTER`
command. 

The cell below is Markdown, execute it to turn the text into print (make sure you've clicked the cell to make it the active cell). You may add any text you like, and double-click the cell to re-edit after you've executed it.

I'm a markdown cell! Feel free to edit me!

The cell below is python code, execute it to see the results

In [None]:
print("Hello World")

result = 1 + 1
print("1 + 1 =", result)

To save your progress, which you should do frequently, simply type `CTRL+s`. You will see the last checkpoint time update at the top of the browser.

Lastly, python is a language that uses many external libraries, which need to be imported. If you've run python in the past, you'll remember some `import` expressions at the top of your scripts. Here, we need to execute a cell with our imports, and we're good to go for the whole notebook.

Let's import numpy, which has a lot of very useful mathematical operations with numbers and arrays.

In [None]:
import numpy as np

What follows is a very quick walkthrough of some basic python functionality. Go through the sections, runnning all code cells, and editing them with your own lines just to get practice. I want you to be comfortable with this content, but by no means are you expected to be experts in python (I'm not). My best programming advice is to use Google and StackOverflow frequently. If there's something you don't know how to do (like, how do I evaluate $e^\pi$?), there's almost certainly an answer online.

## Variables, Types, and Printing

Now we'll get to some Python basics. 

Variables are used to store values. They have different types depending on the assignment. Execute the cell below

In [None]:
# these are integers
int1 = 1
int2 = 5

# these are floats
f1 = 1.334
f2 = 0.93

# these are strings, they can have any text, or they can be empty
s1 = "hello"
s2 = ""
# even if a string looks like a number, its still a string and not a float
s3 = "4.4"

The value of variables is that we can reuse them, even in different cells. To print a variable (or anything) use the `print()` function.

In [None]:
# We can print the ints on separate lines
print(int1)
print(int2)

# or on the same line by separating with commas
print(int1,int2)

# we can even mix types here
print(int1, f1, s1)

# we can also print things that aren't even stored as variables
print("The current year is", 2023)

## Mathematical Operations

Python supports many types of operations, whether they are numerical, comparative, or assigned.

Here are just a few numerical operations:

In [None]:
# adding
result = 1 + 1
print(result)

# subtraction
result = 100-230
print(result)

# multiplication
result = 3.3 * 12.0001
print(result)

# division
result = 3123/3415234.1
print(result)

# power
result = 2.1**(4.2)
print(result)

#logarithm: here we'll use a special library
result = np.log(32)
print(result)

As you can see, arithmetic in python is a lot like using a calculator.

In the cell below, calculate the square root of 2 using two different methods. You may want to search online for a way to do one of them.

## Lists, Dictionaries, and NumPy arrays

### Lists

We are familiar with the idea that we can store values as variables. These variables can have types, like string, int, or float. More advanced objects in python allow us to group and organize these simpler types.

A list is simply a list of items, whose order of elements matters, formatted in the following way:

In [None]:
# Here is a list of strings
fruit = ['apple', 'banana', 'tomato']

# We can make a list of numbers as well,
numbers = [2, 4, 56]

# print the lists


We can add to a list in a few ways, we can append to the end of it,

In [None]:
fruit.append('orange')

# print the appended list

We can also insert an item using an index (remember, indexing always starts at zero).

In [None]:
numbers.insert(2, 12)

print(numbers)

Square brackets are used to access a particular element of the list. For example,

In [None]:
a_fruit = fruit[1]

Conveniently, we can also access elements from the back, by using a negative sign,

In [None]:
last_fruit = fruit[-1]

print(a_fruit, last_fruit)

A more advanced version of indexing is called *slicing* where we can get a new list from a subset of a larger list. Suppose we want to get the first two elements:

In [None]:
sliced_fruit = fruit[:2]

or the middle two,

In [None]:
sliced_fruit_2 = fruit[1:3]
print(sliced_fruit_2)

In the cell below, add two fruit to the list, one at the beginning and one at the end. Then, print a slice of the list with the last three elements.

### Dictionaries

Dictionaries are one of the most useful tools in python. It is similar to a list in that it is a collection of elements, but rather than being ordered with numerical indices, they come in key-value pairs. The keys are typically strings. This provides a very convenient way to store a set of labelled data, where the labels are the keys and the data are the values.

We can initialize an empty dictionary using curly braces. We'll use this dictionary to store chemical information about a particular molecule.

In [None]:
benzene_info = {}

Let's populate this dictionary with some information. This can by done by redefining the object, or by addinge key/value pairs one at a time.

In [None]:
# here we redefine the dictionary, but this time with some values
benzene_info = {'name': 'benzene', 'molar_mass':78.114, 'SMILES':'c1ccccc1', 'LD50': 930}

# We can also add them one at a time
benzene_info['melting_point'] = 5.53

#print the dictionary below


Similar to a list, we can print an item in the dictionary using the key. In the cell below, print the molar mass of benzene using the dictionary.

Notice that the types of the values can vary.

### Numpy Arrays

Numpy is an external library that we'll basically always want to use. For our purposes, numpy arrays are lists of numbers that are amenable for efficient mathematical manipulation. They can be 1-dimensional (vectors), 2-dimensional (matrices), or of general rank.

Numpy arrays come with a lot of automated functions that let us do operations like dot products, norms, and even more advanced numerical algorithms like diagonalization and SVD.

If we have the values we want, we can make a numpy array:

In [None]:
a = np.array([1,2,3])
print(a)

Or we can make an array from an existing list,

In [None]:
# let's make a numpy array from our earlier example
a2 = np.asarray(numbers)
print(a2)

Since these arrays have the same length, we can play with adding and subtracting them,

In [None]:
# verify that these give the same result
print(a2 + a2)
print(np.add(a2,a2))

We can also compute the dot product,

In [None]:
print(np.dot(a,a2))

Now I'll introduce two convenient functions for creating arrays. The first is the `zero` function, which creates an array of specified dimension, filled with zeros.

In [None]:
# Make a rank 1 array with 10 zeroed elements
zarray = np.zeros(10)
print(zarray)

The next function in `linspace`, which creates an array of numbers with a specified minimum, maximum, and number of elements. For example, to get an array with five equally spaced numbers between 2 and 3, we'd use:

In [None]:
larray = np.linspace(2,3,5)
print(larray)

In the cell below create a numpy array with consecutive integers from 0 to 100. Print the result.

Now, add the first 50 elements of this array to the last 50 elements, and store the result in a new array.

## Conditional Statements and Loops

Conditional statements allow you to control the sequence and/or logic of a set of operations.

Here's a numerical example:

In [None]:
x = 2.5

if x < 4:
    print('x is less than 4')
else:
    print('x is greather than or equal to 4')

You can also use `if` statements to compare strings, or to see if an element of a list or dictionary exists:

In [None]:
if 'apple' in fruit:
    print('apple is in fruit')
    
if 'kiwi' in fruit:
    print('kiwi is in fruit')
else:
    fruit.append('kiwi')

Loops allow us to iterate over numbers, elements in a list, or key/value pairs in a dictionary. They are also useful to repeat blocks of code as needed. We will use them a lot.

In [None]:
# We can loop over items in a list:
for f in fruit:
    print(f)
    
# We can also loop through a dictionary
for key, value in benzene_info.items():
    print(key, value)
    

In the cell below, use a `for` loop to calculate the norm of your vector from the previous section (the one where you added the two parts together.

Then, use a numpy function to verify your result.

## Functions

Functions are used for when you need to perform a task multiple times. You can pass a function parameters, and the function will act on the variables, and optionally return a value. The variables themselves are not changed.

Here's a simple (and contrived) example of an adding function:

In [None]:
def add_two_numbers(n1, n2):
    return n1 + n2

result = add_two_numbers(4,np.pi)
print(result)

Suppose we want to find the distance between two points, maybe they represent the location of atoms in a molecule. In the cell below, write a function to calculate the distance between two points, and make your code flexible enough to work for 1-, 2-, and 3-, coordinate systems.

In [None]:
# test your function with these inputs:
point1 = [2.1]
point2 = [7.55]

point1 = [4.66, 14.90]
point2 = [20.02, 15.11]


point1 = [70.1,22.1,3.0]
point2 = [200.11, 30.33, 78.9]

## Assignment

Throughout this course, we will be writing python functions to perform a variety of tasks related to building machine learning models. Before we get to that, let's practice writing functions to perform a simple mathematical task, estimating $\pi$ with a numerical method.

Your assignment is to write a numerical algorithm to calculate $\pi$. Consider a 1x1 box with a circle inscribed,

![alternative text](images/Circle-inscribed-in-a-square.jpg)

where we can imagine it existing on an $xy$-plane with the bottom left corner at (0,0) and the top right point at (1,1). Assuming the radius of the circle is $r$, which in our example is 0.5, we can calculate $\pi$ using the areas of the square and circle,

$$\frac{\rm{area~of~circle}}{\rm{area~of~square}} = \frac{\pi r^2}{(2r)^2}$$

which rearranges to,
$$\pi = 4*\frac{\rm{area~of~circle}}{\rm{area~of~square}} $$


So how to we get these areas without already using $\pi$? This is the part we will do numerically. We can calculate the ratio of areas in a clever way:

 1) First choose a number of random points to generate.
 2) For each of these points, generate a random (x,y) point within our square (what are limits of x and y?).
 3) Count how many of those random points also lie within the circle, let's call these 'hits'.
 4) The ratio of areas is then just the number of hits divided by the total number of random points.
 
 
 **For your assignment, use the cell below to write a function to estimate the value of $\pi$ given a maximum number of random points. Then, briefly answer the questions that follow.**

In [None]:
## Your code will go in this cell. I'll help get you started

## First we need to import a library that calculates random numbers for us
import random

## Let's define a function for this, we'll want to be able to toggle the number of random points
def compute_pi(max_points):
    
    # 1) We'll need to store the number of hits, let's start at zero
    #    and increment it by one whenever we get a hit
    nhits = 0
    
    # 2) Now we need to see how many hits we get out of max_points 
    
    
    # 3) Once we have nhits finalized, calculate and return pi
    
    
# With the above function completed, test it here

1. Use your function to estimate $\pi$ using 5, 10, 100, 1,000, 100,000, and 1,000,000 points. Report the error for each run with respect to the exact value.

2. Does $\pi$ seem to converge to the correct value quickly or slowly? Run any additional calculations you'd like, but I'm only looking for a qualitative answer.

3. Run your function five times, with 100, 1,000, 100,000, and 1,000,000 points for each set of trials. For each number of max points, does the estimated value of $\pi$ change significantly (give the order of magnitude where you see changes)? How does this random error change with the number of max points?

4. A major topic of this course centers on evaluating the performance of A.I. models of chemical phenomena. Does our model of $\pi$ seem to be effective, and how should we define effectiveness? Suggest a possible improvement (not to the equations, just the algorithm itself) to our model.

Bouns: Write a new function to estimate pi that does not generate data from random numbers. Does this function perform better or worse than the stochastic one? Use data to support your claims.