# Python Introduction for Climate and Big Data

First things first, let's introduce what exactly you are looking at.  This interface is a Jupyter notebook, which is going to be our workspace for all of our data analysis (we will start in Google Colab and then switch to Jupyter, but they are in practice the same).  In a notebook, you can do lots, but most importantly you can write and run code which will load, analyze, and graph data.

### Let's get working with python

Python is a programming language, much like any other you may have heard of or used in a class/for research.  At its core, we write a series of commands in a syntax that the computer understands, and then the computer will execute those commands for us.

In [None]:
# anything that follows a "#" will be a comment, which does not get run as code and we use to annotate our code

# we can define and assign variables using "="
x = 3

# we can check the value assigned to a variable either by printing
print('x =', x)

# or simply by typing the object as the last line of a cell
x

This brings us to some characteristics of a Jupyter notebook.  Above, I mentioned a "cell," which is essentially just a block of code.  We can run a single cell by pressing **SHIFT + ENTER**.  Go ahead and click on the above cell and run it.  While a cell is running, there will be a "\*" at the top left corner of the cell, and once the cell has run, there will be a number indicating the order in which cells have been run.  To create a new cell above or below the currently selected cell, press **a** or **b**, respectively, and to delete a cell, double tap **d**.

For the rest of this notebook, once you have read and understood the content of each cell, run it!  Don't hesitate to make new cells to write some of your own code to test your understanding.

In [None]:
# variable names can be anything, but it is good practice to make them informative
my_variable = 9.81
g = 9.81
print('acceleration due to gravity =', g)

In [None]:
# since "=" is used for assignment, we use a double "==" to check for equality
print('x == 4?', x == 4)
print('x == 3?', x == 3)

In [None]:
# the order in which cells are run are important because variables can be reassigned
# not realizing you have reassigned a variable or doing so in the wrong order can be the cause of lots of errors
print('old x =', x)

# let's declare and assign a new variable y
y = x
print('y =', y)

# now we reassign x, but be wary becuase this does NOT also update y
x = 5
print('reassigned x =', x)
print('y does not change', y)                      # if you run this cell again, see what happens to y

In python, there are numerous datatypes.  The important ones for us are integers (**int**), decimal numbers (**float**), text strings (**str**), and true/false booleans (**bool**).  Use the **type** command to get the type of any object.

In [None]:
my_int = 2
type(my_int)

In [None]:
my_float = 2.1
type(my_float)

In [None]:
my_str = 'human memory'
type(my_str)

In [None]:
my_bool = True
type(my_bool)

We also want to get used to using **Lists**, which allow us to hold several variables/objects.  Lists are denoted by square brackets, with elements separated by commas.  Often, we will create a variable that is an empty list and then append (add) elements to it.

In [None]:
# we can make a list using some variables we have already defined
my_list = [my_int, my_float]
print(my_list)

# or we can make an empty list and then append elements to it using the .append() function
lst = []
print('Empty list prior to appending', lst)
lst.append(1)
lst.append(2)
lst.append(3)
lst.append(500)
print('List with some items added', lst)

In [None]:
# to get the number of elements in a list, use the len() function
length = len(lst)
print('length of lst =', length)

In [None]:
# if we want the element at a specific index of the list, we can do so using the following syntax
# note that in python, 0 is the first index (not 1)
first_elem = lst[0]
third_elem = lst[2]

print('first element =', first_elem)
print('third element =', third_elem)

# we can also index backwards from the end of the list, where the final item has index -1
final_elem = lst[-1]

print('final element =', final_elem)

In [None]:
# if we want a slice of a list, we can specify the start and end index
# the slice includes the start index but not the end index
slice1 = lst[1:3]
print(slice1)

# we can also only indicate the start index (inclusive)
slice2 = lst[1:]
print(slice2)

# or only the end index (exclusive)
slice3 = lst[:3]
print(slice3)

We can use python to do math.  All basic operations work as expected, and we can use parentheses to enforce order of operation.

In [None]:
# we can start by making a few variables
a = 10
b = 5
c = 7
d = 3

# let's try some simple operations
my_sum = a + b + c + d
print('my_sum =', my_sum)

div = a / b                    # by default, python will return floats from division
print('div =', div)

# we can also do lots of steps on the same line, just be careful with your parentheses
answer = (a + b) * ((c-d) + c*d)
print('answer =', answer)

### If Statements & For Loops

The two major pieces of logic we need to understand are **if** statements and **for** loops.  If statements let us run code based on a given condition, while for loops allow us to iterate over some values and repeat a process automatically.  Both follow the specific syntax shown below -- note that indentation is important in python.

In [None]:
# let's start with a simple if statement
# in general, the condition must be a boolean (i.e. evaluate to true or false)

run_code = True
# we condition on whether our variable run_code is true
if run_code:
    print('computer running code...')

In [None]:
# we can also add an "else" clause which gets executed when the "if" condition is not satisfied
password = 'upenn_24'
if password == 'UPenn_25':
    print('correct password')
else:
    print('incorrect password')

In [None]:
# finally, we can use "elif" to add further branches
school = 'Nursing'
if school == 'SAS':
    print('degree = BA')
elif school == 'SEAS':
    print('degree = BS')
elif school == 'Wharton':
    print('degree = BSE')
elif school == 'Nursing':
    print('degree = BSN')
else:
    print('sure you go to Penn?')

In [None]:
# note that we can use "!=" for not equals
my_number = 19
if my_number != 13:
    print('Lucky Number')
else:
    print('Unlucky Number')

In [None]:
# we can have multiple conditions in the same if statement
x1 = 3
x2 = 10

# "and" = all conditions must be true
if x1 == 3 and x2 == 7:
    print('and = true')
else:
    print('and = false')
    
# "or" = 1 condition must be true
if x1 == 5 or x2 == 10:
    print('or = true')
else:
    print('or = false')

In [None]:
# now let's try a for loop
# for example, say we want to loop over all the items in "lst," the list we defined above, and print out each element
print(lst)
for elem in lst:         # each iteration, the variable "elem" gets assigned to the next element in "lst:
    print(elem)

In [None]:
# we can also use the range() function to loop over the list
print(lst)
for i in range(len(lst)):                       # each iteration, the variable "i" gets assgined to the next index of the list
    print('index =', i, 'element =', lst[i])    # we can use the index to grab each element

In [None]:
# we can use a for loop to sum up all of the element in our list
total = 0                      # define our total outside of the for loop, so that it doesn't get reassigned each iteration
for i in range(len(lst)):
    total += lst[i]            # "+=" is equivalent to writing total = total + lst[i]

print('total =', total)

### Functions

User-defined functions are super useful in python.  At their core, functions take in a set of inputs (called "arguments" in codespeak), run some code, and return some outputs.  Importantly, functions allow us to do some "abstraction" (again, codespeak), which essentially means instead of having to repeatedly write the same code, we can put it in a function and just call the function.

In [None]:
# first, we define our function
# we'll name this 'fxn', but the name cane be whatever you want (make it informative)
# we can give our function inputs (here we give it x1 and x2)
# and then return outputs (here we return tot)

def fxn(x1, x2):
    # we want our function to return the sum of the inputs
    tot = x1 + x2
    return tot

In [None]:
# now let's call our function and store what it returns
result = fxn(10, 20)
print(result)

# we can also pass in variables as arguments
# here let's give our function 2 elements from our list
print(lst)
result = fxn(lst[0], lst[2])
print(result)

In [None]:
# now let's write a function that will find the smallest number in a list
def small_lst(lst):
    small = lst[0]                     # the current smallest item is the first element
    for i in range(len(lst)):          # loop over the elements in the list
        if lst[i] < small:             # check if current item is smaller than the current smallest
            small = lst[i]             # if so, update the smallest
            
    return small

In [None]:
# let's define a list and hand it to our function
lst = [10, 50, 30, 40, 70, 90, 100, 110, 5, 45, 60]
small = small_lst(lst)
print(small)

### Numpy, Pandas, Matplotlib

Python makes use of lots of "packages," which essentially contain pre-written functions that you can call, saving you from having to do lots of manual coding.  In order to make use of the functions in these packages, we must first import them.  

In [None]:
# imports
import numpy as np                   # we use numpy arrays to do math
import pandas as pd                  # we use pandas dataframes to work with tabular data
import matplotlib.pyplot as plt      # we use matplotlib for graphing
%matplotlib inline

### Numpy
Numpy arrays are much like lists, except they are of fixed size and they allow us to do some more advanced mathematical operations.

In [None]:
# define a numpy array from a list
lst = [1, 2, 3, 4]
arr = np.array(lst)
print(arr)

In [None]:
# mathematical operations on a numpy array will apply to all elements
arr2 = arr + 1
print(arr2)

In [None]:
# mathematical operations between numpy arrays work elementwise
# addition
arr3 = arr + arr2
print(arr3)

# multiplication
arr4 = arr * arr2
print(arr4)

In [None]:
# we have some handy ways of making numpy arrays
# all zeros
zeros = np.zeros(5)
print(zeros)

# range from 0 to n - 1
a = np.arange(0, 10)
print(a)

In [None]:
# again, we can index arrays as we did with lists
first_elem = a[0]
print(first_elem)
last_elem = a[-1]
print(last_elem)

# and slice arrays as we did with lists
arr_slice = a[3:7]
print(arr_slice)

In [None]:
# also, the numpy package offers some handy built-in functions
# mean
mean = np.mean(arr2)
print(mean)

# standard deviation
stdev = np.std(arr2)
print(stdev)

# maximum value
max_val = np.max(arr2)
print(max_val)

#### Pandas
Pandas dataframes allow us to store and parse tabular data.  You can think of pandas as a python version of excel.

In [None]:
# don't worry about the details here, just run this cell to build an example dataframe

x = [1, 2, 3, 5, 7, 10, 11, 12, 14, 17, 19]
y = [100, 56, 73, 124, 74, 73, 89, 92, 110, 108, 132]
z = [90, 101, 90, 90, 90, 65, 90, 90, 110, 90, 90]
bl = [True, True, False, True, False, True, True, False, False, False, True]

df = pd.DataFrame()
df['x'] = x; df['y'] = y; df['z'] = z; df['bl'] = bl
df

In [None]:
# as we can see above, the dataframe has a number of columns, or fields
# one of the first things we may want to do with our data is select a certain field

# select the y data using the following syntax
y_data = df['y']
y_data

In [None]:
# that gave us a pandas series
# we can make this data more useful to us (or at least more familiar) by transforming it to a list or an array
y_list = list(y_data)
print(y_list)

y_arr = np.array(y_data)
print(y_arr)

In [None]:
# the other thing we will often need to do is filter the data based on some condition
# for example, what if we only want rows where 'bl' is True
# we filter with the following syntax
# we want to start with 'df' and condition on whether the 'bl' field of 'df' is True
all_true = df[df['bl'] == True]
all_true

In [None]:
# now let's filter this data for the rows where 'y' is greater than 90
# notice that we are now filtering 'all_true' instead of 'df'
great_y = all_true[all_true['y'] > 90]
great_y

In [None]:
# finally, using this data, let's find the average 'x' value
# we select the field, convert it to a list, and use a numpy function
x_data = list(great_y['x'])
print(x_data)

x_mean = np.mean(x_data)
print(x_mean)

In [None]:
# we can also filter on multiple conditions at once
# for each condition, we just need to add parentheses () to our filtering syntax
# and use the following symbols for logic: '&' = and, '|' = or

# starting from df, let's get rows where 'bl' is False AND 'y' is less than 90
false_bl_and_small_y = df[(df['bl'] == False) & (df['y'] < 90)]
false_bl_and_small_y

In [None]:
# how about rows where 'bl' is False OR 'y' is less than 90
false_bl_or_small_y = df[(df['bl'] == False) | (df['y'] < 90)]
false_bl_or_small_y

##### Matplotlib
Matplotlib allows us to make graphs to visualize our data.

In [None]:
# for all of our graphs, we will be plotting some independent variable with some dependent variable

# IV
x = np.linspace(1, 10, 10)
print(x)

# DV
y = [10, 23, 6, 15, 28, 23, 12, 25, 14, 33]

# in our plot statement, put the IV first, the DV second, and make sure the two are the same shape!
plt.plot(x, y)

# we can also add labels to our plot
plt.xlabel('x')
plt.ylabel('y')
plt.title('matplotlib graph')

# show our plot
plt.show()

In [None]:
# we can also put multiple graphs on the same plot --> be sure to add a legend

# specify the color and the legend label
plt.plot(x, y, color='purple', label='first_curve')
plt.plot(x, np.array(y) + 3, color='orange', label='second_curve')

# add the legend
plt.legend()

plt.show()

## Problems

Now let's get some practice writing python code to answer the questions below.  First, make sure you have downloaded the python_intro_files folder (that contains 'Example_Penn_Data.csv') and put it in the same folder as this notebook.  Run the below cell to load the data.

In [None]:
data = pd.read_csv('python_intro_files/Example_Penn_Data.csv')
data

### Overview of the Data

For the questions in this notebook, you will by analyzing some made up data on the difference degrees offered at Penn (don't use this as a guide for what major to declare!).  The columns, or fields (codespeak), are as follows:

- School = which of Penn's four undergraduate schools
- Major = name of degree program
- Course_Units = number of CUs required to graduate
- PhD = 1 if department offers a PhD program, 0 if not
- Intro_101 = 1 if department offers an introductory course, 0 if not
- Class_Size = number of students in the previous graduating class

### Question 1

For degrees from the College of Arts and Sciences, what is the average number of CUs required to graduate?  Is this more or less than the average for the other three schools combined?

In [None]:
### YOUR CODE HERE


### Question 2

How many departments with a class size of under 30 offer a PhD program?  How many departments with a class size over 50 do not offer at PhD program?

In [None]:
### YOUR CODE HERE


### Question 3

For each school, what percentage of the programs offer an introductory course?

In [None]:
### YOUR CODE HERE


### Question 4

Make a bar chart that plots the school on the x-axis and the average class size across all degrees in the school on the y-axis.  Make sure to label your axes.

In [None]:
### YOUR CODE HERE
