# Python Practice for Climate and Big Data

This notebook contains several exercises to practice your python programming skills.  While the problems do not concern content related to the material of the course, they enforce the "style of thinking" and introduce data analysis techniques that will be themes throughout the course.

In [None]:
# imports
import numpy as np
import pandas as pd; pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import warnings; warnings.filterwarnings('ignore')
import pickle
import time

## Problem 1: Lists, Matplotlib, and Dynamical Systems

A question of great intrigue in theoretical neuroscience is how do we mathematically describe the activity of individual neurons.  One attempt at this is the Izhikevich model, which consists of 3 equations that model the dynamics of a neuron's membrane potential $V$ (or voltage, in mV), where $\Delta V$ is the change in $V$ over each timestep of duration $\Delta t$ (in ms).

$$
\frac{\Delta V}{\Delta t} = 0.04V^{2} + 5V + 140 - u + I
$$

$$
\frac{\Delta u}{\Delta t} = a(bV - u)
$$

$$
if \ V \geq 30 \ mV, \ then \ \{V \leftarrow c \ ; \ u \leftarrow u + d\}
$$

Here $u$ is a membrane recovery variable, $I$ is the input current (in mA), and $a$, $b$, $c$, and $d$ are free parameters.  Essentially, the first two equations model how the voltage $V$ and recovery variable $u$ change in time, while the third equation indicates that if the voltage reaches a certain threshold, it will be reset to some value $c$.  This reset is meant to symbolize the hyperpolarization that follows the firing of an action potential.

In this problem, we will simulate the Izhikevich model with different parameter sets to generate unique neuronal behaviors.  We will use the same constants and initial values for each part as defined below, with our simulations running over 300 ms.  For the input current, we will presume that there is 50 ms of no current followed by 250 ms of constant current at 10 mA.

* Hint: The first two equations correspond to two differential equations, the first of which tells us how the membrane potential changes with time.  If we want to run a simulation that steps forward in time, think of how we can solve for the change in voltage.

In [None]:
# constants, initial values
Vthresh = 30                    # threshold potential [mV]
V0 = -65                        # initial potential [mV]
t0 = -50                        # initial time [ms]
tf = 250                        # final time [ms]
dt = 0.01                       # time step [ms]
I_in = 10                       # input current [mA]

To run simulations of the model, we will define a function that takes in several arguments for the initial conditions and parameter values and returns two lists, one that holds the membrane potential and one that holds the time at each step of the simulation, as well as the number of action potentials fired.

In [None]:
# Izhikevich model simulation
def Izh_model(Vthresh, V0, t0, tf, dt, I_in, a, b, c, d):
    u0 = b * V0
    V = V0                        # set initial voltage
    u = u0                        # set initial membrane recovery
    t = t0                        # set initial time
    Vlist = [V]                   # list holding voltage time-series
    tlist = [t]                   # list holding times
    action_potentials = 0         # number of action potentials
    
    ### YOUR CODE HERE
    raise NotImplementedError
    
    return Vlist, tlist, action_potentials

#### Part (a) Regular Spiking Neuron

Simulate the Izhikevich model with the parameters $a = 0.02$, $b = 0.2$, $c = -65 \ mV$, $d = 8$.  Input to the neuron a step current, $I = 0 \ mA$ when $t < 0$ and $I = 10 \ mA$ when $t \geq 0$.  

Plot the membrane potential as a function of time and indicate how many action potentials the neuron fired.  Make sure to label your axes, as well as indicate on your graph the reset and threshold potentials using dashed horizontal lines.

In [None]:
### YOUR CODE HERE
# parameters


In [None]:
# run the model


In [None]:
# graphing


#### Part (b) Chattering Neuron

Simulate the Izhikevich model with the parameters $a = 0.02$, $b = 0.2$, $c = -50 \ mV$, $d = 2$.  Input to the neuron a step current, $I = 0 \ mA$ when $t < 0$ and $I = 10 \ mA$ when $t \geq 0$.

Plot the membrane potential as a function of time and indicate how many action potentials the neuron fired.  Make sure to label your axes, as well as indicate on your graph the reset and threshold potentials using dashed horizontal lines.

In [None]:
### YOUR CODE HERE
# parameters


In [None]:
# run the model


In [None]:
# graphing


## Problem 2: Numpy and Multi-Dimensional Data

We have data on the temperature profile of a lake in Switzerland, in the form of a 4-dimensional array.  The first dimension represents depth (40 meters), the second length (70 meters), the third width (50 meters), and the last dimension represents the day of the year (starting January 1st and ending December 31st).  In this problem, we will analyze the spatio-temporal dynamics of the lake's temperature profile.

Running the cell below will load the data for you to analyze.  Make sure you have the python_practice_files folder in the same directory as this notebook.

* Hint: We are modelling the lake as a rectangular prism (with a fourth time dimension).  To help in trying to visualize, you can think of the length as east-west (longitude) and the width as north-south (latitude).

In [None]:
def gen_lake_data():
    return np.load('python_practice_files/t_lake.npy')

t_lake = gen_lake_data()

#### Part (a)

Find the days of the year where the average temperature of the entire lake is the coldest and the warmest.

In [None]:
### YOUR CODE HERE


In [None]:
# print results


#### Part (b)

Plot the daily change in temperature of the lake averaged over the entire water body as a time-series.

In [None]:
### YOUR CODE HERE


In [None]:
# graphing


#### Part (c)

For each 10 meter layer of depth, find the sample standard deviation of temperature over the year, treating each day as independent (meaning we average over the length and width).  First, do so manually using the formula for the sample standard deviation:

$$
\sigma = \sqrt{\frac{\sum{(x_{i}-\bar{x})^{2}}}{n-1}}
$$

where we sum over all observations $x_{i}$, $\bar{x}$ is the sample mean, and $n$ is the number of data points.  Then, do the same calculation using a numpy function.

In [None]:
### YOUR CODE HERE


In [None]:
# manually calculate


In [None]:
# print results


In [None]:
# use a numpy function


In [None]:
# print results


#### Part (d)

Selecting a width in the middle of the lake (i.e., a specific latitude), plot the two-dimensional (length on x-axis, depth on y-axis) temperature profile of the lake on 1 day in the summer and 1 day in the winter.

Also, averaging over the width of the lake, plot the two-dimensional (length on x-axis, depth on y-axis) temperature profile averaged over 1 week in the summer and 1 week in the winter.

* Hint: *plt.matshow* is useful for plotting 1 dependent variable as a function of two independent variables.

In [None]:
### YOUR CODE HERE


In [None]:
# graphing


## Problem 3: Pandas and Tabular Data

Imagine you work for a professional football club.  Your team just lost your best player, so the owner has tasked you with identifying potential replacement players that your team can sign.  Specifically, the manager has instructed you that they are looking for players with the following qualifications:

1. Midfielders: only look at players who have 'MF' listed as a position they can play ('Pos')
2. Injury-Free: only look at players who played more than 20 90s in the previous season ('90s')
3. Above average passing completion percentage ('Total_Cmp%')
4. Above average key passes per 90 minutes ('KP')
5. Above average progressive passing distance per 90 minutes ('Total_PrgDist')
6. Youth: only look at players under 25 years old ('Age')

Run the following cell to load in a dataset containing player statistics from the previous season.  Make sure you have the python_practice_files folder in the same directory as this notebook.

In [None]:
def load_fbref_data():
    return pd.read_csv('python_practice_files/passing_t5.csv')

dataframe = load_fbref_data()
dataframe

#### Part (a)

First, let's clean up our dataset a little bit.  There are numerous columns we are not conncerned with, so let's just remove them.  Drop all of the columns in the list below from our dataframe.

In [None]:
cols_to_drop = ['Rk', 'Born', 'Short_Cmp', 'Short_Att', 'Short_Cmp%', 'Medium_Cmp', 'Medium_Att', 'Medium_Cmp%', 
                'Long_Cmp', 'Long_Att', 'Long_Cmp%','xAG', 'A-xAG', '1/3', 'PPA', 'CrsPA', 'Matches']

### YOUR CODE HERE


#### Part (b)

Now, let's begin filtering our data.  First, let's determine how many players qualify based on our first 2 criteria.  Filter out all players who are not midfielders or who played fewer than 20 90s last season.  Report the number of players who qualify from each league ('Comp').

In [None]:
### YOUR CODE HERE


In [None]:
# print results


#### Part (c)

For the next 3 criteria, we want to judge individual performance in relaton to the perforance of other players.  Also, we want to look at statistics based on a per 90 minute rate.  Using the data we have available, create two new columns ('KP_per90' and 'Total_PrgDist_per90') and populate them with the rate data we want.  

Then, filter the data further and identify the list of players who had above average passing completion percentage, key passes per 90 minutes, and progressive passing distance per 90 minutes.  Also, report the means for these three statistics.

In [None]:
### YOUR CODE HERE


In [None]:
# print results


#### Part (d)

Finally, filter based on the youth criteria and present of a list of players who the team should try to sign.  Report the standard error of the mean of our 3 criteria from part (c) for this subset of players.  Recall the formula for the standard error:

$$
SE = \frac{\sigma}{\sqrt{n}}
$$

where $\sigma$ is the sample standard deviation, and $n$ is the number of independent observations.

In [None]:
### YOUR CODE HERE


In [None]:
# print results


## Problem 4: Reading, Parsing, and Writing Files

The opening few pages to Edward Albee's play "A Delicate Balance" are stored in the text file python_practice_files/Albee_A_Delicate_Balance.txt.  Each line in the text file contains one line of dialogue, with the start of the line indicating the character speaking and a ':' separating the character from the dialogue (i.e. CHARACTER_NAME: line of dialogue).

For each question, you will append your answer to the "answers" list, and at the end of the problem you will save that list to a pickle (.pkl) file.

In [None]:
answers = []

#### Part (a)

Determine how many lines each character has.  Store (character, number of lines) tuples in the answers list, which should be ordered based on the order in which the characters' first lines appear in the play.

In [None]:
#### YOUR CODE HERE


In [None]:
# append to answers list


#### Part (b)

Determine how many times a character says "Julia".  Store the integer in the answers list.

In [None]:
### YOUR CODE HERE


In [None]:
# append to answers list


#### Part (c)

Find the 10th word of every line and add it to your answers list (as strings).  For lines with fewer than 10 words, instead add the number of words on the line to your list of answers (as integers).  

Note that the character name does not count as a word, and that we only want to count words (not numbers or special characters).

In [None]:
#### YOUR CODE HERE


In [None]:
# append to answers list


#### Part (d)

Save your answers to a pickle file called 'answers.pkl' within the python_practice_files directory.

In [None]:
### YOUR CODE HERE


## Problem 5: Randomize

Randomize a list of $n$ items.  First, do so with only using *np.random.randint* to generate 1 random integer at a time.  Then, do so with *np.random.shuffle*.  Compare how long the two methods take for lists of length 1000 and 1000000.

* Hints: *time.time()* will get the current time, which we can use to calculate how long code takes to run.

In [None]:
# lists of different lengths
lst_thousand = np.arange(1000)
lst_million = np.arange(1000000)

In [None]:
def randomize(lst):
    ### YOUR CODE HERE
    raise NotImplementedError
    
def shuffle(lst):
    new_lst = lst.copy()
    np.random.shuffle(new_lst)
    return new_lst

In [None]:
### YOUR CODE HERE


In [None]:
# print results
