# COGS 4290 Assignment 0

In [1]:
# imports
import numpy as np
import pandas as pd; pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import warnings; warnings.filterwarnings('ignore')
import pickle
import time

## Problem 1: Lists and Matplotlib
A question of great intrigue in neuroscience is how do we mathematically describe the activity of individual neurons.  One attempt at this the Izhikevich model, which consists of 3 equations that model the dynamics of a neuron's membrane potential (or voltage in mV) $V$, where $\Delta V$ is the change in $V$ over each timestep of duration $\Delta t$ (in ms).

$$
\frac{\Delta V}{\Delta t} = 0.04V^{2} + 5V + 140 - u + I
$$

$$
\frac{\Delta u}{\Delta t} = a(bV - u)
$$

$$
if \ V \geq 30 \ mV, \ then \ \{V \leftarrow c; u \leftarrow u + d\}
$$

Here $u$ is a membrane recovery variable, $I$ is the input current (in mA), and $a$, $b$, $c$, and $d$ are free parameters.  Essentially, the first two equations model how the voltage $V$ and recovery variable $u$ change in time, while the third equation indicates that if the voltage reaches a certain threshold, it will be reset to some value $c$.  This reset is meant to symbolize the hyperpolarization that follows the firing of an action potential.

In this problem, we will simulate the Izhikevich model with different parameter sets to generate unique neuronal behaviors.  We will use the same constants and initial values for each part as defined below, with our simulations running over 300 ms.  For the input current, we will presume that there is 50 ms of no current followed by 250 ms of constant current at 10 mA.

* Hint: The first two equations correspond to two differential equations, the first of which tells us how the membrane potential changes with time.  If we want to run a simulation that steps forward in time, think of how we can solve for the change in voltage.

In [2]:
# constants, initial values
Vthresh = 30           # threshold potential [mV]
V0 = -65               # initial potential [mV]
t0 = -50               # initial time [ms]
tf = 250               # final time [ms]
dt = 0.01              # time step [ms]
I_in = 10              # input current [mA]

To run simulations of the model, we will define a function that takes in several arguments for the initial conditions and parameter values and returns two lists, one that holds the membrane potential and one that holds the time at each step of the simulation, as well as the number of action potentials fired.

In [3]:
# Izhikevich model simulation
def Izh_model(V0, Vthresh, t0, tf, dt, a, b, c, d, I_in):
    V = V0                       # set initial voltage
    u0 = b * V0
    u = u0                       # set initial membrane recovery
    t = t0                       # set initial time
    Vlist = [V]                  # list holding voltage time-series
    tlist = [t]                  # list holding times
    action_potentials = 0        # number of action potentials
    
    ### YOUR CODE HERE
    raise NotImplementedError
    
    return Vlist, tlist, action_potentials
    

#### Part (a) Regular Spiking Neuron
Simulate the Izhekevich model with the parameters $a = 0.02$, $b = 0.2$, $c = -65 \ mV$, $d = 8$.  Input to the neuron a step current, $I = 0$ when $t < 0$ and $I = 10$ when $t \geq 0$.  Plot the results and indicate how many action potentials the neuron fired.  Make sure to label your axes, as well as indicate on your graph the reset and threshold potentials using dashed horizontal lines.

In [4]:
### YOUR CODE HERE
# parameters


In [5]:
# run the model


In [6]:
# graphing


#### Part (b) Chattering Neuron
Simulate the Izhekevich model with the parameters $a = 0.02$, $b = 0.2$, $c = -50 \ mV$, $d = 2$.  Input to the neuron a step current, $I = 0$ when $t < 0$ and $I = 10$ when $t \geq 0$.  Plot the results and indicate how many action potentials the neuron fired.  Make sure to label your axes, as well as indicate on your graph the reset and threshold potentials using dashed horizontal lines.

In [7]:
### YOUR CODE HERE
# parameters


In [8]:
# run the model


In [9]:
# graphing


## Problem 2: Numpy and Multi-dimensional Data

We have data on the temperature profile of a lake in Switzerland, in the form of a 4-dimensional array.  The first dimension represents depth, the second length, the third width (all in meters), and the last dimension represents the day of the year (stating January 1st and ending December 31st).  We are interested in the dynamics of the temperature profile of the lake over the course of the year and want to answer the follwing.

Questions:
1) Find the day of the year where the average temperature of the entire lake is the coldest, warmest.
2) Plot the change in temperature of the lake averaged over the entire body of water as a time-series, taking the difference each day.
3) For each 10 meter layer of depth, find the sample standard deviation of temperatue over the year, treating each day as independent.
4) Averaging over the width, plot the 2-dimensional temperatue profile of the lake averaging over 1 week in the summer and 1 week in the winter.

Running the cell below will generate the data for use to analyze.

In [10]:
def gen_lake_data():
    temp_lake = np.zeros((50, 100, 70, 365), dtype=float)    # 50 meters depth, 100 meters length, 50 meters width, 365 days
    ones = np.ones((temp_lake.shape[1], temp_lake.shape[2]))
    
    # linear function of height
    h = np.arange(0, 50)
    temp_h = -0.3*h + 15
    
    # sinusoid on time
    day = np.arange(0, 365)
    temp_day = np.sin(np.radians(day) - 5*np.pi/6)
    
    for i in range(temp_lake.shape[0]):
        for j in range(temp_lake.shape[3]):
            np.random.seed(i*j)
            temp_lake[i,:,:,j] = temp_h[i] * temp_day[j] * ones * 0.8 + np.random.uniform(-1.0, 1.0, ones.shape) + 10   # randomly sample for each width and length
    
    return temp_lake

t_lake = gen_lake_data()

#### Part (a)
Find the day of the year where the average temperature of the entire lake is the coldest and the warmest.

In [11]:
### YOUR CODE HERE


In [12]:
# print results


#### Part (b)
For each 10 meter layer of depth, find the sample standard deviation of temperature over the year, treating each day as independent.  First, do so manually using the formula for the sample standard deviation:

$$
\sigma = \sqrt{\frac{\sum{(x_{i}-\bar{x})^{2}}}{n-1}}
$$

where we sum over all observations $x_{i}$, $\bar{x}$ is the sample mean, and $n$ is the number of data points.  Then, do the same calculation using a numpy function.

In [13]:
### YOUR CODE HERE


In [14]:
# manually calculate


In [15]:
# print results


In [16]:
# use a numpy function


In [17]:
# print results


#### Part (c)
Selecting a width in the middle of the lake, plot the two-dimensional temperature profile of the lake on 1 day in the summer and 1 day in the winter.  Also, averaging over the width of the lake, plot the temperature profile averaged over 1 week in the summer and 1 week in the winter.

In [18]:
### YOUR CODE HERE


In [19]:
# graphing


## Problem 3: Pandas

Imagine you work for a professional football club.  Your team just lost your best player, so the owner has tasked you with identifying potential replacement players that your team can sign.  Specifically, the manager has instructed you that they are looking for players with the following qualifications:

1) Midfielders: Only look at players who have 'MF' listed as a position they can play ('Pos').
2) Injury-Free: Only look at players who played more that 10 90s in the previous season ('90s').
3) Above average passing completion percentage ('Total_CMP%').
4) Above average key passes per 90 minutes ('KP').
5) Above average progressive passing distance per 90 minutes ('Total_PrgDist').
6) Youth: Only look at players under 25 years old ('Age').

Run the following code to load in a dataset containing player statistics from the previous season.

In [2]:
def load_fbref_data():
    # load in data from html
    passing_t5 = pd.read_html('https://fbref.com/en/comps/Big5/2022-2023/passing/players/2022-2023-Big-5-European-Leagues-Stats')[0]
    # relabel column names
    passing_t5.columns = [x[0] + '_' + x[1] if 'Unnamed' not in x[0] else x[1] for x in passing_t5.columns]
    # clean up ages
    passing_t5['Age'] = [x.split('-')[0] if type(x)==str else '100' for x in passing_t5.Age]
    # convert from strings to numerics
    num_cols = ['Age', 'Born', '90s', 'Total_Cmp', 'Total_Att', 'Total_Cmp%', 'Total_TotDist', 'Total_PrgDist', 'Short_Cmp', 'Short_Att', 'Short_Cmp%', 
                'Medium_Cmp', 'Medium_Att', 'Medium_Cmp%', 'Long_Cmp', 'Long_Att', 'Long_Cmp%', 'Ast', 'xAG', 'Expected_xA', 'Expected_A-xAg', 'KP', '1/3', 'PPA', 'CrsPA', 'PrgP']
    for c in passing_t5.columns:
        if c in num_cols:
            passing_t5[c] = pd.to_numeric(passing_t5[c], errors='coerce')
    
    
    return passing_t5

dataframe = load_fbref_data()
dataframe

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,90s,Total_Cmp,Total_Att,Total_Cmp%,Total_TotDist,Total_PrgDist,Short_Cmp,Short_Att,Short_Cmp%,Medium_Cmp,Medium_Att,Medium_Cmp%,Long_Cmp,Long_Att,Long_Cmp%,Ast,xAG,xA,A-xAG,KP,1/3,PPA,CrsPA,PrgP,Matches
0,1,Brenden Aaronson,us USA,"MF,FW",Leeds United,eng Premier League,21.0,2000.0,26.4,592.0,797.0,74.3,7577.0,2182.0,346.0,423.0,81.8,150.0,195.0,76.9,25.0,65.0,38.5,3.0,4.2,2.6,-1.2,46.0,47.0,16.0,4.0,86.0,Matches
1,2,Paxten Aaronson,us USA,"MF,DF",Eint Frankfurt,de Bundesliga,18.0,2003.0,1.9,51.0,71.0,71.8,659.0,109.0,36.0,38.0,94.7,14.0,23.0,60.9,1.0,6.0,16.7,0.0,0.0,0.1,0.0,1.0,3.0,0.0,0.0,6.0,Matches
2,3,James Abankwah,ie IRL,DF,Udinese,it Serie A,18.0,2004.0,0.7,23.0,29.0,79.3,375.0,79.0,14.0,15.0,93.3,6.0,8.0,75.0,2.0,5.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Matches
3,4,George Abbott,eng ENG,MF,Tottenham,eng Premier League,16.0,2005.0,0.0,1.0,1.0,100.0,8.0,0.0,1.0,1.0,100.0,0.0,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Matches
4,5,Yunis Abdelhamid,ma MAR,DF,Reims,fr Ligue 1,34.0,1987.0,37.0,1679.0,2031.0,82.7,32967.0,13407.0,490.0,571.0,85.8,1006.0,1117.0,90.1,155.0,279.0,55.6,2.0,1.0,0.9,+1.0,13.0,155.0,5.0,0.0,215.0,Matches
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2999,2885,Martin Ødegaard,no NOR,MF,Arsenal,eng Premier League,23.0,1998.0,34.7,1449.0,1804.0,80.3,22540.0,6014.0,759.0,873.0,86.9,518.0,617.0,84.0,110.0,175.0,62.9,7.0,8.1,7.9,-1.1,76.0,135.0,91.0,4.0,266.0,Matches
3000,2886,Milan Đurić,ba BIH,FW,Hellas Verona,it Serie A,32.0,1990.0,15.2,268.0,523.0,51.2,3119.0,740.0,186.0,336.0,55.4,54.0,119.0,45.4,5.0,13.0,38.5,1.0,1.3,0.9,-0.3,21.0,27.0,5.0,0.0,29.0,Matches
3001,2887,Filip Đuričić,rs SRB,"MF,FW",Sampdoria,it Serie A,30.0,1992.0,24.1,586.0,758.0,77.3,9599.0,2169.0,302.0,347.0,87.0,211.0,254.0,83.1,52.0,86.0,60.5,0.0,1.9,1.6,-1.9,34.0,50.0,18.0,4.0,77.0,Matches
3002,2888,Blanco,,MF,Cádiz,es La Liga,22.0,2000.0,1.8,,,,,,,,,,,,,,,0.0,,,,,,,,,Matches


#### Part (a)
First, let's clean up our dataset a little bit.  There are numerous columns we are not concerned with, so let's just remove them.  Drop all of the columns in the list below from our dataframe.

In [21]:
cols_to_drop = ['Rk', 'Born', 'xAG', 'Expected_A-xAG', '1/3', 'PPA', 'CrsPA', 'Matches']


### YOUR CODE HERE


#### Part (b)
Now, let's begin filtering our data.  First, let's determine how many players qualify based on our first 2 criteria.  Filter out all players who are not midfielders or who played fewer than 10 90s last season.  Report the number of players who qualify from each league ('Comp').

In [22]:
### YOUR CODE HERE


In [23]:
# print results


#### Part (c)
For the next 3 criteria, we want to judge individual performance in relation to the performance of other players.  Also, we want to look statistics based on a per 90 minute rate.  Using the data we have available, create two new columns ('KP_per90' and 'Total_PrgDist_per90') and populate them with the rate data we want.  Then, filter the data further and identify the list of players who had above average passing completion percentage, key passes per 90 minutes, and progressive passing distance per 90 minutes.  Also report the means for these three statistics.

In [24]:
### YOUR CODE HERE


In [25]:
# print results


#### Part (d)
Finally, filter based on the final youth criteria, and present a list of players who the team should try to sign.  Report the standard error of the mean of our 3 criteria from part (c) for this subset of players.  Recall the formula for the standard error:

$$
SE = \frac{\sigma}{\sqrt{n}}
$$

where $\sigma$ is the sample standard deviation, and $n$ is the number of independent observations.

In [26]:
### YOUR CODE HERE


In [27]:
# print results


## Problem 4: Reading, Parsing, and Writing Files
The opening few pages to Edward Albee's play "A Delicate Balance" are stored in the text file sample_files/Albee_A_Delicate_Balance.txt.  Each line in the text file contains one line of dialogue, with the start of the line indicating the character speaking (i.e. CHARACTER_NAME: line of dialogue).

For each question, you will append your answer to the "answers" list, and at the end of the problem you will save that list to a pickle (.pkl) file.

In [28]:
answers = []

#### Part (a)
Determine how many lines each character has.  Store in the answers list as (character, number of lines) tuples, which should be ordered based on the order in which the characters' first lines appear in the play.

In [29]:
### YOUR CODE HERE


In [30]:
# append to answers list


#### Part (b)
Determine how many times a character says "Julia."

In [31]:
### YOUR CODE HERE


In [32]:
# append to answers list


#### Part (c)
Find the 10th word of every line and add it to your list of answers.  For lines with fewer than 10 words, instead add the number of words on the line to your list of answers.  Note that the character name does not count as a word, and that we only want words (not numbers or special characters).

In [33]:
### YOUR CODE HERE


In [34]:
# append to answers list


#### Part (d)
Save your answers to a pickle file called 'answers.pkl'.

In [35]:
### YOUR CODE HERE


## Problem 5: Randomize
Randomize a list of $n$ items.  First, do so with only using *np.random.randint* to generate 1 random integer at a time.  Then, do so with *np.random.shuffle*.  Compare how long the two methods take for lists of length 1000, and 1000000.
* Hints: *time.time()* will get the current time, which we can use to calculate how long code takes to run.

In [36]:
# lists of different lengths
lst_thousand = np.arange(1000)
lst_million = np.arange(1000000)

In [37]:
def randomize(lst):
    ### YOUR CODE HERE
    raise NotImplementedError
    
def shuffle(lst):
    new_lst = lst.copy()
    np.random.shuffle(new_lst)
    return new_lst

In [38]:
### YOUR CODE HERE


In [39]:
# print results
