This document is a Python exploration of this R-based document: https://m-clark.github.io/data-processing-and-visualization/iterative.html.  Code is *not* optimized for anything but learning.  In addition, all the content is located with the main document, not here, so some sections may not be included.  I only focus on reproducing the code chunks.

## Iterative Programming

Almost everything you do when dealing with data will need to be done again, and again, and again.  If you are copy-pasting your way to repetitively do the same thing, you're not only doing things inefficiently, you're almost certainly setting yourself up for trouble if anything changes about the data or underlying process.

In order to avoid this, you need to be familiar with basic programming, and a starting point is to use an iterative approach to repetitive problems. 

In [1]:
import pandas as pd
import numpy as np

weather = pd.read_csv('../data/weather.csv')

### For Loops

This is the sort of thing we don't want.

In [2]:
np.mean(weather.humid)
np.mean(weather.temp)
np.mean(weather.wind_speed)
np.mean(weather.precip)

0.004469079073329505

In [3]:
for column in ['temp', 'humid', 'wind_speed', 'precip']: {
  print(np.mean(weather[[column]]))
}

temp    55.260392
dtype: float64
humid    62.530059
dtype: float64
wind_speed    10.517488
dtype: float64
precip    0.004469
dtype: float64


Now if the data name changes, the columns we want change, or we want to calculate something else, we usually end up only changing one thing, rather than *at least* changing one, and probably many more things.  In addition, the amount of code is the same whether the loop goes over 100 columns or 4.

Let's do things a little differently.  The following will provide a usable result and is coded in the same fashion as the R example (not necessarily optimal).

In [4]:
?np.mean

[0;31mSignature:[0m [0mnp[0m[0;34m.[0m[0mmean[0m[0;34m([0m[0ma[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mdtype[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mout[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mkeepdims[0m[0;34m=[0m[0;34m<[0m[0mno[0m [0mvalue[0m[0;34m>[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute the arithmetic mean along the specified axis.

Returns the average of the array elements.  The average is taken over
the flattened array by default, otherwise over the specified axis.
`float64` intermediate and return values are used for integer inputs.

Parameters
----------
a : array_like
    Array containing numbers whose mean is desired. If `a` is not an
    array, a conversion is attempted.
axis : None or int or tuple of ints, optional
    Axis or axes along which the means are computed. The default is to
    compute the mean of the flattened array.

    .. versionadded:: 1.7.0

    If this is a t

In [5]:
columns = ['temp', 'humid', 'wind_speed', 'precip']
nyc_means = np.repeat(None, len(columns))

for i in range(len(columns)):
  column = columns[i]
  nyc_means[i] = np.mean(weather[[column]])

print(nyc_means)

[temp    55.260392
dtype: float64 humid    62.530059
dtype: float64
 wind_speed    10.517488
dtype: float64 precip    0.004469
dtype: float64]


Unlike R, Python loops are fast enough to be viable.  This doesn't get around the verbosity issue, but means that we shouldn't mind using them as we caution ourselves in R.  The other nice thing is that loops in Python are more flexible than R.

Python provides what is called *list comprehension*, which is a way to create a list given a list or vector that is *iterable* with a type of shorthand for a loop.

To demonstrate, we'll just get the squared values of 0, 1 and 2.

In [6]:
[x**2 for x in range(3)]

[0, 1, 4]

Now let's try it for our weather data.

In [7]:
[np.mean(weather[[x]]) for x in columns] # columns was created previously above

[temp    55.260392
 dtype: float64,
 humid    62.530059
 dtype: float64,
 wind_speed    10.517488
 dtype: float64,
 precip    0.004469
 dtype: float64]

While not too dissimilar from how we use sapply or lapply in R, there is no special function to call.

Another nice thing I like about Python loops versus R loops is an easy way to create multiple objects with the loop.  It's not intuitive to start out with for our example, so let's build some intution.

First, let's just do a simple double assignment.

In [8]:
x, y = [1, 2]

In [9]:
x

1

In [10]:
y

2

Well that was easy enough!  Let's try it with a standard loop.

In [11]:
nyc_means = np.repeat(None, len(columns))
nyc_sds = np.repeat(None, len(columns))

for i in range(len(columns)):
    nyc_means[i], nyc_sds[i] = np.mean(weather[[columns[i]]]), np.std(weather[[columns[i]]])
    
nyc_means

array([temp    55.260392
dtype: float64,
       humid    62.530059
dtype: float64,
       wind_speed    10.517488
dtype: float64,
       precip    0.004469
dtype: float64], dtype=object)

In [12]:
nyc_sds

array([temp    17.787512
dtype: float64,
       humid    19.395547
dtype: float64,
       wind_speed    8.539089
dtype: float64,
       precip    0.030153
dtype: float64], dtype=object)

We can now use list comprehension and do this in one line. We have to use `zip` here, and the `*` just allows us to put any number of things into the zip function, but this approach allows us to get what we want in a very succint fashion.

In [13]:
nyc_means, nyc_sds = zip(*[(np.mean(weather[[x]]), np.std(weather[[x]])) for x in columns])

In [14]:
nyc_means

(temp    55.260392
 dtype: float64,
 humid    62.530059
 dtype: float64,
 wind_speed    10.517488
 dtype: float64,
 precip    0.004469
 dtype: float64)

In [15]:
nyc_sds

(temp    17.787512
 dtype: float64,
 humid    19.395547
 dtype: float64,
 wind_speed    8.539089
 dtype: float64,
 precip    0.030153
 dtype: float64)

In the end though, creating a function and using map or other approach like the R way may be best for a particular problem.

### Using while

As in other programming languages, using a while statement in Python is equivalent to a loop.  If you use them, you can take advantage of the `+=` operator, which is a baffling oversight of the R language.  Note the zero start and we change `<=` to `<` as a result, but otherwise this is identical to the R example.

In [16]:
nyc_means = np.repeat(None, len(columns))
i = 0

while i < len(columns):
    nyc_means[i] = np.mean(weather[[columns[i]]])
    i += 1

nyc_means

array([temp    55.260392
dtype: float64,
       humid    62.530059
dtype: float64,
       wind_speed    10.517488
dtype: float64,
       precip    0.004469
dtype: float64], dtype=object)

Understanding loops is fundamental toward spending less time processing data and more time toward exploring it. Your code will be more succinct and more able to handle the usual changes that come with dealing with data.

### Apply-type approaches

In [17]:
def stdize(x):
    return(x - np.mean(x) / np.std(x))

weather[columns].apply(stdize, axis = 1)   # 0 for columns, 1 for rowwise application

Unnamed: 0,temp,humid,wind_speed,precip
0,37.860264,58.210264,9.197284,-1.159736
1,37.917737,60.527737,6.953197,-1.102263
2,37.870971,63.280971,10.358771,-1.149029
3,38.730931,61.020931,11.469511,-1.189069
4,37.850398,63.260398,11.488978,-1.169602
...,...,...,...,...
26110,34.685413,50.505413,12.534773,-1.274587
26111,32.617056,48.147056,15.898756,-1.362944
26112,30.694580,47.884580,13.654720,-1.305420
26113,29.541923,45.361923,15.883623,-1.378077


Sadly the above shows how much slower working with data frames can be in Python vs. R.  The above operation took several seconds.  But as a counterpoint, Python's string capabilities are very easy to use and fast relative to R.  The following provides an example with list comprehension.

In [18]:
x = ['aba', 'abb', 'abc', 'abd', 'abe']

print([i.strip('ab') for i in x]) 

['', '', 'c', 'd', 'e']


Here is an example of a rowwise application.

In [19]:
df = pd.DataFrame(
    {
        'a': range(1,4),
        'b': range(4,7)
    }
)

df

df.apply(np.sum, 1)

0    5
1    7
2    9
dtype: int64

### Map functionality

While we have apply functionality, we also have map functionality similar to that demonstrated with R.  Base R has a Map function, but purrr adds both flexibility and some rigor to the utilization of it.  The main point here is that we can also use something similar for Python.

In [20]:
round = lambda x: '%.2f' % x

weather[columns].applymap(round)

Unnamed: 0,temp,humid,wind_speed,precip
0,39.02,59.37,10.36,0.00
1,39.02,61.63,8.06,0.00
2,39.02,64.43,11.51,0.00
3,39.92,62.21,12.66,0.00
4,39.02,64.43,12.66,0.00
...,...,...,...,...
26110,35.96,51.78,13.81,0.00
26111,33.98,49.51,17.26,0.00
26112,32.00,49.19,14.96,0.00
26113,30.92,46.74,17.26,0.00


The `map` function for a pandas object will apply to the vector in question. Typically this would be a column, and the following is just an explicit form of `applymap`.

In [21]:
df.a.map(round)

0    1.00
1    2.00
2    3.00
Name: a, dtype: object

### Working with lists

List objects make it very easy to iterate some form of data processing.

Let’s say you have models of increasing complexity, and you want to easily summarise and/or compare them. We create a list for which each element is a model object. We then apply a function, e.g. to get the AIC value for each, or adjusted R square.

In [22]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [23]:
mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data
results = list()


# Fit regression model (using the natural log of one of the regressors)
results.append(smf.ols('mpg ~ wt', data = mtcars).fit())
results.append(smf.ols('mpg ~ wt*hp', data = mtcars).fit())
results.append(smf.ols('mpg ~ wt + hp + vs + am', data = mtcars).fit())


In [24]:
results

[<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fc2fbf68bd0>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fc2fbf8c810>,
 <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fc2fc04aad0>]

In [25]:
print([round(x.rsquared_adj) for x in results])

['0.74', '0.87', '0.83']


In [26]:
print([round(x.aic) for x in results])

['164.03', '143.61', '154.06']


## Iterative Programming Exercises

### Exercise 1

With the following matrix, use apply and the sum function to get row or column sums of the matrix x.

In [27]:
x = np.matrix(np.arange(1,10)).reshape(3, 3)
x

matrix([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

### Exercise 2

With the following list object `x`, loop over the elements and sum them.

In [28]:
x = [
    np.arange(1, 4), 
    np.arange(4, 11), 
    np.arange(11, 101)
]

### Exercise 3

As in the previous example, use a map function to create a data frame of the column means. See ?map to see all your options.

In [29]:
d = pd.DataFrame({
  'x' : np.random.normal(size = 100),
  'y' : np.random.normal(10, 2, 100),
  'z' : np.random.normal(50, 10, 100),
})