# Functions

## Defining Functions in Python

Defining functions in python is very simple:

In [2]:
def my_first_function(input):
    output = 2*input
    return output

To use the function, simply run it elsewhere once it has been defined:

In [3]:
my_first_function(5)

10

Note the indentation when writing functions. To end the function, leave a blank line and remove the indent.

In [5]:
def my_second_function(x):
    y = int(x/2)
    return y

my_second_function(12)

6

### _Docstrings_

We can write documentation for our functions, called a _docstring_, by adding writing a string in the line following the `def` call:

In [7]:
def my_documented_function(input1, input2):
    """
    This function converts inputs to string, concatenates the input, then reverses the order.
    """
    in1 = str(input1)
    in2 = str(input2)
    combi = in1+in2
    rev = combi[::-1]
    output = rev
    return output

my_documented_function(43214, 'cats')

'stac41234'

In [19]:
my_documented_function?

[0;31mSignature:[0m [0mmy_documented_function[0m[0;34m([0m[0minput1[0m[0;34m,[0m [0minput2[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m This function converts inputs to string, concatenates the input, then reverses the order.
[0;31mFile:[0m      ~/Documents/Teaching/dpir-intro-python/Week3/<ipython-input-7-f32523afe821>
[0;31mType:[0m      function


### _Default Values_

- We can give arguments default values when defining a function.
- Arguments taking default values become optional arguments; we do not have to pass a value each time we call the function.


In [17]:
def function_with_defaults(x, replace=" ", val=" "):
    "Casts input to string, replaces `replace` with `val`, prints result. Retuns None"
    y = str(x)
    y = y.replace(replace, val)
    print(y)

function_with_defaults("1. I love dogs.")
function_with_defaults("2. I love dogs.", " dogs.")
function_with_defaults("3. I love dogs.", val="_")
function_with_defaults("4. I love dogs.", "d", "b")
function_with_defaults("5. I love dogs.", "dog", "cat")

1. I love dogs.
2. I love 
3._I_love_dogs.
4. I love bogs.
5. I love cats.


### Namespaces

Recall the difference between local and global namespace.

- Variables named within functions are not accessible outside the function.
- When a variable is called within a function, the program first checks if it is defined locally, then checks if it is defined globally.

In [40]:
def function_with_local():
    some_local_variable = 12
    
print(some_local_variable)

NameError: name 'some_local_variable' is not defined

In [48]:
a = "Global A"

def function1():
    print(a)
    
def function2():
    a = "Local A"
    print(a)
    
def function3(a):
    print(a)

function1()
function2()
function3("Argument A")

Global A
Local A
Argument A


### Namespaces Take Away

When defining variables within the global environment, use _unique, specific and informative names_. When working within functions, give generic names that inform what the argument or variable is doing.

# Applying Functions to Vectors

We go over a variety of ways in which you may apply a function to a `pandas.Series` or `pandas.DataFrame`.

- Transformations:
    - Element-wise Operations
    - Cumulative Operations
- Summaries:
    - Point Summaries
    - Grouped Summaries

## Element-wise Operations on a Series

We can use the `pd.Series.apply()` method to apply a function element-wise to a pandas Series.

In [69]:
import pandas as pd

In [79]:
ser = pd.Series(range(0, 31, 2)) # range(start, stop, step)

In [80]:
ser

0      0
1      2
2      4
3      6
4      8
5     10
6     12
7     14
8     16
9     18
10    20
11    22
12    24
13    26
14    28
15    30
dtype: int64

In [81]:
def square(x):
    y = x**2
    return y

ser.apply(square)

0       0
1       4
2      16
3      36
4      64
5     100
6     144
7     196
8     256
9     324
10    400
11    484
12    576
13    676
14    784
15    900
dtype: int64

In [82]:
def exponentiate(x, e):
    y = x**e
    return y

ser.apply(lambda x: exponentiate(x, 3))

0         0
1         8
2        64
3       216
4       512
5      1000
6      1728
7      2744
8      4096
9      5832
10     8000
11    10648
12    13824
13    17576
14    21952
15    27000
dtype: int64

In [83]:
ser.apply(lambda x: x**3)

0         0
1         8
2        64
3       216
4       512
5      1000
6      1728
7      2744
8      4096
9      5832
10     8000
11    10648
12    13824
13    17576
14    21952
15    27000
dtype: int64

In [84]:
e = 1/2
ser.apply(lambda x: x**e)

0     0.000000
1     1.414214
2     2.000000
3     2.449490
4     2.828427
5     3.162278
6     3.464102
7     3.741657
8     4.000000
9     4.242641
10    4.472136
11    4.690416
12    4.898979
13    5.099020
14    5.291503
15    5.477226
dtype: float64

## Cumulative Operations on a Series

In order to use cumulative operations, we can either use a `cum` function, or the `pd.Series.expanding` method.

In [89]:
ser.cumsum()

0       0
1       2
2       6
3      12
4      20
5      30
6      42
7      56
8      72
9      90
10    110
11    132
12    156
13    182
14    210
15    240
dtype: int64

In [87]:
ser.expanding()

Expanding [min_periods=1,center=False,axis=0]

In [88]:
ser.expanding().sum()

0       0.0
1       2.0
2       6.0
3      12.0
4      20.0
5      30.0
6      42.0
7      56.0
8      72.0
9      90.0
10    110.0
11    132.0
12    156.0
13    182.0
14    210.0
15    240.0
dtype: float64

In [92]:
ser.expanding(2).sum() # We can set the minimum period within the expand function.

0       NaN
1       2.0
2       6.0
3      12.0
4      20.0
5      30.0
6      42.0
7      56.0
8      72.0
9      90.0
10    110.0
11    132.0
12    156.0
13    182.0
14    210.0
15    240.0
dtype: float64

We can also use apply with a DataFrame. In this case, each row (axis=0) or column (axis=1) is treated as an element.

array([ 29, -59, -57, -45,  23, -53,  11,  79,   2,  96,  78, -53,  52,
       -91, -44, -52, -53,  72,  19,  54, -78,  48,  73, -15, -58, -83,
       -83, -90,  61,  36])

In [234]:
df = pd.DataFrame({
    'col1': pd.np.random.randint(-100, 100, 5),
    'col2': pd.np.random.randint(-100, 100, 5),
    'col3': pd.np.random.randint(-100, 100, 5)
})
df.head()

Unnamed: 0,col1,col2,col3
0,-70,-60,-5
1,-69,-80,86
2,77,81,-15
3,97,73,47
4,-50,-100,11


In [235]:
df.apply(lambda x: x.mean(), axis=0)

col1    -3.0
col2   -17.2
col3    24.8
dtype: float64

In [236]:
df.apply(lambda x: x.sum(), axis=1)

0   -135
1    -63
2    143
3    217
4   -139
dtype: int64

In [239]:
df.applymap(lambda x: abs(x)**0.5)

Unnamed: 0,col1,col2,col3
0,8.3666,7.745967,2.236068
1,8.306624,8.944272,9.273618
2,8.774964,9.0,3.872983
3,9.848858,8.544004,6.855655
4,7.071068,10.0,3.316625


## Point Summaries

We have already looked at a number of point summary functions in the previous week.

- `pd.Series.mean()`
- `pd.Series.sum()`

We do not spend more time on them here.

## Grouped Summaries

The syntax for group summaries is [explained in detail in the lecture](https://muhark.github.io/dpir-intro-python/Week3/lecture.html#/groupby-syntax-simple-group-operations).

In [241]:
df = pd.read_feather("../Week2/data/bes_data_subset_week2.feather")

In [103]:
df.groupby('region')['Age'].mean()

region
East Midlands         54.903226
Eastern               54.070796
London                46.896552
North East            54.276786
North West            51.388158
Scotland              53.109948
South East            51.971631
South West            54.560241
Wales                 51.269841
West Midlands         54.451327
Yorkshire & Humber    53.152174
Name: Age, dtype: float64

In [105]:
df.groupby('region')[['Age']].mean() # List indexer returns a DataFrame of width 1.

Unnamed: 0_level_0,Age
region,Unnamed: 1_level_1
East Midlands,54.903226
Eastern,54.070796
London,46.896552
North East,54.276786
North West,51.388158
Scotland,53.109948
South East,51.971631
South West,54.560241
Wales,51.269841
West Midlands,54.451327


We can pass custom functions to the groupby object by using `apply`

In [133]:
df.groupby(['region', 'Constit_Code'])['Age'].apply(lambda x: f"{int(x.min())}-{int(x.max())}").rename('Age Range')

region              Constit_Code
East Midlands       Ashfield        21-83
                    Bassetlaw       23-93
                    Bolsover        27-65
                    Broxtowe        35-67
                    Charnwood       36-80
                                    ...  
Yorkshire & Humber  Sheffield       19-71
                    Sheffield,      22-90
                    Skipton an      22-84
                    York Centr      19-77
                    York Outer      46-86
Name: Age Range, Length: 218, dtype: object

We can pass multiple functions by using the `agg` function.

In [113]:
df.groupby(['region', 'Constit_Code'])[['Age']].agg([pd.np.mean, len])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,len
region,Constit_Code,Unnamed: 2_level_2,Unnamed: 3_level_2
East Midlands,Ashfield,56.888889,9.0
East Midlands,Bassetlaw,46.000000,10.0
East Midlands,Bolsover,50.375000,8.0
East Midlands,Broxtowe,55.833333,6.0
East Midlands,Charnwood,60.818182,11.0
...,...,...,...
Yorkshire & Humber,Sheffield,44.000000,12.0
Yorkshire & Humber,"Sheffield,",55.038462,26.0
Yorkshire & Humber,Skipton an,54.444444,9.0
Yorkshire & Humber,York Centr,52.777778,9.0


We can also apply a single function to multiple columns simultaneously:

In [148]:
df.groupby(['region', 'Constit_Code'])[['k03', 'y06', 'y09', 'y11']].apply(lambda x: x.mode().iloc[0])

Unnamed: 0_level_0,Unnamed: 1_level_0,k03,y06,y09,y11
region,Constit_Code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
East Midlands,Ashfield,Mail,Christian - no denomination,Female,English/Welsh/Scottish/Northern Irish/British
East Midlands,Bassetlaw,Mirror/Record,No religion,Male,English/Welsh/Scottish/Northern Irish/British
East Midlands,Bolsover,Mail,No religion,Male,English/Welsh/Scottish/Northern Irish/British
East Midlands,Broxtowe,Guardian/Observer,No religion,Male,English/Welsh/Scottish/Northern Irish/British
East Midlands,Charnwood,Mail,Church of England/ Anglican/Episcopal,Male,English/Welsh/Scottish/Northern Irish/British
...,...,...,...,...,...
Yorkshire & Humber,Sheffield,Guardian/Observer,No religion,Female,English/Welsh/Scottish/Northern Irish/British
Yorkshire & Humber,"Sheffield,",Guardian/Observer,No religion,Female,English/Welsh/Scottish/Northern Irish/British
Yorkshire & Humber,Skipton an,Mail,No religion,Female,English/Welsh/Scottish/Northern Irish/British
Yorkshire & Humber,York Centr,Mail,No religion,Male,English/Welsh/Scottish/Northern Irish/British


And finally, we can map different functions to different columns using the `agg()` function with a dictionary:

In [169]:
def group_mode(x):
    "Function for extracting first modal value from pandas groupby object"
    m = x.value_counts().index[0]
    return m

def gender_proportion(x):
    m = x.apply(lambda e: 1 if e=="Female" else 0)
    m = m.astype(int).mean()
    return m


df.groupby(['region', 'Constit_Code']).agg({'k03': group_mode,
                                            'y06': group_mode,
                                            'y09': gender_proportion,
                                            'Age': ['min', 'max']})

Unnamed: 0_level_0,Unnamed: 1_level_0,k03,y06,y09,Age,Age
Unnamed: 0_level_1,Unnamed: 1_level_1,group_mode,group_mode,gender_proportion,min,max
region,Constit_Code,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
East Midlands,Ashfield,Mail,Christian - no denomination,0.555556,21.0,83.0
East Midlands,Bassetlaw,Mirror/Record,No religion,0.400000,23.0,93.0
East Midlands,Bolsover,Other,No religion,0.500000,27.0,65.0
East Midlands,Broxtowe,Guardian/Observer,No religion,0.500000,35.0,67.0
East Midlands,Charnwood,Mail,Church of England/ Anglican/Episcopal,0.454545,36.0,80.0
...,...,...,...,...,...,...
Yorkshire & Humber,Sheffield,Guardian/Observer,No religion,0.583333,19.0,71.0
Yorkshire & Humber,"Sheffield,",Guardian/Observer,No religion,0.538462,22.0,90.0
Yorkshire & Humber,Skipton an,Mail,No religion,0.555556,22.0,84.0
Yorkshire & Humber,York Centr,Other,No religion,0.444444,19.0,77.0


Note: to index the above DataFrame, you will need some fancy indexers, namely `pd.IndexSlice`.

For more general notes, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index

In [170]:
temp = df.groupby(['region', 'Constit_Code']).agg({'k03': group_mode,
                                                   'y06': group_mode,
                                                   'y09': gender_proportion,
                                                   'Age': ['min', 'max']})

In [176]:
idx = pd.IndexSlice
temp.loc[idx['East Midlands', :], idx[:, 'group_mode']]

Unnamed: 0_level_0,Unnamed: 1_level_0,k03,y06
Unnamed: 0_level_1,Unnamed: 1_level_1,group_mode,group_mode
region,Constit_Code,Unnamed: 2_level_2,Unnamed: 3_level_2
East Midlands,Ashfield,Mail,Christian - no denomination
East Midlands,Bassetlaw,Mirror/Record,No religion
East Midlands,Bolsover,Other,No religion
East Midlands,Broxtowe,Guardian/Observer,No religion
East Midlands,Charnwood,Mail,Church of England/ Anglican/Episcopal
East Midlands,Daventry,Mail,No religion
East Midlands,Derby Sout,Telegraph,No religion
East Midlands,Grantham a,Other,Church of England/ Anglican/Episcopal
East Midlands,Harborough,Mail,No religion
East Midlands,Lincoln,Other,No religion


We can also use apply with a DataFrame. In this case, each row (axis=0) or column (axis=1) is treated as an element.

# Combining Datasets

We look at two commands in particular. For an in-depth explanation, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [186]:
df1 = df.loc[:, ['finalserialno', 'Age', 'y09']]
df2 = df.loc[:, ['finalserialno', 'y06']]

## Using `pd.concat`

In [189]:
print(df1.shape)
print(df2.shape)
print(pd.concat([df1, df2], axis=0, sort=False).shape)
pd.concat([df1, df2], axis=0, sort=False)

(2194, 3)
(2194, 2)
(4388, 4)


Unnamed: 0,finalserialno,Age,y09,y06
0,10115,21.0,Female,
1,10119,53.0,Male,
2,10125,56.0,Male,
3,10215,65.0,Female,
4,10216,68.0,Female,
...,...,...,...,...
2189,69923,,,No religion
2190,69925,,,No religion
2191,70016,,,No religion
2192,70022,,,No religion


In [191]:
print(df1.shape)
print(df2.shape)
print(pd.concat([df1, df2], axis=1).shape)

pd.concat([df1, df2], axis=1)

(2194, 3)
(2194, 2)
(2194, 5)


Unnamed: 0,finalserialno,Age,y09,finalserialno.1,y06
0,10115,21.0,Female,10115,No religion
1,10119,53.0,Male,10119,Islam/Muslim
2,10125,56.0,Male,10125,No religion
3,10215,65.0,Female,10215,Christian - no denomination
4,10216,68.0,Female,10216,Christian - no denomination
...,...,...,...,...,...
2189,69923,59.0,Female,69923,No religion
2190,69925,46.0,Female,69925,No religion
2191,70016,50.0,Female,70016,No religion
2192,70022,82.0,Female,70022,No religion


In [192]:
df3 = df.loc[:2130, ['finalserialno', 'a02']]

In [194]:
pd.concat([df1, df2, df3], axis=1)

Unnamed: 0,finalserialno,Age,y09,finalserialno.1,y06,finalserialno.2,a02
0,10115,21.0,Female,10115,No religion,10115.0,Labour
1,10119,53.0,Male,10119,Islam/Muslim,10119.0,None/No party
2,10125,56.0,Male,10125,No religion,10125.0,Don`t know
3,10215,65.0,Female,10215,Christian - no denomination,10215.0,Don`t know
4,10216,68.0,Female,10216,Christian - no denomination,10216.0,Labour
...,...,...,...,...,...,...,...
2189,69923,59.0,Female,69923,No religion,,
2190,69925,46.0,Female,69925,No religion,,
2191,70016,50.0,Female,70016,No religion,,
2192,70022,82.0,Female,70022,No religion,,


## `pd.merge`

In [195]:
pd.merge(df1, df2, on="finalserialno")

Unnamed: 0,finalserialno,Age,y09,y06
0,10115,21.0,Female,No religion
1,10119,53.0,Male,Islam/Muslim
2,10125,56.0,Male,No religion
3,10215,65.0,Female,Christian - no denomination
4,10216,68.0,Female,Christian - no denomination
...,...,...,...,...
2189,69923,59.0,Female,No religion
2190,69925,46.0,Female,No religion
2191,70016,50.0,Female,No religion
2192,70022,82.0,Female,No religion


In [205]:
df4 = df.loc[30:, ['finalserialno', 'y11']]

In [206]:
print(df3.index)
print(df4.index)
for join in ['inner', 'left', 'right', 'outer']:
    print(pd.merge(df3, df4, how=join, on="finalserialno").shape)

RangeIndex(start=0, stop=2131, step=1)
RangeIndex(start=30, stop=2194, step=1)
(2101, 3)
(2131, 3)
(2164, 3)
(2194, 3)


In [209]:
df4 = df4.rename({'finalserialno':'serialno'}, axis=1).set_index('serialno')
df4

Unnamed: 0_level_0,y11
serialno,Unnamed: 1_level_1
10817,English/Welsh/Scottish/Northern Irish/British
10821,English/Welsh/Scottish/Northern Irish/British
10823,English/Welsh/Scottish/Northern Irish/British
10916,English/Welsh/Scottish/Northern Irish/British
10917,English/Welsh/Scottish/Northern Irish/British
...,...
69923,English/Welsh/Scottish/Northern Irish/British
69925,English/Welsh/Scottish/Northern Irish/British
70016,English/Welsh/Scottish/Northern Irish/British
70022,English/Welsh/Scottish/Northern Irish/British


In [210]:
pd.merge(df3, df4, how="outer", left_on="finalserialno", right_index=True)

Unnamed: 0,finalserialno,a02,y11
30,10817,None/No party,English/Welsh/Scottish/Northern Irish/British
31,10821,None/No party,English/Welsh/Scottish/Northern Irish/British
32,10823,None/No party,English/Welsh/Scottish/Northern Irish/British
33,10916,Conservatives,English/Welsh/Scottish/Northern Irish/British
34,10917,Labour,English/Welsh/Scottish/Northern Irish/British
...,...,...,...
2126,68423,None/No party,English/Welsh/Scottish/Northern Irish/British
2127,68425,Labour,English/Welsh/Scottish/Northern Irish/British
2128,68525,Conservatives,White and Black Caribbean
2129,68602,All of them/ more than one,English/Welsh/Scottish/Northern Irish/British


# Melting and Pivoting

In [258]:
long_df = pd.DataFrame({
    "Constituency": ['Oxford West', 'Oxford East']*4,
    "Year": [2010, 2010, 2015, 2015, 2017, 2017, 2019, 2019],
    "Party": ["Labour", "Tory"]*2+["Labour", "LibDem"]*2
})

long_df

Unnamed: 0,Constituency,Year,Party
0,Oxford West,2010,Labour
1,Oxford East,2010,Tory
2,Oxford West,2015,Labour
3,Oxford East,2015,Tory
4,Oxford West,2017,Labour
5,Oxford East,2017,LibDem
6,Oxford West,2019,Labour
7,Oxford East,2019,LibDem


In [261]:
wide_df = long_df.pivot(index="Constituency", columns="Year", values="Party")
wide_df

Year,2010,2015,2017,2019
Constituency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Oxford East,Tory,Tory,LibDem,LibDem
Oxford West,Labour,Labour,Labour,Labour


In [277]:
wide_df.reset_index().melt(id_vars="Constituency", value_vars=[2010, 2015, 2017, 2019], var_name="Year")

Unnamed: 0,Constituency,Year,value
0,Oxford East,2010,Tory
1,Oxford West,2010,Labour
2,Oxford East,2015,Tory
3,Oxford West,2015,Labour
4,Oxford East,2017,LibDem
5,Oxford West,2017,Labour
6,Oxford East,2019,LibDem
7,Oxford West,2019,Labour
