 # Functions

## Defining Functions in Python

Defining functions in python is very simple:

In [None]:
def my_first_function(x):
    output = 2*x
    return output

To use the function, simply run it elsewhere once it has been defined:

In [None]:
my_first_function(5)

Note the indentation when writing functions. To end the function, leave a blank line and remove the indent.

In [None]:
def my_second_function(x):
    y = int(x/2)
    return y 

my_second_function(12)

**Task**:

Create a function, call it `my_third_function`. This function should take a single argument, `x`, and return `x+2`.

In [None]:
## Your answer here


### _Docstrings_

We can write documentation for our functions, called a _docstring_, by writing a string in the line following the `def` call:

In [None]:
def my_documented_function(input1, input2):
    """
    This function converts inputs to string,
    concatenates the input, then reverses the order.
    """
    in1 = str(input1)
    in2 = str(input2)
    combi = in1+in2
    return combi[::-1]    

my_documented_function(43214, 'cats')

In [None]:
my_documented_function?

**Task**:

Create a function that:

- Takes two arguments, x and y
- Returns (x-y)\*(y-x)

Name it appropriately and describe it in a docstring.

In [None]:
# Your answer here


### _Default Values_

- We can give arguments default values when defining a function.
- Arguments taking default values become optional arguments; we do not have to pass a value each time we call the function.


In [None]:
def function_with_defaults(x, replace=" ", val=" "):
    "Casts input to string, replaces `replace` with `val`, prints result. Returns None"
    y = str(x)
    y = y.replace(replace, val)
    print(y)

function_with_defaults("1. I love dogs.")
function_with_defaults("2. I love dogs.", " dogs.")
function_with_defaults("3. I love dogs.", val="_")
function_with_defaults("4. I love dogs.", "d", "b")
function_with_defaults("5. I love dogs.", "dog", "cat")

### Namespaces

Recall the difference between local and global namespace.

- Variables named within functions are not accessible outside the function.
- When a variable is called within a function, the program first checks if it is defined locally, then checks if it is defined globally.

In [None]:
def function_with_local():
    some_local_variable = 12
    
print(some_local_variable)

In [None]:
a = "Global A"

def function1():
    print(a)
    
def function2():
    a = "Local A"
    print(a)
    
def function3(a):
    print(a)

function1()
function2()
function3("Argument A")

### Namespaces Take Away

When defining variables within the global environment, use _unique, specific and informative names_. When working within functions, give generic names that inform what the argument or variable is doing.

In [None]:
# Check: What is the value of x?
x = 10

def do_something(x):
    x = x+5
    
do_something(x)
# print(x)

In [None]:
# Another check: What is the value of y?
x = 10
y = -5

def do_something(y):
    y = x+y
    return y

y = do_something(y)
# print(y)

In [None]:
# Cleaning up:
del x, y

# Applying Functions to Vectors

We go over a variety of ways in which you may apply a function to a `pandas.Series` or `pandas.DataFrame`.

- Transformations:
    - Element-wise Operations
    - Cumulative Operations
- Summaries:
    - Point Summaries
    - Grouped Summaries

## Element-wise Operations on a Series

We can use the `pd.Series.apply()` method to apply a function element-wise to a pandas Series.

In [None]:
import pandas as pd

ser = pd.Series(range(0, 12, 2)) # range(start, stop, step)
ser

In [None]:
def square(x):
    y = x**2
    return y

ser.apply(square)

In [None]:
def exponentiate(x, e=3):
    y = x**e
    return y

ser.apply(lambda x: exponentiate(x, 3))
ser.apply(exponentiate)

### Lambda Functions

Python allows you to write in-line functions using the lambda statement. The following two are equivalent:

```
def my_func(x):
    return x+3

my_func
lambda x: x+3
```

In [None]:
ser.apply(lambda x: x**3)

In [None]:
e = 1/2
ser.apply(lambda x: x**e)

## Cumulative Operations on a Series

In order to use cumulative operations, we can either use a `cum` function, or the `pd.Series.expanding` method.

In [None]:
ser.cumsum()

In [None]:
ser.expanding()

In [None]:
ser.expanding().sum()

In [None]:
ser.expanding(2).sum() # We can set the minimum period within the expand function.

We can also use apply with a DataFrame. In this case, each row (axis=0) or column (axis=1) is treated as an element.

In [None]:
import numpy as np

df = pd.DataFrame({
    'col1': np.random.randint(-100, 100, 5), # 5 draws from U(-100, 100)
    'col2': np.arange(0, 5), # Integers from [0, 5)
    'col3': np.linspace(0, 1, 5) # Five evenly spaced values from [0, 1]
})
df

In [None]:
# Means over each column
df.apply(lambda x: x.mean(), axis=0)

In [None]:
# Sums over each row
df.apply(lambda x: x.sum(), axis=1)

In [None]:
# |x|**5 to every element
df.applymap(lambda x: abs(x)**0.5)

## Point Summaries

We have already looked at a number of point summary functions in the previous week.

- `pd.Series.mean()`
- `pd.Series.sum()`

We do not spend more time on them here.

## Grouped Summaries

The syntax for group summaries is [explained in detail in the lecture](https://muhark.github.io/dpir-intro-python/Week3/lecture.html#/groupby-syntax-simple-group-operations).

In [None]:
link = 'http://github.com/muhark/dpir-intro-python/raw/master/Week2/data/bes_data_subset_week2.feather'
df = pd.read_feather(link)
col_name_dict =  {
    'a01': 'top_issue',
    'a02': 'top_issue-best_party',
    'a03': 'politics_interest',
    'e01': 'ideo_LR',
    'k01': 'politics_attention',
    'k02': 'read_pol_news',
    'k03': 'newspaper',
    'k11': 'canvasser_contact',
    'k13': 'party_contact',
    'k06': 'twitter_use',
    'k08': 'facebook_use',
    'y01': 'income_band',
    'y03': 'housing',
    'y06': 'religion',
    'y07': 'religiosity',
    'y08': 'union_member',
    'y09': 'gender',
    'y11': 'ethnicity',
    'y17': 'employment_type',
    'y18': 'has_worked'
}
df = df.rename(col_name_dict, axis=1)
df.loc[:, 'finalserialno'] = df['finalserialno'].astype(str)
df.loc[:, 'ideo_LR'] = df['ideo_LR'].replace(
    {'0 Left': '0', '10 Right': '10', 'Refused': np.nan, 'Don`t know': np.nan}
).astype(float)

In [None]:
df.groupby('region')['Age'].mean().astype('int')

In [None]:
df.groupby('region')[['Age']].mean() # List indexer returns a DataFrame of width 1.

**Task**:

Calculate:

- Max age for each gender
- Median ideology for each age

In [None]:
## Your answer here

We can group over multiple columns simultaneously. Pass the grouping columns as a list in the order that they should be grouped by.

In [None]:
df.groupby(['region', 'gender'])['Age'].mean()

We can also pass custom functions using `apply`

In [None]:
df.groupby(['region', 'Constit_Code'])['Age']\
    .apply(lambda x: f"{int(x.min())}-{int(x.max())}")\
    .rename('Age Range')

**Task:**

Calculate the proportion of Facebook users per income band per region.


We can pass multiple functions by using the `agg` function.

In [None]:
df.groupby(['region', 'Constit_Code'])['Age']\
    .agg([np.mean, len])\
    .rename({'len':'count'}, axis=1)

**Task:**

Show the number of Facebook users and total number of individuals per gender per income band.

We can also apply a single function to multiple columns simultaneously:

In [None]:
df.groupby(['region', 'Constit_Code'])\
    [['newspaper', 'religion', 'gender', 'ethnicity']]\
    .apply(lambda x: x.mode().iloc[0])

And finally, we can map different functions to different columns using the `agg()` function with a dictionary:

In [None]:
def group_mode(x):
    "Function for extracting first modal value from pandas groupby object"
    m = x.value_counts().index[0]
    return m

def gender_proportion(x):
    m = x.apply(lambda e: 1 if e=="Female" else 0)
    m = m.astype(int).mean()
    return m

df.groupby(['region', 'Constit_Code']).agg({'newspaper': group_mode,
                                            'religion': group_mode,
                                            'gender': gender_proportion,
                                            'Age': [min, max]})

Note: to index the above DataFrame, you will need some fancy indexers, namely `pd.IndexSlice`.

For more general notes, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-indexing-with-hierarchical-index

In [None]:
temp = df.groupby(['region', 'Constit_Code']).agg({'newspaper': group_mode,
                                                   'religion': group_mode,
                                                   'gender': gender_proportion,
                                                   'Age': ['min', 'max']})

temp

In [None]:
idx = pd.IndexSlice 
temp.loc[idx['East Midlands', :], idx[:, 'group_mode']]

We can also use apply with a DataFrame. In this case, each row (axis=0) or column (axis=1) is treated as an element.

# Combining Datasets

We look at two commands in particular. For an in-depth explanation, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
df1 = df.loc[:, ['finalserialno', 'Age', 'gender']]
df2 = df.loc[:, ['finalserialno', 'religion']]

## Using `pd.concat`

In [None]:
print(df1.shape)
print(df2.shape)
print(pd.concat([df1, df2], axis=0, sort=False).shape)
pd.concat([df1, df2], axis=0, sort=False)   

In [None]:
print(df1.shape)
print(df2.shape)
print(pd.concat([df1, df2], axis=1).shape)

pd.concat([df1, df2], axis=1)

In [None]:
df3 = df.sample(1000).loc[:, ['finalserialno', 'politics_attention']]

In [None]:
pd.concat([df1, df2, df3], axis=1).info()
# Check: why are there differing numbers of NAs?

In [None]:
# Check: How many columns, rows, and NAs will the following have?
pd.concat([df1, df2, df3], axis=0).info()

## `pd.merge`

In [None]:
pd.merge(df1, df2, on="finalserialno")

In [None]:
df4 = df.sample(30).loc[:, ['finalserialno', 'ethnicity']]

In [None]:
print(df3.index)
print(df4.index)
for join in ['inner', 'left', 'right', 'outer']:
    print(pd.merge(df3, df4, how=join, on="finalserialno").shape)

In [None]:
df4 = df4.rename({'finalserialno':'serialno'}, axis=1)
df4

In [None]:
pd.merge(
    df3, df4,
    how="outer", left_on="finalserialno", right_on='serialno')

In [None]:
# Can check the same with set operations
len(set(df3['finalserialno']).intersection(set(df4['serialno'])))

**Task**:

Sample 100 rows of `df` to create a new dataframe and call it `df5`.

Left-join `df5` on `df` on a unique key. How many rows and columns are in the resulting dataframe?

# Melting and Pivoting

In [None]:
long_df = pd.DataFrame({
    "Constituency": ['Oxford West', 'Oxford East']*4,
    "Year": [2010, 2010, 2015, 2015, 2017, 2017, 2019, 2019],
    "Party": ["Labour", "Tory"]*2+["Labour", "LibDem"]*2
})

long_df

In [None]:
wide_df = long_df.pivot(index="Constituency", columns="Year", values="Party")
wide_df

In [None]:
wide_df\
    .reset_index()\
    .melt(
        id_vars="Constituency",
        value_vars=[2010, 2015, 2017, 2019],
        var_name="Year")

**Task**:

Pivot `long_df` so that the years are the index and the constituencies the columns.