# Week 3 Exercises

This week we learned how to do the following tasks:

- Write functions.
- Apply functions element-wise, cumulatively.
- Calculate point and grouped summaries.
- Concatenate and Merge Datasets


## Task 1: Functions

### Task 1a: Numeric Functions

In this exercise you write functions whose domain are either scalar numbers or numeric vectors.

#### Scalar Functions

- One Input: Absolute value
- Two Inputs: Calculate the difference between the first input and the largest multiple of the second input that is less than the first input. Therefore, if the inputs are (41, 10), the function should calculate 41 - 4\*10 = 1.
- Challenge: Write a function that returns the factors of the input. For example, 132 = 2\*2\*3\*11, so $f(132) = \{2, 2, 3, 11\}$

#### Vector Functions

- One Input: Write a summary statistics function. Given a vector, this function should return the following statistics in a `pd.Series` object with corresponding index labels: number of elements, sum, mean, median, variance, standard deviation, and any other statistics that you think are helpful.
- Two Inputs: Write a function that given two equal-length inputs, determines whether each element in the first is divisible by the second. The output should be a vector of equal length to the inputs, indicating with True/False values whether the arguments of the first vector were divisible by the corresponding element in the second. CHALLENGE: Allow the function to take either a scalar or vector input as its second argument.

### Task 1b: String Functions

#### Scalar Functions

- One Input: Write a function that divides a string into a list of words. Note: the `str.split()` function is useful here.
- Two Inputs: Write a function that calculates the number of times the second argument occurs in the first. e.g. "How many times does the letter e occur in this sentence?"

#### Vector Function

- One Input: Write a function that, given a vector/list/series of strings, returns a series where the index is are the unique words in the input, and the values are the number of times that unique word occurs in the entire input. Therefore, if I took a list containing all of the State of the Union Address, I want a function that tells me a) what the unique words in the collection of all Addresses is, and b) how many times those words occur in the total collection.


In [None]:
def absolute_value(x):
    if x < 0:
        x = x*-1
    return x

def largest_mult_diff(x, y):
    # Very lazy solution, but doesn't reinvent the wheel
    return divmod(x, y)[1]

def factorize(x):
    remainder = x
    divide =  2
    factors = []
    while remainder > 1:
        while (remainder % divide) == 0:
            factors.append(divide)
            remainder = remainder//divide
        divide += 1
    return factors

In [None]:
import pandas as pd

def series_summary(x):
    if not isinstance(x, pd.Series):
        x = pd.Series(x)
    index = ['n', 'mean', 'median', 'variance', 'std']
    data = [len(x),
            x.mean(),
            x.median(),
            x.var(),
            x.std()
           ]
    return pd.Series(data, index=index)

def check_divisible(x, y):
    return x%y==0

In [None]:
import re

# There is a way to do this without re, but the regex solution is the most efficient and generalisable (can deal with weird characters)
def split_into_words(x, delim=" "):
    x = re.sub(r'[^A-z ]+', '', x) # This pattern deletes everything but letters and spaces
    x = x.split(delim)
    return x

def count_occurrences(e, x):
    count = len(x.split(e)-1)
    return count

In [None]:
from collections import Counter

def text_col_to_dfm(text_series):
    text_series = text_series.str.lower().str.replace(r"[^A-z ]", '')
    index = text_series.index.values
    tokens = list(set(text_series.str.split(" ").sum()))
    data = []
    for i in index:
        row = []
        l = Counter(text_series.values[i].split(" "))
        for token in tokens:
            row.append(l.get(token, 0))
        data.append(row)
    df = pd.DataFrame(index=index, columns=tokens, data=data)
    return df

## Task 2: Apply

### Task 2a: Element-Wise Operations

1. Using the `Age` variable from the BES dataset, calculate the age of each respondent rounded down to the nearest multiple of 5. Try writing this both using a defined function and with a `lambda` function.
2. Recode the column `y09` as 0 and 1.
3. Write a function that gets the lower bound from the income bounds reported in column `y01`, and returns it as an integer.


### Task 2b: Grouped Functions

1. Calculate the summary statistics on `Age` for each region, and each region/constituency.
2. Calculate the median income bracket (`y01`) per region and region/constituency.
3. Calculate the most commonly given answer to `a02` per region and region/income bracket.
4. Calculate the most commonly given answer to `a02` and `y06` per region.

In [None]:
df = pd.read_feather("../Week2/data/bes_data_subset_week2.feather")

In [None]:
df['Age'].apply(lambda x: x//5*5)

def myround(x, base=5):
    return x//base*base

df['Age'].apply(myround)

In [None]:
df['y09'].apply(lambda x: int(x=='Female'))

In [None]:
df['y01'].unique().tolist()

In [None]:
def get_lower_income_bound(x):
    if x == 'Under GBP 2,600':
        return 0
    elif x == 'GBP 100,000 or more':
        return 100000
    elif x in ['Don`t know', 'Refused']:
        return pd.np.nan
    else:
        return int(x.split(" - ")[0].split("GBP ")[1].replace(",", ""))

df['y01'].apply(get_lower_income_bound)

In [None]:
df.groupby(['region'])['Age'].describe()

In [None]:
df.groupby(['region', 'Constit_Code'])['Age'].describe()

In [None]:
df['lower_income_bound'] = df['y01'].apply(get_lower_income_bound)
df.groupby(['region'])['lower_income_bound'].median()
df.groupby(['region', 'Constit_Code'])['lower_income_bound'].median()

In [None]:
df.groupby(['region'])['a02'].apply(lambda x: pd.Series.mode(x)[0])
df.groupby(['region', 'Constit_Code'])['a02'].apply(lambda x: pd.Series.mode(x)[0])