# Week 3 Exercises

This week we learned how to do the following tasks:

- Write functions.
- Apply functions element-wise, cumulatively.
- Calculate point and grouped summaries.
- Concatenate and Merge Datasets


## Task 1: Functions

### Task 1a: Numeric Functions

In this exercise you write functions whose domain are either scalar numbers or numeric vectors.

#### Scalar Functions

- One Input: Absolute value
- Two Inputs: Calculate the difference between the first input and the largest multiple of the second input that is less than the first input. Therefore, if the inputs are (41, 10), the function should calculate 41 - 4\*10 = 1.
- Challenge: Write a function that returns the factors of the input. For example, 132 = 2\*2\*3\*11, so $f(132) = \{2, 2, 3, 11\}$

#### Vector Functions

- One Input: Write a summary statistics function. Given a vector, this function should return the following statistics in a `pd.Series` object with corresponding index labels: number of elements, sum, mean, median, variance, standard deviation, and any other statistics that you think are helpful.
- Two Inputs: Write a function that given two equal-length inputs, determines whether each element in the first is divisible by the second. The output should be a vector of equal length to the inputs, indicating with True/False values whether the arguments of the first vector were divisible by the corresponding element in the second. CHALLENGE: Allow the function to take either a scalar or vector input as its second argument.

### Task 1b: String Functions

#### Scalar Functions

- One Input: Write a function that divides a string into a list of words. Note: the `str.split()` function is useful here.
- Two Inputs: Write a function that calculates the number of times the second argument occurs in the first. e.g. "How many times does the letter e occur in this sentence?"

#### Vector Function

- One Input: Write a function that, given a vector/list/series of strings, returns a series where the index is are the unique words in the input, and the values are the number of times that unique word occurs in the entire input. Therefore, if I took a list containing all of the State of the Union Address, I want a function that tells me a) what the unique words in the collection of all Addresses is, and b) how many times those words occur in the total collection.


In [None]:
def absolute_value(x):
    """
    There is in fact also a abs() function.
    This is just another way to implement it.
    """
    if x < 0:
        x = x*-1
    return x

def largest_mult_diff(x, y):
    """
    There are a variety of ways to do this.
    Modulo operations are probably the easiest: x % y
    """
    return x % y

def factorize(x):
    """
    Factorizes x.
    """
    # Initial values
    remainder = x
    divide =  2
    # Store factors as we find them
    factors = []
    while remainder > 1: # When factor=1, then we've finished factorizing
        # While because a single prime can be a factor multiple times 
        while (remainder % divide) == 0: # Check if it cleanly divides.
            factors.append(divide) # If it cleanly divides, then add it to the list of factors.
            remainder = remainder//divide # Update the remainder, try again.
        divide += 1 # Increment up through all integers. Faster to try only primes.
    return factors

In [2]:
import pandas as pd

def series_summary(x):
    """
    Returns key statistics of a series.
    """
    if not isinstance(x, pd.Series): # Checks if input is pd.Series object
        x = pd.Series(x) # If not, then make it be so
    index = ['n', 'mean', 'median', 'variance', 'std'] # 5 statistics
    data = [len(x),
            x.mean(),
            x.median(),
            x.var(),
            x.std()
           ]
    return pd.Series(data, index=index)

def check_divisible(x, y):
    return x%y==0

In [3]:
import re

# There is a way to do this without re, but the regex solution is the most efficient and generalisable (can deal with weird characters)
def split_into_words(x, delim=" "):
    x = re.sub(r'[^\w ]+', '', x) # This pattern deletes everything but letters and spaces
    x = x.split(delim)
    return x

def count_occurrences(e, x):
    """
    Returns the number of times 'e' occurs in x.
    """
    count = len(x.split(e))-1
    # We don't need to count the occurrences, we can just break up the string on 'e' and
    # count how many parts it gets split into.
    return count

In [4]:
print(split_into_words('Hello World! My name is Myles Morales. How are you?'))
print(count_occurrences('e', 'Hello World! My name is Myles Morales. How are you?'))

['Hello', 'World', 'My', 'name', 'is', 'Myles', 'Morales', 'How', 'are', 'you']
5


In [5]:
from collections import Counter

def text_col_to_dfm(text_series):
    text_series = text_series.str.lower().str.replace(r"[^\w ]", '')
    index = text_series.index.values
    tokens = list(set(text_series.str.split(" ").sum()))
    data = []
    for i in index:
        row = []
        l = Counter(text_series.values[i].split(" "))
        for token in tokens:
            row.append(l.get(token, 0))
        data.append(row)
    df = pd.DataFrame(index=index, columns=tokens, data=data)
    return df

In [6]:
text_input_series = pd.Series(
    ['Hello World!', 'Hello Jello!', 'World News Report']
)
text_col_to_dfm(text_input_series)

Unnamed: 0,news,report,hello,jello,world
0,0,0,1,0,1
1,0,0,1,1,0
2,1,1,0,0,1


## Task 2: Apply

### Task 2a: Element-Wise Operations

1. Using the `Age` variable from the BES dataset, calculate the age of each respondent rounded down to the nearest multiple of 5. Try writing this both using a defined function and with a `lambda` function.
2. Recode the column `y09` as 0 and 1.
3. Write a function that gets the lower bound from the income bounds reported in column `y01`, and returns it as an integer.


### Task 2b: Grouped Functions

1. Calculate the summary statistics on `Age` for each region, and each region/constituency.
2. Calculate the median income bracket (`y01`) per region and region/constituency.
3. Calculate the most commonly given answer to `a02` per region and region/income bracket.
4. Calculate the most commonly given answer to `a02` and `y06` per region.

In [7]:
df = pd.read_feather("../Week2/data/bes_data_subset_week2.feather")

In [8]:
df['Age'].apply(lambda x: x//5*5) # // integer division

def myround(x, base=5):
    return x//base*base

df['Age'].apply(myround)

0       20.0
1       50.0
2       55.0
3       65.0
4       65.0
        ... 
2189    55.0
2190    45.0
2191    50.0
2192    80.0
2193    85.0
Name: Age, Length: 2194, dtype: float64

In [9]:
df['y09'].apply(lambda x: int(x=='Female'))

0       1
1       0
2       0
3       1
4       1
       ..
2189    1
2190    1
2191    1
2192    1
2193    0
Name: y09, Length: 2194, dtype: category
Categories (2, int64): [0 < 1]

In [10]:
df['y01'].unique().tolist()

['GBP 5,200 - GBP 10,399',
 'GBP 2,600 - GBP 5,199',
 'GBP 36,400 - GBP 39,999',
 'GBP 40,000 - GBP 44,999',
 'Don`t know',
 'GBP 10,400 - GBP 15,599',
 'GBP 50,000 - GBP 59,999',
 'GBP 31,200 - GBP 36,399',
 'GBP 26,000 - GBP 31,199',
 'GBP 60,000 - GBP 74,999',
 'GBP 15,600 - GBP 20,799',
 'Refused',
 'GBP 75,000 - GBP 99,999',
 'GBP 45,000 - GBP 49,999',
 'GBP 100,000 or more',
 'GBP 20,800 - GBP 25,999',
 'Under GBP 2,600']

In [11]:
def get_lower_income_bound(x):
    if x == 'Under GBP 2,600':
        return 0
    elif x == 'GBP 100,000 or more':
        return 100000
    elif x in ['Don`t know', 'Refused']:
        return pd.np.nan
    else:
        return int(x.split(" - ")[0].split("GBP ")[1].replace(",", ""))

df['y01'].apply(get_lower_income_bound)

0        5200.0
1        2600.0
2        5200.0
3       36400.0
4       40000.0
         ...   
2189    60000.0
2190    75000.0
2191     5200.0
2192    15600.0
2193    45000.0
Name: y01, Length: 2194, dtype: float64

In [12]:
df.groupby(['region'])['Age'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
East Midlands,155.0,54.903226,17.222295,19.0,42.5,57.0,66.0,94.0
Eastern,226.0,54.070796,18.42955,18.0,41.0,55.0,68.0,96.0
London,203.0,46.896552,18.675821,18.0,32.0,41.0,61.0,89.0
North East,112.0,54.276786,20.313405,20.0,36.0,55.5,69.25,91.0
North West,304.0,51.388158,17.946216,18.0,37.0,50.0,67.0,95.0
Scotland,191.0,53.109948,16.996701,18.0,40.5,54.0,65.0,97.0
South East,282.0,51.971631,18.33591,18.0,36.0,52.0,67.0,91.0
South West,166.0,54.560241,19.453892,19.0,39.0,56.5,70.0,99.0
Wales,126.0,51.269841,20.510061,18.0,33.25,51.5,67.75,89.0
West Midlands,226.0,54.451327,17.967126,18.0,41.0,56.0,68.75,95.0


In [13]:
df.groupby(['region', 'Constit_Code'])['Age'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
region,Constit_Code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
East Midlands,Ashfield,9.0,56.888889,18.923824,21.0,53.00,65.0,66.00,83.0
East Midlands,Bassetlaw,10.0,46.000000,23.598493,23.0,28.75,39.0,54.75,93.0
East Midlands,Bolsover,8.0,50.375000,12.872312,27.0,43.00,53.0,58.75,65.0
East Midlands,Broxtowe,6.0,55.833333,12.221566,35.0,50.75,59.0,65.00,67.0
East Midlands,Charnwood,11.0,60.818182,14.647991,36.0,51.50,62.0,70.00,80.0
...,...,...,...,...,...,...,...,...,...
Yorkshire & Humber,Sheffield,12.0,44.000000,16.814496,19.0,29.00,44.5,56.00,71.0
Yorkshire & Humber,"Sheffield,",26.0,55.038462,18.287659,22.0,42.50,51.0,70.50,90.0
Yorkshire & Humber,Skipton an,9.0,54.444444,23.553721,22.0,34.00,54.0,73.00,84.0
Yorkshire & Humber,York Centr,9.0,52.777778,19.936009,19.0,44.00,54.0,67.00,77.0


In [14]:
df['lower_income_bound'] = df['y01'].apply(get_lower_income_bound)
df.groupby(['region'])['lower_income_bound'].median()
df.groupby(['region', 'Constit_Code'])['lower_income_bound'].median()

region              Constit_Code
East Midlands       Ashfield         7800.0
                    Bassetlaw       33800.0
                    Bolsover        31200.0
                    Broxtowe        31200.0
                    Charnwood       55000.0
                                     ...   
Yorkshire & Humber  Sheffield       28600.0
                    Sheffield,      26000.0
                    Skipton an      15600.0
                    York Centr      20800.0
                    York Outer      40700.0
Name: lower_income_bound, Length: 218, dtype: float64

In [15]:
df.groupby(['region'])['a02'].apply(lambda x: pd.Series.mode(x)[0])
df.groupby(['region', 'Constit_Code'])['a02'].apply(lambda x: pd.Series.mode(x)[0])

region              Constit_Code
East Midlands       Ashfield           Don`t know
                    Bassetlaw       None/No party
                    Bolsover        None/No party
                    Broxtowe        None/No party
                    Charnwood       Conservatives
                                        ...      
Yorkshire & Humber  Sheffield              Labour
                    Sheffield,      Conservatives
                    Skipton an         Don`t know
                    York Centr         Don`t know
                    York Outer      Conservatives
Name: a02, Length: 218, dtype: object