## Problem 1

### Wide to Long Conversions

I have a dataset of dairy cow level data that is arranged in a very specfic format, specifically a __wide format__.

I would like it to be in a __long format__.

How do we solve this problem?

Assume that:
- The data is too big to fit in memory.
- There is a "wide_to_long" function like [this one in pandas](https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html).
- The wide to long conversion cannot be done for all the rows at once without hitting memory problems.

### Data Explanation
This data is from an organization called a __Dairy Herd Improvement Association__. This organization takes monthly measurements (called "tests") for all the dairy cows on a member farm. Dairy cows produce in a "cycle," and DHIA estimates "total yield" for their whole cycle from a mathematical formula using the monthly measurements.

Description of the data:
- cow_id: the id of the dairy cow.
- total_yield: total milk yield for that cycle.
- max_segment: the total number of tests taken.
- seg_yield: the calculated yield for that test.
- seg_stage: the stage of production cycle.
- seg_time: the date of the test.

### Objective: Make Data1 into Data2

Rather than actually do it in this case, draw out how you would solve the problem on the provided Jamboards.

In [4]:
data1 = '''
cow_id     , total_yield, max_segment, cycle, seg1_yield, seg1_time, seg1_stage, seg2_yield, seg2_time, seg2_stage, seg3_yield, seg3_time, seg3_stage
1          , 10         , 1          , 1    , 10        , 1/3/2013 , 1         ,           ,          ,           ,           ,          , 
1          , 55         , 2          , 1    , 10        , 1/3/2013 , 1         , 6         , 2/3/2013 , 2         ,           ,          , 
2          , 306        , 3          , 1    , 4         , 3/13/2013, 1         , 4         , 4/13/2013, 2         , 12        , 5/13/2013, 3
2          , 35         , 1          , 2    , 10        , 7/3/2013 , 1         ,           ,          ,           ,           ,          , 
'''

data2 = '''
cow_id     , total_yield, max_segment, cycle, seg_yield , seg_time , seg_stage 
1          , 10         , 1          , 1    , 10        , 1/3/2013 , 1         
1          , 55         , 2          , 1    , 6         , 2/3/2013 , 2         
2          ,            , 3          , 1    , 4         , 3/13/2013, 1         
2          ,            , 3          , 1    , 6         , 4/13/2013, 2         
2          , 306        , 3          , 1    , 12        , 5/13/2013, 3         
2          , 35         , 1          , 2    , 10        , 7/3/2013 , 1               
'''

In [5]:
import pandas as pd
from io import StringIO

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

In [6]:
df1

Unnamed: 0,cow_id,total_yield,max_segment,cycle,seg1_yield,seg1_time,seg1_stage,seg2_yield,seg2_time,seg2_stage,seg3_yield,seg3_time,seg3_stage
0,1,10,1,1,10,1/3/2013,1,,,,,,
1,1,55,2,1,10,1/3/2013,1,6.0,2/3/2013,2.0,,,
2,2,306,3,1,4,3/13/2013,1,4.0,4/13/2013,2.0,12.0,5/13/2013,3.0
3,2,35,1,2,10,7/3/2013,1,,,,,,


In [7]:
df2

Unnamed: 0,cow_id,total_yield,max_segment,cycle,seg_yield,seg_time,seg_stage
0,1,10.0,1,1,10,1/3/2013,1
1,1,55.0,2,1,6,2/3/2013,2
2,2,,3,1,4,3/13/2013,1
3,2,,3,1,6,4/13/2013,2
4,2,306.0,3,1,12,5/13/2013,3
5,2,35.0,1,2,10,7/3/2013,1


## Problem 2

Given an array of 1s and 2s, suppose that I need to count how many times it turns from being 1 to being 2. How many switches happen in this array?

In [8]:
import numpy as np
np.random.seed(444)

x = np.random.choice([1, 2], size=100000)

For reference, here is the for loop version

In [9]:
def count_transitions(x) -> int:
    count = 0
    for i, j in zip(x[:-1], x[1:]):
        if j==2 and i==1:
            count += 1
    return count

count_transitions(x)

24984

How would you make this vectorized and do it in one line?

## Problem 3
Given panel data, how can you do a vectorized demeaning by group `i`?

__Hint: use the `stack` function__

In [10]:
panel_data = '''
i,t,value
0,0,4.688025813099681
0,1,5.52269259395655
0,2,3.794489256250384
0,3,3.9649616347982652
0,4,2.4118897054484862
0,5,3.4117301110880547
0,6,5.471047024539948
0,7,2.2243578376592072
0,8,3.0142946545236295
0,9,3.6286363500138106
1,0,20.763821602823516
1,1,21.364728204694217
1,2,16.473117837949424
1,3,22.906416928450746
1,4,20.685359348048078
1,5,23.98534980700863
1,6,17.449606215978182
1,7,24.30998889198093
1,8,19.684197761131074
1,9,19.468066794961956
2,0,13.345220275775793
2,1,13.759424454205883
2,2,15.433405160603295
2,3,7.599836415708792
2,4,12.815756066971403
2,5,9.567004610686734
2,6,13.707836922291087
2,7,5.037831324914107
2,8,9.862822201697297
2,9,17.52951651777798
3,0,2.8745063600488643
3,1,17.500660845021965
3,2,21.60320009500734
3,3,3.615382578465688
3,4,12.067565035781877
3,5,12.34780837405084
3,6,6.555567279947617
3,7,7.524996593472945
3,8,-2.8197131509063347
3,9,-12.42907679575168
'''

In [11]:
sample = pd.read_csv(StringIO(panel_data))

sample

Unnamed: 0,i,t,value
0,0,0,4.688026
1,0,1,5.522693
2,0,2,3.794489
3,0,3,3.964962
4,0,4,2.41189
5,0,5,3.41173
6,0,6,5.471047
7,0,7,2.224358
8,0,8,3.014295
9,0,9,3.628636


## Problem 4

Use chunking to calculcate the mean and variance of these two arrays.

In [None]:
import numpy as np
sample = np.random.normal(loc=[4., 20.], scale=[1., 3.5],
                           size=(100000000, 2))