## Intro

My friend is learning how to program with pandas. Durring the pandemic, he runs classes online and the output is in and out times for the students. Some have connection issues that make them clock in and out a lot, but they are still in class most of the time. Other students skip class altogether. Can we use pandas to figure out how long each student was in class?


In [2]:
import pandas as pd
import numpy as np

## Setup
First we are going to generate the data. If you are just learning pandas and what I am doing below looks a little intence, don't worry about it. Basically, we are creating 4 students with random in/out times. Then we shuffle the order so that the times and IDS need to be sorted later.

In [3]:
student_log = pd.DataFrame(columns=['student_id', 'timestamp'])

for i in range(4):
    num_pairs = np.random.randint(low=1, high=6)*2
    inout = np.arange(num_pairs, dtype=np.float)
    inout += np.random.uniform(0,30,num_pairs)
    inout.sort()
    student_log = student_log.append(pd.DataFrame({'student_id': i, 'timestamp': inout}), ignore_index=True)

student_log = student_log.sample(frac=1.0).reset_index(drop=True)

<!-- TEASER_END -->

In [4]:
student_log

Unnamed: 0,student_id,timestamp
0,1,26.707855
1,2,17.490896
2,2,7.517338
3,0,9.278458
4,0,11.517678
5,0,1.804287
6,0,29.014598
7,1,23.594506
8,0,1.801327
9,0,13.595199


## Data Clean

Now that we have the data, we should get it in the proper order. I am going to sort by `student_id` then by `timestamp`. Then we can calcualte how long each student was in class.

In [5]:
student_log = student_log.sort_values(['student_id', 'timestamp'])
student_log

Unnamed: 0,student_id,timestamp
8,0,1.801327
5,0,1.804287
3,0,9.278458
4,0,11.517678
17,0,12.619402
9,0,13.595199
13,0,20.954802
6,0,29.014598
7,1,23.594506
0,1,26.707855


The index is out of order, but that is OK. You can reset the index if needed but we are going to ignore it for now.

## Aggrigate the data

Now we can see how to figure out how long each student was in class for. We need to diff the rows so that we find out how long each student was clocked in for each session.

In [6]:
student_log.diff().head(10)

Unnamed: 0,student_id,timestamp
8,,
5,0.0,0.00296
3,0.0,7.474172
4,0.0,2.23922
17,0.0,1.101724
9,0.0,0.975797
13,0.0,7.359603
6,0.0,8.059797
7,1.0,-5.420092
0,0.0,3.113348


Oops, we have a big jump when we switch students. We don't really want to do this, nor do we want to diff the student ID.

In [7]:
student_log['diff'] = student_log.groupby('student_id').diff(periods=1)
student_log.groupby('student_id').sum('diff')

Unnamed: 0_level_0,timestamp,diff
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,100.585752,27.213271
1,50.302361,3.113348
2,126.539852,31.714492
3,69.41785,20.6639


In [8]:
# or simply
student_log.groupby('student_id').sum('diff')['diff']

student_id
0    27.213271
1     3.113348
2    31.714492
3    20.663900
Name: diff, dtype: float64

## Conclusion

This gives the results we want. We can see who was in class for how long. I have made some assumptions here. That there are always in/out pairs and that there are no double in records caused by funny networking issues. In a future post I might think about addressing these other issues.

### Summary
we have used pandas functions to calculate how many minutes each student was in class. We can now see who we should mark as absent. Thanks for reading and let me know if you have any comments!