## Intro

My friend is learning how to program with pandas. Durring the pandemic, he runs classes online and the output is in and out times for the students. Some have connection issues that make them clock in and out a lot, but they are still in class most of the time. Other students skip class altogether. Can we use pandas to figure out how long each student was in class?


In [1]:
import pandas as pd
import numpy as np

## Setup
First we are going to generate the data. If you are just learning pandas and what I am doing below looks a little intence, don't worry about it. Basically, we are creating 4 students with random in/out times. Then we shuffle the order so that the times and IDS need to be sorted later.

In [37]:
student_log = pd.DataFrame(columns=['student_id', 'timestamp'])

for i in range(4):
    num_pairs = np.random.randint(low=1, high=6)*2
    inout = np.arange(num_pairs, dtype=np.float)
    inout += np.random.uniform(0,30,num_pairs)
    inout.sort()
    student_log = student_log.append(pd.DataFrame({'student_id': i, 'timestamp': inout}), ignore_index=True)

student_log = student_log.sample(frac=1.0).reset_index(drop=True)

In [38]:
student_log

Unnamed: 0,student_id,timestamp
0,2,20.302406
1,1,10.067449
2,2,24.856066
3,3,21.860119
4,0,20.329282
5,1,27.197078
6,1,10.765858
7,1,17.504872
8,2,25.197883
9,1,11.399453


## Data Clean

Now that we have the data, we should get it in the proper order. I am going to sort by `student_id` then by `timestamp`. Then we can calcualte how long each student was in class.

In [40]:
student_log = student_log.sort_values(['student_id', 'timestamp'])
student_log

Unnamed: 0,student_id,timestamp
16,0,3.655195
25,0,8.933129
20,0,9.405293
15,0,9.946995
21,0,10.825147
4,0,20.329282
10,1,9.715836
1,1,10.067449
6,1,10.765858
9,1,11.399453


The index is out of order, but that is OK. You can reset the index if needed but we are going to ignore it for now.

## Aggrigate the data

Now we can see how to figure out how long each student was in class for. We need to diff the rows so that we find out how long each student was clocked in for each session.

In [43]:
student_log.diff().head(10)

Unnamed: 0,student_id,timestamp
16,,
25,0.0,5.277934
20,0.0,0.472164
15,0.0,0.541702
21,0.0,0.878151
4,0.0,9.504136
10,1.0,-10.613446
1,0.0,0.351613
6,0.0,0.698409
9,0.0,0.633594


Oops, we have a big jump when we switch students. We don't really want to do this, nor do we want to diff the student ID.

In [54]:
student_log['diff'] = student_log.groupby('student_id').diff(periods=1)
student_log.groupby('student_id').sum('diff')

Unnamed: 0_level_0,timestamp,diff
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,63.095041,16.674087
1,125.28042,17.481242
2,169.642426,27.468956
3,145.355685,32.663703


In [55]:
# or simply
student_log.groupby('student_id').sum('diff')['diff']

student_id
0    16.674087
1    17.481242
2    27.468956
3    32.663703
Name: diff, dtype: float64

## Conclusion

we have used pandas functions to calculate how many minutes each student was in class. We can now see who we should mark as absent. Thanks for reading and let me know if you have any comments!