# Summarizing Logs by First/Last Seen
Frequently I want to know first/last seen timestamps across a large amount of log data.

It's easy to summarize this using Pandas.

First let's create a simple set of fake log data, but note that this will work across very large and complex data sets as long as they have timestamps:

In [2]:
logs = [\
        {'date':'2020-11-01','ip':'10.1.1.1'},
        {'date':'2020-11-01','ip':'10.2.2.2'},
        {'date':'2020-11-01','ip':'10.3.3.3'},
        {'date':'2020-11-01','ip':'10.4.4.4'},
        {'date':'2020-11-01','ip':'10.5.5.5'},
        {'date':'2020-11-01','ip':'10.6.6.6'},
        {'date':'2020-11-01','ip':'10.7.7.7'},
        {'date':'2020-11-01','ip':'10.8.8.8'},
        {'date':'2020-11-01','ip':'10.9.9.9'},
        {'date':'2020-11-02','ip':'10.1.1.1'},
        {'date':'2020-11-02','ip':'10.2.2.2'},
        {'date':'2020-11-03','ip':'10.3.3.3'},
        {'date':'2020-11-03','ip':'10.4.4.4'},
        {'date':'2020-11-03','ip':'10.5.5.5'},
        {'date':'2020-11-04','ip':'10.6.6.6'},
        {'date':'2020-11-04','ip':'10.7.7.7'},
        {'date':'2020-11-05','ip':'10.8.8.8'},
        {'date':'2020-11-07','ip':'10.9.9.9'},

       ]

In [3]:
df = pd.DataFrame(logs)

In [4]:
df

Unnamed: 0,date,ip
0,2020-11-01,10.1.1.1
1,2020-11-01,10.2.2.2
2,2020-11-01,10.3.3.3
3,2020-11-01,10.4.4.4
4,2020-11-01,10.5.5.5
5,2020-11-01,10.6.6.6
6,2020-11-01,10.7.7.7
7,2020-11-01,10.8.8.8
8,2020-11-01,10.9.9.9
9,2020-11-02,10.1.1.1


Next, we group the logs by IP and run the `.agg` function with the `min` and `max` functions as arguments.

This gives as the data we want, but it's in a MultiIndex:

In [9]:
df2 = df.groupby('ip').agg({'date':[min,max]}).reset_index()
df2

Unnamed: 0_level_0,ip,date,date
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max
0,10.1.1.1,2020-11-01,2020-11-02
1,10.2.2.2,2020-11-01,2020-11-02
2,10.3.3.3,2020-11-01,2020-11-03
3,10.4.4.4,2020-11-01,2020-11-03
4,10.5.5.5,2020-11-01,2020-11-03
5,10.6.6.6,2020-11-01,2020-11-04
6,10.7.7.7,2020-11-01,2020-11-04
7,10.8.8.8,2020-11-01,2020-11-05
8,10.9.9.9,2020-11-01,2020-11-07


Rename the columns to make it look like we want:

In [10]:
df2.columns = ['ip','first_seen','last_seen']
df2

Unnamed: 0,ip,first_seen,last_seen
0,10.1.1.1,2020-11-01,2020-11-02
1,10.2.2.2,2020-11-01,2020-11-02
2,10.3.3.3,2020-11-01,2020-11-03
3,10.4.4.4,2020-11-01,2020-11-03
4,10.5.5.5,2020-11-01,2020-11-03
5,10.6.6.6,2020-11-01,2020-11-04
6,10.7.7.7,2020-11-01,2020-11-04
7,10.8.8.8,2020-11-01,2020-11-05
8,10.9.9.9,2020-11-01,2020-11-07
