<a href="https://colab.research.google.com/github/pathwaycom/pathway-examples/blob/main/tutorials/suspicious_user_activity.ipynb" target="_parent"><img src="https://pathway.com/assets/colab-badge.svg" alt="Run In Colab" class="inline"/></a>

# Installing Pathway with Python 3.8+

In the cell below, we install Pathway into a Python 3.8+ Linux runtime.

> **If you are running in Google Colab, please run the colab notebook (Ctrl+F9)**, disregarding the 'not authored by Google' warning.
> 
> **The installation and loading time is less than 1 minute**.


In [None]:
%%capture --no-display
!pip install --extra-index-url https://packages.pathway.com/966431ef6ba pathway

# Detecting suspicious user activity with Tumbling Window group-by

Your task is to detect suspicious user login attempts during some period of time.
The main ingredient used is grouping over a tumbling window.

You have an input data table with following columns:
* `username`,
* whether the login was `successful`,
* `time` of a login attempt,
* `ip_address` of a login.


First ingest the data.

In [1]:
# Uncomment to download the required files.
# %%capture --no-display
# !wget https://public-pathway-releases.s3.eu-central-1.amazonaws.com/data/suspicious_users_tutorial_logins.csv -O logins.csv

In [2]:
from datetime import datetime

import pathway as pw

logins = pw.csv.read("logins.csv", value_columns=["username", "successful", "time", "ip_address"])

In [3]:
logins = logins.select(
    *pw.this.without(pw.this.successful),
    successful=(pw.this.successful=="True")
)

In [4]:
logins = logins.select(
    *pw.this.without(pw.this.successful),
    successful=pw.declare_type(bool, pw.this.successful),
)

Then filter attempts and keep only the unsuccessful ones.

In [5]:
processed = logins.filter(~pw.this.successful)

Now, group remaining attempts by login `time` and `ip_address` (ignoring seconds in `time` of login).

In [6]:
by_minutes = processed.select(
    pw.this.ip_address,
    time=pw.apply(
        lambda timestamp_str: (datetime.fromtimestamp(int(timestamp_str)).isoformat())[:-2]+"00",
        pw.this.time)
)
grouped_by_minutes = by_minutes.groupby(pw.this.time, pw.this.ip_address)

The next step is to count the logins...

In [7]:
logins_counted = grouped_by_minutes.reduce(
    by_minutes.time, by_minutes.ip_address, count=pw.reducers.count(by_minutes.id)
)

...and to keep only incidents where the number of failed logins exceeded the threshold.

In [8]:
suspicious_logins = logins_counted.filter(pw.this.count >= 5)
pw.debug.compute_and_print(suspicious_logins)

            | time                | ip_address    | count
^QGJQT5M... | 2018-12-25T10:30:00 | 50.37.169.241 | 7
