# Occupancy Distribution Through Time.

Let's gain some insights about the distribution of occupancy count across time and locations.

We'll start by setting up the script.

In [16]:
# The Python interpreter requires this workaround to
# import modules outside of this notebook's directory.
import sys
sys.path.append("..")

In [17]:
# Third-party and standard modules
from typing import Optional
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
import pytz
from importlib import reload

# User-defined modules.
# The reload() function is needed to update modules after
# changes are made to their files.
import dataframe_manip as dfm
dfm = reload(dfm)

In [18]:
def dataframe_info(d: pd.DataFrame) -> str:
    """A basic function for getting dataframe info."""
    return """
    shape: {}
    index datatype: {}
    na value count: {}
    """.format(
        str(d.shape),
        d.index.dtype,
        d.isna().sum().sum()
    )

## The main dataframe: `occupancy`

We'll be mainly working with a single dataframe named `occupancy` created from the csv provided.

`occupancy` is a time-series dataframe. Each of its rows represents a point in time. Each of its columns represents a WiFi access point.

In [19]:
filepath = './wifi_data_until_20190204.csv'
timezone = pytz.timezone('US/Pacific')

occupancy: pd.DataFrame = dfm.csv_to_timeseries_df(
    filepath=filepath, timezone=timezone
)

print(dataframe_info(occupancy))

TypeError: Already tz-aware, use tz_convert to convert.

## Occupancy by the hour

Let's start by finding out how total connectivity (across all access points) varies by the hour.

We'll reduce `data` to the `total_occupancy_vs_time` dataframe. Like `data`, each of its rows represents a point in time. However, it only has one column representing all access points.

In [None]:
total_occupancy_vs_time = dfm.row_totals(occupancy)

print(dataframe_info(
    total_occupancy_vs_time
))

Let's create create 24 boxplots --each representing 1 hour-- which show how the distribution of **occupancy** (as measured by the number of devices connected to access points) *throughout all buildings* varies hour-by-hour.

In [None]:
fig, ax = plt.subplots(figsize=(24, 15))

seaborn.boxplot(
    # The hours (a number [0,23]) for each row.
    x=total_occupancy_vs_time.index.hour,
    # The total occupancy at that hour.
    y=total_occupancy_vs_time,
    ax=ax
)

It's immediately clear that measures of occupancy total have strong relationships with the hour of the day. The transitions between hours are smooth (i.e. continuous) and aggressive.

The total **peaks at 22:00 (10:00 pm)** and is **lowest around 13:00 (1:00 pm) or 14:00 (2:00 pm)**.

Interestingly, the **interquartile range** (i.e. rectangle length) in occupancy also **peaks around midnight** and is **lowest around 13:00 or 14:00**. The upper quartile (75th percentile) moves far more aggressively than the lower quartile or median.

### Questions and hypotheses

Note that this data comes from a college.

#### Why are there way more devices at night than in the day?

It's possible that the number of devices connected to WiFi is higher at night because students have no classes, are in their study areas (e.g. dormitories, library), and are on their laptops, which don't connect to WiFi when asleep.

The extreme changes (in the thousands) could also be due to missing values in the dataset. Many of our access points have missing data points, which means that a device connected to one building in the afternoon wasn't noticed until it moved to another building at night.

#### Why is distribution lowest around 13:30?

It's worth noting that 12:00 is not when we expect occupancy to be at its lowest. Classes generally range from 9:00 to 18:00, so 13:30 is a reasonable expectation for the distribution's dipping point.

#### Why are changes from 1:30 to 13:30 more gradual than those from 13:30 to 1:30?

This may be because there's greater variation in sleep time than variation in the time that students start studying at night.

#### Why is there much more variation in the upper quartile than other quartiles?

This may be due to a fixed/stable number of devices which don't vary much throughout the day. These could be public devices (e.g. library computers) or staff devices (e.g. staff phones). It's worth noting that the minimum doesn't change much throughout the day.

In [None]:
# acpt is short for 'access point'
stats_per_acpt: pd.DataFrame = pd.DataFrame.from_dict({
    'total':
    dfm.column_totals(occupancy),
    'mean':
    dfm.column_means(occupancy, skipna=True),
    'mean, skipna=False':
    dfm.column_means(occupancy, skipna=False),
    'median':
    dfm.column_medians(occupancy, skipna=True),
    'median, skipna=False':
    dfm.column_medians(occupancy, skipna=False)
})

print(dataframe_info(
    stats_per_acpt
))