# Exploratory Data Analysis
------

## What Kind of Data do We Have

As we saw in the previous notebook, we have some image data, see the examples. The images are saved as JPEG files. Each picture is named with a pattern, **pic_&lt;timestamp-in-seconds&gt;.jpg** e.g. **pic_1492067569.jpg** Also they have some more interesting attributes in their EXIF data, which is listed below.

<table>
<tr>
    <td> <figure>
        <img src="../data/raw/example/open/pic_1492067569.jpg" width=350/> </figure>
        <figcaption>Example 1</figcaption>
    </td>
    <td>
        <img src="../data/raw/example/open/pic_1495255809.jpg" width=350/>
        <figcaption>Example 2</figcaption>
    </td>
</tr>
</table>

###  EXIF Data

This list contians not all but some of the EXIF data found in the images.

<table>
    <tr>
        <th>Attribute</th>
        <th>Example Value</th>
    </tr>
    <tr>
        <td>Image resolution (WxH)</td>
        <td>2592px x 1944px</td>
    </tr>
    <tr>
        <td>Exposure time</td>
        <td>1/51 s</td>
    </tr>
    <tr>
        <td>f-number</td>
        <td>f/2.9</td>
    </tr>
    <tr>
        <td>Shutter speed value</td>
        <td>5.69 EV</td>
    </tr>
    <tr>
        <td>Arperture value</td>
        <td>3.07 EV</td>
    </tr>
    <tr>
        <td>Brightness value</td>
        <td>2.67 EV</td>
    </tr>
    <tr>
        <td>Focal length</td>
        <td>3.6 mm</td>
    </tr>
    <tr>
        <td>ISO speed rating</td>
        <td>100</td>
    </tr>
</table>

### What Else do We Have

#### Directory Structure

Pictures were captured and saved in an own directory for each day, so the directory structure looks something like this.

```
$ tree .
.
├── ...
├── 2017-05-07_auto
├── 2017-05-08_auto
├── 2017-05-09_auto
├── 2017-05-10_auto
├── ...
├── 2017-05-27_auto
├── 2017-05-28_auto
├── 2017-05-29_auto
├── 2017-05-30_auto
├── 2017-05-31_auto
├── ...
├── 2017-08-06_auto
├── 2017-12-24_auto
├── 2017-12-31_auto
│   ├── pic_1514702547.jpg
│   ├── pic_1514702617.jpg
│   ├── pic_1514702727.jpg
│   ├── pic_1514702797.jpg
│   ├── pic_1514702879.jpg
│   ├── ...
└── ...
```

#### Time Range

As mentioned before, images were captured over one year. Starting date is **2017-03-07** and last date is **2018-04-04**, which means we have images showing all four seasons, naming **spring**, **summer**, **autumn** and **winter**, and different weather conditions, see following examples.

> Note: Yes, unfortunately the camera was moved a little bit, we have to live with that, I am sorry.

<table>
<tr>
    <td>
        <img src="../data/raw/example/time-range/spring-midday-pic_1492083489.jpg" width=350/>
        <figcaption>Spring</figcaption>
    </td>
    <td> <figure>
        <img src="../data/raw/example/time-range/summer-midday-pic_1495800549.jpg" width=350/> </figure>
        <figcaption>Summer</figcaption>
    </td>
    <td>
        <img src="../data/raw/example/time-range/autumn-midday-pic_1509968123.jpg" width=350/>
        <figcaption>Autumn</figcaption>
    </td>
    <td>
        <img src="../data/raw/example/time-range/winter-midday-pic_1518610019.jpg" width=350/>
        <figcaption>Winter</figcaption>
    </td>
</tr>
</table>

#### Time Between Images

Just to get an overview for what we have, we want to extract the mean time between captured images. First lets list the top *n* files of an image directory.

In [1]:
import os

IMG_DIR = "../data/raw/raw"
FILE_COUNT = 10
DIR_COUNT = 1

# print first <file_count> directory entries of <dir_count> directories
dir_counter = 0
for dirpath, dirnames, filenames in os.walk(IMG_DIR, topdown=True):
    for idx, file in enumerate(filenames):
        if idx < FILE_COUNT:
            print("{}/{}".format(dirpath,file))
    # only analyze first <dir_count> directories of os.walk
    dir_counter += 1
    if dir_counter > DIR_COUNT:
        break

../data/raw/raw/2017-05-13_auto/pic_1494657849.jpg
../data/raw/raw/2017-05-13_auto/pic_1494699169.jpg
../data/raw/raw/2017-05-13_auto/pic_1494664149.jpg
../data/raw/raw/2017-05-13_auto/pic_1494648549.jpg
../data/raw/raw/2017-05-13_auto/pic_1494689949.jpg
../data/raw/raw/2017-05-13_auto/pic_1494696649.jpg
../data/raw/raw/2017-05-13_auto/pic_1494646242.jpg
../data/raw/raw/2017-05-13_auto/pic_1494662449.jpg
../data/raw/raw/2017-05-13_auto/pic_1494694309.jpg
../data/raw/raw/2017-05-13_auto/pic_1494692049.jpg


Extract the timestamps from the filenames and calculate statistics for time between two captured images, like **mean**, **std** and **quantiles**.

In [2]:
import os
import sys
import re
import numpy as np

IMG_DIR = "../data/raw/raw"
FILE_COUNT = sys.maxsize
SUB_DIR_COUNT = 1

# define a regex to get the timestamp
PATTERN_TIMESTAMP = re.compile("^pic_(\d{10,}).jpg$")

def extract_timestamp_from_filenames(
    path, sub_dir_count = 1,
    file_count = sys.maxsize,
    pattern_timestamp = PATTERN_TIMESTAMP,
    pattern_group = 1
):
    # extract timestamps
    dir_counter = 0
    timestamps = np.empty(0)
    for dirpath, dirnames, filenames in os.walk(path, topdown=True):
        # only analyze first <sub_dir_count> directories of os.walk
        if dir_counter > sub_dir_count:
            break
        dir_counter += 1

        # extract timestamp
        for idx, file in enumerate(filenames):
            if idx < file_count:
                timestamp = re.match(pattern_timestamp, file).group(pattern_group)
                timestamps = np.append(timestamps,timestamp)
                # DEBUG:
                #print("file: {}; found timestamp: {}".format(file, timestamp))
            else:
                break
    return timestamps

def calculate_timestamp_diffs(timestamps):
    # bring timestamps in ascending order
    sorted_timestamps = np.array(sorted(timestamps))
        
    return np.diff(sorted_timestamps.astype(int))
            
def timestamp_diff_statistics(timestamp_diffs):
    mean_timestamp = np.mean(timestamp_diffs)
    std_timestamp = np.std(timestamp_diffs)
    quantiles_timestamp = np.quantile(timestamp_diffs, q=[.25, .5, .75, 1])
    return (mean_timestamp, std_timestamp, quantiles_timestamp)


## use functions
timestamps = extract_timestamp_from_filenames(IMG_DIR, sub_dir_count=SUB_DIR_COUNT, file_count=FILE_COUNT)
#timestamps = extract_timestamp_from_filenames("../data/raw/raw/2017-05-20_auto", dir_count=1)
timestamp_diffs = calculate_timestamp_diffs(timestamps)

ts_mean, ts_std, ts_quantiles = timestamp_diff_statistics(timestamp_diffs)
print("Mean time between images in {}: {:.2f}s (std {:.2f}s, quartiles in s {})".format(
        IMG_DIR,
        ts_mean, ts_std, ts_quantiles)
)


Mean time between images in ../data/raw/raw: 89.86s (std 16.59s, quartiles in s [ 80.  87. 100. 120.])


In [8]:
import plotly.express as px
import dash_core_components as dcc
import dash_html_components as html
import pandas as pd

from jupyter_dash import JupyterDash
from dash.dependencies import Input, Output

IMG_ROOT_DIR = "../data/raw/raw"

# get all subdirectories, which can be selected from dropdown
day_dates_images_captured = sorted(os.listdir(IMG_ROOT_DIR))

# Build App
app = JupyterDash(__name__)
app.layout = html.Div([
    html.H1("Time Differences Between Image Captures"),
    dcc.Graph(id='graph'),
    html.Label([
        "Capture Date",
        dcc.Dropdown(
            id = 'capture-date-dropdown',
            clearable = False,
            value = '2017-05-13_auto',
            options = [
                {'label': c, 'value': c}
                for c in day_dates_images_captured
            ])
    ]),
])

# Define callback to update graph
@app.callback(
    Output('graph', 'figure'),
    [Input("capture-date-dropdown", "value")]
)

def update_figure(capture_date):
    # Load Data, use one subdirectory of IMG_ROOT_DIR
    path = os.path.sep.join( (IMG_DIR, capture_date) )
    timestamps = extract_timestamp_from_filenames(path, sub_dir_count=1)
    timestamp_diffs = calculate_timestamp_diffs(timestamps)

    # create the bins
    #counts, bin_edges = np.histogram(timestamp_diffs, bins=range(0, int(np.max(timestamp_diffs)), 5))
    counts, bin_edges = np.histogram(timestamp_diffs, bins=10)
    bins = 0.5 * (bin_edges[:-1] + bin_edges[1:])
    return px.bar(x=bins, y=counts, labels={'x':'time between captures in seconds', 'y':'count'})
    #df = pd.DataFrame(data=timestamp_diffs, columns=["diffs"])
    #return px.histogram(df, x="diffs", nbins=10, labels={'diffs':'time between captures in seconds', 'y':'count'})

# Run app and display result inline in the notebook
## Workaround: set load_dotenv=False to prevent current working directory to be changed, which will break
## relative paths, e.g. IMG_DIR
## for more information, see https://github.com/pallets/flask/blob/master/src/flask/app.py #function run(...)
app.run_server(mode='inline', load_dotenv=False)
#app.run_server(mode='inline')


Path: ../data/raw/raw/2017-05-13_auto
Path: ../data/raw/raw/2017-04-14_auto
