# d_summary
Algorithm that does some vary basic validation of the remote dataset. Which includes:
* Check header labels
* Count the number of rows
* Average of `int64` and `float64` columns
* Categories of `category` columns

Limits of the algorithm:
* The entire dataset needs at least 10 records.
* `category` columns needs at least 2 different values

## 1. Algorithm

In [1]:
import pandas
import json

### input.txt
The input.txt is mounted by the docker-container, and contains input to the algorithm.

The input for this algorithm include the method name that is called in the docker-container `summarize` and a `dict` containing column names and dtypes. The allowed types are: `object`, `int64`, `float64`, `bool`, `datetime64`, `category`

In [111]:
input_ = {
    "method":"summarize", 
    "columns":{
        "patient_id": 'int64',
        "age": 'int64',
        "weight": 'float64', 
        "stage": 'category',
        "cat": 'category'
    }
}

### database.csv
The database csv-file is mounted in the docker-container.

In [112]:
dataframe = pandas.read_csv("example_dataset.csv", sep=";",decimal=",", dtype=input_.get("columns"))
dataframe

Unnamed: 0,patient_id,age,weight,stage,cat
0,1,41,73.2,IV,Q
1,2,37,65.9,I,Q
2,3,45,84.1,II,Q


In [113]:
dataframe.dtypes

patient_id       int64
age              int64
weight         float64
stage         category
cat           category
dtype: object

### algorithm.py

In [114]:
# retrieve column names from the dataset
columns_series = pandas.Series(data=input_.get("columns"))
column_names = list(columns_series)

# compare column names from dataset to the input column names
column_names_correct = column_names == list(input_.get("columns").keys())
print(f"column_names_correct={column_names_correct}")

column_names_correct=False


In [115]:
# count the number of rows in the dataset
number_of_rows = len(dataframe)
print(f"number_of_rows={number_of_rows}")

number_of_rows=3


In [116]:
# compute the avarage of the numeric columns
numeric_colums = columns_series.loc[columns_series.isin(['int64','float64'])]
averages = {}
for column_name in numeric_colums.keys():
    averages[column_name] = dataframe[column_name].mean()
print(f"computed averages={averages}")

computed averages={'patient_id': 2.0, 'age': 41.0, 'weight': 74.4}


In [125]:
# return the categories in categorial columns
categoral_colums = columns_series.loc[columns_series.isin(['category'])]
categories = {}
for column_name in categoral_colums.keys():
    t = list(dataframe[column_name].cat.categories)
    categories[column_name] = t if len(t) > 1 else "single category"
print(f"found categories={categories}")

found categories={'stage': ['I', 'II', 'IV'], 'cat': 'single category'}


In [126]:
output = {
    "column_names_correct": column_names_correct,
    "number_of_rows": number_of_rows,
    "averages": averages,
    "categories": categories
}
output

{'column_names_correct': False,
 'number_of_rows': 3,
 'averages': {'patient_id': 2.0, 'age': 41.0, 'weight': 74.4},
 'categories': {'stage': ['I', 'II', 'IV'], 'cat': 'single category'}}

### output.txt

In [128]:
with open("output.txt", "w") as fp:
    json.dump(output,fp)