In [None]:
import io

import numpy as np
import pandas as pd
import plotly_express as px

# Data frames 1 &mdash; What is a data frame



A data frame is a table with the data. For examle, a standard spreadsheet with a data
can be thought of as a data frame. Let's look at an example.

In [None]:
df = pd.read_csv('data/tokyo-weather.csv')
df.head()

The data frame has columns, rows and the cells holding the values. The values in the cells can be numeric (including NaN to represent missing numbers), or they can be string values to represent text data or categorical data. The interpretation of the data frame comes from statistics.
Each column in the data frame corresponds to a variable, that is something that either
can be measured, or can be controlled by us. Each row corresponds to one observation, with
values in different columns logically being related. For example, in the table abouve,
one row coresonds to the weather data for 1 hour.

In Python Pandas library, the column types can be inspected using dtypes property. Note that numeric types
are further subdivided into integer (`int64`) and floating point (`float64`) types. The string data is represented with dtype `object`.

## What is a CSV format

There are many ways to represent the tabular data, spreadsheets being the most popular one among general computer users. However, for the programmatic access, a simpler format may be even more useful.
It is easy to generate, even by typing manually, and relatively easy to parse. CSV stands for comma-separated values, so it uses a comma `,` to separate the values in a single row.

CSV format has several detailed definitions which may disagree in small details, but it is possible to stick
to a conservative set of rules serving as a minimum common denominator and being highly interoperable. Here are the conservative rules:

* Every line has the same number of fields separated by commas. In CSV speak, each line is called "a record".
* The values of fields should not contain commas or newline characters. In a rare event that comma needs to be a part of the value, the field value should be enclosed in double quotes. (If the contents of the field needs to contain double quote character itself, it should be doubled inside, but this quickly gets dangerous in a sense that the details of escaping rules may differ between different programs working with CSV files).
* The first line in the file optionally may be a header, i.e. contain the human-readable column names.

Typically the CSV format is used in files with `.csv` suffix, but Python language makes it easy enough to parse CSV defined in string literals. This in fact may be the easiest way to define small data frames in Jupyter notebooks. Here is an example. 

In [None]:
df = pd.read_csv(io.StringIO("""
x,y
1,2
3,4
"""))
df

In case you are curious, `pd.read_csv` accepts file-like objects to read the data from, and io.StringIO is way to create a file-like object from a string literal. Triple quotes `"""` are a Python syntax that allows to define multi-line string literal.

Here is another, more traditional way to create a CSV file from the Jupyter notebook and then load it as a regular file:

In [None]:
%%writefile test.csv
x,y
1,2
3,4

In [None]:
test_df = pd.read_csv('test.csv')
test_df

## Tidy data frames: How to think about data frame structure

There are many possible ways how one can put the same data into the tabular format.

     TODO(salikh): Add examples
     
One way to think of the data that has been inspired by statistics, is an experiment report.
It is called _tidy_ data and satisfies the following conditions:

* Each kind of "experiment" is kept in a separate table (data frame).
* In a table, one row is "one observation", and one column is one variable.
* The values are in the fields only, i.e. the values should never occur in column headers. The variable names should be in column header only, i.e. variable names should never occur in field values.
* Variable (columns) can be subdivided into _controlled_ (how we set up an experiment), and _measured_ (the values that we are measuring). This way of thinking explains what do we mean by each row corresponding to one observation.

All other possible formats of data that are not tidy are called _messy_ by contrast.

There is some connection of tidy data frames to 3rd normal form in the database theory, but data frames tend to be more flexible and malleable. It is also worth noting, that depending on the purpose of data analysis and required computations, the definition of "one observation" may be different. For example, let's assume that we have the data about flight arrival and departure times. If we want to study flight lengths, then it is convenient to have departure and arrival as independent variables in separate columns, which makes it really easy to compute flight length. If on the other hand we want to study the distribution of how the air stripe is used, then depatures and arrivals are just timestamps of events, and arrival/departure is better to be thought of an additional categorical variable.


There are two benefits to tidy data frames

* Bringing all data into tidy frame format makes your life easier as you do not need
  to remember and handle various data format pecularities. Data handing becomes much more
  uniform.
  
* There is an existing set of tools that work best when the data is in tidy format. The most
  important of those tools is a plotting library used for data visualiation.
  We will see some examples later in this unit.


# Exercise: Create data frame from textual description

In this exercise, you task is to create a tidy data frame based on the textual description
provided below. An person (Aliсe) wants to do a data analysis on her coffee drinking habits.

Here is the Alices description of her week:

* Alice goes to office every weekday
* Alice drops by the coffee shop before work every day except Wednesdays
* In the morning, Alice buys an S-size coffee cup
* Alice goes to gym every Tuesday and Thursday.
* After gym Alice goes to the coffee shop and has a L-size coffee.
* When not going to gym, Alice goes straight home and goes to sleep without coffee.
* On weekends, Alice does not go to coffee shops, but brews coffee at home, once on Saturday and once on
  Sunday. Her coffee maker makes 500 ml of coffee.
* S-size cup is 200 ml. L-size cut is 300 ml.
  
Your task: create a data frame named `coffee` that would describe how much coffee Alice drinks on each day of the week,
and add additional columns describing the day:

* `"work"`: boolean (True/False) describes whether the day is workday (true) or weekends (false).
* `"gym"`: boolean (True/False) describes whether Alice goes to the gym on that day (true - goes to gym, false - does not go to gym).



In [None]:
coffee = pd.read_csv(io.StringIO("""day,coffee_ml,work,gym
...
"""))

In [None]:
# Inspect the resulting data frame
coffee

In [None]:
# Test the data frame.
assert len(coffee) == 7, "Your dataframe should have 7 rows for each day of the week"
assert 'day' in coffee, "Your dataframe should have a 'day' column"
assert 'coffee_ml' in coffee, "Your dataframe should have a 'coffee_ml' column"
assert 'work' in coffee, "Your dataframe should have a 'work' column"
assert 'gym' in coffee, "Your dataframe should have a 'gym' column"