# Data Formats and Terminology

## Key Concepts and Definitions
- Column
- Row
- Field
- Table
- Cell
- Variable
- Observation
- Binary File
- ASCII File
- CSV Format
- JSON Format
- XML Format
- API

## Key Questions
- Where does data come from?
- What is the relationship between rows and columns? Fields, cells, tables?
- What is hierarchical data?
- How do we store data in files?
- What is the relationship between APIs and file formats?
- Can data be redundant? Do we want that?
- Why is data typically 2 dimensional?
- When is two-dimensional data insufficient?
- What is relational data? Multidimensional data?
- What applications can I use to open data
- What formats can I use to save data?


## What is Data?
I think you can find a lot of discussion of the various aspects of data in books, articles on the web, just about anywhere. Lot's of them are covered in wikipedias article here:
https://en.wikipedia.org/wiki/Data

I want to highlight two excerpts from that article

>Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.

and 

> The Latin word data is the plural of datum, "(thing) given," neuter past participle of dare "to give".[4] Data may be used as a plural noun in this sense, with some writers in the 2010s using datum in the singular and data for plural. In the 2010s, though, in non-specialist, everyday writing, "data" is most commonly used in the singular, as a mass noun (like "information", "sand" or "rain").[5]

First that it is encoding... *something*. What exactly? Second that there's an inherent plurality to its meaning. Data, as it has come to be useful is about multiple encodings not individual ones.

## Where Does Data Come From?

I find that quora is a nice place to start with learning about things. At its best, it is having experts weigh in on questions. 

One answer from this question, "Where does evidence/data come from? How is it used?
https://www.quora.com/Where-does-evidence-data-come-from-How-it-used

> Observations of naturally occurring events are recorded as data points, and then used for analysis and inference. This can help the observer make decisions, or gain more information about causation.

[Charlie Pioli](https://www.quora.com/Where-does-evidence-data-come-from-How-it-used/answer/Charlie-Pioli)

Suggests that ultimately, data starts in the natural world. IRL. What is the process of encoding the natural world into what we commonly think of as data? Every value that makes its way into a spreadsheet file, followed some path in the real world to get there. In my experience, that path isn't something that people think about enough, so I hope it is something we do a bit of here. 

https://www.quora.com/What-is-data-1


## Tabular Data

|ObservationNum|Field|Field2|Field3|
|--|--|--|--|
|Observation 1|Cell|Cell|Cell|
|Observation 2|Cell|Cell|Cell|
|Observation 3|Cell|Cell|Cell|
|Observation 4|Cell|Cell|Cell|
|Observation 5|Cell|Cell|Cell|

In tabular data, __rows__ often represent individual __observations__ or __data points__.

__Columns__ on the other hand represent __variables__ or __attributes__ of the phenomenon being measured or recorded. If you think about how this relates to something like a google form, you might consider them to represent __fields__

Individual __cells__ in a data table represent the measurements or variable/attribute/field value for a given observation.

## How do we store data on a computer? And why does it matter?
This tends to be important because eventually you need to move data from one tool to another. A computer file of some kind is needed to facilitate the interchange. Some discussion here:
https://www.thoughtspot.com/blog/csv-vs-delimited-flat-files-how-choose

#### Binary Files
It's not so important now, but will become more important as you work with more complex and "closer to the hardware" tools to understand how computers fundamentally store information. You've probably heard of binary and 0's and 1's and bits. And that all information on a computer is stored as bits. This is the only way computers are able to store information, as little switches that when accumulated infinitely can represent infinitely complex things. There's a nice little discussion here
https://betterexplained.com/articles/a-little-diddy-about-binary-file-formats/

#### ASCII Files
There is also a whole world of __flat__ or __text__ files that are very important to the world of developers and data users. 
https://en.wikipedia.org/wiki/Text_file

These files are important to us because
- most programming languages are programmed using text files
- they can be manipulated with a rich ecosystem of text editing tools
- as we'll see many data interchange formats are text based.

Ultimately these files are binary as well, but they all use an encoding scheme that translates a binary byte (8 bits) into a human readable characters

#### Table
So given the following table, what are some ways we can represent this as a text file? There are a few common formats and you can read about them all over the intenet. A couple of links
https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/

https://scrapehero.freshdesk.com/support/solutions/articles/5000008629-data-formats-csv-json-xml-or-sql


|ObservationNum|Field|Field2|Field3|
|--|--|--|--|
|Observation 1|Cell|Cell|Cell|
|Observation 2|Cell|Cell|Cell|
|Observation 3|Cell|Cell|Cell|
|Observation 4|Cell|Cell|Cell|
|Observation 5|Cell|Cell|Cell|

#### CSV

#### JSON

In [None]:
{ "Observations":[
    {"ObservationNum": "Observation 1", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"},
    {"ObservationNum": "Observation 2", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"}
    {"ObservationNum": "Observation 3", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"}
    {"ObservationNum": "Observation 4", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"}
    {"ObservationNum": "Observation 5", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"}
]}

#### XML

In [None]:
<Observations>
    <Observation>
        <ObservationNum>Observation 1</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2> 
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 2</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 3</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 4</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 5</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
</Observations>

- CSV: 
    - https://en.wikipedia.org/wiki/Comma-separated_values
- JSON:
    - https://en.wikipedia.org/wiki/JSON
    - https://www.w3schools.com/js/js_json_intro.asp
    - http://www.json.org/
- XML: 
    - https://en.wikipedia.org/wiki/XML
    - https://www.w3.org/XML/
- Comparing JSON and XML
    - https://www.w3schools.com/js/js_json_xml.asp

#### APIs and Response formats

As a final point in this discussion about data formats, let's think a bit about __APIs__. API  is an acronym for "Application Programmer Interface". In principle it provides a programmer working on their own application, access to the services or functions that are part of another program. In practice these are often connections that are made over the internet. Your internet enabled application can connect to a server with an API set up using internet protocols. The medium of communication is typically data. Your program talks to a server, requests something, and the the API sends you a response as data. How is that data sent to you? Often in one of the data formats above. A nice discussion of the relative values of each data format can be found here:

http://ezinearticles.com/?CSV-vs-XML-vs-JSON---Which-is-the-Best-Response-Data-Format?&id=4073117



## Relational Data

Are there ever conditions when it doesn't make sense to store data into a single table?


## Where can I find data to look at?

#### Hawaii Datasets 
- Honolulu Open Data Portal: https://data.honolulu.gov/browse?limitTo=datasets&utf8=%E2%9C%93
- Hawaii Open Data Portal: https://data.hawaii.gov/browse?limitTo=datasets&utf8=%E2%9C%93

#### UHERO Macroeconomic Indicators
http://data.uhero.hawaii.edu/#/

#### Federal Datasources
US Census, Bureau of Labor Statistics, Bureau of Economic Analysis

#### Kaggle

|Dataset|Link|
|--|--|
|Individual Income Tax Statistics|	https://www.kaggle.com/irs/individual-income-tax-statistics|
|Health Insurance Marketplace|	https://www.kaggle.com/hhs/health-insurance-marketplace|
|US Consumer Finance Complaints|	https://www.kaggle.com/cfpb/us-consumer-finance-complaints|
|Daily Happiness and Employee Turnover|	https://www.kaggle.com/harriken/employeeturnover|
|Loan Data|	https://www.kaggle.com/zhijinzhai/loandata|
|Medical Appointment No Shows|	https://www.kaggle.com/joniarroba/noshowappointments|
|AirBnb Property Data from Texas|	https://www.kaggle.com/PromptCloudHQ/airbnb-property-data-from-texas|
|Credit Card Fraud Detection|	https://www.kaggle.com/dalpozz/creditcardfraud|
|Human Resources Analytics|	https://www.kaggle.com/ludobenistant/hr-analytics|
|Default of Credit Card Clients Dataset|	https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset|
|Twitter US Airlines Sentiment|	https://www.kaggle.com/crowdflower/twitter-airline-sentiment|
|IBM HR Analytics Employee Attrition & Performance|	https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset![image.png](attachment:image.png)|

## Possible Projects

#### Loading CSVs / XML / JSON in various applications
Experiment with and document how to "load" data in various programming languages and applications. What does it mean when the data are loaded? What can you do with it?
Python, R, Terminal / Command Line, Text Editor (Atom, Sublime, vi), Excel, Tableau

#### Storymapping Data
Find a dataset and illustrate how it is created and all of the steps that it goes through to arrive on your computer.