# 1. Data Formats and Terminology
*Module: Acquiring Data and Data Formats (Sprint 1 of 2)*

## Sprint Module Review and Data Stories

#### Basic Data Manipulation
*“Getting” the data you want to analyze means downloading it from websites, connecting to and querying databases, extracting it from HTML webpages, interfacing with APIs (application programmer interfaces) importing and exporting files, and converting back and forth between data formats. Programming languages, databases, command line-based applications and graphical applications each have something to offer.*

|Data Journalist| Data Engineer | Statistical Modeler| Business Analyst |
|----|----------------|----|------------------|
|...I need to **identify data formats** to successfully load it into tools and investigate it|...I need to be able to be able to **programmatically read, write, edit, and convert data files** so that my tools can work with data sources|...I need to understand the **basic terminology and structure of data** so that I can apply statistical analyses|...I need to understand the **story of where the data came from** so I know how it is relevant to my action or recommendation|

## Key Questions
- Where does data come from?
- What is the relationship between rows and columns? Fields, cells, tables?
- What is hierarchical data?
- How do we store data in files?
- What is the relationship between APIs and file formats?
- Can data be redundant? Do we want that?
- Why is data typically 2 dimensional?
- When is two-dimensional data insufficient?
- What is relational data? Multidimensional data?
- What applications can I use to open data
- What formats can I use to save data?

## Key Concepts and Definitions
- Column
- Row
- Field
- Table
- Cell
- Variable
- Observation
- Binary File
- ASCII File
- CSV Format
- JSON Format
- XML Format
- API


---

## What is Data?
I think you can find a lot of discussion of the various aspects of data in books, articles on the web, just about anywhere. Lot's of them are covered in wikipedias article here:
https://en.wikipedia.org/wiki/Data

I want to highlight two excerpts from that article

>Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or processing.

and 

> The Latin word data is the plural of datum, "(thing) given," neuter past participle of dare "to give".[4] Data may be used as a plural noun in this sense, with some writers in the 2010s using datum in the singular and data for plural. In the 2010s, though, in non-specialist, everyday writing, "data" is most commonly used in the singular, as a mass noun (like "information", "sand" or "rain").[5]

First that it is encoding... *something*. What exactly? Second that there's an inherent plurality to its meaning. Data, as it has come to be useful is about multiple encodings not individual ones.

## Where Does Data Come From?

I find that quora is a nice place to start with learning about things. At its best, it is having experts weigh in on questions. 

One answer from this question, "Where does evidence/data come from? How is it used?
https://www.quora.com/Where-does-evidence-data-come-from-How-it-used

> Observations of naturally occurring events are recorded as data points, and then used for analysis and inference. This can help the observer make decisions, or gain more information about causation.

[Charlie Pioli](https://www.quora.com/Where-does-evidence-data-come-from-How-it-used/answer/Charlie-Pioli)

Suggests that ultimately, data starts in the natural world. IRL. What is the process of encoding the natural world into what we commonly think of as data? Every value that makes its way into a spreadsheet file, followed some path in the real world to get there. In my experience, that path isn't something that people think about enough, so I hope it is something we do a bit of here. 

https://www.quora.com/What-is-data-1



## Tabular Data

|ObservationNum|Field|Field2|Field3|
|--|--|--|--|
|Observation 1|Cell|Cell|Cell|
|Observation 2|Cell|Cell|Cell|
|Observation 3|Cell|Cell|Cell|
|Observation 4|Cell|Cell|Cell|
|Observation 5|Cell|Cell|Cell|

|ObservationNum|Turbidity|Total Suspended Solids|pH|
|--|--|--|--|
|Water Sample 1|100|120|7|
|Water Sample 2|120|360|4.5|
|Water Sample 3|250|180|7.5|
|Water Sample 4|10|90|9|
|Water Sample 5|300|45|2|

In tabular data, __rows__ often represent individual __observations__ or __data points__.

__Columns__ on the other hand represent __variables__ or __attributes__ of the phenomenon being measured or recorded. If you think about how this relates to something like a google form, you might consider them to represent __fields__

Individual __cells__ in a data table represent the measurements or variable/attribute/field value for a given observation.

## How do we store data on a computer? And why does it matter?
This tends to be important because eventually you need to move data from one tool to another. A computer file of some kind is needed to facilitate the interchange. Some discussion here:
https://www.thoughtspot.com/blog/csv-vs-delimited-flat-files-how-choose

#### Binary Files
It's not so important now, but will become more important as you work with more complex and "closer to the hardware" tools to understand how computers fundamentally store information. You've probably heard of binary and 0's and 1's and bits. And that all information on a computer is stored as bits. This is the only way computers are able to store information, as little switches that when accumulated infinitely can represent infinitely complex things. There's a nice little discussion here
https://betterexplained.com/articles/a-little-diddy-about-binary-file-formats/

#### ASCII Files
There is also a whole world of __flat__ or __text__ files that are very important to the world of developers and data users. 
https://en.wikipedia.org/wiki/Text_file

These files are important to us because
- most programming languages are programmed using text files
- they can be manipulated with a rich ecosystem of text editing tools
- as we'll see many data interchange formats are text based.

Ultimately these files are binary as well, but they all use an encoding scheme that translates a binary byte (8 bits) into a human readable characters




#### Table
So given the following table, what are some ways we can represent this as a text file? There are a few common formats and you can read about them all over the intenet. A couple of links
https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/

https://scrapehero.freshdesk.com/support/solutions/articles/5000008629-data-formats-csv-json-xml-or-sql


|ObservationNum|Field|Field2|Field3|
|--|--|--|--|
|Observation 1|Cell|Cell|Cell|
|Observation 2|Cell|Cell|Cell|
|Observation 3|Cell|Cell|Cell|
|Observation 4|Cell|Cell|Cell|
|Observation 5|Cell|Cell|Cell|

#### CSV

#### JSON

In [None]:
{ "Observations":[
    {"ObservationNum": "Observation 1", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"},
    {"ObservationNum": "Observation 2", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"},
    {"ObservationNum": "Observation 3", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"},
    {"ObservationNum": "Observation 4", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"},
    {"ObservationNum": "Observation 5", "Field": "Cell", "Field2": "Cell", "Field3": "Cell"}
]}

#### XML

In [None]:
<Observations>
    <Observation>
        <ObservationNum>Observation 1</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2> 
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 2</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 3</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 4</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
    <Observation>
        <ObservationNum>Observation 5</ObservationNum>
        <Field>Cell</Field>
        <Field2>Cell</Field2>
        <Field3>Cell</Field3>
    </Observation>
</Observations>

- CSV: 
    - https://en.wikipedia.org/wiki/Comma-separated_values
- JSON:
    - https://en.wikipedia.org/wiki/JSON
    - https://www.w3schools.com/js/js_json_intro.asp
    - http://www.json.org/
- XML: 
    - https://en.wikipedia.org/wiki/XML
    - https://www.w3.org/XML/
- Comparing JSON and XML
    - https://www.w3schools.com/js/js_json_xml.asp

#### APIs and Response formats

As a final point in this discussion about data formats, let's think a bit about __APIs__. API  is an acronym for "Application Programmer Interface". In principle it provides a programmer working on their own application, access to the services or functions that are part of another program. In practice these are often connections that are made over the internet. Your internet enabled application can connect to a server with an API set up using internet protocols. The medium of communication is typically data. Your program talks to a server, requests something, and the the API sends you a response as data. How is that data sent to you? Often in one of the data formats above. A nice discussion of the relative values of each data format can be found here:

http://ezinearticles.com/?CSV-vs-XML-vs-JSON---Which-is-the-Best-Response-Data-Format?&id=4073117



## Relational Data

Are there ever conditions when it doesn't make sense to store data into a single table?


## Where can I find data to look at?

#### Hawaii Datasets 
- Honolulu Open Data Portal: https://data.honolulu.gov/browse?limitTo=datasets&utf8=%E2%9C%93
- Hawaii Open Data Portal: https://data.hawaii.gov/browse?limitTo=datasets&utf8=%E2%9C%93

#### UHERO Macroeconomic Indicators
http://data.uhero.hawaii.edu/#/

#### Federal Datasources
US Census, Bureau of Labor Statistics, Bureau of Economic Analysis

#### Kaggle

|Dataset|Link|
|--|--|
|Individual Income Tax Statistics|	https://www.kaggle.com/irs/individual-income-tax-statistics|
|Health Insurance Marketplace|	https://www.kaggle.com/hhs/health-insurance-marketplace|
|US Consumer Finance Complaints|	https://www.kaggle.com/cfpb/us-consumer-finance-complaints|
|Daily Happiness and Employee Turnover|	https://www.kaggle.com/harriken/employeeturnover|
|Loan Data|	https://www.kaggle.com/zhijinzhai/loandata|
|Medical Appointment No Shows|	https://www.kaggle.com/joniarroba/noshowappointments|
|AirBnb Property Data from Texas|	https://www.kaggle.com/PromptCloudHQ/airbnb-property-data-from-texas|
|Credit Card Fraud Detection|	https://www.kaggle.com/dalpozz/creditcardfraud|
|Human Resources Analytics|	https://www.kaggle.com/ludobenistant/hr-analytics|
|Default of Credit Card Clients Dataset|	https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset|
|Twitter US Airlines Sentiment|	https://www.kaggle.com/crowdflower/twitter-airline-sentiment|
|IBM HR Analytics Employee Attrition & Performance|	https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset![image.png](attachment:image.png)|

## Possible Projects

#### Loading CSVs / XML / JSON in various applications
Experiment with and document how to "load" data in various programming languages and applications. What does it mean when the data are loaded? What can you do with it?
Python, R, Terminal / Command Line, Text Editor (Atom, Sublime, vi), Excel, Tableau

#### Storymapping Data
Find a dataset and illustrate how it is created and all of the steps that it goes through to arrive on your computer.

##### Helpful Links from Class Slack Channel


 |Topic/Link|Added by|
|--|--|
|[Intro to Geospatial Data with  Python](https://github.com/SocialDataSci/Geospatial_Data_with_Python/blob/master/Intro%20to%20Geospatial%20Data%20with%20Python.ipynb)| Hunter|
|[Essential Python Geospatial Libraries](https://github.com/SpatialPython/spatial_python/blob/master/packages.md)| Justin|
|[Working with Shape Files](http://basemaptutorial.readthedocs.io/en/latest/shapefile.html)| Ben|
|[Fast Geospatial Analysis in Python](http://matthewrocklin.com/blog/work/2017/09/21/accelerating-geopandas-1)| Justin|
|[Intro to Python](https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-1-python-basics?ex=1)| Ben|
|[Natural Language Toolkit](http://www.nltk.org/)| Hunter|	
|[AutoSum: Summarize Publications Automatically](https://github.com/soodoku/autosum)| Hunter|
|[Tesseract OCR](https://github.com/tesseract-ocr/tesseract)|	Hunter|
|[ASCII](https://en.wikipedia.org/wiki/ASCII#/media/File:USASCII_code_chart.png)| Hunter | 
|[ASCII or Binary?](http://www.webweaver.nu/html-tips/ascii-binary.shtml)|	Hunter |
|[Towards Data Science](https://towardsdatascience.com/)| Justin |
|[A Monthly Digest of the Best Data Visualization](http://www.visualisingdata.com/2017/10/best-visualisation-web-august-2017)| Vic	|
 
 
 
 


## Project Pitches

#### Journey-mapping Projects
- https://medium.com/code-for-america/journey-mapping-our-citys-permitting-process-e20a0823ecc3
- https://uxmastery.com/how-to-create-a-customer-journey-map/
- https://www.smashingmagazine.com/2015/01/all-about-customer-journey-mapping/
- Make an excel chart. What is the frequency of each pathways to purchase of each membership plan. map, online, app.
- Survey Monkey - create an end of sprint survey. So that all of us in the BDA cohort so we can give feedback. How did data go from concept to survey monkey, excel, R
- Autorenewing bikeshare memberships. $$. Avoid user fees. Most popular ways, opt into autorenew, newsletters - link. Text reminder if under 60/30 minutes. Helpful to know this. Categorizing the process for autorenew. Autorenewed after being charged the user fee. More categories.
- Residents don't know about the monthly plan options at kiosks. Look at the stations that people sign up at with zip codes in Hawaii. Can develop programs at the stations. Specific to the station. Figuring out what stations.
- Sources of data about perceptions of bikeshare from low income... have several options. Data format to analyze?
- Collect data in life throughout the week and color postcards (Dear Data)
- People viewing an Ad. What does it mean when people view the Ad. 10,158 have seen it. Who are they?
- Z-Score: Where does the data that goes into the formula for calculating come from?
- What data do real estate, utilize to determine where to list.
- Statistical model for starting on fantasy football team. Gather data from different sources. (Historical performance, head to head matchups, live weather data). Data that affects scores.
- External Benchmarking - How did this email do? Determining where to get the data. Getting to raw sources. At industry level. Email, website bounce rates, social media (open rate, read rate, engagement by channel). Vendor platforms pull from their clients.
- Internal KPI determination. Determine KPIs from external sources. Lead counting, CRM, Adwords, marketing spend (API's). Python or R?

#### "Structure of data" Explorations
- Personality Types - Relationship, Meyers Briggs. What kind of data.
- Birthday Cards - Month and Day birthday data, python or r to convert to month and day data. Export from an report at work. Write to the same excel file (maybe expand)
- Existing training surveys. Worked with in Excel. But now work with it in R.
- Convert a file from XML to EXCEL and EXCEL into XML. What does it look like. What do you type in to do the conversion.
- Environmental Assessments (EA) Impact statements (EIS) - different geographic areas (e.g. wetlands, hazards, special management, other land use terms). Less cumbersome mapping. and implications. Match the actions to geographic indicators. Possible Applications : Permitting, Geographic Overview. Automating the process. 
- Quarterly financial data for companies into a standard format. XML, JSON.
- Compare the format and structure of news articles. Differences and Similarities between takes from different news media sources. Grab HTML news articles. Read the tags and parse out pieces of information from the HTML. Title, date. Body. Put it into different file formats.
-  Compare the structure and format of a few companies from different web services. NASDAQ. YAHOO. Put it into a usable format. 
- Download economic reporting calendars and compare the formatting differences. Need to know when the government is going to report their earnings. Build a calendar to know when to pull things down.

#### File i/o (input / output projects)
- Convert a file from XML to EXCEL and EXCEL into XML. What does it look like. What do you type in to do the conversion.
- Existing training surveys. Worked with in Excel. But now work with it in R.
- Birthday Cards - Month and Day birthday data, python or r to convert to month and day data. Export from an report at work. Write to the same excel file (maybe expand)

#### Compiling Data
- Energy Efficiencies for different water sources. Fresh water and ways to make it. Energy related processes across the kinds of water that are available. Pull data on it. Hunt it down, and compile. Compare the amount of effort / energy have to expend to get it to a drinkable state. Wastewater reuse - stormwater reuse - pumping water from aquifers - desalination. 
- Statistical model for starting on fantasy football team. Gather data from different sources. (Historical performance, head to head matchups, live weather data). Data that affects scores.
- Extrapolate availability considering use and current availability. (per capita use, population). Pull the data, formats, work with it. End goal: Forecast
- Gather Stock performance data from different sources. 
- Start from Scratch - What data should we be looking at? Other boot camps have thought about this. Tracing back the report findings that you see out there. (Flatiron coding school article). Committee on outcomes. Raw data Dumps. PDFs. Determining how to pull the data from PDFS and learning how to extract it into a table form. Script in python to scrape. TABPY (plugin) - Autosum.py?
- External Benchmarking - How did this email do? Determining where to get the data. Getting to raw sources. At industry level. Email, website bounce rates, social media (open rate, read rate, engagement by channel). Vendor platforms pull from their clients.
- Internal KPI determination. Determine KPIs from external sources. Lead counting, CRM, Adwords, marketing spend (API's). Python or R?
- Environmental Assessments (EA) Impact statements (EIS) - different geographic areas (e.g. wetlands, hazards, special management, other land use terms). Less cumbersome mapping. and implications. Match the actions to geographic indicators. Possible Applications : Permitting, Geographic Overview. Automating the process. 
- Quarterly financial data for companies into a standard format. XML, JSON.
- Paid social optimization - how much money do you need. Export the campaign data to determine an optimal measure of spend. Shares, likes, reach. Python or R. Salesforce + Mailchimp. Combining the two datasets

#### Connecting to APIs
- Pull data from a USDA API. List of organic farms, isolate address data. Matching to TMK. Map in python or R.
- Program to fill out Family Meical Leave Act form in python/R?
- Tedious process of combining data from multiple spreadsheets. Pull analytics from github accounts using R.  API to github.
- Internal KPI determination. Determine KPIs from external sources. Lead counting, CRM, Adwords, marketing spend (API's). Python or R?
- Paid social optimization - how much money do you need. Export the campaign data to determine an optimal measure of spend. Shares, likes, reach. Python or R. Salesforce + Mailchimp. Combining the two datasets

#### Data Generation Projects
- Log data for the most frequently requested forms and keep them out and easily accessible
- Survey Monkey - create an end of sprint survey. So that all of us in the BDA cohort so we can give feedback. How did data go from concept to survey monkey, excel, R

#### Geo data formats Themed Projects
- Convert a shapefile (local rainfile) CSV with lat and lon, using python and r.
- Convert geospatial data into other formats. Python libraries - shapely and fiona.
- Crop suitability shapefile -> Select a part of it -> calculate some statistics. About a bunch of files. Automate process. Compare GIS, python and R.
- Water Resource / Availability - competing ideas. Different geographic. Board of Water Supply has an infographic. Trace back the data sources. Understand the data files that make it up. Compare them again to make a new one and communicate that idea.
- Pull data from a USDA API. List of organic farms, isolate address data. Matching to TMK. Map in python or R.
- NetCDF Files - Planting dates of soy. Day planted and harvested. Read in with the python library. Reading in file and NET CDF File is in binary. Bring it into a datafram. Write it to disk. Interact with the terminal. Compare the size of the file in ASCII vs Binary. Water quality shapefile, data set.
- Mapping points of interest near properties - amenities by walking distance. Visually demonstrate the walking distances.

#### Working with PDFs
- PDF of agricultural dedications. TMK data, etc. Extract TMKs (and/or address), and Geocode into lat, lon. Map using python or R. Ideally a TMK and a detailed shape.
- Look at the outcomes from vocational schools and compare to graduates from the UH system - comparative analysis. Open source file that is also PDF.
- Download information from the US Treasury site. PDF's OCR. Extract information.
- Program to fill out Family Meical Leave Act form in python/R?
- Read Faster - OCR to find keywords. Investigate ways to programmatically search and make the work less tedious. Automate it more. One application is the legislative postings - to be able to search capitol.hawaii.gov. (Tesseract)
- Start from Scratch - What data should we be looking at? Other boot camps have thought about this. Tracing back the report findings that you see out there. (Flatiron coding school article). Committee on outcomes. Raw data Dumps. PDFs. Determining how to pull the data from PDFS and learning how to extract it into a table form. Script in python to scrape. TABPY (plugin) - Autosum.py?

#### Aesthetic Projects
- Make fancy chart for parents. "Cool". "Pretty"
- Collect data in life throughout the week and color postcards (Dear Data)
- Lead Volume Comparison - visualization. Leads per tower on tableau. Looking for velocity. Marketing spend.








