In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Data Management & Visualization Fall 2023

Welcome to **Data Management & Visualization**, a one semester crash course on working with and visualizing data. Throughout this course we will working through many topics in data science and software developement, learning how to using code to perform data analysis and answer questions for us.

### Contents:

* [Introduction](#introduction)
* [What is Data?](#what-is-data)
* [What is Data Management?](#what-is-data-management)
* [What is Data Visualization?](#what-is-data-visualization)
* [What isn't Data Visualization?](#what-isnt-data-visualization)

<a id='introduction'></a>
## Introduction

### Prerequisites

* Register for a free account at [`GitHub`](https://www.github.com). **This is mandatory**.
    * Send me your GitHub username in an email.
* Register for a free account at [`Kaggle`](https://www.kaggle.com). **This is optional**

### Syllabus

Our syllabus can be found in two places:

* [GitHub](https://github.com/ruc-data-viz/data-viz-syllabus)
* [Canvas](https://rutgers.instructure.com/courses/254127/files/33955545)

The GitHub version will always be the most up-to-date version. I will do my best to update the Canvas version whenever changes are made, but if you are ever unsure, just consult the GitHub version.

### Development Environment Set Up

You are free to use *any* development environment you wish. There are a handful of highly recommended ways to work through material in this course:

* GitHub Codespaces
* Conda
* Docker + Dev Container

#### GitHub Codespaces (Recommended)

This is the simplest and quickest way to get started with this class! GitHub Codespaces are free and integrated directly into the course, and are preconfigured with everything you need to work from *anywhere* that has an active internet connection. The drawbacks? You need an active internet connection, and sometimes they can be slow due to resource limitations. Every repository available in this course is configured to run within a Codespace for convenience.

This what we will use today.

#### Conda

Conda is a Python environment and package manager [for python]. You are free to install *Anaconda* (or *Miniconda*, for more advanced usage) and build the development environment locally on your PC. This set up is more complicated, but it would only need to be done once. Your laptop is likely much faster than a Codespace and you do not need an active internet connection to work. However, you would need to ensure your environment is kept up-to-date yourself.

#### Docker + Dev Container

This is similar to Conda above as far as benefits, with the added bonus that keeping your environment up-to-date is easier. However, the initial setup is much more difficult and you will need a relatively modern PC. This is for advanced users who already know how to use dev containers.

<a id='what-is-data'></a>
## What is Data?

Data is simply information, and information can come in many forms. It can be qualitative or quantitative. For computing systems data can be *structured*, *unstructured*, or somewhere in the middle.

-Its 

* structured data is data that follows some some relational format, usually optimized for storage and/or access.
    * e.g. databases, spreadsheets, etc.
* unstructured data is data that is simply not structured, and is much more common than structured data. most data is unstructured
    * e.g. images, audio, video, sensor data, surveys, etc.
* semi-structured data is data that may be formatted using structures and standardized formats, but lacks a complete structure to its data model.
    * e.g. JSON, YAML, XML, CSV, etc.

<a id='what-is-data-management'></a>
## What is Data Management?

Data management is the process of loading, transforming, and storing data. Data is rarely in a usable form for a myriad of reasons.

### Loading Data

When working with data, is needs to be loaded into some context. In the most basic scenario this means loading data from some file and storing its contents in the computer's memory. Data however can be too large for some computing systems to do just that, and such data may not be found as a file that can be opened. We may query a large database for subset of data. We may load data from a remote server using network requests leveraging *pagination* to iteratively pull fragments of the data one at a time. We may need to map the data stored on a hardrive to our computer's memory. We may even need to asynchronously distribute the loading of the data to multiple systems!

Some examples:

* Query a census database for demographics data for New Jersey from 2000 through 2010.
* Loading a CSV containing NFL players and their fantasy draft points and statistics.
* Loading 10000 CSV files containing survery results for 10000 different districts across the country, each survey containing on average 50k entries.

### Transforming Data



Data is rarely in a form that is readily usable and beneficial to analytics. We usually need to distill large datasets into smaller ones that are more focused on answering a specific question. Data may contain noise, outliers, and other unwanted data points that need to be identified and filtered out. Transforming data is meant to handle *all of this*. Often times we need to fuse multiple datasets together, and to do so not only do we need to load the various pieces of data, but we also need to combine the data in ways that make sense. Transformations may be computationally costly, and so we need to actively aware of what we are doing to our data and how we are doing it.

Some examples:

* Joining a temperature sensor, humidity sensor, and air quality sensor on their time metric. [Using Pandas and Numpy here]
* Smoothing imaging data to remove impurities due to imperfect apparatus. 
* Aggregating test scores to compute the average score, total score, and standard deviation of a test group.

### Storing Data

[Not much relevent.]
[We should save the data. ]

[saving for YAML, JSON, CSV etc]

Once we have loaded and transformed our data, we very likely want to store our new data. Storing data is important so that we do not need to reload and transform the data, which can be expensive operations to perform. e.g. it may take 5 minutes to load and transform the data - should you need to wait 5 minutes every time you want to view your analysis? (NO!)

Storing data can be as simple as saving your data as a file (e.g. CSV, JSON, YAML, etc.) or pushing data to a database (INSERT/UPDATE/UPSERT). It may also be more complicated as the size and complexity of the data grows. Should I store the data optimized for disk space or load times? How flat or strucuted is the data? Depending on the answers to these questions you will find you have many options for storing your data.

Some examples:
<br>[Parquet files are optimised for super large files.] <br>
[Arrow, for a small versions.] <br>
[We will only use it through Pandas]  <br>


* Pushing sensor data to a database.
* Storing 10Tb of data as *Parquet* files to archive it for longterm use.
* Storing 10Gb of data as *Arrow* files to improve the rapid and iterative loading of the data.

<a id='what-is-data-visualization'></a>
## What is Data Visualization?

Data visualization is the use of graphical elements to represent data - graphical elements may be plots, charts, images, and more. Usually data is too complex, too large, or some combination of the two to comprehend. We can use visual elements to simply represent large quantities of data is small visual spaces; or to visualize complex relationships between various pieces of data. Data visualization allows us to see our data in ways that the raw data does not allow for. Consider the following dataset on iris flowers:

In [1]:
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
iris

# There are three different flowers here 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


What can we conclude from this data table? We can see the columns, and the number of rows, but its just too large. Simply looking at the raw data is unhelpful. But if we visualize it...

In [2]:
import hvplot.pandas
iris.hvplot.scatter(x='petal_length', y='petal_width', by='species')

Here we have taken our raw data and with it constructed a *scatter plot* that places the petal length on the x-axis against petal width on the y-axis, and colored each point by the species of flower associated with that data point. This visual, while omitting *sepal* data, is highly informative! We can clearly see a natural clustering within the data that shows a clear dilineation of sizes between the various species within the dataset. By staring at the raw data you *may* be able to draw the same conclusion, but by visualizing the data we can instantly see the relationships in the data. What if our data had 1000 entries, and/or 10 columns? How could we possibly reason about our data by staring at it?

*We aim to tell true stories about our data*. Telling a story about data requires more than just plotting data - it requires finding ways to intuitively represent (usually many) different aspects of our data that allow users to understand as much about the data as they can.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/29/Minard.png/1920px-Minard.png)

```text
Charles Minard's map of Napoleon's disastrous Russian campaign of 1812. The graphic is notable for its representation in two dimensions of six types of data: the number of Napoleon's troops; distance; temperature; the latitude and longitude; direction of travel; and location relative to specific dates. Statistician professor Edward Tufte described the graphic as what "may well be the best statistical graphic ever drawn".
```
[https://en.wikipedia.org/wiki/Charles_Joseph_Minard](https://en.wikipedia.org/wiki/Charles_Joseph_Minard)


<a id='what-isnt-data-visualization'></a>
## What Isn't Data Visualization?

Data visualization is not *marketing*. Marketing graphics are usually focused on highlighting partial conclusions to draw in viewers. Critical information is usually skewed in ways that visually distort the message and convince the viewer some partial truths or false conclusions.


Talks about lie factor. How we skew the data. 

# Book: 

Nice description in the book 