[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nishantsule/Data-Visualization-Workshop/blob/master/Data-Viz-FundamentalsReview.ipynb)

# Fundamentals of Data Visualization using Python

## A brief introduction to dealing with data in Python

First off, let's review some basics of Python and some useful libraries that we will use to work with the data and for creating plots.

If you have programmed before in Python, this initial parts maybe slow. If you haven't used Python before, this might be a bit challenging but hopefully should be useful to get started.

You are running this notebook on [Colab](colab.research.google.com). You will notice buttons in the top left that say "+ Code" and "+ Text". The "+ Code" adds a code cell and the "+ Text" a text cell. If you want to try them out, select this cell by clicking on it, then press the "+ Text". A new cell should appear below this one.

You can type in the cell and use the buttons to format it, or you can mark up the text using [Markdown](https://www.markdownguide.org/basic-syntax/). When you are done, hold down Shift and press Enter, which will format the text you just typed and then move to the next cell. (Shift + Enter will also be useful to run code cells as we will see down below).

The first thing to try out with Python is that it can be used as a scripting language. With numbers in particular, it can be used simply as a calculator. Create a code cell below and try out some simple arithmatic operations. The operators for basic arithmatic are `+`, `-`, `*`, and `/`, representing addition, subtraction, multiplication, and division, respectively. To run a code cell you can click on the "Play" button on the left of the cell or simply hold Shift and press Enter. 

In [2]:
34 + 6 / 10

34.6

In [3]:
(34 + 6) / 10

4.0

Notice that parantheses are important to distinguish the order of arithmatic operations. 

### Numpy

Next, we will look at a very useful library for more advanced mathematical computations called `Numpy`. We can import any library in Python using the `import` keyword as shown below.

In [4]:
import numpy as np

Once imported, any variables, keywords, functions, methods etc. from the library can be accessed using the `np.` prefix as you can see below.

In [5]:
print('Pi =', np.pi)
print('Sin(pi/2) =', np.sin(np.pi / 2))

Pi = 3.141592653589793
Sin(pi/2) = 1.0


`Numpy` is most useful because of arrays. Arrays are collections of numbers where one can perform operations on all of them at once or even access them one at a time. Two very useful ways to generate arrays are using the functions `linspace` and `arange`. `linspace` produces linearly spaced numbers given a *start*, *stop*, and the *number of points*, whereas `arange` produces linearly spaced numbers given the *start*, *stop*, and the *step* between individual points. Run the examples below to see this in action. **Note** that in `arange`, the last point, i.e. *stop* is not included.

In [6]:
print(np.linspace(0, 5, 6))  # 0=start, 5=stop, 6=number of points

[0. 1. 2. 3. 4. 5.]


In [7]:
print(np.arange(0, 5, 1))  # 0=start, 5=stop (non inclusive), 1=step

[0 1 2 3 4]


### Saving your work

On Colab press the "Copy to Drive" button (near the upper left), which saves a copy of this notebook in your Google Drive.
If you want to change the name of the file, you can click on the name in the upper left.
If you don't use Google Drive, look under the File menu to see other options.
Once you make a copy, any additional changes you make will be saved automatically, so now you can continue without worrying about losing your work.

### Variables: Numeric, Strings, Date/Times, Location

A variable is essentially a name that refers to a value. The value of a variable can take many different forms. Some of the more commonly used types are numeric, strings, date/times, and locations. Let's explore these below.

**Numeric**

In [8]:
x = np.arange(0, 5, 1)
y = np.exp(x)
print('x =', x)
print('e^x =', y)

x = [0 1 2 3 4]
e^x = [ 1.          2.71828183  7.3890561  20.08553692 54.59815003]


**Strings**

In [9]:
names = ['Sanders', 'Biden', 'Warren', 'Buttigieg', 'Klobuchar', 'Bloomberg']
print('Democratic presidential candidates:', names)

Democratic presidential candidates: ['Sanders', 'Biden', 'Warren', 'Buttigieg', 'Klobuchar', 'Bloomberg']


**Date/Times**

Dates, times, timedeltas etc. are very useful and frequently used variable types. Although Python has a native support for datetime variable types which is included in the module `datetime`, we will use a different library called `pandas`. `pandas`, short for Python for Data Analysis, is extremely useful when it comes to working with large sets of data and datetimes variables are often encountered in large datasets, which `pandas` can help manipulate. Import `pandas` as shown below.

In [10]:
import pandas as pd

`pandas` has a function called `Timestamp` that can convert certain formats of strings into datetime variables as shown below.

In [11]:
pd.Timestamp('10:15:00')

Timestamp('2020-03-10 10:15:00')

In [12]:
date_of_birth = pd.Timestamp('July 9, 1984, 7:42pm')
print(date_of_birth)

1984-07-09 19:42:00


For a datetime variable, you can find the year, month, day, name of the day, name of the month etc as shown below.

In [13]:
print(date_of_birth.year)
print(date_of_birth.month)
print(date_of_birth.day)

1984
7
9


In [14]:
print(date_of_birth.day_name())
print(date_of_birth.month_name())

Monday
July


You can perform datetime operations such as finding the `Timedelta` using variables that are `Timestamp`. You can also obtain all the components of a `Timedelta` using `components`, as shown below.

In [15]:
now = pd.Timestamp.now()
age = now - date_of_birth
print(age)

13027 days 18:58:01.584548


In [16]:
age.components

Components(days=13027, hours=18, minutes=58, seconds=1, milliseconds=584, microseconds=548, nanoseconds=0)

**Location**

We can store the latitute and longitude information of any location as a pair of numbers. Python again has libraries that can be used to convert these pairs of numbers into geographic information and for example let us place points on a map. We will look one such library called `geopandas` later on in the workshop. For now, we can assign more than one value to a variable in Python. This is called a tuple. 

In [17]:
lat = 42.3601
lon = -71.0589
boston = lat, lon
print(boston)

(42.3601, -71.0589)


**Side Note: Lists**

Python provides another handy way to store a sequence of elements, which is known as a list.
To create a list, you put a sequence of elements in square brackets. A list is very useful in scenarios where the number of elements is changing or is unknown. A list can also be used to store elements of different types. We can add an element at the end of a list using the `append` method.

In [18]:
list1 = []  # creates an empty list
print(list1)

[]


In [19]:
list2 = [1, 2, 3, 'Hello']  # creates a list with provided elements
print(list2)

[1, 2, 3, 'Hello']


In [20]:
list2.append('World')
print(list2)

list2.append(4)
print(list2)

[1, 2, 3, 'Hello', 'World']
[1, 2, 3, 'Hello', 'World', 4]


Let's bring what we have looked at so far together with a couple of small exercises.

### Exercise 1

**A.** Covert the two lists below to a `Timestamp`. 

*Hint:* This requires writing a `for` loop.

In [21]:
date_of_birth = ['June 14, 1946', 'Aug 4, 1961', 'July 6, 1946', 
                 'Aug 19, 1946', 'June 12, 1924', 'Feb 6, 1911', 'Oct 1, 1924']
year = [2017, 2009, 2001, 1993, 1989, 1981, 1977]

In [22]:
# Solution

for i in range (len(date_of_birth)):
    temp = pd.Timestamp(date_of_birth[i])
    date_of_birth[i] = temp
print(date_of_birth)

for i in range (len(year)):
    temp = pd.Timestamp(str(year[i]))
    year[i] = temp  
print(year)

[Timestamp('1946-06-14 00:00:00'), Timestamp('1961-08-04 00:00:00'), Timestamp('1946-07-06 00:00:00'), Timestamp('1946-08-19 00:00:00'), Timestamp('1924-06-12 00:00:00'), Timestamp('1911-02-06 00:00:00'), Timestamp('1924-10-01 00:00:00')]
[Timestamp('2017-01-01 00:00:00'), Timestamp('2009-01-01 00:00:00'), Timestamp('2001-01-01 00:00:00'), Timestamp('1993-01-01 00:00:00'), Timestamp('1989-01-01 00:00:00'), Timestamp('1981-01-01 00:00:00'), Timestamp('1977-01-01 00:00:00')]


**B.** You perhaps guessed this already, but if not read on. The `date_of_birth` list contains dates of birth of the last seven US Presidents and the `year` list contains the year they took office. Now create a list called `age` and populate it with the age of the last seven US Presidents when they took office.

*Hint:* You can use find the difference between the `Timestamp` to calculate the `Timedelta`. The `Timedelta` will be in days, so to approximate the age years divide the number of days by 365 and round the value to an integer.

In [40]:
# Solution

age = []
for i in range(len(year)):
    age.append(year[i] - date_of_birth[i])
print(age)

[Timedelta('25769 days 00:00:00'), Timedelta('17317 days 00:00:00'), Timedelta('19903 days 00:00:00'), Timedelta('16937 days 00:00:00'), Timedelta('23579 days 00:00:00'), Timedelta('25532 days 00:00:00'), Timedelta('19085 days 00:00:00')]


## Principles of effective data visualizations

***1. Have graphical integrity***

(Do not try to mislead using data visualizations)

***2. Keep it simple***

(Avoid chart junk, maximize data-ink ratio)

***3. Adapt to your audience***

(Know your audience or make it neutral)

***4. Use the right display***

(Appropriate plot types)

***5. Use color strategically***

(Do not overuse color, consider colorblindness)