## Exercise 02 - Best charts to represent different data or dataset types.
Each type of data or datasets is best visualized by certain kinds of charts, depending on both the targeted audience as well as the personal preferences of the data visualizer. In this exercise you will first simulate different types of data and datasets in python. Randomizations are useful to understand certain statistical concepts and also as a basis for random sampling, which may be required when dealing with big data. Based on these simulated data, the exercise will consist on choosing the type of chart that you find more adequate to represent the different types of data and datasets.

The objectives of this exercise are to:

1. identify each type of variables and tables that were created.
2. try your best to interpret each line of the code provided.
3. based on these simulated data, insert markdown boxes to this notebook writing the type of chart that you find more adequate to represent the different types of data and datasets, justifying your choices. You should identify the different axis of the plot, if applicable. Don't forget that drawing sketches might help! You may get some help from this site: https://datavizproject.com/

You will use two modules that provide pseudo-random number generators to implement random sampling routines. Have a look here (random module) and here (numpy.random module). Both allow to simulate data and take random samples, although np.random allows more pseudo-random generator methods to be implemented.

To run the simulations, you should first import pandas, numpy and random modules (in case you still did not installed these modules, you will need to install beforehand).

In [1]:
import pandas as pd
import numpy as np
import random

In [2]:
# Simulate var1
var1 = []
random.seed(24) # optional: used to fix the seed of the pseudo-random number generator (use any number of your choice)
levels = ["Permanent crops", "Irrigated crops", "Managed Forest", "Natural Forest", "Agro-Forestry system", "Urban", "Pasture", "Shrubland"]
for _ in range(100): # a loop is needed because random.sample selects unique elements (with no replacement)
    var1 += random.sample(levels, 1) # var1.append(random.sample(levels, 1)) would also work
print(var1)

['Pasture', 'Managed Forest', 'Natural Forest', 'Managed Forest', 'Natural Forest', 'Managed Forest', 'Irrigated crops', 'Managed Forest', 'Agro-Forestry system', 'Permanent crops', 'Shrubland', 'Shrubland', 'Irrigated crops', 'Permanent crops', 'Managed Forest', 'Shrubland', 'Shrubland', 'Agro-Forestry system', 'Shrubland', 'Irrigated crops', 'Agro-Forestry system', 'Managed Forest', 'Urban', 'Agro-Forestry system', 'Irrigated crops', 'Urban', 'Permanent crops', 'Natural Forest', 'Urban', 'Urban', 'Irrigated crops', 'Agro-Forestry system', 'Irrigated crops', 'Natural Forest', 'Managed Forest', 'Shrubland', 'Agro-Forestry system', 'Agro-Forestry system', 'Agro-Forestry system', 'Natural Forest', 'Irrigated crops', 'Shrubland', 'Managed Forest', 'Managed Forest', 'Urban', 'Shrubland', 'Natural Forest', 'Managed Forest', 'Shrubland', 'Agro-Forestry system', 'Natural Forest', 'Natural Forest', 'Irrigated crops', 'Managed Forest', 'Agro-Forestry system', 'Urban', 'Natural Forest', 'Pasture

In [23]:
# alternative to run a random sampling with replacement (using numpy)
levels = np.array(["Permanent crops", "Irrigated crops", "Managed Forest", "Natural Forest", "Agro-Forestry system", "Urban", "Pasture", "Shrubland"])
sampler = np.random.randint(0, len(levels), 100) # 100 random values within an interval (0 to 7)
var1 = levels.take(sampler) # use sampler to select values from "levels"; take - returns elements from array along the mentioned axis and indices
# print(var1)

In [14]:
sampler = np.random.randint(0, len(levels), 100)
type(sampler)

numpy.ndarray

In [24]:
# Simulate var2
np.random.seed(24) # optional: used to fix the seed of the pseudo-random number generator (use any number of your choice)
var2 = np.random.uniform(0, 100, 100)
# print(var2)

In [16]:
# Simulate table1
table1 = pd.DataFrame(var1).value_counts(sort=True)
table1 = table1.rename_axis("landuse")
table1 = table1.reset_index(name="Frequency")
print(table1)

                landuse  Frequency
0       Permanent crops         17
1  Agro-Forestry system         15
2               Pasture         14
3             Shrubland         13
4        Natural Forest         12
5       Irrigated crops         11
6        Managed Forest         10
7                 Urban          8


### My notes
1. Variables: landuse, number of times each landuse type appears in the table
2. Code interpretation:
    - A df is being made using the land use categories that were defined earlier, counting the amount of each type, and ordering decending
    - The var1 column is being renamed to "landuse"
    - The index is named "Frequency"
3. Best type of chart: bar chart with land use on the x-axis and the count of each land use type on the y-axis
    - Good for comparing categorical data, easy to visually identify occurences of each landuse type

In [7]:
# Simulate table2
table2 = pd.DataFrame(list(zip(var1, var2)), columns = ["landuse", "cover"])
print(table2)

                 landuse      cover
0   Agro-Forestry system  96.001730
1                  Urban  69.951205
2   Agro-Forestry system  99.986729
3                Pasture  22.006730
4        Permanent crops  36.105635
..                   ...        ...
95       Irrigated crops  27.560264
96       Irrigated crops  60.397982
97                 Urban  54.597285
98        Natural Forest  20.978981
99  Agro-Forestry system  13.612275

[100 rows x 2 columns]


### My notes
1. Variables: landuse, area of each landuse type
2. Code interpretation:
    - A df is being made using the land use categories that were defined earlier, and a value for cover is assigned based on the var2 numbers
        - The zip() function is being used to paid a land use variable with a cover variable
    - Column names are assigned
3. Best type of chart:
    - Bar chart with land use on the x-axis and cover on the y-axis. Could also use a pie chart if you wanted to indicate the proportion of each land use cover type
    - Bar chart is best for comparing the cover of each land type
    - Pie chart isn't as good for comparing exact proportions of land cover types, but gives a good general idea

Note: The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together, etc. The tuple() function displays a readable version of the result - try running: print(tuple(zip(var1,var2)))

In [17]:
# print(tuple(zip(var1,var2)))

In [22]:
# Simulate table3
np.random.seed(24) # optional: used to fix the seed of the pseudo-random number generator (use any number of your choice)
year = list(range(1970,2021))
temp = np.random.normal(17,2,51)
table3 = pd.DataFrame(list(zip(year, temp)), columns = ["Year", "Temperature"])
table3.head()

Unnamed: 0,Year,Temperature
0,1970,19.658424
1,1971,15.459933
2,1972,16.367439
3,1973,15.018379
4,1974,14.858367


### My notes
1. Variables: year, temperature
2. Code interpretation:
   - Create a list of numbers (years) between 1970 and 2021
   - Create a list of 51 temperature values with a SD of 2 and an average of 17
   - Creating a df by zipping a year to a temperature value
   - Assigning column names
4. Best type of chart:
    - Line chart - x = year, y = temperature
        - Good for visualizing temperature change over time
    - Scatterplot - x = year, y = temperature
        - Same as with line plot, just wouldn't have a line connecting the data points

In [10]:
# Simulate table4
xx = np.array([16,21])
yy = np.array([300, 1200])
means = [xx.mean(), yy.mean()]  
stds = [xx.std() / 3, yy.std() / 3]
corr = -0.7 # correlation
covs = [[stds[0]**2          , stds[0]*stds[1]*corr], 
        [stds[0]*stds[1]*corr,           stds[1]**2]] # covariance matrix
table4 = pd.DataFrame(np.random.multivariate_normal(means, covs, 100), columns = ["Mean Anual Temperature", "Total Precipitation"])
print(table4)

    Mean Anual Temperature  Total Precipitation
0                18.294961           909.201074
1                18.556194           684.600944
2                18.213491           840.320436
3                18.157810           755.513792
4                17.785119           824.875035
..                     ...                  ...
95               18.941036           608.551015
96               18.339957           645.584342
97               18.835738           610.889186
98               18.311114           787.298732
99               18.915846           691.382160

[100 rows x 2 columns]


### My notes
1. Variables: mean annual temperature, total precipitation
2. Code interpretation:
    - create two arrays
    - Generate the means, standard deviation, assigne correlation coefficient, and a covariance matrix
    - Create a df by generating 100 values using the means and covariance matrix
4. Best type of chart:
    - Scatterplot - x = mean annual temp, y = precipitation
        - Allows for visualization of correlation between the two variables

In [11]:
# Simulate table5
col1 = pd.Series(list(range(1900,2010,10))).repeat(8)
col2 = ["Permanent crops", "Irrigated crops", "Managed Forest", "Natural Forest", "Agro-Forestry system", "Urban", "Pasture", "Shrubland" ]*11
col3 = np.random.uniform(0, 100, 90)
table5 = pd.DataFrame(list(zip(col1, col2, col3)), columns = ["Year", "Landuse", "Cover"])
print(table5)

    Year               Landuse      Cover
0   1900       Permanent crops  77.543675
1   1900       Irrigated crops  38.634305
2   1900        Managed Forest  36.945386
3   1900        Natural Forest  69.217019
4   1900  Agro-Forestry system   4.370542
..   ...                   ...        ...
83  2000        Natural Forest  32.942488
84  2000  Agro-Forestry system  68.643865
85  2000                 Urban  34.579609
86  2000               Pasture  45.485347
87  2000             Shrubland  53.094214

[88 rows x 3 columns]


### My notes
1. Variables: year, landuse, cover
2. Code interpretation:
    - Creating a range of years to be used (col1)
    - Assign land use type values (col2)
    - Create a list of 90 numbers between 0 and 100 where each value has an equal opportunity to be selected. Decimals.
    - Create a df by zipping three variables together and assigning column names
4. Best type of chart:
    - Stacked area plot
        - x = year, y = cover
        - Each year would have a piece of the stacked bar assigned to it to allow the viewer to see how the land use was distributed by year
    - Line plot
        - x = year, y = cover
        - Each line would represent a different land use type, which allows for easy visualization of how land use changes over time