## Data Visualization in Python for Absolute Beginners

### Basic Charts

#### Histograms

For the first few examples, we will use some data on the top 50 bestselling novels on Amazon. Let's start with a histogram. Histograms are used to visualize the distribution of a continuous variable.

In our case, we will visualize the distribution of prices of bestselling books from 2021.

We will first define the data we want to plot as a list. Lists are a Python data type that store multiple elements and are enclosed in square brackets ([).

In [29]:
# Specify the data
prices_2021 = [7.48, 12.52, 17.78, 11.98, 7.49, 5.36, 6.99, 13.58, 14.34, 8.99, 6.62, 12.01, 18.0, 15.98, 10.62, 6.0, 10.58, 6.99, 4.31, 4.14, 10.26, 4.07, 14.4, 8.48, 8.49, 14.16, 26.0, 15.49, 4.79, 9.59, 11.6, 7.57, 13.99, 11.4, 10.35, 7.74, 13.79, 9.58, 13.09, 13.29, 19.42, 9.42, 10.34, 17.99, 14.8, 5.06, 8.55, 8.37, 14.89, 5.98]

# Preview the list
print(prices_2021)

[7.48, 12.52, 17.78, 11.98, 7.49, 5.36, 6.99, 13.58, 14.34, 8.99, 6.62, 12.01, 18.0, 15.98, 10.62, 6.0, 10.58, 6.99, 4.31, 4.14, 10.26, 4.07, 14.4, 8.48, 8.49, 14.16, 26.0, 15.49, 4.79, 9.59, 11.6, 7.57, 13.99, 11.4, 10.35, 7.74, 13.79, 9.58, 13.09, 13.29, 19.42, 9.42, 10.34, 17.99, 14.8, 5.06, 8.55, 8.37, 14.89, 5.98]


We then import plotly.express and initialize a figure object fig and use the histogram() function to specify the data we want on the x-axis.

Lastly, we use the .show() method to generate the plot!

In [30]:
# Preview the list
print(prices_2021)

# import plotly express
import plotly.express as px

# initialize a histogram
fig = px.histogram(x=prices_2021)

# show the plot
fig.show()

[7.48, 12.52, 17.78, 11.98, 7.49, 5.36, 6.99, 13.58, 14.34, 8.99, 6.62, 12.01, 18.0, 15.98, 10.62, 6.0, 10.58, 6.99, 4.31, 4.14, 10.26, 4.07, 14.4, 8.48, 8.49, 14.16, 26.0, 15.49, 4.79, 9.59, 11.6, 7.57, 13.99, 11.4, 10.35, 7.74, 13.79, 9.58, 13.09, 13.29, 19.42, 9.42, 10.34, 17.99, 14.8, 5.06, 8.55, 8.37, 14.89, 5.98]


#### Bar chart

Next, we will cover a [bar chart](https://plotly.com/python/bar-charts/)! Bar charts are a great way to plot counts or percentages of a categorical variable.

For our bar chart, we will plot the average price by genre.

In [31]:
# Specify the data
genres = ["Fiction", "Non Fiction"]
average_prices = [10.6, 14.5]

# Preview the first list
print(average_prices)

[10.6, 14.5]


In [32]:
fig = px.bar(x = average_prices, y = genres)

fig.show()


#### Line charts

Line charts are typically used to show how a variable (or variables) changes over time. With Plotly Express, it is incredibly easy to create a [line chart](https://plotly.com/python/line-charts/).

Let's plot the total number of reviews for each year.

In [33]:
# Specify the data
years = [2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
total_reviews = [235506, 273981, 405041, 654546, 654907, 792997, 711669, 709800, 644420, 696521, 794917, 1790733, 2818117]

# Initialize a line plot
fig = px.line(y = years, x = total_reviews)

fig.update_layout()

fig.show()

#### Scatter plots
Scatter plots are similar to line plots and serve as a great way to visualize the relationship between two continuous variables that are not necessarily connected.

Creating a [scatter plot with Plotly](https://plotly.com/python/line-and-scatter/) with Plotly is just as easy as creating a line chart.

We will use a pandas DataFrame instead of Python lists to make things easier. A DataFrame is a data structure composed of labelled columns and rows, much like a spreadsheet.

In [34]:
import pandas as pd

books = pd.read_csv("amazon.csv")

books

Unnamed: 0,Name,Author,User Rating,Reviews,Price,Price_r,Year,Genre
0,10-Day Green Smoothie Cleanse,JJ Smith,4.7,17350.0,8.00,$8,2016,Non Fiction
1,11/22/63: A Novel,Stephen King,4.6,2052.0,22.00,$22,2011,Fiction
2,12 Rules for Life: An Antidote to Chaos,Jordan B. Peterson,4.7,18979.0,15.00,$15,2018,Non Fiction
3,1984,George Orwell,4.7,70425.0,7.48,$8,2021,Fiction
4,1984 (Signet Classics),George Orwell,4.7,21424.0,6.00,$6,2017,Fiction
...,...,...,...,...,...,...,...,...
645,Wrecking Ball (Diary of a Wimpy Kid Book 14),Jeff Kinney,4.9,9413.0,8.00,$8,2019,Fiction
646,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331.0,8.00,$8,2016,Non Fiction
647,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331.0,8.00,$8,2017,Non Fiction
648,You Are a Badass: How to Stop Doubting Your Gr...,Jen Sincero,4.7,14331.0,8.00,$8,2018,Non Fiction


In [35]:
fig = px.scatter(books, x = "Price", y = "Reviews", opacity = 0.5)

fig.show()

### Customizing plots

#### Labels
Okay, we have a plot that visualizes the relationship between price and the number of reviews. But would anyone know this from the chart alone?

Let's improve the x and y-axis labels, and add a title to our plot!

In [36]:
fig = px.scatter(books, x="Price",
              y="Reviews",
              title = "Number of reviews by Book price <br><sup>Amazon Bestseller List 2009 to Present</sup>",
              labels = {"Reviews": "# of Reviews", "Prices": "Price (USD)"})

fig.show()

#### Colors
Some plots can split the data by a categorical variable. In the case of a scatterplot, you can use the `color` parameter to differentiate points by another variable.

Let's further differentiate between the "Genre" of book to see if there is any clear difference between the two types.

In [37]:
# Initialize a scatter plot
fig = px.scatter(books,
                 x="Price",
                 y="Reviews",
                 title="Number of Reviews by Book Price<br><sup>Amazon Bestseller List 2009 to Present</sup>",
                 labels={"Reviews": "# of Reviews", "Price": "Price (USD)"},
                 color="Genre"
                )

# Show the plot
fig.show()

#### Changing templates

Another fun way to customize your plots is to change the `template`. Here, we pass in "simple_white" as an argument to the parameter.

Plotly has several built-in templates that you can use:
- 'ggplot2' 
- 'seaborn'
- 'simple_white'
- 'plotly'
- 'plotly_white'
- 'plotly_dark'
- 'presentation'
- 'xgridoff'
- 'ygridoff'
- 'gridon'
- 'none'

In [38]:
# Initialize a scatter plot
fig = px.scatter(books,
                 x="Price",
                 y="Reviews",
                 title="Number of Reviews by Book Price<br><sup>Amazon Bestseller List 2009 to Present</sup>",
                 labels={"Reviews": "# of Reviews", "Price": "Price (USD)"},
                 color="Genre",
                 template="seaborn"
                )

# Show the plot
fig.show()

## Geographical scatter plot
For our final plot, let's create a [geographical scatter plot](https://plotly.com/python/scatter-plots-on-maps/) by visualizing meteorite landings by size.

The data we will use is adapted from [this dataset](https://www.kaggle.com/datasets/nasa/meteorite-landings), filtered to only include meteorites that were 100 kg or heavier.

In [39]:
heavy = pd.read_csv("heavy_meteorites.csv")

heavy

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,GeoLocation
0,Abee,6,Valid,EH4,107000,Fell,1952.0,54.21667,-113.00000,"""(54.216670, -113.000000)"""
1,Alfianello,466,Valid,L6,228000,Fell,1883.0,45.26667,10.15000,"""(45.266670, 10.150000)"""
2,Allende,2278,Valid,CV3,2000000,Fell,1969.0,26.96667,-105.31667,"""(26.966670, -105.316670)"""
3,Bjurböle,5064,Valid,L/LL4,330000,Fell,1899.0,60.40000,25.80000,"""(60.400000, 25.800000)"""
4,Boguslavka,5098,Valid,"""Iron, IIAB""",256000,Fell,1916.0,44.55000,131.63333,"""(44.550000, 131.633330)"""
...,...,...,...,...,...,...,...,...,...,...
420,Zaragoza,48916,Valid,"""Iron, IVA-an""",162000,Found,,41.65000,-0.86667,"""(41.650000, -0.866670)"""
421,Zerhamra,30403,Valid,"""Iron, IIIAB-an""",630000,Found,1967.0,29.85861,-2.64500,"""(29.858610, -2.645000)"""
422,Zhaoping,54609,Valid,"""Iron, IAB complex""",2000000,Found,1983.0,24.23333,111.18333,"""(24.233330, 111.183330)"""
423,Zhigansk,30405,Valid,"""Iron, IIIAB""",900000,Found,1966.0,68.00000,128.30000,"""(68.000000, 128.300000)"""


We create the plot in the same way as other plots, but this time we use some different parameters:
- `lat` and `lon` specify the columns that contain the latitude and longitude of each landing.

In [40]:
# Initialize a geographical scatter plot
fig = px.scatter_geo(heavy,
                     lat="reclat",
                     lon="reclong",
                     title="Meteorite Landings"
                    )

# Show the plot
fig.show()

Let's improve by adding a template and coloring by the "fall" observation. We can also include the following:

- `size` to scale the points by the mass of the meteorite.
- `hover_data` to provide additional information on the meteorite upon hover.

In [41]:
# Initialize a geographical scatter plot
fig = px.scatter_geo(heavy,
                     lat="reclat",
                     lon="reclong",
                     title="Meteorite Landings",
                     color="fall",
                     template="plotly_dark",
                     size="mass",
                     hover_data=["name", "year"]
                    )

# Show the plot
fig.show()

## Bonus: No-code visualizations
Workspace now has no-code visualizations! Let's try out a line plot using some pre-processed meteorite data.

After loading it in, we can click "Visualize" on the DataFrame output and customize our plot!

In [42]:
# Read in the csv as a DataFrame
mass = pd.read_csv("average_meteorite_mass.csv")

# Preview the DataFrame
mass

Unnamed: 0,year,fall,mass
0,1990-01-01,Fell,44276.325
1,1990-01-01,Found,508.334716
2,1991-01-01,Fell,27186.14
3,1991-01-01,Found,708.811344
4,1992-01-01,Fell,56316.666667
5,1992-01-01,Found,7191.786377
6,1993-01-01,Fell,14866.5
7,1993-01-01,Found,100.885063
8,1994-01-01,Fell,15751.5
9,1994-01-01,Found,837.276032
