# An Introduction to Jupyter Notebooks

Jupyter Notebooks are a file format (```*.ipynb```) that you can execute and explain your code in a step-wise format.
> Jupyter Notebooks supports not only code execution in Python, but over 40 languages including R, Lua, Rust, and Julia with numerous [kernels](https://docs.jupyter.org/en/latest/projects/kernels.html#kernels-programming-languages).

We can write in Markdown to write text with some level of control over your formatting.
- [Here's a Link to Basic Markdown](https://www.markdownguide.org/basic-syntax/)
- [Here's a link to Markdown's Extended Syntax](https://www.markdownguide.org/extended-syntax/)

Topics We Will Cover
- Importing different files and filetypes with [```pandas```](https://pandas.pydata.org/docs/index.html)
- Basic Statistical Analysis of tabular data with ```pandas``` and ```numpy```
- Creating Charts with python packages from the [Matplotlib](https://matplotlib.org/), [Plotly](https://plotly.com/python/), or [HoloViz Ecosystem](https://holoviz.org/background.html#background-why-holoviz)
- Evaluate the potential usecases for each visualization package

![EvidenceOfLearning](../images/learning.gif)
<br>
*This is you, enjoying the learning process.*

Step 1: Import ```pandas``` into your python program.

In [1]:
import pandas as pd
import numpy as np

# This will import the pandas and numpy packages into your Python program.

df_json = pd.read_json('../data/food-waste-pilot/food-waste-pilot.json')
df_csv = pd.read_csv('../data/food-waste-pilot/food-waste-pilot.csv')
df_xlsx = pd.read_excel('../data/food-waste-pilot/food-waste-pilot.xlsx')

In [2]:
df_csv.shape

(152, 3)

In [3]:
df_csv.head() # Grabs the top 5 items in your Dataframe by default.

Unnamed: 0,Collection Date,Food Waste Collected,Estimated Earned Compost Created
0,2022-02-25,250.8,25
1,2022-03-02,298.8,30
2,2022-03-21,601.2,60
3,2022-03-28,857.2,86
4,2022-03-30,610.8,61


In [4]:
df_csv.tail() # Grabs the bottom 5 items in your Dataframe by default.

Unnamed: 0,Collection Date,Food Waste Collected,Estimated Earned Compost Created
147,2022-10-12,385.8,39
148,2022-10-28,713.6,71
149,2022-10-31,953.4,95
150,2022-12-14,694.4,69
151,2023-01-06,968.6,97


In [5]:
df_csv.columns

Index(['Collection Date', 'Food Waste Collected',
       'Estimated Earned Compost Created'],
      dtype='object')

In [9]:
df_csv.dtypes # Returns the data types of your columns.

Collection Date                      object
Food Waste Collected                float64
Estimated Earned Compost Created      int64
dtype: object

In [10]:
df_csv.describe()

Unnamed: 0,Food Waste Collected,Estimated Earned Compost Created
count,152.0,152.0
mean,526.873684,52.611842
std,197.838075,19.787631
min,0.0,0.0
25%,398.05,39.75
50%,531.5,53.0
75%,658.9,66.0
max,1065.8,107.0


In [7]:
df_csv.info() # Returns index, column names, a count of Non-Null values, and data types.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 3 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Collection Date                   152 non-null    object 
 1   Food Waste Collected              152 non-null    float64
 2   Estimated Earned Compost Created  152 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 3.7+ KB


In [7]:
# Oh no, we can see that our Collection Date is not the data type that we want, we need to convert it to a date value.

df_csv['Collection Date'] = pd.to_datetime(df_csv['Collection Date'])

In [8]:
df_csv.dtypes

Collection Date                     datetime64[ns]
Food Waste Collected                       float64
Estimated Earned Compost Created             int64
dtype: object

In [9]:
df_csv.describe()

Unnamed: 0,Food Waste Collected,Estimated Earned Compost Created
count,152.0,152.0
mean,526.873684,52.611842
std,197.838075,19.787631
min,0.0,0.0
25%,398.05,39.75
50%,531.5,53.0
75%,658.9,66.0
max,1065.8,107.0


In [11]:
# What if we want to know the date that we collected the most food waste?

df_csv.loc[
    df_csv['Food Waste Collected'].idxmax(),
    ['Collection Date']
]

Collection Date    2022-08-10
Name: 20, dtype: object

In [10]:
import plotly.express as px

df = px.data.gapminder().query("year==2007")

In [11]:
df.columns

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap',
       'iso_alpha', 'iso_num'],
      dtype='object')

In [12]:
df.describe()

Unnamed: 0,year,lifeExp,pop,gdpPercap,iso_num
count,142.0,142.0,142.0,142.0,142.0
mean,2007.0,67.007423,44021220.0,11680.07182,425.880282
std,0.0,12.073021,147621400.0,12859.937337,249.111541
min,2007.0,39.613,199579.0,277.551859,4.0
25%,2007.0,57.16025,4508034.0,1624.842248,209.5
50%,2007.0,71.9355,10517530.0,6124.371109,410.0
75%,2007.0,76.41325,31210040.0,18008.83564,636.0
max,2007.0,82.603,1318683000.0,49357.19017,894.0


In [13]:
px.strip(df, x='lifeExp', hover_name="country")

In [14]:
px.strip(df, x='lifeExp', color="continent", hover_name="country")

In [15]:
px.histogram(df, x='lifeExp', color="continent", hover_name="country")

In [16]:
px.histogram(df, x='lifeExp', color="continent", hover_name="country", marginal="rug")

In [17]:
px.histogram(df, x='lifeExp', y="pop", color="continent", hover_name="country", marginal="rug")

In [18]:
px.histogram(df, x='lifeExp', y="pop", color="continent", hover_name="country", marginal="rug", facet_col="continent")

In [19]:
px.bar(df, color='lifeExp', x="pop", y="continent", hover_name="country")

In [20]:
px.sunburst(df, color='lifeExp', values="pop", path=["continent", "country"], hover_name="country", height=500)

In [21]:
px.treemap(df, color='lifeExp', values="pop", path=["continent", "country"], hover_name="country", height=500)

In [22]:
px.choropleth(df, color='lifeExp', locations="iso_alpha", hover_name="country", height=500)

In [23]:
px.scatter(df, x="gdpPercap", y='lifeExp', hover_name="country", height=500)

In [24]:
px.scatter(df, x="gdpPercap", y='lifeExp', hover_name="country", color="continent",size="pop", height=500)

We can see that the curve follows a logarithmic path, so make `log_x=True` to straighten out the line to view the relationships in an easier manner. In the graph below we can view the [monotic and nonmonotonic relationships](https://www.statology.org/monotonic-relationship/) in the dataset.

In [25]:
px.scatter(df, x="gdpPercap", y='lifeExp', hover_name="country", color="continent",size="pop", size_max=60, log_x=True, height=500)

In [26]:
fig = px.scatter(df, x="gdpPercap", y='lifeExp', hover_name="country", color="continent",size="pop", size_max=60, log_x=True, height=500)

This will allow you to inspect the values for each of these cells, unfortunately this is a great deal easier to see in JupyterLab.

In [27]:
fig.show("json")

In [28]:
import plotly.express as px

df = px.data.gapminder().query("year == 2007")

fig = px.scatter(df, y="lifeExp", x="gdpPercap", color="continent", log_x=True, size="pop", size_max=60,
                 hover_name="country", height=600, width=1000, template="simple_white", 
                 color_discrete_sequence=px.colors.qualitative.G10,
                 title="Health vs Wealth 2007",
                 labels=dict(
                     continent="Continent", pop="Population",
                     gdpPercap="GDP per Capita (US$, price-adjusted)", 
                     lifeExp="Life Expectancy (years)"))

fig.update_layout(font_family="Rockwell",
                  legend=dict(orientation="h", title="", y=1.1, x=1, xanchor="right", yanchor="bottom"))
fig.update_xaxes(tickprefix="$", range=[2,5], dtick=1)
fig.update_yaxes(range=[30,90])
fig.add_hline((df["lifeExp"]*df["pop"]).sum()/df["pop"].sum(), line_width=1, line_dash="dot")
fig.add_vline((df["gdpPercap"]*df["pop"]).sum()/df["pop"].sum(), line_width=1, line_dash="dot")
fig.show()

fig.write_image("gapminder_2007.svg") # static export
fig.write_html("gapminder_2007.html") # interactive export
fig.write_json("gapminder_2007.json") # serialized export


In [29]:
px.defaults.height=600

In [30]:
import plotly.express as px

z = [[.1, .3, .5, .7, .9],
     [1, .8, .6, .4, .2],
     [.2, 0, .5, .7, .9],
     [.9, .8, .4, .2, 0],
     [.3, .4, .5, .7, 1]]

fig = px.imshow(z, text_auto=True)
fig.show()

In [31]:
import plotly.express as px
df = px.data.wind()
fig = px.bar_polar(df, r="frequency", theta="direction", height=600,
                   color="strength", template="plotly_dark",
                   color_discrete_sequence= px.colors.sequential.Plasma_r)
fig.show()

In [32]:
df = px.data.iris()
fig = px.parallel_coordinates(df, color="species_id", labels={"species_id": "Species",
                  "sepal_width": "Sepal Width", "sepal_length": "Sepal Length",
                  "petal_width": "Petal Width", "petal_length": "Petal Length", },
                    color_continuous_scale=px.colors.diverging.Tealrose, color_continuous_midpoint=2)
fig.show()


iteritems is deprecated and will be removed in a future version. Use .items instead.



In [33]:
df = px.data.tips()
fig = px.parallel_categories(df, color="size", color_continuous_scale=px.colors.sequential.Inferno)
fig.show()



iteritems is deprecated and will be removed in a future version. Use .items instead.



In [34]:
df = px.data.election()
fig = px.scatter_ternary(df, a="Joly", b="Coderre", c="Bergeron", color="winner", size="total", hover_name="district",
                   size_max=15, color_discrete_map = {"Joly": "blue", "Bergeron": "green", "Coderre":"red"} )
fig.show()

In [35]:
df = px.data.election()
fig = px.scatter_3d(df, x="Joly", y="Coderre", z="Bergeron", color="winner", size="total", hover_name="district",
                  symbol="result", color_discrete_map = {"Joly": "blue", "Bergeron": "green", "Coderre":"red"})
fig.show()