# Welcome to Jupyter! 
What you are looking at is called a jupyter notebook. Notebooks like this are often used in data science as a way to teach and collaborate. They let you read about a challenge or technique, see and try running code step-by-step, make changes and see the results all on one page. 

You'll notice there are two types of page sections. Narrative, like this, is just for reading. Any code in here won't run. If you click on the narrative section, you'll enter edit mode. You can make notes for yourself here. Don't worry, nobody else can see your changes.

Below, you'll see a simple block of code. You should see a little triangle arrow to the left of the code when your mouse hovers over the code block. If you click this arrow, you should see the result of the code (Hellow world!) print out below the code box.

> If you are not already in a Jupyter notebook environment, you can launch this notebook in an online JupyterLab session hosted by mybinder.org:
>
> https://mybinder.org/v2/gh/intersective/binder-base/trunk?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252Fintersective%252Fdata-capstones%26urlpath%3Dlab%252Ftree%252Fdata-capstones%252Fskillsbuild%252Fsustainability%252Fnyc_water_project_1.ipynb%26branch%3Dtrunk


In [None]:
text_variable='Hello world!' #creates a variable which contains specific text. You can change the text if you want.
print(text_variable) #prints out the text stored in the variable text_variable.

Code blocks "remember" what happened in the previous code block, but only if the previous block "ran". For example, assuming you ran the above code, we can now run this next block of code that references `text_variable`. 

In [None]:
print(text_variable + " to you too")

This is important to remember - in these lessons it's very easy to forget to run a block which may cause later blocks to have errors. You can "Run all" which will start from the top and run every code block in order.

If you want to learn more about how notebooks work, [here is a good introduction](../../../binder_intro.ipynb) which should open in a new tab within JupyterLab.

Now let's get to work!

# Redoing the Water Consumption in NYC Analysis in Python

We're going to quickly redo the mini-project we previous did in google sheets (analyzing NYC water consumption from 1979-2022). This time, we'll do it all in Python using this notebook and a graphing library called plot.ly. This will be a "warm-up" to the real project - which is a much deeper dive into NYC water consumption.

To start, we need to initialize our notebook with required python libaries. Just running this next code block will load up the libraries.

In [None]:
%pip install pandas numpy cufflinks plotly
import pandas as pd
import numpy as np
import cufflinks as cf
import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
cf.go_offline() # required to use plotly offline (no account required).
py.init_notebook_mode() # graphs charts inline (IPython).
print("Imported!")

Now let's access our data source. NYC's Open Data initiative makes data sets available via their simple SOCRATA API system. You can get all of the data in JSON format, or you can pre-filter the data.

We're going to access the same data we did for the Google Charts project, but this time via the API. Return to the [NYC Open Data website](https://data.cityofnewyork.us/Environment/Water-Consumption-in-the-City-of-New-York/ia2d-e54m)

In [None]:
dataUrl = 'https://data.cityofnewyork.us/resource/ia2d-e54m.json'
data = pd.read_json(dataUrl)
print("Rows, Columns of data: ", data.shape)
print("\nData types of columns:")
data.dtypes

Let's recreate the simple chart of population by year that we did at the start of the google sheets project.

We will use plot.ly library and their iplot renderer, which creates interactive graphs. Plot.ly combines line graphs and scatter plot graphs - a line graph is a connected scatter plot graph. 

In [None]:
# we need to set up our data object, which configures plot.ly. There are many more options, as we will see shortly.
graphData = [
    go.Scatter(
        x=data.year,
        y=data.new_york_city_population
    )
]
py.iplot(graphData)


While it's great to be able to plot a chart with a few lines of code, this chart does not look very nice and also doesn't follow best practices. The chart is too wide, max aspect ratio for a bar chart should be 1:1.5 in otherwords the dimensions should be 600px high by 900 wide for example
-   There is no title. The title should be centred and descriptive 
-	The X and Y axis have no labels
-   The background color is not white
-   The Y axis starts at 7m instead of 0, distoring the graph and making growth seem greater than it actually was 
-   The aspect ratio is not 1.5:1 

Here's more robust code that implements these formatting changes:

In [None]:
pop = go.Scatter(
    x=data.year,                        # x-axis will be the year
    y=data.new_york_city_population,    # y-axis will be the population
    fill='tozeroy',                     # fill the area under the graph
)
layout = go.Layout(
        template='simple_white',# use a white background
        autosize=False,         # don't automatically size the graph
        width=900,              # 900x600 is a 1.5:1 ratio
        height=600,
        font=dict(              # sets the font family for the whole graph
            family="Rockwell",
            color="slategray"
        ),   
        xaxis=dict(
            title_text="Year",  # the axis title
            color="slategray",  # the axis color
            dtick=10,           # the step size of the axis, in this case 10 years
            showgrid=False,     # don't show the major gridlines
        ),
        yaxis=dict(
            title_text="Population", 
            color="slategray",
            showgrid=True,      # we do want the y-axis major gridlines
            rangemode="tozero", # ensure the y-axis starts at 0
            dtick=2500000       # set the tick size to 2.5m 
        ),
        title=dict(
            text="NYC Population from 1979-2022", 
            x=.5                # this horizontally centers the title
        ),
        title_font=dict(        # lets us override the font size & color for the title
            size=24,
            color="darkslategray"
        ),
)
go.Figure(data=pop, layout=layout)



This looks a lot like what we did in Google Sheets. The great part is we can easily reuse the formatting code for future graphs. Let's now transform our data and render the final exercise graph, where we show the percentage change since 1979 for population, consumption and per-capita consumption.

First we'll need to create those additional columns in our data set. Note that there is a built in function called pct_change() but like our first attempt with Google Sheets, that function calculates the change between one row and the next. Which gives us the EKG type graph. So we are creating a custom function (lambda) that takes each row and applies the following math equation to it:
`% = (current row / row 0) - 1`

Which is another way of getting the percentage change between two rows. Note this value is expressed as a decimal and we'll need to format the axis accordingly!

In [None]:
# 
data['perc_change_population'] = data[['new_york_city_population']].apply(lambda x: x.div(x.iloc[0]).subtract(1))
data['perc_change_consumption'] = data[['nyc_consumption_million_gallons_per_day']].apply(lambda x: x.div(x.iloc[0]).subtract(1))
data['perc_change_percap'] = data[['per_capita_gallons_per_person_per_day']].apply(lambda x: x.div(x.iloc[0]).subtract(1))

# show the new table
data


Now that we've transformed the data, let's graph it using all of our best practices!

In [None]:
colors = ['lightslategray',] * 5
colors[0] = 'blue'
population = go.Scatter(
    x=data.year,                        # x-axis will be the year
    y=data.perc_change_population,      # y-axis will be the population
    name='% Change in Population',      # set the title of the series for the legend
    line_color='slategray'              # set the color of the series line
)

consumption = go.Scatter(
    x=data.year,                        # x-axis will be the year
    y=data.perc_change_consumption,     # y-axis will be the population
    name='% Change in Water<br>Consumption', # we add a <BR> into the title so it wraps onto two lines
    line_color='royalblue'              # set the color of the series line
)

percap = go.Scatter(
    x=data.year,                        # x-axis will be the year
    y=data.perc_change_percap,          # y-axis will be the population
    name='% Change in Per-Capita<br>Water Consumption', # we add a <BR> into the title so it wraps onto two lines
    line_color='lightblue'              # set the color of the series line
   
)
layout = go.Layout(
        template='simple_white',# use a white background
        autosize=False,         # don't automatically size the graph
        width=900,              # 900x600 is a 1.5:1 ratio
        height=600,
        font=dict(              # sets the font family for the whole graph
            family="Rockwell",
            color="slategray"
        ),   
        xaxis=dict(
            title_text="Year",  # the axis title
            color="slategray",  # the axis color
            dtick=10,           # the step size of the axis, in this case 10 years
            showgrid=False,     # don't show the major gridlines
        ),
        yaxis=dict(
            title_text="Percentage Change", 
            color="slategray",
            showgrid=True,      # we do want the y-axis major gridlines
            rangemode="tozero", # ensure the y-axis starts at 0
            dtick=.25,            # set the tick size to 25%
            tickformat='.0%', # format as percentage
        ),
        title=dict(
            text="NYC Water Consumption vs Population 1979-2022", 
            x=.5                # this horizontally centers the title
        ),
        title_font=dict(        # lets us override the font size & color for the title
            size=24,
            color="darkslategray"
        ),
)
go.Figure(data=[population,consumption,percap], layout=layout)

You've just quickly recreated the chart we made in Google Sheets, but this time using the power of Python and plot.ly. Now that you have the basics down, let's get into a much deeper data set. 

Follow [this link](nyc_water_project_2.ipynb) to go to the next workbook!