# Introduction  to Data Tutorial 1

This is an introduction to Python, as well as some of the most useful [modules](https://realpython.com/python-modules-packages/) like `numpy`, `pandas`, and `matplotlib`. The first part was written by [Simon Mudd](https://github.com/simon-m-mudd/smm_teaching_notebooks), and originally adapted for Google Colab by Joanmarie Del Vecchio and then turned back into normal notebooks for JHub. The second half will involve a dataset from closer to home. 


# Learing objectives

- Introduction to Python and basic packages like `numpy`, `pandas` and `matplotlib`
- Introduction to loading in data from spreadsheets and manipulating the data with code
- Not much geomorphology, but we will get to look at some data!

# Part 1: Python basics


Written by Simon M. Mudd with last update 21/01/2022

This series of lessons is a very basic introduction to python. They are intended for students with no python background, or for students who need a basic refresher.

If you spend a few seconds on any search engine you will find hundreds of basic python tutorials that are better than this one. But this one specifically covers the stuff I want you to know for lesson 2 and 3, so that is why I wrote it.



## What is python?
Python is a programming language. It is used for all kinds of things but the target audience for these lessons are people who will do some data analysis or plotting. Similar alternatives are Matlab (but this is a commercial product and you have to pay for it), R (which is focussed on statistics and not as flexible as python), and Julia (which has far fewer users than python so help is harder to find).

Amongst the above languages, Python is the most widely used and also the most frequently cited in job ads.

## What do you use python for (in these lessons)
These lessons are to show how you might replace some tasks in Excel with python. This boils down to handling data and plotting it. Excel is fast and easy for lots of things but there are certain things that are easier in python

## How does python work?
You need some version of python installed to get it to work. But if you are using a functioning python notebook, someone has already done that for you.

## Python in a notebook
You are reading this in a notebook. The notebook is made up of "cells". Cells can either be text (usually formatted using something called markdown, which you can look up) or code. If you are going to run some code the cell needs to be a "code" cell.

This cell is a text cell so it won't run any python code.

The next cell will be code.

In [None]:
# This is a code cell. 
# If you put a `#` on a line python will ignore it. 
# This is called a comment. 
# Below I will tell python the letter a has the value 5 and I will then print it to screen
# To run this cell click on it and press shift+enter
a = 5
print(a)

You run cells by "executing" them. Do this by clicking on the cell and typing shift+enter. If there is any output from the cell it will appear below the cell.

## Assigning variables
In python you can assign variables with the = symbol. Python will remember this variable.

In [None]:
a = 5
b = 6
print(a+b)

## Printing things
You can print things with the print command. What you are printing afterwards needs to be in parentheses. Python assigns a type to its variables. The variable could be a number, or a string of characters (called a "string" or "str"). You assign strings by using quotations. When you print things you can't mix types. But you can convert things all to string and print that.

In [None]:
# THIS DOESN'T WORK
a = 5
b = "The number is: "
print(b+a)

In [None]:
# This works because I convert the number to a string:
a = str(5)
b = "The number is: "
print(b+a)

Lists of stuff (numbers, for example)
You can assign a number or a string to a variable. But you can also assigns groups of numbers or strings. These are called lists.

When you have a list you can access their individual values using something called and index. python uses square brackets for this. python uses something called "zero indexing" where the first element in a list has the index of 0. This might seem a bit weird but it works that way for some c[omputer science reasons](https://en.wikipedia.org/wiki/Zero-based_numbering#:~:text=Zero%2Dbased%20numbering%20is%20a,mathematical%20or%20non%2Dprogramming%20circumstances). As a consequence programmers think you are a loser if you index your first element with a 1, like a normal human. It is annoying at first but you will get used to it.

In [None]:
some_numbers = [7,4,2,-1,-5]
print("Here is the list: ")
print(some_numbers)
print("Below is the first element. The index of the first element is 0, not 1!!")
print(some_numbers[0])
print("The fifth element is:")
print(some_numbers[4])

You can to the same thing with strings:

In [None]:
some_strings = ["yo","yo","ma"]
print("The strings are:")
print(some_strings)

## Stuff to try
* Try to assign a variable
* Try to print your variable
* Try to assign a list
* Try to print an item in your list



# Importing intro

Written by Simon M. Mudd with last update 21/01/2022

## Importing useful stuff
There are loads of people that write useful python code. They make this code available so that you can use it in your own code. One day you might even write some python code that you let other people use.

This useful code is usually distributed as "packages". To get a package you need to install it. But if you are just starting out, you get told where to run your python code (for example, in google colab, or in some university's notebook server), and someone has already installed the useful packages for you. You are almost certainly in one of those places now.

But just installing some package doesn't make it available. You need to `import` it.

## What sorts of useful stuff can I import?
If you are reading this tutorial you are vaguely going to use python for data science, broadly defined (as opposed to, say, web development, or making games). Any data scientist will use this stuff:

* `numpy`: short for numerical python. Mathsy stuff in here.
* `matplotlib`: for making plots
* `scipy`: more mathsy stuff but more specific than `numpy`. Statistcs are in here.
* `pandas`: reading and dealing with data.

The above packages are really, really common. Everyone uses them. And those packages are very stable, the installations almost never break and there is lots of documentation and tutorials.

You might also be someone interested in geospatial data. This requires packages that are a bit more niche than the packages listed above so I will leave that to a later lesson.

## Show me how to import something!!
You import a package by using the python command `import`. Like this:

In [None]:
import numpy
numpy.__version__

I imported `numpy` there and then I called `__version__` which you can use with pretty much any python package to tell you the version number.

To use something in a package you use the name of the package followed by the `.` symbol. Here are two more things that you might do with numpy (note this only work if you have already imported `numpy`

In [None]:
a = numpy.arange(6)
print("Here is a numpy array;")
print(a)
b = numpy.linspace(1,5,11)
print("Here is another one made a different way:")
print(b)

Wait a minute. That looks like a list. Why did you call it an array?
`numpy` has its own version of a list called a *numpy array*. Without giving you too many details, a numpy array behaves much like a list but has some added features that make it bettern than a list for certain kinds of computation. You can convert a list to an array using the numpy function `asarray`:

In [None]:
a_list = [1,2,3,4]
an_array = numpy.asarray(a_list)
print("Type of a_list is:")
print(type(a_list))
print("Type of an_array is:")
print(type(an_array))

## Do I really have to type out `numpy` each time I want to do something with numpy?
It is a bit annoying to type out the full name of a package each time you want to use something in it. Luckily python lets you give packages short names (or long names, if you have some weird fetish) by usinging the import as syntax:

In [None]:
import numpy as np
a = np.arange(6)
print("Here is a numpy array;")
print(a)

## Let's import matplotlib and make a plot
Another useful packages is `matplotlib` which does plotting. `matplotlib` gives you a lot of control over the appearance of plots, but here is a very simple example:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
x = [1,2,3,4,5]
y = [2,3,3.5,7,9]

plt.plot(x,y)
plt.xlabel("x")
plt.ylabel("y")
plt.title('About as simple as it gets, folks')
plt.grid()
plt.show()

Or we could do a scatter plot:

In [None]:
# Clear the plot with the clf() command
plt.clf()

plt.scatter(x,y)
plt.xlabel("x",fontsize=20)
plt.ylabel("y",fontsize=20)

# I'm going to use a newline character (\n) in the title
plt.title('Scatter plots are usually\nbetter with discrete data',fontsize=24)
plt.grid()
plt.show()

Okay, we have some very basic tools up our sleeves. Lets move on to `pandas`.

# Pandas intro

It is possible that someone who knows something about python will look at these lessons. They will skim lesson 01 and 02 and then get to this lesson and think "what are you doing, you can't go from something as basic as assigning and making a numpy array straight to pandas, you psychopath!". They will then run away, screaming like a banshee, and jump into a canal.

But I don't care what those people think. Because I want you to know how to plot some spreadsheet data as quickly as possible.

Ready? Lets go!!

## What does a Chinese bear have to do with importing data?
pandas comes from "python data analysis library". It sounds better than "pdal". Although I like bicylces so I might have called it "pedal". Anyway, I did not write pandas. A guy working for a hedge fund did, because analysis of financial data often involves dealing with messy time series data. This package has ended up being immensely useful to all kinds of people and now forms one of the keystones of data science in python.

pandas is most at home reading csv data. csv stands for "comma separated value". You can save excel files in this format. The format has columns separated by commas. 

Actually, let me write this file. You won't really be writing files like this in python so don't worry about the syntax.

In [None]:
f = open("toadfile.csv", "w")
f.write("year,pond,toads")
f.write("\n2011,1,1")
f.write("\n2011,2,13")
f.write("\n2012,1,2")
f.write("\n2012,2,11")
f.write("\n2013,1,7")
f.write("\n2013,2,4")
f.close()

with open("toadfile.csv", 'r') as fin:
    print(fin.read())

Okay, if you wanted to you could open this file using Excel. But instead we will open it with `pandas`

## Pandas and toads
We first need to import `pandas`. And because I don't feel like writing out `pandas` all the time I will import it as `pd`.

I will then read the toads file using the `read_csv` function:

In [None]:
import pandas as pd
df = pd.read_csv("toadfile.csv")

Hey what happened? There is no output if you run the above cell.

That is because pandas has read the data in the toadfile into a variable called `df`. In the pandas world `df` is quite common name for a variable because it is short for `dataframe`. And pandas calls any collection of data a `DataFrame`. You can see this with the `type` command.

In [None]:
print(type(df))

Okay, what if we want to look at the data? You can use the head command to see the data

In [None]:
df.head()

The default is to see the first 5 items. You can change this quite easily:

In [None]:
df.head(6)

## Selecting data using pandas (where a lot of the magic happens)
Why is `pandas` so useful? Well one of the things I find most useful is the ease of selecting data.

Observe me get the `toads` data:

In [None]:
toads = df.toads
print("The data type of the variable toads is: ")
print(type(toads))
print("And here is the data:")
print(toads)
print("I can convert this to a list")
print(list(toads))
print("And I can convert it to an array:")
print(toads.to_numpy())

Here is another very useful feature. You can select data by conditional statements (if the data meets a condition, you keep it).

You need this funny syntax with some brackets but as long as you copy the format below with your own data you should be fine.

In [None]:
df_2011 = df[(df['year'] == 2011)]
df_2011.head()

In [None]:
df_2011 = df[(df['year'] > 2011)]
df_2011.head()

In [None]:
df_pond1 = df[(df['pond'] == 1)]
df_pond1.head()

## Plotting some data using pandas
Recent versions of pandas have some built in plotting functions. [You can see the options here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).



In [None]:
df.plot.scatter(x="year", y="toads", c="pond", cmap="viridis", s=50);

# Part 2: Plotting more complex data

Here is where Joanmarie takes over with some more complicated data than a few toads. 

In [None]:
import pandas as pd 
# As long as this .csv file is in the "data" directory,
# we can oad temperature data from a CSV file
weather_data = pd.read_csv('data/williamsburg_meteo.csv')

# Peek at the data and particularly the column names
weather_data.head()

We can create new columns and fill them with a single value or perform an opreation on a column:

In [None]:
# You can make any column and fill it with anything you want
weather_data['QC'] = 'good' # a pretend "quality control column"

# You can look at data in one column and make a new column with slightly different formatting
weather_data['datetime'] = pd.to_datetime(weather_data['DATE']) # the pd.to_datetime() just reads the dates as a specific type of data that plots well for time series

# You can do a calculation on a column! 
weather_data['PRCP_cm'] = weather_data['PRCP'] * 2.54 # convert inches to centimeters

weather_data.head()

Precip and temperature data were originally given in imperial units. Using the example above where I converted inches to centimeters (`weather_data['PRCP']` to `weather_data['PRCP_cm']`), create new columns where temperature values are given in the metric system (Celcius).

In [None]:
# your code here

Now, let's make some plots:

In [None]:
# For ease, we will define separate variables as the columns in our DataFrame. 
# This way you can type "date" instead of the DataFrame and column name

date = weather_data['datetime']

temperature = weather_data['TOBS'] 

We can now use the `plt` module we loaded to make a simple plot. You can always [read the docs](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) to understand the arguments that plotting functions take. (And you *find* the docs by searching "matplotlib [function]")

In [None]:
plt.plot(date, temperature)

## Your turn!

Create a [scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) plot of temperature over time with points colored by precipitation. Most plotting functions allow you to specify a `c` axis that colors certain datapoints to be a third data axis for data-rich plots. When you do that you'll want to add a [colorbar](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.colorbar.html) so your viewers know what they're looking at. 

I'll get you started:

In [None]:
# Specify the "c" keyword to add precipitation data as colors!
plt.scatter(???, ???, c=???)


# Call the "colorbar()" class to add a colorbar!
plt.colorbar(label="precipitation")

# Set the title of the plot
plt.title('??')

# Label the x-axis
plt.xlabel('??')

# Label the y-axis
plt.ylabel('???')

Whew! Congrats, you've made it to the end of the introduction to Python, data, and plotting!!

You can experiment with plotting in the cells above. If the above material was a bit of a stretch for you and you need to digest it a little, take the rest of the time to read over the content and convince yourself you know what goes on when you call certain lines of code. 

If, on the other hand, you feel confident with the material covered so far, you can read ahead to learn about some of the nuances of plotting. 

# Advanced topic: the nuances of making plots

## Using `plt.plot()`

In Matplotlib, both plt.figure() and fig, ax = plt.subplots() are used to create figures for object-oriented plotting, but they have different use cases and behaviors:

`plt.figure()`:

- `plt.figure()` is used to create a single figure object, and it returns a reference to that figure. This figure can contain one or more subplots (Axes).

- When you create plots using `plt.plot()`, `plt.scatter()`, etc., without explicitly specifying an Axes object, Matplotlib will automatically create an Axes within the current figure.

- It is useful when you want to create a single plot without multiple subplots, and you are not concerned about creating multiple axes explicitly.

Here, we will create a `figure` object

In [None]:
# Create a figure with a specific size (10x6 inches)
plt.figure(figsize=(10, 6))

# Create a line plot using time on the x-axis and temperature on the y-axis
# Customize the plot with blue color, circular markers, solid lines, and marker size
plt.plot(date, temperature, color='blue', marker='o', linestyle='-', markersize=4)

# Set the title of the plot
plt.title('Temperature Over Time')

# Label the x-axis
plt.xlabel('Time')

# Label the y-axis
plt.ylabel('Temperature (°C)')

Note we can do things like [set the limits of axes](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylim.html):

In [None]:
# Create a figure with a specific size (10x6 inches)
plt.figure(figsize=(10, 6))

# Create a line plot using time on the x-axis and temperature on the y-axis
# Customize the plot with blue color, circular markers, solid lines, and marker size
plt.plot(date, temperature, color='blue', marker='o', linestyle='-', markersize=4)

# Set the title of the plot
plt.title('Temperature Over Time')

# Label the x-axis
plt.xlabel('Time')

# Set axis limits from 0 to 100
plt.ylim(0, 100)

# Label the y-axis
plt.ylabel('Temperature (°C)')

## Using `ax` objects

In contrast, we can use `fig, ax = plt.subplots()`:

- Multiple Subplots: `plt.subplots()` is used to create a figure (Fig) and one or more subplots (Axes) within that figure. It returns both the figure and an array of Axes objects.

- Explicit Axes: You explicitly create and specify the Axes objects when using `fig, ax = plt.subplots()`. This allows you to have more control over the placement and arrangement of subplots.

- Usage: It is useful when you need to create multiple subplots within a single figure, such as creating a grid of plots.

A main difference is that the syntax for customizing `ax` objects will often include "`set_`" as in `set_xlabel()` as opposed to just `plt.xlabel()`


In [None]:
# Create a fig and an ax object with two elements
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

# Plot something on the ax object
ax[0].plot(date, temperature, color='red', marker='o', linestyle='-', markersize=4)

# Set the title of the axis
ax[0].set_title('Temperature Over Time')

# Label the x-axis
ax[0].set_xlabel('Time')

# Label the y-axis
ax[0].set_ylabel('Temperature (°C)')

# Plot something on the ax object
ax[1].plot(date, weather_data['PRCP'], color='blue', marker='o', linestyle='-', markersize=4)

# Set the title of the axis
ax[1].set_title('Precipitation Over Time')

# Label the x-axis
ax[1].set_xlabel('Time')

# Label the y-axis
ax[1].set_ylabel('Precipitation (in)')



But you don't need to create multiple plots if you don't want to:

In [None]:
# Create a fig and ax object
fig, ax = plt.subplots(figsize=(10, 6))

# Plot something on the ax object
ax.plot(date, temperature, color='blue', marker='o', linestyle='-', markersize=4)

# Set the title of the axis
ax.set_title('Temperature Over Time')

# Label the x-axis
ax.set_xlabel('Time')

# Label the y-axis
ax.set_ylabel('Temperature (°C)')


## Using `pandas`' built-in plotting functions

`pandas` actually has its own [plotting functions](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) that use a slightly different syntax for quick visualization of data.

You can see below that the syntax is `[name of the data frame].plot.[type of plot]` for something like a [scatterplot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html).

In [None]:
# Use the built-in plot() function to create a line plot
weather_data.plot.scatter(x='datetime', y='TMAX', c='TMIN', title='Example Plot', marker='o', cmap='viridis')

You can specify the `ax` object to plot on for maximum customization of axes!

In [None]:
# Create a fig and ax object
fig, ax = plt.subplots(figsize=(10, 6))

# Use the built-in plot() function to create a line plot
weather_data.plot.scatter(x='datetime', y='TMAX', c='TMIN', title='Example Plot', marker='o', cmap='viridis', ax=ax)

ax.set_ylim(0, 100)

ax.set_ylabel('Maximum temp (F)')

ax.set_xlabel('Date')

## Your turn!

Bringing all your knowledge together, create a visual that shows both a line plot (which cannot be colored by another variable) and a scatter plot (which can be colored) that shows some data. 

In [None]:
# Create a fig and ax object
fig, ax = plt.subplots(figsize=(10, 6))

# Plot something on the ax object
# zorder tells the program what order to plot objects in
ax.plot(date, ???, color='???', linestyle='-', zorder=0)

# One way to do it is to name a variable the ax object's plot
# I am also specifying a "vmin" and "vmax" which are the maximum and minimum values for the colorbar
scatterplot = ax.scatter(date, ???, c=???, marker='o', linestyle='-',
                         vmin=???, 
                          vmax=???,
                            zorder=1)

# Customize the colorbar by specifying the variable name for the axis object
colorbar = plt.colorbar(scatterplot, ax=ax)
colorbar.set_label('???')  # Set the label for the colorbar

ax.set_ylabel('???')

ax.set_xlabel('???')