# NM Supercomputing Challenge 2020

## Session #1 (Mark Servilla): Introduction to Git and GitHub, Jupyter Notebooks, Python, and pandas

---
## Interacting with Git and GitHub

 1. Fork the [2020-10-03-nmcsa-pangeo](https://github.com/nmcarpentries/2020-10-03-nmcsa-pangeo) repository to your GitHub account
 1. Click on the **launch-binder** button near the top of the README.md file (see README.md instructions)
 1. Once the *Pangeo Binder* is launched and you see the *JupyterLab* environment, click on the **terminal** button near the bottom of the *Launcher* tab
 1. In the terminal window (Linxu Bash shell), we will configure this local Git instance for your profile and set it up to use your "forked" repository for saving changes from this Binder environment to your GitHub account:  
   ```
   export PS1="> "
   git config user.email "mark.servilla@gmail.com"
   git config user.name "Mark Servilla"
   git remote set-url origin https://github.com/servilla/2020-10-03-nmcsa-pangeo
   git remote -v
     origin  https://github.com/servilla/2020-10-03-nmcsa-pangeo (fetch)
     origin  https://github.com/servilla/2020-10-03-nmcsa-pangeo (push)
   ```

 #### Come to this step at the end of each session
 5. To save the work you have just done in this session, we'll use Git to *push* our changes back to your GitHub account. But first, you should always perform a `git pull` before doing your *push*:
   ```
   git pull
   git add *
   git commit -m "Saving changes to GitHub"
   git push
     Username for 'https://github.com': servilla
     Password for 'https://servilla@github.com':
     Counting objects: 4, done.
     Delta compression using up to 6 threads.
     Compressing objects: 100% (3/3), done.
     Writing objects: 100% (4/4), 505 bytes | 505.00 KiB/s, done.
     Total 4 (delta 2), reused 0 (delta 0)
     remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
     To https://github.com/servilla/2020-10-03-nmcsa-pangeo
        2963446..0e25a7d  master -> master
   ```

---
## Working with Jupyter Notebooks

"Jupyter notebooks are documents that combine live runnable code with narrative text (Markdown), equations (LaTeX), images, interactive visualizations and other rich output. Jupyter notebooks (.ipynb files) are fully supported in JupyterLab. The notebook document format used in JupyterLab is the same as in the classic Jupyter Notebook. Your existing notebooks should open correctly in JupyterLab." - https://jupyterlab.readthedocs.io/en/stable/user/notebook.html

Types of cells:
  1. Raw cell - to display raw text
  1. Code cell - to enter and execute code
  1. Markdown cell - to enter and display markdown (like this cell)

Common keyboard commands:
 - Esc - Enter command mode
 - Enter - Enter edit mode
 - Ctrl-Enter - Run cell
 - Alt-Enter - Run cell and insert new cell below
 - X - Delete cell
 - C - Copy Cell
 - A - Insert cell above
 - B - Insert cell below

[Keyboard Shortcuts](https://blog.ja-ke.tech/assets/jupyterlab-shortcuts/Shortcuts2020.png)

---
## Python quickstart

What is Python?
 - named after the British comedy troup "Monty Python"
 - is an interpreted computer language
 - developed by Guido van Rossum in 1991.

### Comments:

In [None]:
x = 1  # Comments begin with a "hash" or "sharp" character

### Built in data types:

In [None]:
# Variables in python are symbolic names for data expressions and are created
# with the assignment operator "="; variable names may include numbers, but
# must start with a letter:

text = "Data Carpentry"  # An example of a string
number = 42  # An example of an integer
pi_value = 3.1415  # An example of a float
is_true = True  # An example of a boolean

In [None]:
text

In [None]:
number

In [None]:
pi_value

In [None]:
is_true

In [None]:
# We can also use the built-in function "print" to produce visual output of
# expressions and variables

print("This is a text string: ", text)
print("This is an integer number: ", number)
print("This is a floating point or real number: ", pi_value)
print("This is a boolean: ", is_true)

In [None]:
# The built-in "type" function tells us what data type each variable is using

print("text is type: ", type(text))
print("number is type: ", type(number))
print("pi_value is type: ", type(pi_value))
print("is_true is type: ", type(is_true))

### Operators:

In [None]:
# We can perform mathematical calculations in Python using the basic operators
# +, -, /, *, %

addition = 2 + 2
subtraction = 10 - 5
multiplication = 4 * 2
float_quotient = 12 / 5
integer_quotient = 12 // 5
remainder = 12 % 5
exponentiation = 2 ** 10

In [None]:
print("addition = 2 + 2: ", addition)
print("subtraction = 10 - 5: ", subtraction)
print("multiplication = 4 * 2: ", multiplication)
print("float_quotient = 12 / 5: ", float_quotient)
print("integer_quotient = 12 // 5: ", integer_quotient)
print("remainder = 12 % 5: ", remainder)
print("exponentiation = 2 ** 10: ", exponentiation)

In [None]:
# We can also use conditional operators: <, >, ==, !=, <=, >=

less_than = 5 < 6
greater_than = 6 > 4
is_equal = 5 == 5
is_not_equal = 5 != 7
less_than_or_equal = 2 <= 3
greater_than_or_equal = 4 >= 3

In [None]:
print("less_than = 5 < 6: ", less_than)
print("greater_than = 6 > 4: ", greater_than)
print("is_equal = 5 == 5: ", is_equal)
print("is_not_equal = 5 != 7: ", is_not_equal)
print("less_than_or_equal = 2 <= 3: ", less_than_or_equal)
print("greater_than_or_equal = 4 >= 3: ",greater_than_or_equal)

In [None]:
# And logical comparisons using and, or, & not identities

and_comparison = True and True
or_comparison = True or False
not_comparison = not False

In [None]:
print("and_comparison = True and True: ", and_comparison)
print("or_comparison = True or False: ", or_comparison)
print("not_comparison = not False: ", not_comparison)

### Lists and Dictionaries:

In [None]:
# Lists are a common data structure to hold an ordered sequence of elements.
# Each element can be accessed by an index. Note that Python indexes start
# with 0 instead of 1

numbers = [1, 2, 3]
numbers[0]

In [None]:
# A for loop can be used to access the elements in a list or other Python
# data structure one at a time. Indentation is very important in Python.
# Note that the second line in the example below is indented.

for num in numbers:
    print(num)

In [None]:
# To add elements to the end of a list, we can use the append method. Methods
# are a way to interact with an object (a list, for example). We can invoke a
# method using the dot . followed by the method name and a list of arguments
# in parentheses. Let’s look at an example using append:

numbers.append(4)
print(numbers)

In [None]:
# A dictionary is a container that holds pairs of objects - keys and values:

translation = {'one': 'first', 'two': 'second'}
translation['one']

In [None]:
# To add an item to the dictionary we assign a value to a new key:

rev = {'first': 'one', 'second': 'two'}
print(rev)
rev['third'] = 'three'
print(rev)

In [None]:
# Using for loops with dictionaries is a little more complicated:

for key, value in rev.items():
    print(key, '->', value)

### Functions:

In [None]:
# Defining a section of code as a function in Python is done using the def
# keyword. For example a function that takes two arguments and returns their
# sum can be defined as:

def add_function(a, b):
    result = a + b
    return result

z = add_function(20, 22)
print(z)

---
## Introduction to Python Pandas

 - "pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python." - https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html
 - pandas was created by Wes McKinney in 2008 and the name *pandas* is derived from the term "panel data" used in the field of economics.
 - Think of pandas as a combination of spreadsheet, database, and statiscal software rolled into a single Python package.

Python doesn’t load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax import libraryName. If we want to give the library a nickname to shorten the command, we can add as nickNameHere. An example of importing the pandas library using the common nickname pd is below.

You must install, then `import` pandas to use it in the notebook:

In [None]:
# The exclamation point built-in allows you to execute shell commands

!pip install pandas

In [None]:
import pandas as pd

### What is a pandas' DataFrame?

"A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the `data.frame` in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure." - https://datacarpentry.org/python-ecology-lesson/02-starting-with-data/index.html

Reading the data file `./data/surveys.csv`:

In [None]:
# Note that pd.read_csv is used because we imported pandas as pd
surveys_df = pd.read_csv("data/surveys.csv")

### Viewing the contents of a the *surveys* DataFrame:

In [None]:
# We can use the print function to display the DataFrame:

print(surveys_df)

In [None]:
# But just letting the Jupyter Notebook format the default output is much nicer:

surveys_df

### Exploring the structure of the *surveys* DataFrame:

In [None]:
# We can look at the top of the DataFrame with the head fucntion:

surveys_df.head()

In [None]:
# We can look at the bottom of the DataFrame with the tail fucntion:

surveys_df.tail()

In [None]:
# We can look at the data type of the DataFrame with the Python type function:

type(surveys_df)

In [None]:
# We can look at the columns of the DataFrame with the columns variable:

surveys_df.columns

In [None]:
# We can look at the shape of the DataFrame with the shape variable:

surveys_df.shape

In [None]:
# We can look at the data types of teh DataFrame with the dtypes variables:

surveys_df.dtypes

### Analyzing and calculating statistics on the *surveys* DataFrame:

In [None]:
# We can compute summary statistics of the DataFrame with the describe function:

surveys_df.describe()

In [None]:
# We can compute the unique values in a categorical column of the Data Frame:

surveys_df["species_id"].unique()

In [None]:
# We can compute statistics for a single column of the DataFrame:

surveys_df['weight'].describe()

In [None]:
# We can computes statistics about categories in a column of the DataFrame:

grouped_data = surveys_df.groupby('sex')
grouped_data.describe()

In [None]:
# And if we want to see only the counts of each sex of the DataFrame:

grouped_data.describe()["record_id"]["count"]

In [None]:
# We can also do basic math on numeric columns of the DataFrame:

weight_doubled = surveys_df["weight"] * 2
print(surveys_df["weight"], weight_doubled)

In [None]:
# Count the number of samples by species:

species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)

### Displaying a graph of data from the *surveys* DataFrame:

In [None]:
# Install the plotting library "matplotlib"

!pip install matplotlib

In [None]:
# Make sure figures appear inline in the Notebook with the built-in magic %

%matplotlib inline

In [None]:
# Create a quick bar chart of specie counts

species_counts.plot(kind='bar');

In [None]:
# We can also look at how many animals were captured in each site:

total_count = surveys_df.groupby('plot_id')['record_id'].nunique()
total_count.plot(kind='bar');