# Introduction to Python for Researchers 

Ian Thomas, Research Innovation and Capability, RMIT University

***

## Introduction

In this tutorial, we will 

* introduce NeCTAR, the Australian national infrastructure for research cloud computing, 

* create a Jupyter notebook (a user interface for python programming), 

* review Scipy (a collection of python packages for mathematics, scientific and engineering programming, and 

* experiment with the Pandas and Matplotlib packages for a small case study.

This tutorial is a quick introduction of these various technology tools, as a stepping-off point for further investigation.  I provide links to each of the tools and other background information that may be helpful.

### Who is this for?

* "Absolute" beginners

* Who might be good at spreadsheets (e.g., Excel)

* Who might be interested in improving the reproducability of their analyses

* Who are interested in getting a high-level overview of data wrangling and are prepared to do their own self-study.

### "Why would I want to learn and use Python?"

* Its relatively easy to learn

* Its free with a good user community for assistance.

* Works well as a scripting language for gluing other programs together.

* Includes lots of helpful libraries of extra tools (including data wrangling and analysis)

* Its a robust general-purpose language useful beyond just data wrangling.

* Its a more reproducible alternative to using Excel?






## Jupyter Notebooks

Text from <http://jupyter.org>

> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

We will install a Jupyter notebook, a web-based environment for Python (and other language) programming.  Here we show *three* different ways of getting to a jupyter notebook, and along the way hightlight a key technology or platform that can be useful for other research activites.


### Getting a Jupyter notebook up and running

1.  **The locally installed option**: 
Download and install the Anaconda distribution onto your mac or pc : <https://www.anaconda.com/products/individual> a open source distribution with hundreds of packages useful for data science, machine learning and scientific programming.

2. **The cloud option** Deploy Jupyter on the NeCTAR research cloud, the National cloud computing platform for Australian Researchers (including HDRS).

3. **The demonstrator option** Use Mybinder.org <http://mybinder.org> to spin up _temporary_ Jupyter notebooks for  experimental purposes.  This is the *easy* option!

See <https://jupyter.org/install.html> for more information.


### The Anaconda Distribution: The locally installed option

Text from <https://www.anaconda.com/>

> With over 20 million users worldwide, the open-source Individual Edition (Distribution) is the easiest way to perform Python/R data science and machine learning on a single machine. Developed for solo practitioners, it is the toolkit that equips you to work with thousands of open-source packages and libraries.

This  application is available for Windows, Mac or Linux, and provides an environment for running Python applications and using a set extensive set of preinstalled libraries.

It runs on your desktop or laptop (not in the cloud) and is great for doing your own analyses or programming on your own machine.


#### Other Anaconda applications

There are other Anaconda distribution applications including RStudio (for R), UI builders and other analysis tools.





### Nectar deployment: The cloud option

Text from <http://nectar.org.au>

> The National eResearch Collaboration Tools and Resources project (Nectar) provides an online infrastructure that supports researchers to connect with colleagues in Australia and around the world, allowing them to collaborate and share ideas and research outcomes, which will ultimately contribute to our collective knowledge and make a significant impact on our society.

> *Nectar Cloud* provides computing infrastructure, software and services that allow Australia’s research community to store, access, and run data, remotely, rapidly and autonomously. Nectar Cloud’s self-service structure allows users to access their own data at any time and collaborate with others from their desktop in a fast and efficient way.

 The good news is that you have access to the Nectar research cloud right now, as all Austrialian researchers (including HDRs) are allocated a small amount of resources with which to experiment. RMIT has its own share of these resources that it can provide to RMIT researchers through an allocation scheme.

Try it!  Go to 
<http://cloud.nectar.org.au/start-now/>
and login, using your university credentials.

The cloud also provides common applications that can be automatically provisioned onto the cloud.  One of those is Jupyter:

<https://support.ehelp.edu.au/support/solutions/articles/6000196124-nectar-applications-jupyter-notebook>

#### A cloud for researchers 

The Nectar cloud is more just Jupyter notebooks.  

The research cloud can be used to "scale-up" research software in order to to utilise more resources than are available on a typical desktop.  For example, the largest NeCTAR instance (virtual machine) has 32 CPU cores and 64GB of RAM, which might be more than your laptop can provide. But you can quickly create more than multiple instances to get even more work done.

To find out more go to http://nectar.org.au/cloudpage

The basic terminology of the cloud is described here: https://support.ehelp.edu.au/support/solutions/articles/6000055378-welcome

Creating an instance is described here: https://support.ehelp.edu.au/support/solutions/articles/6000055376-launching-virtual-machines

If you’d like to apply for additional resources then instructions are here: https://support.ehelp.edu.au/support/solutions/articles/6000068044-managing-an-allocation







### Binder deployment: The Demonstrator Option

Text from http://mybinder.org:

> Have a repository full of Jupyter notebooks? With Binder, open those notebooks in an executable environment, making your code immediately reproducible by anyone, anywhere.

This is by far the simplest option for just running this tutorial and trying out Python.
If you have created a notebook using one of the other options, then binder is the easiest way of demonstrating and allowing others to investigate your results.

So, we're  using binder to do this tutorial!  The following link will spin up your own virtual machine with Jupyter, Scipy and Python with this tutorial running:

### <http://eresear.ch/pythontutorial>

You can experiment with this notebook, add and change code and rerun.   However, note that the computing resource behind binder is temporary and only designed for short experiments.  For example, the notebook will be deleted if it is idle for more than ten minutes.   This includes any changes you've made.  For more permanent resources, try the other two deployments from earlier.


#### Reproducible experiments

What is interesting here is that we have a reproducible way to publish Python and text and visualisations  for others to view and edit and investigate.



##  Lets get to the actual Jupyter Notebook (already)



The cells in a Jupyter notebook like you've seen above aren't just text, they can also be Python code. Below is a really simple example.

* To execute this cell, select it and press the play button above.⏯ (or hit shift-return)
* Multiple presses will execute each cell down the page.
* You can select and edit any code cell and press play to run the cell again. 
* To run all the code in the notebook, press fast forward button ⏩

In [None]:
print("Hello World")

Try changing the printed words in the previous cell and rerunning it!


Python is an extensive and powerful programming language and while many tasks can be achieved using the standard tools of the language, its main power comes from its collection of additional domain-specfic Python libraries.  

So in this tutorial we're going skim the surface of the language proper, and focus on using a powerful toolbox of libraries designed for data processing in Python.  We're not going to delve too deep in the complexity of the language but you can get a surprising amount with just a few commands.  

There will be resources on the Python language later to help to investigate further.


***

## Scientific processing with SciPy

### SciPy Toolkit

Text from www.scipy.org

> SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering.

Includes

* **Numpy** Base N-dimensional array package

* **Scipy library** Fundamental library for scientific computing

* **Matplotlib** Comprehensive 2D plotting

* **Pandas** Data structures and analysis

* **scikit-image** Collection of algorithms for image processing.

* **sckit-learn** collection of tools for machine learning.

* ...

In this tutorial we investigate the _pandas_ and _matplotlib_ packages.


### Pandas 

Text from <http://pandas.pydata.org>

> Python has long been great for data munging and preparation, but less so for data analysis and modeling. Pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

> Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.

> ### Library Highlights

> * A fast and efficient DataFrame object for data manipulation with integrated indexing;
* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
* Flexible reshaping and pivoting of data sets;
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Columns can be inserted and deleted from data structures for size mutability;
* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
* High performance merging and joining of data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
* Highly optimized for performance, with critical code paths written in Cython or C.
* Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

***

### Okay this all well and good, but what about in practice?


## Wine Reviews: 130k wine reviews with variety, location, winery, price and description 
![](noun_Wine_15141.png) 

Scraped from Wine enthusiast Magazine week of June 15th 2017

Collated by Zack Thoutt:   <https://www.kaggle.com/zynicide/wine-reviews>

Analysis below based on <https://github.com/kjingers/reproducible-python>

Adapted by Ian Thomas, RMIT University.

### We load the Pandas library so we can use it for data exploration


In [None]:
import pandas as pd

### First load the CSV (comma seperated values) file with the data into a Pandas DataFrame


In [None]:
wine = pd.read_csv("winemag-data-130k-v2.csv")
wine.head()

### Get basic description of the data

In [None]:
wine.describe(include ='all').T

In [None]:
wine.points.describe()

### Lets look closer at the testers

In [None]:
wine.taster_name.unique()

In [None]:
wine.taster_name.describe()

In [None]:
wine.taster_name.value_counts()

### Grouping per country and points to analyse the mean price of the wines


In [None]:
count = wine.groupby(['country','points'])['price'].agg(['count','min','max','mean']).sort_values(by = 'mean',ascending = False)[:20]
count.reset_index(inplace=True)
count

### Lets break that down

In [None]:
temp = wine.groupby(['country','points'])
print(temp)
temp = temp['price'] 
print(temp) # Question: why do we not see anything up to here?
temp = temp.agg(['count','min','max','mean'])
print(temp)
temp = temp.sort_values(by='mean',ascending=False)
print(temp)
temp = temp[:20]
print(temp)
temp.reset_index(inplace=True)
print(temp)

These are just a few functions from the Pandas library.  Much more information here: <http://pandas.pydata.org>
There is a handy cheatsheet for pandas here: <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>

### Matplotlib

Text from <https://matplotlib.org/>

> Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.


In [None]:
# We need this line for showing plots in jupyter notebooks
%matplotlib inline 

# include the library so we can use it as "plt"
import matplotlib.pyplot as plt


In [None]:

wine['points'].value_counts().sort_index().plot.bar(color = 'blue',
                                                   title = 'Rankings given by wine magazine');

### Lets break that down

In [None]:
temp = wine['points']
print(temp)
temp = temp.value_counts()
print(temp)
temp = temp.sort_index()
print(temp)

temp.plot.bar(color="blue", title="Rankings given win magazine")


### How many countries are represented?

In [None]:
wine['country'].nunique()

### Countries with the most wine representations

In [None]:
fig, ax = plt.subplots(figsize = (10, 8))

# Maybe you want to try breaking down the following line like we have been doing?
country = wine['country'].value_counts().to_frame()[0:20] 

country.plot.bar(ax = ax, color="blue", legend = None, title = 'Countries with most wine representations');

### Which countries have the highest point mean?

In [None]:

# the following lines may seem familiar...
country_grouped = wine.groupby('country')
grouped_list = country_grouped['points'].mean().reset_index()


grouped_list.sort_values(by ='points', ascending = False).iloc[:20].reset_index(drop = True)


### Question: what does the `drop=True` mean in reset_index?. 

Don't be afraid to web search for the answer (noone ever remembers all the options!)

<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html>


### Lets look at Australia!...  

![](noun_Kangaroo_3308208.png) 



In [None]:
oz = wine[wine['country'] == 'Australia'].copy()
oz.head()

In [None]:
fig, ax = plt.subplots(figsize = (10, 8))

oz_points = oz['points'].value_counts().to_frame()
oz_points.sort_values(by = 'points', ascending = False, inplace = True)

oz_points.plot.bar(ax = ax, legend = None)

ax.set_xlabel('Points')
ax.set_ylabel('No of wines');



## Where to from here?


### Further Resources

* The Jupyter notebook <http://jupyter.org>
* Nectar Research Cloud <http://nectar.org.au>
* Binder <http://mybinder.org>
* SciPy libraries <http://www.scipy.org>
* Learning Python <https://swcarpentry.github.io/python-novice-inflammation/>
* Pandas <http://pandas.pydata.org>
* Matplotlib <https://matplotlib.org/>
* Jupyter notebooks for GLAM (Galleries, libraries, archives, museums) <https://glam-workbench.github.io/>
* Jupyter notebooks for the digital humanities <https://github.com/quinnanya/dh-jupyter>


