# Before we begin 

1. Start a new project in SherlockML. 
2. Spin up a server. 
3. Open a terminal and clone the repository containing this notebook:
    ``git clone xx``

# Jupyter Notebook Basics

This is a Jupyter notebook. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. 

#### Notebooks have two different keyboard input modes:
1. <b>Edit mode</b> allows you to type code/text into a cell and is indicated by a green cell border. 
2. <b>Command mode</b> binds the keyboard to notebook level actions and is indicated by a grey cell border.
<br>

Change from edit to command mode by pressing `esc`. And change back by hitting `enter` 

#### Types of cells

This is a Markdown cell

In [None]:
print("This is a Python cell!")

#### Change, add and delete cells in command mode
- Change cell type from code to markdown by pressing `m`. Change it back to code with `y`. Or use the drop down menu. 
- Add a cell above with `a` and below with `b`
- Delete a cell with `dd`

Type `h` for more keyboard shortcuts. 

#### Running commands

To run a command, click in the cell and click the play button above or press ctrl+enter (or shift+enter which automatically places your cursor in the next cell down, or alt+enter to also add a new cell below). 

In [None]:
# shorthand for print 1+2 , can only be used once per cell to avoid ambiguity
1 + 2 

In [None]:
# a semi-colon supresses the cell output
1 + 2;

Bash commands start with a '!' 

In [None]:
!pwd

In [None]:
!ls

# Python Basics 

Python is an easy to learn, powerful programming language. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for data science. 

In [None]:
import this

## Hello world 

In [None]:
print('Hello Python world!')

That's it! This simplicity makes Python very quick to develop in. 

In [None]:
Image('img/python.png', width=400)

#### Aside: Python 2 vs Python 3

There are two versions of Python which are very commonly used: Python 2 and 3. 

If you are using Python 2.7, you would instead type:

In [None]:
print "Hello Python world!"

If possible, use Python 3! 

But be aware - Python 3 broke backward compatibility, and much Python 2 code does not run un-modified on Python 3. 

### Packages

__Batteries included__: Having a rich and versatile standard library which is immediately available (sys, os, time, shutil, glob, re, random). 

In addition, Python has a bunch of extremely useful third-party packages for doing scientific analysis. In Python there is package to do everything. This is a key reason for the rapid adoption of Python in science. 

__Numpy__ is the fundamental library for data science. __Numpy__ gives us *fast* and *powerful* tools for numerical operations on large, multi-dimensional arrays of data. Which as you can image is useful for much of data science!

__Pandas__ is a library built on top of numpy which makes analysing messy, real-world datasets more intuitive. Pandas adds more functionality and a wonderfully useful 2-dimensional data structure known as a `DataFrame`.

Knowing how to use these libraries will make the slog of understanding your data and getting it into a useable state much easier. 

# Numpy

In [None]:
# You will always see these libraries imported in the following way
import numpy as np

## Getting help

Think np has a sum method? Let's check!

In [None]:
np.su*?

To display all the contents of the numpy namespace

```ipython
In [3]: np.<TAB>
```

To display Numpy's built-in documentation:

```ipython
In [4]: np?
```

In general, make extensive use of documentation & Stack Overflow. Numpy and Pandas have so many users that any question you have has likely been asked and answered on Stack Overflow. Other useful resources:

- [Pandas online documentation](http://pandas.pydata.org/)
- [Numpy online documentation](https://docs.scipy.org/doc/)
- [* Python Data Science Handbook*](http://shop.oreilly.com/product/0636920023784.do) Written by Jake VanderPlas.  
- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas). Second edition out soon. 

In [None]:
from IPython.display import Image
Image(filename='img/datasciencehandbook.jpg', width=300) 

Freely available as Jupyter Notebooks [here](https://github.com/jakevdp/PythonDataScienceHandbook).

## Why do we care about numpy?

Python is quick to develop in, but can be slow to execute. With Numpy...

1. Our code is faster
3. Our code is (often) more readable
2. Our code is (almost always) more intuitive

#### For example:  Implementing a simple  [random walk](https://en.wikipedia.org/wiki/Random_walk)

i.e. at each step, move either one place forward or one place backward

In [None]:
# python implementation - requires for loop
import random

def random_walk(n):
    '''Randomly walk n steps'''
    position = 0
    walk = [position]
    for i in range(n):
        position += random.choice([-1, 1])
        walk.append(position)
    return walk

%timeit random_walk(10000) # timeit is a "magic" ipython command - see the documentation for others

In [None]:
# numpy implementation - no for loop, ~100x faster, more readable
def random_walk(n):
    '''Randomly walk n steps'''
    steps = np.random.choice([-1, 1], size=n) 
    return np.cumsum(steps)

%timeit random_walk(10000)

The idea of removing `for` loops in favour of creating and manipulating whole arrays at a time is central to numerical computing in Python, and most of what follows focuses on it. This is known as a *vectorized* operation. This vectorized approach is designed to push the loop into compiled C code that NumPy calls, leading to much faster execution.

You can make use of this by using numpy arrays rather than python lists, and using:
1. <b><a href=http://docs.scipy.org/doc/numpy/reference/ufuncs.html>Ufuncs</a></b> for element-wise operations on arrays (+, -, *, /, etc.)
2. <b>Aggregations</b> for summarizing the values of an array (e.g. np.min, np.max, np.sum, np.mean)
3. <b><a href=http://scipy.github.io/old-wiki/pages/EricsBroadcastingDoc>Broadcasting</a></b> for combining arrays
4. <b><a href=http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html>Indexing and slicing</a></b> 

We will see examples of all of these in the remainder of the notebook. 

# Hello Numpy World 
Let's see what I'm on about. 

We'll cover:
    
    i. Creating data arrays
    ii. Indexing 
    iii. Reshaping arrays
    iv. Broadcasting scalars and arrays to different sizes


## i. Creating data

Create a numpy array from a Python list

In [None]:
np.array([1, 0, 0, 1, 0]) 

Create a 5-element array of zeros

In [None]:
np.zeros(5) 

Create a 3x5 array of integer ones

In [None]:
np.ones((3, 5), dtype=int) 

Create an evenly spaced array of length 5 between 0 and 1 

In [None]:
np.linspace(0, 1, num=5) 

Create a 4x3 array of random integers between 0 and 6

In [None]:
r = np.random.randint(0, 6, size=(4, 3))
r

Create an array of zeros of the same shape

In [None]:
np.zeros_like(r)

## ii. Access data by indexing

In [None]:
a = np.arange(9).reshape(3,3)
a

Item by index

In [None]:
a[2, 2] 

Row by index

In [None]:
a[1, :] 

Column by index

In [None]:
a[:, 2] 

In [None]:
b = np.arange(10)
b

Every element from the 2nd to the 6th 

In [None]:
b[1:6] 

Every other element

In [None]:
b[::2]

The final element

In [None]:
b[-1]

The third and eighth elements

In [None]:
b[[2, 7]]

## iii. Reshaping

In [None]:
z = np.arange(6)
z

z is one-dimensional array

In [None]:
z.shape

Reshape z by adding an extra dimension

In [None]:
z = z.reshape(len(z), 1) 
z

Reshape z into a 3x2 array

In [None]:
z = z.reshape(3, 2)
z

Transpose z

In [None]:
z.T

Flatten z

In [None]:
z.flatten()

## iv. Broadcasting
On numpy arrays operations, like `+`, `-`, `*`,  are elementwise. It’s possible to do __operations on arrays of different sizes__ when numpy can transform them to be the same size (known as "broadcasting").

In [None]:
Z = np.arange(9).reshape(3, 3)
Z

Add 1 to every element in Z

In [None]:
Z + 1

1 was 'broadcast' into the same shape as Z, i.e. `np.ones(shape=(3,3))`

In [None]:
np.alltrue(Z + 1 == Z + np.ones((3, 3)))

What would this look like without broadcasting?

In [None]:
for i in range(3):
    for j in range(3):
        Z[i, j] += 1 

# Pandas

In [None]:
import pandas as pd

# have plots render in notebook
%matplotlib inline 

## Introduction

Pandas is a package that builds on the NumPy array structure by introducing ``DataFrame``s, which are essentially multidimensional arrays with attached row and column labels. 
Pandas is the tool of choice for the sort of "data munging" tasks that occupy much of a data scientist's time.
In this (short!) introduction to pandas we will introduce the basic functionalities of pandas which you will find useful on a day to day basis as a data scientist. 

## Pandas objects

Panda's has three fundamental data structures: the ``Series``, ``DataFrame``, and ``Index``.

### Series

A Pandas ``Series`` is a one-dimensional array of indexed data.
One way to create a series is as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

Get the values

In [None]:
data.values

Get the index

In [None]:
data.index

Get the second row using integer indexing

In [None]:
data.iloc[1]

Get the same row using using the index

In [None]:
data.loc['b']

A series can also be created from a dictionary. 

In [None]:
age_dict = {'Max': 26,
            'Andy': 25,
            'Ben': 28,
            'Sarah': 26,
            'Anne': 21}
age = pd.Series(age_dict)
age

### Dataframe

The next fundamental structure in Pandas is the ``DataFrame``. If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [None]:
height_dict = {'Max': 170, 
               'Andy': 164, 
               'Ben': 175,
               'Sarah': 165, 
               'Anne': 160}
height = pd.Series(height_dict)
height

In [None]:
people = pd.DataFrame({'age': age,
                       'height': height})
people

Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [None]:
people.index

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
people.columns

Get the age of Andy

In [None]:
people.loc['Andy' , 'age']

Access the age series. 

In [None]:
people['age']

This is a convenient shorthand for:

In [None]:
people.loc[:, 'age']

Columns can also be accessed using a SQL-like syntax:

In [None]:
people.age

However, this syntax can lead to errors so is generally discouraged. 

## Missing data

In the real world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. Pandas uses ``None`` or ``NaN`` (acronym for *Not a Number*) to represent missing data. 

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
data

Where are the null values?

In [None]:
data.isnull()

Drop the null values

In [None]:
data.dropna()

Replace the null values with zeros

In [None]:
data.fillna(0)

Notice that a copy is returned and the original series is unchanged. 

In [None]:
data

If we want to modify the original dataframe we can set the `inplace` argument to `True`. 

In [None]:
data.dropna(inplace=True)
data

## Combining Datasets: Concat and Append

``pd.concat()`` can be used for a simple concatenation of ``Series`` or ``DataFrame`` objects

In [None]:
s1 = pd.Series(['Alpha', 'Bravo', 'Charlie'], index=[0, 1, 2])
s1

In [None]:
s2 = pd.Series(['Delta', 'Echo', 'Foxtrot'], index=[3, 4, 5])
s2 

Perform a row-wise concatenation. 

In [None]:
pd.concat([s1, s2])

In [None]:
df1 = pd.DataFrame({'employee': ['John', 'Simon', 'Lucy', 'Sue'],
                    'job': ['Data Scientist', 'Data Engineer', 'Software Developer', 'HR']})
df1

In [None]:
df2 = pd.DataFrame({'employee': ['Sue', 'Simon', 'Lucy', 'John'],
                    'years_at_company': [1, 3, 2, 1]})
df2

To combine this information into a single ``DataFrame``, we can use the ``pd.merge()`` function:

In [None]:
df3 = pd.merge(df1, df2)
df3

The common column 'employee' is used as the join column. 

Also look up the `join` function in the Pandas documentation.

## Aggregation and Grouping

In [None]:
rng = np.random.RandomState(42)
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df

Compute the mean over the columns.

In [None]:
df.mean()

Compute the sum over the rows. 

In [None]:
df.sum(axis='columns')

Groupby breaks up a dataframe depending on the value of a specified key, computes some function within the individual groups (usually an aggregate, transformation, or filtering), and finally merges the results of these into an output array. 

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Group by the key

In [None]:
df.groupby('key')

Notice that what is returned is not a set of ``DataFrame``s, but a ``DataFrameGroupBy`` object. To produce a result, we can apply an aggregate to this ``DataFrameGroupBy`` object. 

Aggreate the groupby object by computing the sum. 

In [None]:
df.groupby('key').sum()

Many methods aggregation methods are built in. Otherwise arbitary functions can be applied to the groups using `agg` or `apply`. 

# Vectorized String Operations

These are very useful when working with real-world (i.e. messy) data. 

In [None]:
data = ['london', 'LEEDS', None, 'CamBridge']
names = pd.Series(data)
names

Capitalize the names of the cities. 

In [None]:
names.str.capitalize()

## Dealing with real data 

In [None]:
df = pd.read_csv('data/meteors.csv', encoding="ISO-8859-1")

Let's get a sense of what's in this dataset by printing the first 5 rows. 

In [None]:
df.head(5)

What are the columns?

In [None]:
df.columns

How many rows? What are the data types of the columns? Are there any null values?

In [None]:
df.info()

The dates and times in the DataFrame have been read as strings. We can cast them to datetime objects by passing  `parse_dates` to `read_csv`. 

In [None]:
df = pd.read_csv('data/meteors.csv', encoding="ISO-8859-1", parse_dates=['created_at'])

In [None]:
df.created_at[:5]

describe() computes several common aggregates for each column and returns the result. 

In [None]:
df.describe()

Plot the distributions of 'fell' and 'found'. 

In [None]:
df['fell_found'].value_counts()[:2].plot(kind='bar');

## Exercises for you

Select all the meteorites which fell after 1999. 

In [None]:
# your code 
df[df['year'] > 1999]

Select all of the metorites of type 'L6' which fell after 1999

In [None]:
# your code 
df[(df['year'] > 1999) & (df['type_of_meteorite'] == "L6")]

Find the masses of the heaviest 5 meteors

In [None]:
# your code 
df.sort_values(by='mass_g', ascending=False)['mass_g'].iloc[:5]

Plot a histogram showing the number of metorites by meteorite type. Show only the 10 most common types of meteorites to have fallen.  

In [None]:
# your code 
df['type_of_meteorite'].value_counts()[:10].plot(kind='bar');

Plot the number of metorites which fell each year after 1999. 

In [None]:
# your code 
(df.loc[(df['year'] > 1999), 'year'].astype(int)
                                    .value_counts()
                                    .sort_index()
                                    .plot('bar'));

Which years had the biggest average meteors?

In [None]:
## your code
annual = df.groupby('year')
avg_mass = annual['mass_g'].mean()
avg_mass.sort_values(ascending=False).iloc[:5]

### Task (optional): recreate the data science in this [FiveThrityEight story](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/)
We provide the data from their github, which you will need to merge, or join, or concatenate. 
Please only use Pandas.

In [None]:
births_94_03 = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv')
births_00_14 = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv')