# Titanic lab
### Introduction to pandas and matplotlib

## 1. Jupyter notebooks 
* You are reading this line in a jupyter notebook.
* A notebook consists of cells. A cell can contain either code or markdown. 
    * This cell contains hypertext. The next cell contains code.
* If you are not familiar with markdown, here is a [cheatsheet](https://wordpress.com/support/markdown-quick-reference/)
* You can __run a cell__ with code by selecting it (click) and pressing `Ctrl + Enter` to execute the code and display output (if any).
* If you're running this on a device with no keyboard, use topbar __play/stop/restart__ buttons to run code.
* Behind the curtains, there's a python interpreter that runs that code and remembers anything you defined.

Run the following cells to get started:

In [None]:
a = 5

In [None]:
a*2

In [None]:
print(a * 2)

* __`Ctrl + S`__ to save changes (or use the button that looks like a floppy disk)
* __Top menu -> Kernel -> Interrupt__ (or Stop button) if you want it to stop running cell midway.
* __Top menu -> Kernel -> Restart__ (or cyclic arrow button) if interrupt doesn't fix the problem (you will lose all variables).
* To make the interpreter to forget all your previous runs and start running from scratch, use __Kernel/Restart and Clear output__ button.

* More tutorials: [Hacker's guide](http://arogozhnikov.github.io/2016/09/10/jupyter-features.html), [Beginner's guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/), [Datacamp tutorial](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

<br>__Intellisense__ : 
* if you're typing something, press `Tab` to see automatic suggestions, use arrow keys + enter to pick one.
* if you move your cursor inside some function and press __Shift + Tab__, you'll get a docstring window. __Shift + (Tab , Tab)__ will expand it.

<div style="background-color:yellow;">    
    <h4>Exercise</h4>
Type this in the next cell:<br>
`import math`<br>
`math.a` 
    
 </div> 

In [None]:

# then place your cursor at the end of the unfinished line 'math.a ...' and press Tab
# select function that computes arctangent from two parameters (should have 2 in it's name)
# once you select the function, press shift + tab + tab(again) to see the docstring 

## 2. Pandas
Pandas is a library that helps you load the data, prepare it and perform some basic statistical analysis. The main object is the `pandas.DataFrame` - a 2d table with batteries included. 

In the cells below we use `pandas` to read the data on the infamous titanic shipwreck.

__Keep running all the code cells as you read__

### 2.1. Sample dataset
Download the data file [titanic.csv](https://docs.google.com/spreadsheets/d/1QGNxqRU02eAvTGih1t0cErB5R05mdOdUBgJZACGcuvs/edit?usp=sharing) to your local directory.

__Update the variable `file_name` in the cell below to point to your local directory where you will store the datasets for this course__ and then run the cell.

<!---import sys
[]: #if 'google.colab' in sys.modules:
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/coursera/week1_intro/primer/train.csv--->

In [None]:
file_name = "../../data_ml_2020/titanic.csv"

In [None]:
import pandas as pd

# this creates a pandas.DataFrame
data = pd.read_csv(file_name, index_col='PassengerId')

In [None]:
# Selecting rows
head = data[:10]

head  # if you leave an expression at the end of a cell, jupyter will "display" it automatically

#### Some dataset variables
* Name - a string with person's full name
* Survived - 1 if a person survived the shipwreck, 0 otherwise.
* Pclass - passenger class. Pclass == 3 is cheap'n'cheerful, Pclass == 1 is for moneybags.
* Sex - a person's gender (in those ol' times when just 2 of them were allowed)
* Age - age in years, if available
* Sibsp - number of siblings on a ship
* Parch - number of parents on a ship
* Fare - ticket cost
* Embarked - port where the passenger embarked
 * C = Cherbourg; Q = Queenstown; S = Southampton

### 2.2. Pandas basics

In [None]:
# table dimensions
print("len(data) = ", len(data))
print("data.shape = ", data.shape)

In [None]:
# select a single row
print(data.loc[4])

In [None]:
# select a single column.
ages = data["Age"] # alternatively: data.Age
print(ages[:10])  

In [None]:
# select several columns and rows at once
# alternatively: data[["Fare","Pclass"]].loc[5:10]
data.loc[5:10, ("Fare", "Pclass")]

Pandas also have some basic data analysis tools. For one, you can quickly display statistical aggregates for each column using `.describe()`

In [None]:
data.describe()

In [None]:
# filters
print("Only male children")
mc = data[(data['Age'] < 18) & (data['Sex'] == 'male')]
mc.head()

__More pandas__: 
* A neat [tutorial](http://pandas.pydata.org/) from pydata
* Official [tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html), including this [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html#min)
* Bunch of cheat sheets awaits just one google query away from you (e.g. [basics](http://blog.yhat.com/static/img/datacamp-cheat.png), [combining datasets](https://pbs.twimg.com/media/C65MaMpVwAA3v0A.jpg) and so on). 

```
```
<div style="background-color:yellow;">
    <h3>Task 1. Your turn:</h3>    
</div>



In [None]:
# select passengers number 13 and 666 - did they survive?

<YOUR CODE>

In [None]:
# compute the overall survival rate (what fraction of passengers survived the shipwreck)

<YOUR CODE>

### 1.3. Missing values
Some columns contain __NaN__ values - this means that there is no data there. For example, passenger `#6` has unknown *Age*, some others have unknown *Fare*. To simplify the data analysis, we can replace missing values by using pandas `fillna` function.

_Note: we do this only for the purpose of this tutorial. In general, you think twice before you modify data like this._

In [None]:
# Age before replacement
data.loc[6]

In [None]:
data['Age'] = data['Age'].fillna(value=data['Age'].mean())
data['Fare'] = data['Fare'].fillna(value=data['Fare'].mean())

In [None]:
# Age after replacement - meaning?
data.loc[6]

## 3. Numpy 
### 3.1. Arrays

Almost any machine learning model requires some computationally heavy lifting often involving vectors and matrices. The raw Python is too slow and too imprecise - so instead, we use `numpy`. The main object here is `numpy.array`, which is used to represent vectors and matrices.

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("a = ", a)
print("b = ", b)

# math and boolean operations can be applied to each element of an array
print("a + 1 =", a + 1)
print("a * 2 =", a * 2)
print("a == 2", a == 2)

# ... or corresponding elements of two (or more) arrays
print("a + b =", a + b)
print("a * b =", a * b)

### 3.2. Matrix/vector operations
There's also a bunch of pre-implemented operations on the entire vector/matrix: [cheatsheet](./docs/Numpy_Python_Cheat_Sheet.pdf). 

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("numpy.sum(a) = ", np.sum(a))
print("numpy.mean(a) = ", np.mean(a))
print("numpy.min(a) = ",  np.min(a))
print("numpy.argmin(b) = ", np.argmin(b))  # index of minimal element

# dot product - used for matrix/vector multiplication
print("numpy.dot(a,b) = ", np.dot(a, b))

print("numpy.unique(['male','male','female','female','male']) = ", np.unique(
    ['male', 'male', 'female', 'female', 'male']))

### 3.3. Indexing/slicing 

In [None]:
a = np.array([0, 1, 4, 9, 16, 25])
ix = np.array([1, 2, 5])
print("a = ", a)
print("Select by index")
print("a[[1,2,5]] = ", a[ix])

### 3.4. Boolean operations and filters

In [None]:
print("Boolean operations")

print('a = ', a)
print('b = ', b)
print("a > 2", a > 2)
print("numpy.logical_not(a>2) = ", np.logical_not(a > 2))
print("numpy.logical_and(a>2,a<10) = ", np.logical_and(a > 2, a < 10))
print("numpy.logical_or(b<2,b>4) = ", np.logical_or(b < 2, b > 4))

print("\nSelect by boolean filter")
print("a[a > 5] = ", a[a > 5])

print("(a % 2 == 0) =", a % 2 == 0)  # True for even, False for odd
print("a[a % 2 == 0] =", a[a % 2 == 0])  # select all elements in a that are even

### 3.5. Numpy and pandas dataframe
The important part: all this functionality works with pandas dataframes!

In [None]:
print("Max ticket price: ", np.max(data["Fare"]))
print("\nThe guy who paid the most:\n", data.iloc[np.argmax(data["Fare"])])

```
```
<div style="background-color:yellow;">
    <h3>Task 2. Your turn:</h3>    
</div>


In [None]:
# your code: compute mean passenger age and the data about the oldest woman on the ship
<YOUR CODE>

In [None]:
# who on average paid more for their ticket, men or women?

mean_fare_men = <YOUR CODE>
mean_fare_women = <YOUR CODE>

print(mean_fare_men, mean_fare_women)

In [None]:
# who is more likely to survive: a child (<18 yo) or an adult?

child_survival_rate = <YOUR CODE>
adult_survival_rate = <YOUR CODE>

print(child_survival_rate, adult_survival_rate)

More about Pandas:
[kaggle microlesson](https://www.kaggle.com/learn/pandas)

## 4. Plots and matplotlib

Using python to visualize the data is covered by yet another library: `matplotlib`.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# ^-- this "magic" tells all future matplotlib plots to be drawn inside notebook and not in a separate window.

# line plot
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])

In [None]:
# scatter-plot
plt.scatter([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])

plt.show()  # show the first plot and begin drawing next one

In [None]:
# draw a scatter plot with custom markers and colors
plt.scatter([1, 1, 2, 3, 4, 4.5], [3, 2, 2, 5, 15, 24],
            c=["red", "blue", "orange", "green", "cyan", "gray"], marker="x")

# without .show(), several plots will be drawn on top of one another
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25], c="black")

# adding more sugar
plt.title("Conspiracy theory proven!!!")
plt.xlabel("Per capita alcohol consumption")
plt.ylabel("# of data scientists per 100,000")

# fun with correlations: http://bit.ly/1FcNnWF

In [None]:
# histogram - showing data density
plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9, 10])
plt.show()

plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4,
          4, 5, 5, 5, 6, 7, 7, 8, 9, 10], bins=5)

```
```
<div style="background-color:yellow;">
    <h3>Task 3. Your turn:</h3>    
</div>


In [None]:
# plot a histogram of age and a histogram of ticket fares on separate plots

<YOUR CODE>


In [None]:
# Can you do that? use tab shift-tab to see if there is a way to draw a 2D histogram of age vs fare.

In [None]:
# make a scatter plot of passenger age vs ticket fare

<YOUR CODE>



In [None]:
# Can you do that? add separate colors for men and women

#### More about charts
* Extended [tutorial](https://matplotlib.org/2.0.2/users/pyplot_tutorial.html)
* A [cheat sheet](docs/Python_Matplotlib_Cheat_Sheet.pdf)
* Other libraries for more sophisticated stuff: [Plotly](https://plot.ly/python/) and [Bokeh](https://bokeh.pydata.org/en/latest/)
* Also check ou this micro-lesson from kaggle: [data visualization](https://www.kaggle.com/learn/data-visualization)

### This is the end of the pandas-primer lab. 
We learned some pandas, some numpy, and some matplotlib.

Copyright &copy; 2020 Marina Barsky.