# Python for everyone -- 11-12 NumPy, plotting and Pandas

<a href="https://classroom-40p3.onrender.com/" target="_blank">Classroom sign-in</a>

## Data analysis in Python

Python is the most popular programming language for data science. The main reasons for this is that it is easy to write and read, so you get quick results (and easily re-usable code). Also, in a self-reinforcing way, there are a lot of packages that are created for data analysis and scientific use.

The main packages in Python's scientific ecosystem:
* **NumPy**: It provides fast numerical calculations and an intiutive syntax to write mathematical expressions. It does the heavy work for many other data analysis packages.
* **Matplotlib**: The default plotting package, very flexible and has many options. Sometimes it gets a bit complicated to produce customized figures.
* **Pandas**: Extremely useful package for loading and wrangling (cleaning and re-structuring) data. Good for exploratory data analysis.
* **SciPy**: Scientific python package, provides optimizers, differential equation solving, some statistics, etc.
* **scikit-learn**: Maching learning library
* **PyTorch**: deep learning library, supports the use of GPUs. ChatGPT is mostly implemented in Python and uses PyTorch.

Other plotting libraries:
* **Seaborn**: Builds on Matplotlib. Creates nice looking plots that are common in data analysis with less instructions.
* **Plotly**: Higher-level alternative to Matplotlib. Good for, among other things, interactive visualizations that are displayed in your browser.

Other tools:
* **Anaconda environment manager**: Separate environment for each project. Makes installing data analysis libraries easy.
* **Jupyter Notebooks**: Popular for data analysis, because results display next to the code.



## 01 NumPy

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=2aebaf08-1ba6-41e4-89d2-acb600e0ecf2)

From NumPy's documentation:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

 * a powerful N-dimensional array object
 * sophisticated (broadcasting) functions
 * tools for integrating C/C++ and Fortran code
 * useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Traditionally, everyone imports NumPy as:

In [None]:
import numpy as np

The central object of NumPy is the N-dimensional array (or `ndarray`):

In [None]:
l = [1,2,3,4,5]
print(type(l),l,l[0],l[2:4])

In [None]:
a = np.array(l) #convert list to array
print(type(a),a,a[0],a[2:4])

Indexing and slicing works similarly to lists.

If used right, NumPy arrays are much more effective for numerical calculations than lists because array operations avoid Python `for` loops and offload heavy duty to C code (more on this later).

### Dimension and shape

A one-dimensional array is like a list, it is a sequence of values that can be accessed by a single index.

In [None]:
a1 = np.array([1,1,4,-2])
a1

In [None]:
a1[2]

The shape of a one-dimensional array is its length:

In [None]:
a1.shape

A two-dimensional array is a like table with rows and columns, or like a list of lists.

In [None]:
a2 = np.array([[11,12,13,14],
               [21,22,22,24],
               [31,32,33,34]])
a2

The elements of a 2-dimensional array can be accessed with two indices:

In [None]:
a2[2][1] # first index: row, second index: column

In [None]:
a2[2,1]

The shape of a 2-dimensional array is a tuple with two integers:

In [None]:
a2.shape # number of rows and columns

You can also create 3- and higher dimensional arrays.

### Data type

As opposed to lists arrays contain a single type of data, which is usually numeric

In [None]:
a

In [None]:
a.dtype

In [None]:
a_float = np.array( [1.,2.5,3.3])

In [None]:
a_float.dtype

Arrays are mutable, i.e., you can change the values of its elements:

In [None]:
a

In [None]:
a[1]=42
a

But the type cannot change:

In [None]:
a[0]=12.34

In [None]:
a

### Array operations and functions

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=86512763-8ed4-4b45-b2d0-acb600e71194)

Most array operations and mathematical functions implemented in NumPy are applied to arrays elementwise: 

In [None]:
a = np.ones(3) 
b = np.arange(3, dtype=float) # creates an array [0 1 2]
print(a,"+",b,"=",a+b)
print(a,"-",b,"=",a-b)
print('sin(b)=',np.sin(b))
print("b**3 =", np.power(b,3))

And there are many useful functions that calculate summaries of arrays:

In [None]:
a = np.arange(4)
print("a =", a)
print("sum =", np.sum(a))
print("mean =", np.mean(a))
print("std =", np.std(a))
print("median =", np.median(a))
print("min =", np.min(a))


### Loops vs array operations (important!)

Working with arrays you should vectorize your algorithm, meaning that you avoid explicit `for` loops. Array operations off-load Python loops to compiled C code, leading to large performance improvements.

Consider:

In [None]:
#%%timeit
n = 100000000
total = 0
for i in range(n):
    total += i
total

Same thing with vectorized code:

In [None]:
#%%timeit
a = np.arange(n)
np.sum(a)

Note: using the built in `sum()` function is also slow:

In [None]:
sum(a)

This is because `sum` cannot take advantage of the properties of a numpy array, while `np.sum` can.

## 02 Plotting with matplotlib

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=3c096e75-27e6-4fbd-a120-acbb00cf049e)

Just like NumPy, matplotlib is a fundamental part of Python's scientific stack. It is the plotting tool most people default to. It has a wide range of options, it is very flexible and customizable, and even more is available through third party [extensions](https://matplotlib.org/thirdpartypackages/index.html). A number of modules, such as pandas, Seaborn, and NetworkX, make use of matplotlib. 

Just looking at the documentation can be daunting due to the high number of options and various ways to interact with the module. The recommended strategy to learn matplotlib is to familiarize yourself with plotting via [tutorials](https://matplotlib.org/tutorials/introductory/pyplot.html) and [examples](https://matplotlib.org/gallery/). Kind of like what we will be doing here!

Traditionally everyone imports matplotlib as

In [None]:
import matplotlib.pyplot as plt

`pyplot` is the submodule that contains the functions that we use to interface with the module.

Next, let's take a look at parts of a figure:
<img src="https://matplotlib.org/_images/anatomy.png" width="50%">
<font size=.1>(image borrowed from matplotlib's official tutorial)</font>

With matplotlib you can customize every part of this and more. Before going into the myrad of options, we should internalize two important objects: 
* `Figure`: the object that contains everything (canvas, title, axes,...), this is what first create when you get ready to plot something
* `Axes`: Defines a region on your figure where you actually plot data using x-y coordinates. A `Figure` can have any number of them.



In our first figure, we will plot the $x^2$ function. So before that we have to create the data:

In [None]:
x = np.arange(0,10,.1) 
y = x**2.
x[:30],y[:30]

Side note: matplotlib is designed to plot NumPy arrays, but similar objects typically work (for example lists).

It adds to the sometimes confusing syntax of matplotlib that there are two styles to create figures: object oriented and pyplot style.

**Object oriented** is when you explicitly create the `Figure` and the `Axes`, and explicitly specify which `Axes` you are working with:

In [None]:
fig  = plt.figure(figsize=(3,3))
ax   = plt.axes()

ax.plot(x, y, label="x**2") #first two positions: x and y data; label a string that will appear in the legend

ax.set_xlabel('x')
ax.set_ylabel('y')


ax.legend() # add legend

ax.set_title('A title') # add title


Or you can use the **pyplot style**, which is meant to resemble plotting with MatLab. It automatically creates the `Figure` and `Axes`, and keeps track of a current `Figure` and `Axes` in the background. The same figure:

In [None]:
plt.plot(x, y, label="x**2") #first two positions: x and y data; label a string that will appear in the legend


plt.xlabel('x')
plt.ylabel('y')

plt.legend() # add legend

plt.title('A nice quadtratic function'); # add title

Both styles are equally powerful, you can use either or even mix them. In this notebook, we will use both: pyplot for simple one plot figures, and object oriented style for more complicated cases.

In our first plot we used the default style and the color. How can we customize it? Through the argmuments of the `plot` function. But again, two typical ways:
* format string for quick formating
* separate attributes for more explicit (and complete) formatting

In [None]:
x = np.linspace(0,2*np.pi,20)


plt.plot(x, np.sin(x), 'ro--', linewidth=3., label="sin") # o: circle marker; --: dashed line; r:red color

plt.plot(x, np.cos(x), color='green', marker='s', linestyle='solid', label="cos") 
plt.xlabel('x')
plt.ylabel('y')

plt.legend() # add legend

plt.title('Adding some style'); # add title

Format string requires less typing, but less flexible, it only has three parts (you can change their order)
```python
fmt = '[marker][line][color]'
```
Try out these format strings: 
* `'b'`: blue markers with default shape
* `'-g'`: green solid line
*`'^k:'`: black triangle_up markers connected by a dotted line

For a full list of options click [here](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot), and scroll all the way down to notes.

### Exercise

Using a for loop plot the powers of x between 1 and 5, i.e., $x$,$x^2$ and $x^3$. Create a legend too. Try out the default formating and try to modify it.

Advanced: Use a `for` loop to plot more powers of $x$. 

<details><summary><u>Solution.</u></summary>
<p>
    
```python
x= np.linspace(0,1.1,20)

plt.plot(x,x,label="x")
plt.plot(x,x**2,label="x**2")
plt.plot(x,x**3,label="x**3")

plt.xlabel('x')
plt.ylabel('y')
plt.legend();
    
# advanced:
    
x= np.linspace(0,1.1,20)
for k in range(1,6):
    plt.plot(x,x**k,label="x**%d"%k)
plt.xlabel('x')
plt.ylabel('y')
plt.legend();
```
    
</p>
</details>

## Subplots


You can add multiple `Axes` to a `Figure` to create subplots. The easiest way of doing this is using the `plt.subplots` function wich creates a grid of plots. For example:

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(6,4)) # returns a Figure and an array of Axes
print("shape of axes array:",axes.shape)

#plotting in subplots
x = np.arange(0.1,10,.1)

axes[0][0].plot(x,np.exp(x),':', linewidth=2)

axes[0][1].plot(x,np.log(x),'--', color="#aaaa00",linewidth=8)

axes[1][0].plot(x,np.tan(x),'b-', markersize=10.,linewidth=2)

axes[1][1].plot(x,x**2,'x', linewidth=8, markersize=2.) 


# let's label all axes the same way
for ax in axes.reshape(-1): #we reshape the axes to be one dimensional, try iterating over it without reshaping
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    
plt.tight_layout()

(Why do you see those uneven peaks for the `tan` function? Try changing the increment in the `np.arange(.1,10,.1)`.)

## Histograms

The `hist(data)` function can both calculate and plot a histogram from a sample `data`.

It has a `bins` attribute that works just like NumPy's `histogram` function: you can specify the number of evenly spaced bins or the edges of the bins.

In [None]:
#generate a large sample of normally distributed numbers with mean 2 and standard deviation 1
x = np.random.normal(2,1,10000)
plt.hist(x, 50, facecolor='g',alpha=1.) #try changing facecolor and alpha
plt.xlabel('x')
plt.ylabel('count')


### ðŸ”´ Exercise

Download a csv file from [here](https://posfaim.github.io/dnds5027/data/weight-height-clean.csv). This file contains the weight, height, age and gender of a sample of people.

Run the following code to open the csv file and save the weight of the people. Don't forget to adjust the file path to reflect the location of the file on your machine.

In [None]:
weights = []
with open("data/weight-height-clean.csv") as f:
    print(f.readline()) #print out the header
    for line in f:
        data = line.split(',')
        weights.append(float(data[0]))

weights = np.array(weights) # convert from list to numpy array
weights[:5]

Use numpy functions to calculate the minimum, the maximum and the average weight.

<details><summary><u>Solution.</u></summary>
<p>
    
```python
print(f"min: {weights.min()}, max: {weights.max()}")
print(f"mean weight:{weights.mean():.2f}")
```
    
</p>
</details>

Plot the histogram of the weights.

<details><summary><u>Solution.</u></summary>
<p>
    
```python
plt.hist(weights, bins=50)
plt.xlabel('weight [kg]')
plt.ylabel('count')
```
    
</p>
</details>

## 03 Pandas

[Intro video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=4aa0894d-686a-42d6-9f07-ab8a00d1e2bf)

Introduce and explore [pandas](http://pandas.pydata.org/), a library for tabular data manipulating and analysis that has implementations of common tasks making the following (and more) very straightforward:
- Reading in and cleaning tabular data
- Merging data from different sources
- Basic analysis and plotting

Pandas stands for **Pan**el **Da**ta (an expression borrowed from econometrics), it brings tools and ideas from Excel, R, and SQL to Python.

We'll have a look at the basic data structures: Series and Dataframes. Then we'll look at a dataset about the passengers of the Titanic.

Traditionally everyone imports pandas as:

In [None]:
import pandas as pd

# we will also use:
import numpy as np
import matplotlib.pyplot as plt

### Series

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=0c046192-d9a5-4906-a275-acbd00fde38b)

Pandas has two primary data structures: series and dataframes. Series are similar to Python lists or NumPy vectors: they are one dimensional. 

We can create a series from a list:

In [None]:
my_list = [3,2,1,34,5,3,2,3,100]
my_first_series = pd.Series(my_list)
my_first_series

When we create the series, notice that we also get a column with a number for each row. This is the index of the series and it contains a label for each row. 

You can access the values and the indices separately: 

In [None]:
S = my_first_series # just so I don't have to type that much

#access the values
print(S.values)
print(type(S.values))

#access indices
print(S.index)

The values are stored in a NumPy array.

By default the indices are integers starting from 0. But you can use a variety of data types:

In [None]:
S1 = pd.Series([1.5,2.,3.],index=['A','B','C'])
S1

In [None]:
S2 = pd.Series([1,2,3])
S2.index = [.5,.75,1.]
S2

### Accessing elements

There are three main ways access elements:
* By index using `S.loc[...]`
* By position using `S.iloc[...]`
* And using `S[]`
Let's look at these individually.

#### `loc`
With `S.loc[label]` you can access elements of `S` using the labels set as the index of the series, so this is kind of like indexing a dictionary:

In [None]:
print(S1.loc['A'], S2.loc[.75], S.loc[0])

Unlike dictionaries, we can also use lists of labels:

In [None]:
S1.loc[['A','C']]

The labels do not have to be in the original order:

In [None]:
S2.loc[[1.,.5]]

In [None]:
S.loc[[3,2,1]]

Note that in this last example the numbers are the labels, which just happen to be the same as the position of the rows in the original series `S`.

And you can also do slicing:

In [None]:
S1.loc['B':'C'] # if using labels, the end label 'C' is also included

#### `iloc`

With `S.iloc[pos]` you can access elements based on their position, so this is kind of like the indexing of lists and arrays:

In [None]:
print(S.iloc[0],S1.iloc[1],S2.iloc[2])

Again you can provide a list of positions:

In [None]:
S1.iloc[[2,1,0]]

Or you can use slices:

In [None]:
S2.iloc[0:2] # when using positions the final position, here 2, is not included

#### `[...]`

So what does `S[x]` return? Let's try it out:

In [None]:
print(S1['A'],S1[0])
print(S2[.75],S1[1])

pandas guesses if it is a label or a position! But what happens both labels and positions are integers? Let's try it out, first let's create a series that has the position and label are both integers but in different order:

In [None]:
S3 = S[::-1] #with this slice we reversed the order of rows
S3

In [None]:
print(S3[0],S3[8])

This is based on label!

In [None]:
S3[1:4]

Confusingly, slices are based on position.

Bottomline: when in doubt us `loc` and `iloc`.

### Exercise

Create a series with values that are integers from 1 to 5, and index labels that are the first five letters of the English alphabet. Then select every second row, try to do it as many ways as you can think of!

<details><summary><u>Hint</u></summary>
<p>

Try using slices and list of both positions and labels.
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python

S = pd.Series(np.arange(1,6),index=['a','b','c','d','e'])
print(S[::2])
print(S.loc['a':'e':2])
print(S.loc[['a','c','e']])
print(S.iloc[[ i for i in range(0,5,2)]])

```
    
</p>
</details>

### Masking

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=1e5bae96-d184-413a-a633-acbd01020c36)

We can also use masks the same way we did with NumPy arrays. A mask is a list or array of bools that has the same length as your series. Using it as an index will return elements where the mask is `True`.

In [None]:
S = pd.Series([3,5,2,3],index=['A','B','C','D'])
mask = [True, False, True, False]
S[mask]

We can use this to filter rows based on the value of their elements, by create a mask using comparison operators such as `==`, `>`, or `>=`. For example 

In [None]:
just_threes_mask = (S == 3)
print(just_threes_mask)

Using this mask to filter rows that are equal to 3:

In [None]:
S[just_threes_mask]

Or in one line:

In [None]:
S[S==3]

### Exercise

Using masking, select rows of the following series such that values are larger than 5 but smaller than 10. You can use a `for` loop to create the mask, or even better try to use the elementwise logic and operator `&`: 

<details><summary><u>Hint</u></summary>
<p>

You can iterate over the values of a series in either of the following two ways:
```python
    
for x in S.values:
    ...
    
for x in S:
    ...
```
</p>
</details>

In [None]:
S = pd.Series(range(20))


<details><summary><u>Solution.</u></summary>
<p>
    
```python
print(S[(S>5) & (S<10)])

mask = []
for x in S:
    mask.append(x>5 and x<10)
print(S[mask])
```
    
</p>
</details>

### Applying functions to series

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=56629303-7ec5-491a-b6ed-acbd010662fb)

In most use cases, our series will contain only one data type, so let's look at examples like that.

In [None]:
a = pd.Series([1.0, 2.0, 3.0])
b = pd.Series([1.0, 1.0, 1.0])

print(a)

Similarly to numpy, basic operations get applied to the series elementwise. For example, multiplying a series by 2, doubles each element:

In [None]:
2*a

Or adding two series is calculated elementwise (elements are matched by label):

In [None]:
a+b

We can even apply a numpy function to the enitre series:

In [None]:
np.sin(a)

If we want apply a general function to the Series elemetwise we have to use the `apply` function. This is important, we will use it often to create new columns in tables.

The `apply()` function for series is the same as list comprehensions are for lists. Let's look at a few examples:

In [None]:
L = [10,'apple',3.4,'hello']

new_list = []
for x in L:
    new_list.append(type(x))

print(new_list)

The same using series and `apply()`:

In [None]:
S = pd.Series(L)
new_series = S.apply(type)
print(new_series)

We can define our own functions:

In [None]:
#list
L = ["apple!","pear!","watermelon!"]

L2 = []
for s in L:
    L2.append(s.replace("!",""))
    
print(L2)

In [None]:
#series
s = pd.Series(L)

def func(x):
    return x.replace("!","")

s2 = s.apply(func)
print(s2)

Note that you can also do this in an alternative way:

In [None]:
s.str.replace("!","")

### Exercise

Take the following list and list comprehension and
* Convert the list into a pandas series
* Use `apply()` to do the same operation as the list comprehension

In [None]:
L = ["APPLE","PEAR","WATERMELON"]

L2 = []
for s in L:
    L2.append(s.lower()+"!")
L2

<details><summary><u>Solution.</u></summary>
<p>
    
```python
S= pd.Series(L)
S2 = S.apply(lambda s: s.lower()+"!")
print(S2)
```
    
</p>
</details>

### From dictionaries to series

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=8a39629f-df2c-4b20-85b1-acbd010a3cbc)

So far we converted lists to create series, we can also use a python dictionaries.

Let's say that I have my ratings of my favorite actors in a dictionary and let's convert this into a series: 

In [None]:
#let's create a dict of the average movie ratings from 1 to 10 of actors.
actor_rating_dict = {'Nicolas Cage':3,'Robert Redford':5,'Julianne Moore':8,
                     'Jeff Bridges':7, 'Idris Elba':8,'Meryl Streep':9,
                     'Pam Grier':9, 'Dorottya Udvaros':7.5}
actor_rating_series = pd.Series(actor_rating_dict)
print(actor_rating_series)

Note the keys of the dictionary are mapped to index labels.

As we have seen before, we can now look up values based on index label or by position:

In [None]:
#look up by index label
print(actor_rating_series['Idris Elba'])
#look up by index position
print(actor_rating_series[4])

### Dataframe

Look, I have another dictionary! This dictionary stores the number of movies I have seen with the actors in them:

In [None]:
#creating another series: this time how many movies an actor has played in.
actor_frequency_dict = {'Nicolas Cage':20,'Robert Redford':6,
                        'Julianne Moore':10, 'Jeff Bridges':2,
                        'Idris Elba':14,'Mr. Bean':3,'Meryl Streep':7,
                        'Pam Grier':11,'Dorottya Udvaros':5}

actor_frequency_series = pd.Series(actor_frequency_dict)
actor_frequency_series

We can combine any number of series with the concat command,  what is returned is a dataframe.

In [None]:
df = pd.concat([actor_rating_series, actor_frequency_series], axis=1, sort=True)
df

What does Mr. Bean's `NaN` mean? I have seen three movies with Mr. Bean, but for some reason I didn't rate him. If the `concat()` function encounters a key that is missing from one of the dictionaries, it substitutes the missing value with the special value `NaN`, which stands for not-a-number. 

### Tiny exercise
What happens if we exclude `axis=1` from the concat command? Try it out!

We can rename the columns to have more descriptive labels:

In [None]:
df.columns = ['Average_Rating','Number_of_Movies']
df

We can access columns by name, this returns a series:

In [None]:
df['Number_of_Movies']

Or this is the same:

In [None]:
df.Number_of_Movies

We can access elements by index label using `loc`:

In [None]:
#element: [row_label,column_label]
print(df.loc['Mr. Bean','Number_of_Movies'])

Or we can access entire rows and columns:

In [None]:
#row: [row_name]
print(df.loc['Mr. Bean'])
print(type(df.loc['Mr. Bean']))
print('-------------')

#column: [:,column_name]
print(df.loc[:,'Number_of_Movies'])
print(type(df.loc[:,'Number_of_Movies']))


The rows and colunms are returned as series objects.

You can also do masking. The most common way to use this, is to filter rows by creating a mask that has as many `True` or `False` values as the number of row. For example, to get all actors with rating larger than 6:

In [None]:
df[df['Average_Rating']>6.]

### Dealing with missing values

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=58cfe757-1aed-4d44-ba11-ab8a00dc9c7e)

A common task in data processing is to deal with missing data. For example, we don't have a rating for Mr. Bean. One possibility is that we make an educated guess, and say that Mr. Bean has the same rating as the average of everyone else:

In [None]:
#df['Average_Rating'] returns a series corresponding to the column
#and series has a method to calculate its mean
avg = df['Average_Rating'].mean() 
print('Average of Average Rating = %g' % (avg))

#We can change the elements directly:
df.loc['Mr. Bean','Average_Rating'] = avg

df

This is such a common task, that pandas has a built in method to locate and replace `NaN` called `.fillna()`.

In [None]:
# set Mr. Bean's average rating back to np.nan
df.loc['Mr. Bean','Average_Rating'] = np.nan

# override the Average_Rating column with a version that has the nan's replaced by the average
# of the non-nan entries.
df['Average_Rating'] = df['Average_Rating'].fillna(np.mean(df['Average_Rating']))
df

### Using apply() with a dataframe

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=dd8d9677-fa4a-4493-9047-ab8a00e03354)

We seen before that you can get a column as a series, this way you use `apply()` like we did before. For example, we can create a series indicating our favorite actros: it will contain 1 if the rating of the actor is higher than 7 otherwise 0.

In [None]:
def is_favorite(rating):
    if rating>7:
        return 1
    else:
        return 0

favorites = df['Average_Rating'].apply(is_favorite)
print(favorites)

And we can use this to create a new column:

In [None]:
df['Favorites']=favorites
df

However, we might want to use more than one column as input. For example, I would like to create a new column for that will indicate actors that I don't like, yet I've seen many times:

In [None]:
df

In [None]:
#define a fuction that we will use with apply
def func(row):
    if row['Average_Rating']<=5 and row['Number_of_Movies']>=10:
        return 1
    else:
        return 0

love_to_hate = df.apply(func, axis=1)
df['love_to_hate'] = love_to_hate
df

The setting `axis=1` tells `apply()` to iterate through rows, `axis=0` iterates through columns. Take look:

In [None]:
df.apply(np.mean, axis=0)

### Exercise

Create a new column `love_to_love` that contains a `1` for actors that have rating at least `7` and I have seen movies with them at least `10` times.

<details><summary><u>Solution.</u></summary>
<p>
    
```python
#define a fuction that we will use with apply
def func(row):
    if row['Average_Rating']>=7 and row['Number_of_Movies']>=10:
        return 1
    else:
        return 0

love_to_love= df.apply(func, axis=1)
df['love_to_love'] = love_to_love
df
```
    
</p>
</details>

### Plotting

Pandas has built in plotting to quickly take care of common plots. It sits 'on top of' matplotlib and so can be customized in the same way.

The pandas `plot()` function returns the matplotlib `Axes` object of the figure that it created. You can use this `Axes` to customize your figure.

Creating an automatically labelled bar chart is very simple:

In [None]:
ax = df.plot(kind='bar', use_index=True, y='Average_Rating')
ax.set_title('Actor Number of Films vs Avg. Rating', size=15);

Or a histogram:

In [None]:
ax=df.plot(kind='hist', y='Number_of_Movies', legend=True)

ax.set_title('Number of Movies Histogram');

So now we know the basics, let's do something more complex.

## Part II: Surviving the Titanic

Download the csv file from [here](https://posfaim.github.io/dnds5027/data/titanic.txt).

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=41bee325-1d33-4211-a829-ab8a00e2e714)

Pandas has great data manipulation abilities. Let's finally consider some real data first. First we are going to consider passenger data from the Titanic, which sunk on its maiden voyage. Of 2,224 passengers and crew, more than 1,500 died.

We have a data file containing data of some of the passengers, this is how it looks like (this will work on Linux or Mac, but not on Windows):

In [None]:
!head data/titanic.csv

Reading from a csv in pandas is very easy! `read_csv` is very flexible: can take `txt`, plain files, and many more.

In [None]:
df = pd.read_csv('data/titanic.csv', header=0, sep=',')

`header = 0` indicates that the first row is the header. In this case it is not necessary. `sep` is the column seperator, other common examples include tabs `\t`, white space, and `|`.

Note that pandas automatically guesses the datatype of each column and converts it appropriately. This usually works, but in some unusual cases it might fail, for example, phone numbers might be converted to numbers instead of kept as strings. In these cases the `dtype` argument can be used to specify the data type by hand.

How does are dataframe look like? We have too many rows to print out the whole table, but we can use the `head()` method to show only the first five rows:

In [None]:
df.head()

Tail method reads the last five rows:

In [None]:
df.tail()

### Some more information about the data:
- Pclass: passenger class. 
- SibSp: number of siblings+spouses aboard
- Parch: number of parents+children aboard
- Fare: cost of ticket
- Cabin: room ID, if passenger had a room
- Embarked: port of departure (C= Cherbourg; Q= Queenstown; S=Southampton)

### Let's check out a few data exploration techniques

Basic statistics are printed out by the `describe()` method for numeric columns.

In [None]:
df.describe()

We can also visualize pairs of variables quickly. The `pd.plotting.scatter_matrix()` function creates a matrix of plots: the diagonals contain histograms, and the off-diagonal plots are scatter plots.

In [None]:
pd.plotting.scatter_matrix(df[['Parch','Age','Fare']],figsize=(8,8));

Where `df[['Parch','Age','Fare']]` selects three columns.

###  Grouping

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=c74c32a0-0766-4867-90e7-ab8a00e52015)

Pandas has powerful grouping methods that allows us to group entries based on a column value.

For example, we can ask the question does the average survival rate depend on the passengers ticket class? For this we group passengers based on the colunm `Pclass`, this creates three groups. Then we calculate the average survival rate for each group separetly by calculating the mean of the `Survived` column. (Remeber this column is 1 if the passenger survived, 0 if they died; therefore the mean is the survival rate!)

These steps are done easily with pandas:

In [None]:
df.groupby('Pclass')['Survived'].mean()

### Exercise

What about "women and children first"? Does sex correlates with survival rate? Calculate the average survival rate for men and women separately!

<details><summary><u>Hint</u></summary>
<p>

Do the same thing as in the previous example, only this time group by `Sex` instead of `Pclass`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
df.groupby(['Sex'])['Survived'].mean()
```
    
</p>
</details>

We can do even more refined grouping. Let's combine the two: groupby both class and sex, and calculate the survival rates.

In [None]:
survived_by_class_and_sex = df.groupby(['Pclass','Sex'])['Survived'].mean()
survived_by_class_and_sex

Let's plot these survival rates!

In [None]:
survived_by_class_and_sex.unstack(1).plot(kind='bar',
                                          title='Survival Probability by Sex and Class');

### Exercise

In the previous example we used the `unstack()` method. What does this do to grouped data? Try `unstack(0)` and `unstack(1)` in the next cell, also try the plot without using `unstack()`. Explain what the the function does!

<details><summary><u>Hint</u></summary>
<p>

In addition to trying out plots, print out `survived_by_class_and_sex`, `survived_by_class_and_sex.unstack(0)`, and `survived_by_class_and_sex.unstack(1)`. What is their type? What are the indices?

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
The call `df.groupby(['Pclass','Sex'])['Survived'].mean()` returns a series object with [hierarchical indices](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html), where `Pclass` is the level 0 index and `Sex` is the subindex at level 1. The method `unstack()` transforms the 1 dimensional series with hierarchical indices to a 2 dimensional data frame where the column names and row names are the two levels of indices.

Try these lines in separate cells:

```python
survived_by_class_and_sex
survived_by_class_and_sex.unstack(1).plot(kind='bar', title='Survival Probability by Sex and Class')
survived_by_class_and_sex.unstack(1)
```
    
</p>
</details>

### Subsetting

As we mentioned before, filtering the data based on some condition is also simple. If you remember, this is similar to using boolean masks in NumPy.

We can subset the data to only include passengers below 30:

In [None]:
under_30=df[df['Age']<30]
under_30.head()

### Exercise

Subset the data to only include passengers that payed less than average fare.

<details><summary><u>Hint</u></summary>
<p>

Calculate the mean fare using `df['Fare'].mean()`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
avg_fare = df['Fare'].mean()
print("The average fare is", round(avg_fare,3))
below_average_fare = df[df['Fare']<avg_fare]
below_average_fare.head()
```
    
</p>
</details>

We can do more complicated filtering using logical operators. (Remember `&`=and and `|`=or)

For example, to select the male passengers who got on the Titanic in France (Cherbourg is in France) we can write:

In [None]:
males_from_france = df[(df['Embarked']=='C') & (df['Sex']=='male')]
males_from_france.head()

### New columns

We can also create new columns! Let's count the reverends on board.

First we define a helper function that takes a name as input and returns 1 if they are a reverend and 0 if they are a layman. 

In [None]:
def is_rev(input_name):
    # they are a reverend if their name contains the 'Rev.' title
    if 'Rev.' in input_name:
        return 1  
    else:
        return 0

To test the function, we can apply it elementwise to the `Name` column and count the number of reverends on board:

In [None]:
np.sum(df['Name'].apply(is_rev))

To define a new column we can simply write:

In [None]:
df['is_reverend'] = df['Name'].apply(is_rev)

#check the columns
df.columns