---
title: NumPy and Pandas
---


## Numpy

Fast computation using vectors and matrices

In [None]:
list1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
list2 = [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

summed = []
for i in range(len(list1)):
    summed.append(list1[i] + list2[i])
summed

### Arrays

In [None]:
import numpy as np

In [None]:
a = np.array(list1)
b = np.array(list2)
a, b

### Broadcasting

In [None]:
a + b

In [None]:
a * b

In [None]:
a - 10

In [None]:
a.sum()

In [None]:
a.mean()

### Multidimentional arrays

In [None]:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
list_of_lists

In [None]:
list_of_lists[1][1]

In [None]:
matrix = np.array(list_of_lists)
matrix

In [None]:
matrix[1][1] # not efficient

In [None]:
matrix[1, 1] # efficient

In [None]:
matrix - 10

In [None]:
matrix.sum()

In [None]:
list_of_lists_of_lists = [[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[1, 2, 3], [4, 5, 6], [7, 8, 9]]]
list_of_lists_of_lists

In [None]:
tensor = np.array(list_of_lists_of_lists)
tensor

In [None]:
tensor[1, 1, 1]

## Pandas

Fast computations on data tables (on top of Numpy).

In [None]:
import pandas as pd

### DataFrame

In [None]:
df = pd.DataFrame({'name': ['Mike', 'Mia', 'Jake'], 'weight': [82, 62, 75]})
df

In [None]:
type(df)

In [None]:
df = pd.DataFrame(dict(name=['Mike', 'Mia', 'Jake'], weight=[82, 62, 75]))
df

In [None]:
records = [('Mike', 82), ('Mia', 62), ('Jake', 75)]

df = pd.DataFrame().from_records(records, columns=['age', 'weight'])
df

In [None]:
df.index

In [None]:
df.index.values

In [None]:
df.columns

In [None]:
df.dtypes

Add a column to an existing dataframe:

In [None]:
df['height'] = [182.5, 173.0, 192.5]
df

Add another, categorical, column:

In [None]:
df['sex'] = pd.Categorical(['male', 'female', 'male'], categories=['female', 'male'], ordered=True)
df

In [None]:
df.dtypes

A Series just wraps an array:

In [None]:
df.height.to_numpy()

### Example penguin data set

In [None]:
import seaborn as sns

penguins = sns.load_dataset('penguins')

In [None]:
penguins

In [None]:
penguins.dtypes

In [None]:
penguins.head()

In [None]:
penguins.tail()

### Series

In [None]:
penguins['flipper_length_mm']

In [None]:
penguins.flipper_length_mm

In [None]:
type(penguins.flipper_length_mm)

### Broadcasting

In [None]:
penguins.bill_depth_mm - 1000

In [None]:
penguins.bill_depth_mm * penguins.flipper_length_mm

### Indexing

#### Get a cell

In [None]:
penguins.loc[4, 'island']

#### Get a row

In [None]:
penguins.loc[4]

#### Get a column

In [None]:
penguins['bill_depth_mm']

In [None]:
penguins.bill_depth_mm

#### Get a range of rows and multiple columns

In [None]:
penguins.loc[40:45, ['island', 'body_mass_g']]

#### Use boolean series as index to subset data

In [None]:
idx = penguins.bill_length_mm > 55
idx

In [None]:
penguins.loc[idx]

In [None]:
penguins.loc[(penguins.bill_length_mm > 55) & (penguins.sex == 'Female')]

#### Setting and resetting the index

In [None]:
penguins

In [None]:
df = penguins.set_index(['species', 'sex', 'island'])
df.head(10)

In [None]:
df.reset_index()

### Sorting rows

In [None]:
sorted_df = penguins.sort_values(by="bill_length_mm")
sorted_df.head()

In [None]:
sorted_df.index.values

Click to the left of an output cell to enable/disable scrolling of the output (usefull for large amounts of output).

In [None]:
sorted_df.loc[0]

In [None]:
sorted_df.flipper_length_mm[0]

In [None]:
sorted_df.iloc[0] # iloc !!!

In [None]:
sorted_df.flipper_length_mm.iloc[0]

### Summary stats

In [None]:
penguins.describe()

In [None]:
penguins.bill_length_mm.mean()

In [None]:
penguins.bill_length_mm.count()

### Group

In [None]:
penguins.groupby('island')

### Aggregate

*Aggregating* produces a **single** value for each variable in each group:

Means for all numeric variables for each island:

In [None]:
penguins.groupby('island').aggregate("mean", numeric_only=True)

In [None]:
penguins.groupby('island').mean(numeric_only=True)

Means for `bill_length_mm` and `flipper_length_mm`:

In [None]:
penguins.groupby('island')[['bill_length_mm', 'flipper_length_mm']].mean()

Just for `flipper_length_mm`:

In [None]:
penguins.groupby('island').flipper_length_mm.mean()

### Transform

*Transforming* produces new colums with the *same length* as the input:

In [None]:
penguins.groupby('island')[['bill_length_mm', 'flipper_length_mm']].transform("mean")

In [None]:
def z_value(sr):
    return (sr - sr.mean()) / sr.std()

penguins.groupby('island')[['bill_length_mm', 'flipper_length_mm']].transform(z_value)

### Apply

Flexible method allowing any operation on grouped data.

Return a single value:

In [None]:
def fun(df):
    return df.bill_length_mm + df.flipper_length_mm.mean() / df.body_mass_g

penguins.groupby('island').apply(fun)#.to_frame('my_stat')

Return a dataframe:

In [None]:
def fun(df):
    return pd.DataFrame({'sqrt_bill': np.sqrt(df.bill_length_mm),
                         'bill_squared': df.bill_length_mm**2})

penguins.groupby('island').apply(fun)