# Data Science: Data processing

Typical packages: `pandas`, `plotnine`, `plotly`, `streamlit`

**References**

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Python for Data Analysis, 2nd Edition](https://github.com/wesm/pydata-book)

## Introduction to `pandas`

In [1]:
import numpy as np
import pandas as pd

## Series and Data Frames

### Series objects

A `Series` is like a vector. All elements must have the same type or are nulls.

In [2]:
s = pd.Series([1,1,2,3] + [None])
s

0    1.0
1    1.0
2    2.0
3    3.0
4    NaN
dtype: float64

### Size

In [3]:
s.size

5

### Unique Counts

In [4]:
s.value_counts()

1.0    2
3.0    1
2.0    1
dtype: int64

### Special types of series

#### Strings

In [5]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

0      the quick
1    quick brown
2      brown fox
3      fox jumps
4     jumps over
5       over the
6       the lazy
7       lazy dog
dtype: object

In [6]:
s1.str.upper()

0      THE QUICK
1    QUICK BROWN
2      BROWN FOX
3      FOX JUMPS
4     JUMPS OVER
5       OVER THE
6       THE LAZY
7       LAZY DOG
dtype: object

In [7]:
s1.str.split()

0      [the, quick]
1    [quick, brown]
2      [brown, fox]
3      [fox, jumps]
4     [jumps, over]
5       [over, the]
6       [the, lazy]
7       [lazy, dog]
dtype: object

In [8]:
s1.str.split()[1]

['quick', 'brown']

In [9]:
s1.str.split().str[1]

0    quick
1    brown
2      fox
3    jumps
4     over
5      the
6     lazy
7      dog
dtype: object

### Categories

In [10]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: object

In [11]:
s2 = s2.astype('category')
s2

0       Asian
1       Asian
2       White
3       Black
4       White
5    Hispanic
dtype: category
Categories (4, object): ['Asian', 'Black', 'Hispanic', 'White']

In [12]:
s2.cat.categories

Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')

In [13]:
s2.cat.codes

0    0
1    0
2    3
3    1
4    3
5    2
dtype: int8

### Dates and times

Datetimes are often useful as indices to a time series.

In [14]:
import pendulum

In [15]:
d = pendulum.today()

In [16]:
d.to_date_string()

'2021-09-15'

In [17]:
k = 18
s3 = pd.Series(range(k), 
               index=pd.date_range(d.to_date_string(),
                                   periods=k, 
                                   freq='M'))

In [18]:
s3

2021-09-30     0
2021-10-31     1
2021-11-30     2
2021-12-31     3
2022-01-31     4
2022-02-28     5
2022-03-31     6
2022-04-30     7
2022-05-31     8
2022-06-30     9
2022-07-31    10
2022-08-31    11
2022-09-30    12
2022-10-31    13
2022-11-30    14
2022-12-31    15
2023-01-31    16
2023-02-28    17
Freq: M, dtype: int64

In [19]:
s3['2021']

2021-09-30    0
2021-10-31    1
2021-11-30    2
2021-12-31    3
Freq: M, dtype: int64

In [20]:
s3['2021-01':'2021-06']

Series([], Freq: M, dtype: int64)

If used as a series, then need `dt` accessor method

In [21]:
s4 = s3.index.to_series()

In [22]:
s4.dt.day_name()

2021-09-30     Thursday
2021-10-31       Sunday
2021-11-30      Tuesday
2021-12-31       Friday
2022-01-31       Monday
2022-02-28       Monday
2022-03-31     Thursday
2022-04-30     Saturday
2022-05-31      Tuesday
2022-06-30     Thursday
2022-07-31       Sunday
2022-08-31    Wednesday
2022-09-30       Friday
2022-10-31       Monday
2022-11-30    Wednesday
2022-12-31     Saturday
2023-01-31      Tuesday
2023-02-28      Tuesday
Freq: M, dtype: object

### DataFrame objects

A `DataFrame` is like a matrix. Columns in a `DataFrame` are `Series`.

- Each column in a DataFrame represents a **variale**
- Each row in a DataFrame represents an **observation**
- Each cell in a DataFrame represents a **value**

In [23]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

Unnamed: 0,num
0,1.0
1,2.0
2,3.0
3,


In [24]:
df.num

0    1.0
1    2.0
2    3.0
3    NaN
Name: num, dtype: float64

### Index

Row and column identifiers are of `Index` type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

In [25]:
df.index

RangeIndex(start=0, stop=4, step=1)

#### Setting a column as the row index

In [26]:
df

Unnamed: 0,num
0,1.0
1,2.0
2,3.0
3,


In [27]:
df1 = df.set_index('num')
df1

1.0
2.0
3.0
""


#### Making an index into a column

In [28]:
df1.reset_index(drop=True)

0
1
2
3


### Columns

This is just a different index object

In [29]:
df.columns

Index(['num'], dtype='object')

### Getting raw values

Sometimes you just want a `numpy` array, and not a `pandas` object.

In [30]:
df.values

array([[ 1.],
       [ 2.],
       [ 3.],
       [nan]])

## Creating Data Frames

### Manual

In [31]:
from collections import OrderedDict

In [32]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n), 
                              weight=np.random.normal(70, 20, n),
                              height=np.random.normal(170, 15, n),
                              date=dates,
                             ))
df

Unnamed: 0,pid,weight,height,date
0,263,50.398153,161.413359,2021-09-15 03:14:43.772058
1,353,54.710708,172.305744,2021-09-16 03:14:43.772058
2,498,74.319565,139.533164,2021-09-17 03:14:43.772058
3,384,69.405748,153.92614,2021-09-18 03:14:43.772058
4,910,57.587377,153.467541,2021-09-19 03:14:43.772058


### From file

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like `read_X` where X is the data type.

In [33]:
%%file measures.txt
pid	weight	height	date
328	72.654347	203.560866	2018-11-11 14:16:18.148411
756	34.027679	189.847316	2018-11-12 14:16:18.148411
185	28.501914	158.646074	2018-11-13 14:16:18.148411
507	17.396343	180.795993	2018-11-14 14:16:18.148411
919	64.724301	173.564725	2018-11-15 14:16:18.148411

Writing measures.txt


In [34]:
df = pd.read_table('measures.txt')
df

Unnamed: 0,pid,weight,height,date
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411
3,507,17.396343,180.795993,2018-11-14 14:16:18.148411
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


## Indexing Data Frames

### Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

In [35]:
df[1:3]

Unnamed: 0,pid,weight,height,date
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411


If you provide a singe value or list, it is assumed that you are asking for columns.

In [36]:
df[['pid', 'weight']]

Unnamed: 0,pid,weight
0,328,72.654347
1,756,34.027679
2,185,28.501914
3,507,17.396343
4,919,64.724301


### Extracting a column

#### Dictionary style access

In [37]:
df['pid']

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

#### Property style access

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

In [38]:
df.pid

0    328
1    756
2    185
3    507
4    919
Name: pid, dtype: int64

### Indexing by location

This is similar to `numpy` indexing

In [39]:
df.iloc[1:3, :]

Unnamed: 0,pid,weight,height,date
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411


In [40]:
df.iloc[1:3, 1:4:2]

Unnamed: 0,weight,date
1,34.027679,2018-11-12 14:16:18.148411
2,28.501914,2018-11-13 14:16:18.148411


### Indexing by name

In [41]:
df.loc[1:3, 'weight':'height']

Unnamed: 0,weight,height
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993


**Warning**: When using `loc`, the row slice indicates row names, not positions.

In [42]:
df1 = df.copy()
df1.index = df.index + 1
df1

Unnamed: 0,pid,weight,height,date
1,328,72.654347,203.560866,2018-11-11 14:16:18.148411
2,756,34.027679,189.847316,2018-11-12 14:16:18.148411
3,185,28.501914,158.646074,2018-11-13 14:16:18.148411
4,507,17.396343,180.795993,2018-11-14 14:16:18.148411
5,919,64.724301,173.564725,2018-11-15 14:16:18.148411


In [43]:
df1.loc[1:3, 'weight':'height']

Unnamed: 0,weight,height
1,72.654347,203.560866
2,34.027679,189.847316
3,28.501914,158.646074


## Structure of a Data Frame

### Data types

In [44]:
df.dtypes

pid         int64
weight    float64
height    float64
date       object
dtype: object

### Converting data types

#### Using `astype` on one column

In [45]:
df.pid = df.pid.astype('category')

#### Using `astype` on multiple columns

In [46]:
df = df.astype(dict(weight=float, height=float))

#### Using a conversion function

In [47]:
df.date = pd.to_datetime(df.date)

#### Check

In [48]:
df.dtypes

pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

### Basic properties

In [49]:
df.size

20

In [50]:
df.shape

(5, 4)

In [51]:
df.describe()

Unnamed: 0,weight,height
count,5.0,5.0
mean,43.460917,181.282995
std,23.960945,16.895933
min,17.396343,158.646074
25%,28.501914,173.564725
50%,34.027679,180.795993
75%,64.724301,189.847316
max,72.654347,203.560866


In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   pid     5 non-null      category      
 1   weight  5 non-null      float64       
 2   height  5 non-null      float64       
 3   date    5 non-null      datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(2)
memory usage: 453.0 bytes


### Inspection

In [53]:
df.head(n=3)

Unnamed: 0,pid,weight,height,date
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411


In [54]:
df.tail(n=3)

Unnamed: 0,pid,weight,height,date
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411
3,507,17.396343,180.795993,2018-11-14 14:16:18.148411
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


In [55]:
df.sample(n=3)

Unnamed: 0,pid,weight,height,date
3,507,17.396343,180.795993,2018-11-14 14:16:18.148411
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411


In [56]:
df.sample(frac=0.5)

Unnamed: 0,pid,weight,height,date
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411


## Selecting, Renaming and Removing Columns

### Selecting columns

In [57]:
df.head(3)

Unnamed: 0,pid,weight,height,date
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411


In [59]:
df.filter(items=['pid', 'date'])

Unnamed: 0,pid,date
0,328,2018-11-11 14:16:18.148411
1,756,2018-11-12 14:16:18.148411
2,185,2018-11-13 14:16:18.148411
3,507,2018-11-14 14:16:18.148411
4,919,2018-11-15 14:16:18.148411


In [60]:
df.filter(regex='.*ght')

Unnamed: 0,weight,height
0,72.654347,203.560866
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993
4,64.724301,173.564725


### By data type

In [61]:
df.dtypes

pid             category
weight           float64
height           float64
date      datetime64[ns]
dtype: object

In [62]:
df.select_dtypes(include=np.number)

Unnamed: 0,weight,height
0,72.654347,203.560866
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993
4,64.724301,173.564725


In [63]:
df.select_dtypes(exclude=['datetime64', 'category'])

Unnamed: 0,weight,height
0,72.654347,203.560866
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993
4,64.724301,173.564725


#### Note that you can also use regular string methods on the columns

In [64]:
df.loc[:, df.columns.str.contains('d')]

Unnamed: 0,pid,date
0,328,2018-11-11 14:16:18.148411
1,756,2018-11-12 14:16:18.148411
2,185,2018-11-13 14:16:18.148411
3,507,2018-11-14 14:16:18.148411
4,919,2018-11-15 14:16:18.148411


### Renaming columns

In [65]:
df.rename(dict(weight='w', height='h'), axis=1)

Unnamed: 0,pid,w,h,date
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411
3,507,17.396343,180.795993,2018-11-14 14:16:18.148411
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


In [66]:
orig_cols = df.columns 

In [67]:
df.columns = list('abcd')

In [68]:
df

Unnamed: 0,a,b,c,d
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411
3,507,17.396343,180.795993,2018-11-14 14:16:18.148411
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


In [69]:
df.columns = orig_cols

In [70]:
df

Unnamed: 0,pid,weight,height,date
0,328,72.654347,203.560866,2018-11-11 14:16:18.148411
1,756,34.027679,189.847316,2018-11-12 14:16:18.148411
2,185,28.501914,158.646074,2018-11-13 14:16:18.148411
3,507,17.396343,180.795993,2018-11-14 14:16:18.148411
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


### Removing columns

In [71]:
df.drop(['pid', 'date'], axis=1)

Unnamed: 0,weight,height
0,72.654347,203.560866
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993
4,64.724301,173.564725


In [72]:
df.drop(columns=['pid', 'date'])

Unnamed: 0,weight,height
0,72.654347,203.560866
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993
4,64.724301,173.564725


In [73]:
df.drop(columns=df.columns[df.columns.str.contains('d')])

Unnamed: 0,weight,height
0,72.654347,203.560866
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993
4,64.724301,173.564725


#### You can also use regular indexing

In [74]:
df.loc[:, ~df.columns.str.contains('d')]

Unnamed: 0,weight,height
0,72.654347,203.560866
1,34.027679,189.847316
2,28.501914,158.646074
3,17.396343,180.795993
4,64.724301,173.564725


## Selecting, Renaming and Removing Rows

### Selecting rows

In [75]:
df[df.weight.between(60,70)]

Unnamed: 0,pid,weight,height,date
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


In [76]:
df[(df.weight >= 60) & (df.weight <= 70)]

Unnamed: 0,pid,weight,height,date
4,919,64.724301,173.564725,2018-11-15 14:16:18.148411


In [None]:
df.query('60 <= weight <= 70')

In [None]:
df[df.date.between(pd.to_datetime('2018-11-13'), 
                   pd.to_datetime('2018-11-15 23:59:59'))]

### Renaming rows

In [None]:
df.rename({i:letter for i,letter in enumerate('abcde')})

In [None]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']

In [None]:
df

In [None]:
df = df.reset_index(drop=True)

In [None]:
df

### Dropping rows

In [None]:
df.drop([1,3], axis=0)

In [None]:
df[::2]

#### Dropping duplicated data

In [None]:
df['something'] = [1,1,None,2,None]

In [None]:
df

In [None]:
df.loc[df.something.duplicated()]

In [None]:
df.drop_duplicates(subset='something')

#### Dropping missing data

In [None]:
df

In [None]:
df.something.fillna(0)

In [None]:
df.something.ffill()

In [None]:
df.something.bfill()

In [None]:
df.something.interpolate()

In [None]:
df.dropna()

## Transforming and Creating Columns

In [None]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)

In [None]:
df['bmi'] = df['weight'] / (df['height']/100)**2

In [None]:
df

In [None]:
df['something'] = [2,2,None,None,3]

In [None]:
df

## Sorting Data Frames

### Sort on indexes

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=0, ascending=False)

### Sort on values

In [None]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])

## Summarizing

### Apply an aggregation function

In [None]:
df.select_dtypes(include=np.number)

In [None]:
df.select_dtypes(include=np.number).agg(np.sum)

In [None]:
df.agg(['count', np.sum, np.mean])

In [None]:
df.agg({'weight': ['median', 'sum'], 'height': ['min', 'max']})

## Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with `groupby` followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

In [None]:
df['treatment'] = list('ababa')

In [None]:
df

In [None]:
grouped = df.groupby('treatment')

In [None]:
grouped.get_group('a')

In [None]:
grouped.mean()

### Using `agg` with `groupby`

In [None]:
grouped.agg('mean')

In [None]:
grouped.agg(['mean', 'std'])

In [None]:
grouped.agg({'weight': ['mean', 'std'], 
             'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})

### Using `trasnform` wtih `groupby`

When you apply a transform with a grouped object, it returns the same value for each member in a group - all rows are represented.

In [None]:
g_mean = grouped[['weight', 'height']].transform(np.mean)
g_mean

In [None]:
g_std = grouped[['weight', 'height']].transform(np.std)
g_std

In [None]:
(df[['weight', 'height']] - g_mean)/g_std

## Combining Data Frames

In [None]:
df

In [None]:
df1 =  df.iloc[3:].copy()

In [None]:
df1.drop('something', axis=1, inplace=True)
df1

### Adding rows

Note that `pandas` aligns by column indexes automatically.

In [None]:
df.append(df1, sort=False)

In [None]:
pd.concat([df, df1], sort=False)

### Adding columns

In [None]:
df.pid

In [None]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))

In [None]:
df2.pid

In [None]:
df.pid = df.pid.astype('int')

In [None]:
pd.merge(df, df2, on='pid', how='inner')

In [None]:
pd.merge(df, df2, on='pid', how='left')

In [None]:
pd.merge(df, df2, on='pid', how='right')

In [None]:
pd.merge(df, df2, on='pid', how='outer')

### Merging on the index

In [None]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))

In [None]:
df1

In [None]:
df2

In [None]:
df3

In [None]:
df1.join([df2, df3])

## Fixing common DataFrame issues

### Multiple variables in a column

In [None]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df

In [None]:
df.pid_treat.str.split('-')

In [None]:
df.pid_treat.str.split('-', expand=True)

### Multiple values in a cell

In [None]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], 
                       vals = [(1,2,3), (4,5,6), (7,8,9)]))
df

#### Easy way

In [None]:
df.explode('vals')

#### Hard way

In [None]:
df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df

In [None]:
df.drop('vals', axis=1, inplace=True)

In [None]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)

## Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

### Converting multiple columns into a single column

This is often useful if you need to condition on some variable.

In [None]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)

In [None]:
iris.head()

In [None]:
iris.shape

In [None]:
df_iris = pd.melt(iris, id_vars='species')

In [None]:
df_iris.sample(10)

## Pivoting

Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary.

In [None]:
df_iris.pivot_table(index='variable', columns='species', values='value', aggfunc='mean')

## Functional style - `apply`, `applymap` and `map`

`apply` can be used to apply a custom function

In [None]:
scores = pd.DataFrame(
    np.around(np.clip(np.random.normal(90, 10, (5,3)), 0, 100), 1),
    columns = ['math', 'stat', 'biol'],
    index = ['anne', 'bob', 'charles', 'dirk', 'edgar']
)

In [None]:
scores

In [None]:
def convert_grade_1(score):
    return np.where(score > 90, 'A', 
                    np.where(score > 80, 'B',
                            np.where(score > 70, 'C', 'F')))

In [None]:
scores.apply(convert_grade_1)

The `np.where` is a little clumsy - here is an alternative.

In [None]:
def convert_grade_2(score):
    if score.name == 'math': # math professors are mean
        return np.choose(
            pd.cut(score, [-1, 80, 90, 95, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )    
    else:
        return np.choose(
            pd.cut(score, [-1, 70, 80, 90, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )

In [None]:
scores.apply(convert_grade_2)

`apply` can be used to avoid explicit looping

In [None]:
def likely_profession(row):
    if (row.biol > row.math) and (row.biol > row.stat):
        return 'farmer'
    elif (row.math > row.biol) and (row.math > row.stat):
        return 'high school teacher'
    elif (row.stat > row.math) and (row.stat > row.biol):
        return 'actuary'
    else:
        return 'doctor'

In [None]:
scores.apply(likely_profession, axis=1)

If all else fails, you can loop over `pandas` data frames.

- Be prepared for pitying looks from more snobbish Python coders

Loops are frowned upon because they are not efficient, but sometimes pragmatism beats elegance.

In [None]:
for idx, row in scores.iterrows():
    print(f'\nidx = {idx}\nrow = {row.index}: {row.values}\n', 
          end='-'*30)

`apply` can be used for reductions along margins

In [None]:
df = pd.DataFrame(np.random.randint(0, 10, (4,5)), 
                  columns=list('abcde'), 
                  index=list('wxyz'))

In [None]:
df

In [None]:
df.apply(sum, axis=0)

In [None]:
df.apply(sum, axis=1)

#### For element-wise mapping operations

In [None]:
import string

In [None]:
char_map = {i: c for i,c in enumerate(string.ascii_uppercase)}

In [None]:
df.applymap(lambda x: char_map[x])

#### For mapping a series

In [None]:
df.assign(b_map = df.b.map(char_map))

## Chaining commands

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

Chaining requires the use of a `DataFrame` method. If no appropriate method exists, you can use `pipe` to create a custom one.

In [None]:
(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.petal_length).
    query('both > 2').
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)

## Moving between R and Python in Jupyter

In [None]:
%load_ext rpy2.ipython

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

In [None]:
iris = %R iris

In [None]:
iris.head()

In [None]:
iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()

In [None]:
%%R -i iris_py -o iris_r

iris_r <- iris_py[1:3,]

In [None]:
iris_r

In [None]:
! python3 -m pip install --quiet watermark

In [None]:
%load_ext watermark

In [None]:
%watermark -v -iv