# Pandas

This notebook contains examples of using Pandas to perform explorator data analysis (EDA) on tabular data.

In [1]:
# Only include this if you don't already have the pandas module installed. You can also do this on the command line.
%pip install pandas 

import pandas as pd 

mouseDF = pd.read_csv('../data/mousew.csv')

Note: you may need to restart the kernel to use updated packages.


In [2]:
mouseDF

Unnamed: 0,strain,sex,weight
0,C57Bl/6J,M,27.596753
1,C57Bl/6J,M,29.635837
2,C57Bl/6J,M,28.460685
3,C57Bl/6J,M,29.248225
4,C57Bl/6J,M,25.702118
...,...,...,...
395,DBA/2J,F,20.489587
396,DBA/2J,F,21.512789
397,DBA/2J,F,18.677338
398,DBA/2J,F,22.974729


In [2]:
mouseDF = pd.read_csv('../data/mousew.csv.gz')
mouseDF

Unnamed: 0,strain,sex,weight
0,C57Bl/6J,M,27.596753
1,C57Bl/6J,M,29.635837
2,C57Bl/6J,M,28.460685
3,C57Bl/6J,M,29.248225
4,C57Bl/6J,M,25.702118
...,...,...,...
395,DBA/2J,F,20.489587
396,DBA/2J,F,21.512789
397,DBA/2J,F,18.677338
398,DBA/2J,F,22.974729


## Writing data out

We can write a DataFrame to lots of formats. Here are some poular ones:
  * CSV (comma separated values)
  * TSV (tab separated values)
  * JSON (JavaScript Object Notation)
  * Excel

In [None]:
# If you happen to be writing to or reading from an Excel (.xlsx or .xls) file (like we are in this cell), 
# you'll need to install the Python library openpyxl. Use the magic pip command for this:
%pip install openpyxl

# CSV (comma separated values):
mouseDF.to_csv('../data/mousew2.csv')

# TSV (tab separated values):
mouseDF.to_csv('../data/mousew.tsv', sep='\t')

# JSON (JavaScript Object Notation -- lists each row in the DataFrame as a dictionary with columns 
# as keys and their data in that row as values.
mouseDF.to_json('../data/mousew.json')

# Excel:
mouseDF.to_excel('../data/mousew.xlsx')

# By default, the .to_*() calls will add an Index column; you can prevent that column from being 
# output by passing in an index=False argument:
mouseDF.to_csv('../data/mousew-noindex.csv', index=False)




## Reading data in

When we read data in, we use the pandas module directl (which we import as `pd` by convention). This is different from writing data out, where we call the `.to_*()` method on the DataFrame we want to save.

In [4]:
# TSV
df = pd.read_table('../data/mousew.tsv')
df
# # CSV
# df = pd.read_csv('../data/mousew.csv')
# # ***OR****
# df = pd.read_table('../data/mousew.csv', delimiter=',')

# # JSON
# df = pd.read_json('../data/mousew.json')

# # Excel
# df = pd.read_excel('../data/mousew.xlsx')

# # If you have a large file, pass in an argument for the `nrows` parameter,
# # Here we're only reading in the first 100 rows:
# df = pd.read_csv('../data/mousew.csv', nrows=100)

Unnamed: 0.1,Unnamed: 0,strain,sex,weight
0,0,C57Bl/6J,M,27.596753
1,1,C57Bl/6J,M,29.635837
2,2,C57Bl/6J,M,28.460685
3,3,C57Bl/6J,M,29.248225
4,4,C57Bl/6J,M,25.702118
...,...,...,...,...
395,395,DBA/2J,F,20.489587
396,396,DBA/2J,F,21.512789
397,397,DBA/2J,F,18.677338
398,398,DBA/2J,F,22.974729


In [8]:

mouseDF[(mouseDF['weight'] >= 28) & (mouseDF['weight'] < 29)].to_csv('../data/mousew-28oz.csv.gz')

In [9]:
mouseDF28Only = pd.read_csv('../data/mousew-28oz.csv.gz')
mouseDF28Only

Unnamed: 0.1,Unnamed: 0,strain,sex,weight
0,2,C57Bl/6J,M,28.460685
1,17,C57Bl/6J,M,28.131761
2,18,C57Bl/6J,M,28.724828
3,21,C57Bl/6J,M,28.393142
4,28,C57Bl/6J,M,28.248257
5,30,C57Bl/6J,M,28.431916
6,32,C57Bl/6J,M,28.900976
7,40,C57Bl/6J,M,28.227189
8,41,C57Bl/6J,M,28.427392
9,48,C57Bl/6J,M,28.531084


In [11]:
mouseDF28Only = pd.read_csv('../data/mousew-28oz.csv.gz', index_col=0)
mouseDF28Only

Unnamed: 0,strain,sex,weight
2,C57Bl/6J,M,28.460685
17,C57Bl/6J,M,28.131761
18,C57Bl/6J,M,28.724828
21,C57Bl/6J,M,28.393142
28,C57Bl/6J,M,28.248257
30,C57Bl/6J,M,28.431916
32,C57Bl/6J,M,28.900976
40,C57Bl/6J,M,28.227189
41,C57Bl/6J,M,28.427392
48,C57Bl/6J,M,28.531084


## Creating DataFrames from a dictionary or set of lists

In [12]:
# Creating a DataFrame from a list of dictionaries:
rawData = [
    {
        'strain': 'C57BI/6J',
        'sex': 'M',
        'weight': 27.596753
    },
    {
        'strain': 'C57BI/6J',
        'sex': 'M',
        'weight': 29.248225
    },
    # ...
]
data = pd.DataFrame(rawData)
data

Unnamed: 0,strain,sex,weight
0,C57BI/6J,M,27.596753
1,C57BI/6J,M,29.248225


In [13]:
# Creating a DataFrame from a list of lists + column names:
rawData = [
    ['C57BI/6J', 'M', 27.596753],
    ['C57BI/6J', 'M', 29.248225]
]

data = pd.DataFrame(rawData, columns=['strain', 'sex', 'weight'])
data

Unnamed: 0,strain,sex,weight
0,C57BI/6J,M,27.596753
1,C57BI/6J,M,29.248225


## Basic pandas—examining DataFrames

Some of the things we might


In [14]:
# How many rows?
# Finding the length (number of keys in) a dictionary:
# x = {'key1': 1, 'key2': 2, 'key3': 3, 'key4': 4}
# len(x.keys())

# TODO
len(mouseDF)

400

In [16]:
# How many columns?

# TODO
len(mouseDF.columns)

3

In [17]:
# We can ask for the shape of a DataFrame, which gives us a tuple with these values:

# TODO
mouseDF.shape

(400, 3)

In [18]:
# Get the number of rows using mouseDF.shape:

# TODO
mouseDF.shape[0]

400

In [19]:
# Select the weight column, which is a Series object.
# Preferred way:

# TODO
mouseDF['weight']

0      27.596753
1      29.635837
2      28.460685
3      29.248225
4      25.702118
         ...    
395    20.489587
396    21.512789
397    18.677338
398    22.974729
399    19.312054
Name: weight, Length: 400, dtype: float64

In [20]:
mouseDF28Only['weight']

2      28.460685
17     28.131761
18     28.724828
21     28.393142
28     28.248257
30     28.431916
32     28.900976
40     28.227189
41     28.427392
48     28.531084
61     28.212117
63     28.148897
65     28.539025
69     28.807248
71     28.999457
77     28.984005
83     28.276477
85     28.281818
87     28.535487
88     28.038621
95     28.456805
96     28.181539
98     28.059095
203    28.852402
211    28.058833
218    28.572702
231    28.622008
234    28.187609
236    28.027747
240    28.890028
242    28.380888
249    28.726820
260    28.353228
264    28.352875
274    28.142244
277    28.754204
278    28.136773
279    28.863385
286    28.886780
291    28.082726
292    28.988429
Name: weight, dtype: float64

In [21]:
# Dot notation -- not preferred because:
#   - it only works with column names that are valid Python identifiers (no spaces, starts with alpha+_, only contains alphanumeric+_)
#   - since it can't be used everywhere, when you do use it, it'll cause inconsistencies
#   - it's non-programatic (you can access a column whose name is stored in a variable this way)

# TODO
mouseDF.weight

0      27.596753
1      29.635837
2      28.460685
3      29.248225
4      25.702118
         ...    
395    20.489587
396    21.512789
397    18.677338
398    22.974729
399    19.312054
Name: weight, Length: 400, dtype: float64

In [None]:
# See just the first 10 lines (if you omit the argument, the default is 5)

# TODO


In [None]:
# See just the last 10 lines (or 5 if the argument is omitted)

# TODO

In [None]:
# Get basic stats about each numeric column.

# TODO

In [None]:
# Summarize all of the numeric columns of a DataFrame:

# TODO

In [None]:
# Sum every column -- Question: what looks amiss about the result?

# TODO

In [None]:
# Also, min, max, count, mean:

print('MIN:\n', mouseDF.min())
print('\nMAX:\n', mouseDF.max())
print('\nCOUNT:\n', mouseDF.count())
print('\nMEAN:\n', mouseDF.mean())


## Selecting data in pandas

In [None]:
# Selecting columns:
mouseDF['weight'] # --> returns a Series (a single column)

mouseDF[['strain', 'weight']] # --> returns a DataFrame (two or more columns)

In [None]:
# Selecting columns:
mouseDF['weight']
# Selecting rows:
mouseDF[0:2] # --> returns a DataFrame containing row indexes 0 and 1 -- this is just like list slicing

In [None]:
# Selecting both (an intersection of row and column):
mouseDF['weight'][1] # --> the second row of the 'weight' column
# mouseDF['weight'] -> returns a series
# (^^^)[1] -- says "get index 1 from the series"

In [None]:
# A second way of writing the same thing using .loc -- "explicit location"
mouseDF.loc[1, 'weight']

In [None]:
# A third way using .iloc -- "implicit location"
mouseDF.iloc[1, 2] # --> the 1 means "row index 1" and the 2 means "column index 2"

## Selecting rows based on criteria (filtering)

In [None]:
# How would you find all mice with weights greater than 25?

# TODO

# NOTE: 171 rows were returned out of the 400 mice.

In [None]:
# Let's break it down here.

# This creates a boolean Series (basically a bitmask) such that a True corresponds to a 
# row where the condition (weight > 25) holds, and False otherwise.
mouseDF['weight'] > 25

# Passing this bitmask into the square brackets tells the DataFrame to only return the rows of data that
# correspond with the rows in the bitmap that are True.

In [None]:
# Let's try a trickier one...select all mice who are male AND have a weight *below* 22.

# TODO

# We need to use the bitwise AND operator (a single ampersand) to combine the bitmap from each of the
# subexpressions. It's important that both subexpression are wrapped in parentheses, otherwise 
# Bad Things Will Happen (TM). If we want to OR two bitmaps, we use | (single pipe).

In [None]:
# Find missing weights  (there is no missing data in mouseDF, so this' isn't interesting here):
mouseDF[mouseDF['weight'].isna()]


In [None]:
# Find rows where at least one column value is missing:
mouseDF[ mouseDF.isna().any(axis=1) ]

# The `axis=1` tells the any() method to aggregate across the columns in each row.

In [None]:
# Find if a column value is in a list -- not that interesting here, as we only have two 
# strains and two sexes.

mouseDF[mouseDF['strain'].isin(['DBA/2J', 'DBA/2M'])]

## Removing data

In [None]:
# For the axis parameter -- 1 = column, 0 = row
mouseDF.drop('sex', axis=1) # This returns a "view" of the DataFrame without the 'sex' column

In [None]:
# The original data is unaltered, though:
mouseDF

In [None]:
# You can also remove data in place. 
# Here, I'm making a copy of mouseDF first so I can use the original data later.
mouseDF2 = mouseDF.copy()
mouseDF2.drop('sex', axis=1, inplace=True)
mouseDF2

In [None]:
# You can drop just missing data. We will discuss dealing with missing data in 
# more detail later -- this is but one option, and is not necessarily the best option.

# Drop any row with one or missing value:
mouseDF.dropna() 
# Same as:
mouseDF.dropna(how='any', axis=0, inplace=False)

# Drop any row missing all values.
mouseDF.dropna(how='all', axis=0, inplace=False)

# Drop any row missing all values in the columns 'sex' and 'weight'
mouseDF.dropna(how='all', axis=0, inplace=False, subset=['weight', 'sex'])

# Drop any column with at least one missing value:
mouseDF.dropna(how='any', axis=1, inplace=False)

## Working with data

In [None]:
maxWeight = mouseDF['weight'].max()
minWeight = mouseDF['weight'].min()

# Here's an example of broadcasting
# -- scalar values (that's what we call single values vs. structured values like lists or dictionaries) 
# are *broadcast* along the dimension of a Series.
mouseDF['scaledWeight'] = (maxWeight - mouseDF['weight'])/(maxWeight-minWeight)
mouseDF

In [None]:
# You can also broadcast scalar values across an entire DataFrame:
# We can perform row-wise operations on Series. The following
# adds the value of mouseDF['weight'][i] with mouseDF['scaledWeight'][i] 
# for each i in len(mouseDF)
mouseDF['weight'] + mouseDF['scaledWeight']

In [None]:
# We can also do it like this.
mouseDF['weight'].add(mouseDF['scaledWeight'])

You can do some other cool things with `add()`, `sub()`, `mul()`, `div()`, `mod()`, `pow()` with DataFrames—[check out this documentation on add](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add.html?highlight=add).

In [None]:
# Apply a function to every element in a Series (this is slow!):
mouseDF['weight'].apply(lambda elem: (6-elem)/2)

In [None]:
# With DataFrame, we can apply a function to every row or column:
mouseDF.apply(lambda row: row['weight']/(row['scaledWeight']+2),axis=1)

## Sorting

In [None]:
 # Sort the DataFrame rows by weight.
 mouseDF.sort_values(['weight'])

## Grouping

Follows a "split-apply-combine" model. This is similar to the group by in SQL and pivot table in Excel.

In [None]:
# Split
groupedByStrain = mouseDF[['strain', 'weight', 'scaledWeight']].groupby('strain')

# Apply
groupedByStrain.agg(['min', 'mean', 'max'])

### Activity

Let's load in one of the Apache files and see what we can do with it...

In [None]:
# TODO