# Introduction to pandas

## Wrangling Tabular Data in Python

### DATA 601

**By:Usman Alim** 

Further Reading:

* **Python for Data Analysis** (second edition), by _Wes McKinney_  ([Library link](https://ucalgary-primo.hosted.exlibrisgroup.com/primo-explore/fulldisplay?docid=01UCALG_ALMA51642853910004336&context=L&vid=UCALGARY&search_scope=EVERYTHING&tab=everything&lang=en_US) for book) <br>
The material in this notebook is based on Chapters 5, 6, 7, and 8.
* [**pandas official documentation**](https://pandas.pydata.org/pandas-docs/stable/)

## Outline

- **[Introduction](#introduction)**


- **[Core functionality](#core)**


- **[Reading CSV data](#reading)**


- **[Data wrangling](#wrangling)**


## <a name="introduction">Introduction</a>

- pandas: Python Data Analysis Library<br>
  By convention, `import pandas as pd`.
  

- Designed for tabular or spreadsheet data.


- Borrows many idioms from NumPy but supports hetereogeneous data.


- The main data structures are: `Series` and `DataFrame`


- Lots of optimized functions to work with _small_ to _medium_ datasets.

###  `Series`

- A series is a one-dimensional data structure where each item has an associated label known as the _index_.<br>
  Access to the elements is via the index.
  
  
- Behaves like a Python `dict` but data can be ordered. Think of it as a _fixed length_ ordered dictionary.


- Duplicate indices are supported but for efficiency reasons, it is better to have unique indices. Unique indices will allow $O(1)$ access to the rows.


- Supports NumPy like indexing and filtering as well as vectorized computation. Series data is stored as a NumPy array. 

In [11]:
!pip install matplotlib


Collecting matplotlib
  Downloading matplotlib-3.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting cycler>=0.10
  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting kiwisolver>=1.0.1
  Downloading kiwisolver-1.4.4-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m62.8 MB/s[0m eta [36m0:00:00[0m
Collecting fonttools>=4.22.0
  Downloading fonttools-4.37.4-py3-none-any.whl (960 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m960.8/960.8 kB[0m [31m56.2 MB/s[0m eta [36m0:00:00[0m
Collecting pillow>=6.2.0
  Downloading Pillow-9.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m74.1 MB/s[0m eta 

In [10]:
# Series example
import pandas as pd
import numpy as np

#Let's create a series from a dictionary. We can assign names to the index and the values.

ages_data = {'susan': 22, 'joe': 29, 'al': 21, 'frank': 30, 'salim': 27}
data = pd.Series(ages_data)
data.index.name = "name"
data.name = "age"
print(data)
print("\n")

# The series is iterable.
for i in data:
    print(i)

# The ordering can be changed in place
data.sort_index(inplace=True)
print("\n")
print(data)

data.sort_values(inplace=True)
print("\n")
print(data)


# inspect the types of the indices and values
print("\n")
print(type(data.index))
print(type(data.values))
    

name
susan    22
joe      29
al       21
frank    30
salim    27
Name: age, dtype: int64


22
29
21
30
27


name
al       21
frank    30
joe      29
salim    27
susan    22
Name: age, dtype: int64


name
al       21
susan    22
salim    27
joe      29
frank    30
Name: age, dtype: int64


<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>


In [2]:
# We can alter the index and the values in-place
data.index = [str.upper(name) for name in data.index]
print(data)
print("\n")

data["AL"] = data["AL"] + 5
print(data)

# add some more data
data["BOB"] = 28
data["BOB"] = 29
data["JOHN"] = np.nan # used to represent missing data
print("\n")
print(data)

AL       21
SUSAN    22
SALIM    27
JOE      29
FRANK    30
Name: age, dtype: int64


AL       26
SUSAN    22
SALIM    27
JOE      29
FRANK    30
Name: age, dtype: int64


AL       26.0
SUSAN    22.0
SALIM    27.0
JOE      29.0
FRANK    30.0
BOB      29.0
JOHN      NaN
Name: age, dtype: float64


In [3]:
# Let's investigate the impact of non-unique indices.

x = np.random.randint(0, 10**6, 10**6)
s1 = pd.Series(x) # this will have default integer indices.
s1.index.name="unique"
s1.name="rands"
print(s1.is_unique)

print("\n")
s2 = s1.reindex(index=x)
s2.index.name="not-unique"
print(s2.head())
print("\n")

s3 = s2.sort_index()
s3.index.name="not-unique-sorted"
print(s3.head())
print("\n")

%time print(s1[s2.index[0]])
print("\n")

%time print(s2[s2.index[0]])
print("\n")

%time print(s3[s2.index[0]])

False


not-unique
65265     112319
517635    780127
651800    333704
521433    883098
636833    407939
Name: rands, dtype: int64


not-unique-sorted
0     65265
1    517635
1    517635
4    636833
5    214211
Name: rands, dtype: int64


112319
CPU times: user 39 µs, sys: 12 µs, total: 51 µs
Wall time: 54.8 µs


not-unique
65265    112319
65265    112319
Name: rands, dtype: int64
CPU times: user 56.4 ms, sys: 5.78 ms, total: 62.2 ms
Wall time: 61.9 ms


not-unique-sorted
65265    112319
65265    112319
Name: rands, dtype: int64
CPU times: user 48.2 ms, sys: 12.1 ms, total: 60.3 ms
Wall time: 68.3 ms


###  `DataFrame`

- A `DataFrame` is a rectangular 2D table with named rows and columns. Each column can have a different value type.


- Think of it as a number of `Series` objects all sharing the same index.


- Both rows and columns can be indexed. Data can be sorted and filtered in a number of ways.


- Convenient panads functions to read a number of formats into a `DataFrame`.

In [4]:
# Let's build a data frame by adding another column to the series 

loc_data = dict(zip(ages_data.keys(),['Calgary', 'Calgary', 'Vancouver', 'Toronto', 'Beirut']))
frame_data = {'age': [*ages_data.values()], 'location': [*loc_data.values()] }
frame = pd.DataFrame(frame_data)
frame.index = [*ages_data.keys()]
display(frame) # jupyter will pretty-print DataFrames


Unnamed: 0,age,location
susan,22,Calgary
joe,29,Calgary
al,21,Vancouver
frank,30,Toronto
salim,27,Beirut


In [5]:
# We can index by column in one of two ways. A series is returned
print(frame["age"]) # syntax for columns
print("\n")
print(frame.age)
print("\n")

# individual entries can be altered. Preferred way to do that is to use the # loc operator (with numpy like syntax). pandas manual advises against 
# using chained indexing.

print(frame.loc["susan"]) # syntax for rows
print(frame.loc["susan", "age"])
frame.loc["susan", "age"] = 24

frame

susan    22
joe      29
al       21
frank    30
salim    27
Name: age, dtype: int64


susan    22
joe      29
al       21
frank    30
salim    27
Name: age, dtype: int64


age              22
location    Calgary
Name: susan, dtype: object
22


Unnamed: 0,age,location
susan,24,Calgary
joe,29,Calgary
al,21,Vancouver
frank,30,Toronto
salim,27,Beirut


## <a name="core">Core Functionality</a>

- Many fundamental operations on tabular data are already implemented with convenient syntactic sugar. Do not reinvent the wheel, try and make use of vectorized computation as much as possible.


- Some things we may want to do with tabular data:

  - Indexing, selection and filtering
  
  - Sorting

  - Arithmetic, function application, computing statistics



In [6]:
# Indexing and slicing using the indexing operator []:
# Indexing using the '[]' operator indexes the columns.

print(frame['age']) # one column
display(frame[['location','age']]) # or a list of columns

# Can also use [] operator with the slicing operator : to index rows
print("\n")
display(frame[0:2])
print("\n")
display(frame['al':])

susan    24
joe      29
al       21
frank    30
salim    27
Name: age, dtype: int64


Unnamed: 0,location,age
susan,Calgary,24
joe,Calgary,29
al,Vancouver,21
frank,Toronto,30
salim,Beirut,27






Unnamed: 0,age,location
susan,24,Calgary
joe,29,Calgary






Unnamed: 0,age,location
al,21,Vancouver
frank,30,Toronto
salim,27,Beirut


In [7]:
# Some may prefer indexing and slicing using loc (label indexing) and 
# iloc(integer indexing)

# Lets add another column of data first
frame['ID'] = pd.Series(np.random.randint(low=100, high=1000, size=len(frame.index)), index=frame.index)
display(frame)

# loc indexing and slicing behave like numpy, the first argument is the
# row, the second is the column
print(frame.loc['susan',:])
display(frame.loc['susan',['ID','age']])
display(frame.loc['susan':'al',['ID','age']])
display(frame.loc['susan':'al', 'age':'location'])

frame.loc['susan':'al','ID'] = 0
frame


Unnamed: 0,age,location,ID
susan,24,Calgary,523
joe,29,Calgary,968
al,21,Vancouver,163
frank,30,Toronto,257
salim,27,Beirut,197


age              24
location    Calgary
ID              523
Name: susan, dtype: object


ID     523
age     24
Name: susan, dtype: object

Unnamed: 0,ID,age
susan,523,24
joe,968,29
al,163,21


Unnamed: 0,age,location
susan,24,Calgary
joe,29,Calgary
al,21,Vancouver


Unnamed: 0,age,location,ID
susan,24,Calgary,0
joe,29,Calgary,0
al,21,Vancouver,0
frank,30,Toronto,257
salim,27,Beirut,197


In [8]:
# Boolean indexing and filtering
#
# AKA: How to execute SQL-like queries
#
# Selecting and filtering data in pandas is done through boolean
# indexing. The syntax is similar to numpy. 

# select all rows where age >= 25
display(frame[frame["age"] >= 25])

# select all rows where location is Calgary
display(frame[frame["location"] == 'Calgary'])

# Some may find the following syntax more convenient
display(frame[(frame.age <=27) & (frame.ID > 0)])

display(frame.loc[(frame.age <=27) & (frame.ID > 0)])

Unnamed: 0,age,location,ID
joe,29,Calgary,0
frank,30,Toronto,257
salim,27,Beirut,197


Unnamed: 0,age,location,ID
susan,24,Calgary,0
joe,29,Calgary,0


Unnamed: 0,age,location,ID
salim,27,Beirut,197


Unnamed: 0,age,location,ID
salim,27,Beirut,197


In [9]:
# Organising query results

# Display only a subset of the columns
display(frame.loc[frame.age >= 25].loc[:,["age", "location"]])

# Sorting the index in reverse order
display( frame.loc[frame.age >= 25].loc[:,["age", "location"]].sort_index(ascending=False) )

# sorting according to a particular column
display(frame.loc[frame.age >= 25].loc[:,["age", "location"]].sort_values(by='location', ascending=True))

Unnamed: 0,age,location
joe,29,Calgary
frank,30,Toronto
salim,27,Beirut


Unnamed: 0,age,location
salim,27,Beirut
joe,29,Calgary
frank,30,Toronto


Unnamed: 0,age,location
salim,27,Beirut
joe,29,Calgary
frank,30,Toronto


### Arithmetic Operations and Function Application

- Since series data is stored as NumPy arrays, vectorized arithmetic operations between DataFrames are supported.


- Similarly, we can apply functions to entire rows or columns in a vectorized manner.


- DO NOT loop over the data, this is inefficient!


- By default, a binary arithmetic operation on two dataframes will align by the row and column indices, performing a union of the indices. Missing values are created for data that does not exist. 


In [10]:
# Let's create two dataframes and add them together

f1 = pd.DataFrame(np.arange(12).reshape((4,3)), columns=list('ABC'), index=['susan', 'joe', 'frank', 'al'])
f2 = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('BCD'), index=['susan', 'joe', 'robert'])
display(f1)
display(f2)

display(f1 + f2)

Unnamed: 0,A,B,C
susan,0,1,2
joe,3,4,5
frank,6,7,8
al,9,10,11


Unnamed: 0,B,C,D
susan,0,1,2
joe,3,4,5
robert,6,7,8


Unnamed: 0,A,B,C,D
al,,,,
frank,,,,
joe,,7.0,9.0,
robert,,,,
susan,,1.0,3.0,


In [11]:
# Arithmetic operations are also defined between a dataframe and a series. The series
# is "broadcast" to all the rows of the dataframe. 

display(f1)
display(f2)

display(f1.loc["susan"] + f1)
display(f2.loc["susan"] + f1)

Unnamed: 0,A,B,C
susan,0,1,2
joe,3,4,5
frank,6,7,8
al,9,10,11


Unnamed: 0,B,C,D
susan,0,1,2
joe,3,4,5
robert,6,7,8


Unnamed: 0,A,B,C
susan,0,2,4
joe,3,5,7
frank,6,8,10
al,9,11,13


Unnamed: 0,A,B,C,D
susan,,1.0,3.0,
joe,,4.0,6.0,
frank,,7.0,9.0,
al,,10.0,12.0,


In [12]:
# Vectorized functions for simple aggregation

display(f1)
display(f1.sum()) # Sum down the columns
display(f1.sum(axis='columns')) # Sum along the rows

# Can also do accumulations
display(f1.cumsum())

# NumPy ufuncs are also supported.
display(np.cos(f1)) # Applies to the entire array. Numeric data needed.
display(np.cos(f1.loc["susan"])) # applies to the specific row

Unnamed: 0,A,B,C
susan,0,1,2
joe,3,4,5
frank,6,7,8
al,9,10,11


A    18
B    22
C    26
dtype: int64

susan     3
joe      12
frank    21
al       30
dtype: int64

Unnamed: 0,A,B,C
susan,0,1,2
joe,3,5,7
frank,9,12,15
al,18,22,26


Unnamed: 0,A,B,C
susan,1.0,0.540302,-0.416147
joe,-0.989992,-0.653644,0.283662
frank,0.96017,0.753902,-0.1455
al,-0.91113,-0.839072,0.004426


A    1.000000
B    0.540302
C   -0.416147
Name: susan, dtype: float64

In [13]:
# Apply and Map for Data Transformation

# Use apply to apply custom functions to the rows or columns

# In the following, the argument is the entire row (or column)
func = lambda x: np.sqrt(np.dot(x, x))

display(f1.apply(func))
display(f1.apply(func, axis="columns"))

# Use map (for a Series) and applymap (for a DataFrame) to apply a function in an 
# element-wise manner
formatter = lambda x: '%0.2f' % x

display(f1.apply(func, axis="columns").map(formatter))
display(f1.applymap(lambda x: x*x))




A    11.224972
B    12.884099
C    14.628739
dtype: float64

susan     2.236068
joe       7.071068
frank    12.206556
al       17.378147
dtype: float64

susan     2.24
joe       7.07
frank    12.21
al       17.38
dtype: object

Unnamed: 0,A,B,C
susan,0,1,4
joe,9,16,25
frank,36,49,64
al,81,100,121


## <a name="reading">Importing Data into pandas</a>

- pandas provides a number of readers and writers to read/write data in various tabular formats.


- For a reader, the input can be in text format (csv, html etc.) or a binary format (e.g. HDF5) and the output is a `DataFrame` object. Portions of a dataset can also be read.


- Readers have numerous options to specify how the data is delimited, how it is to be interpreted (data types for columns), what to do with missing values etc.


- Here, we'll look at some examples of reading csv files. Please refer to `pandas` [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for additional supported file formats.

In [14]:
# Reading csv data. Data can be be local or on a server

URL_base = "https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/"
ex1 = "examples/ex1.csv"
ex2 = "examples/ex2.csv"

display(pd.read_csv(URL_base + ex1))

# This example doesn't have a header row. We can let pandas assign defaults 
# or specify the column names
display(pd.read_csv(URL_base + ex2))
names=['A', 'B', 'C', 'D', 'message']
display(pd.read_csv(URL_base + ex2, names=names))

# One of the columns can serve as the index
display(pd.read_csv(URL_base + ex2, names=names, index_col='A'))

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo


Unnamed: 0,A,B,C,D,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Unnamed: 0_level_0,B,C,D,message
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [15]:
# The delimiter can be specified, we can use regular expressions.

ex3 = "examples/ex3.txt"

# whitespace as a separator
display(pd.read_csv(URL_base + ex3, sep='\s+'))

# Specifying missing values
ex5 = 'examples/ex5.csv'
display(pd.read_csv(URL_base + ex5))
miss = {'message': ['NA', 'foo'], 'something': ['two']}
display(pd.read_csv(URL_base + ex5, na_values=miss))

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


### Exercise

- Read the [Historical Rainfall](https://data.calgary.ca/Environment/Historical-Rainfall/d9kv-swk3) dataset from the [City of Calgary's Open Data Portal](https://data.calgary.ca).
  Note that you can read the data directly via its URL. At the time of writing this, the URL is:
  https://data.calgary.ca/resource/d9kv-swk3.csv
- Display the head and tail of the dataset.
- Get a list of all the communities where the rain gauges are located.
- Plot a histogram of the values in the 'rainfall' column to get a sense of the distribution of data in this column.
  

In [16]:
# When working with large files or to get an idea of the structure of the data,
# it may be useful to read the data in pieces.

ex6 = 'examples/ex6.csv'
frame = pd.read_csv(URL_base + ex6, nrows=10)
display(frame)

# We can specify a chunksize (in terms of number of rows) 
# and read the file incrementally

streamer = pd.read_csv(URL_base + ex6, chunksize=1000)
print(type(streamer))

# Now we can iterate over the chunks
totalone = 0
for chunk in streamer:
    totalone = totalone + chunk['one'].sum()
    
print(totalone)


Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.81748,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U
7,-0.913135,1.530624,-0.572657,0.477252,K
8,0.35848,-0.497572,-0.367016,0.507702,S
9,-1.740877,-1.160417,-1.63783,2.172201,G


<class 'pandas.io.parsers.readers.TextFileReader'>
457.5016080420038


### Exercise

- For the above example dataset, read in chunks of 100 rows and determine the mean values for the first four columns.


- Use the `DataFrame` member function `sum()` to determine the sum of each column in each chunk.
  - Beware of _overflow_ issues when summing up large arrays. For this exercise, you don't need to worry about it.


- Verify your answer by computing the mean directly (i.e. by reading the dataset in full).

In [17]:
streamer = pd.read_csv(URL_base + ex6, chunksize=100)
tots = pd.Series([])
nums = 0
for chunk in streamer:
    tots = tots.add(chunk[['one','two','three','four']].sum(), fill_value=0)
    nums = nums + len(chunk.index)
print(nums)
    
print(tots)
print("\n")
print(tots / nums)

print("\n")
display(pd.read_csv(URL_base + ex6)[['one', 'two', 'three', 'four']].mean())

  tots = pd.Series([])


10000
one      457.501608
two        8.708928
three   -264.634750
four     159.854005
dtype: float64


one      0.045750
two      0.000871
three   -0.026463
four     0.015985
dtype: float64




one      0.045750
two      0.000871
three   -0.026463
four     0.015985
dtype: float64

## <a name="wrangling">More Data Wrangling</a>

- Data Cleaning and Filtering

- Joining and Combining Datasets

- Hierarchical Indexing and Reshaping

- Grouping and Aggregation

### Missing Data

- There may be missing entries in datasets you read, or the results of an operation might introduce missing values. In `pandas`, the floating point `NaN` is used to indicate missing numeric values. `None` is sometimes used to indicate missing string values.


- Any arithmetic operation with a `NaN` will result in a `NaN`. So an entire series will end up with `NaN`s if they are not handled appropriately. 


- `pandas` provides a number of helper routines for dropping, filtering and replacing missing values. We'll look at a few examples here.

In [18]:
# Dropping NA values using dropna(). Consult documentation of 
# dropna() for more details.

display(f1)
display(f2)
result = f1+f2
result["E"] = 10
display(result)

# By default, dropna will drop any row with a missing value
display(result.dropna()) 

# We can operate along the columns
display(result.dropna(axis=1))

# We can specify a threshold that we can tolerate
display(result.dropna(thresh=2))


Unnamed: 0,A,B,C
susan,0,1,2
joe,3,4,5
frank,6,7,8
al,9,10,11


Unnamed: 0,B,C,D
susan,0,1,2
joe,3,4,5
robert,6,7,8


Unnamed: 0,A,B,C,D,E
al,,,,,10
frank,,,,,10
joe,,7.0,9.0,,10
robert,,,,,10
susan,,1.0,3.0,,10


Unnamed: 0,A,B,C,D,E


Unnamed: 0,E
al,10
frank,10
joe,10
robert,10
susan,10


Unnamed: 0,A,B,C,D,E
joe,,7.0,9.0,,10
susan,,1.0,3.0,,10


In [19]:
# filling in missing data with reasonable values using fillna(). Consult
# documentation for more details. Method used for filling in missing
# data will be application dependent.

# We can specify a default value to use for missing values.
display(result.fillna(0))

# Use different values by column
display(result.fillna({'A': 0.0, 'D': 5.0}))

# Fill in by looking up neighbours (e.g. backfill)
display(result.fillna(method='bfill'))

Unnamed: 0,A,B,C,D,E
al,0.0,0.0,0.0,0.0,10
frank,0.0,0.0,0.0,0.0,10
joe,0.0,7.0,9.0,0.0,10
robert,0.0,0.0,0.0,0.0,10
susan,0.0,1.0,3.0,0.0,10


Unnamed: 0,A,B,C,D,E
al,0.0,,,5.0,10
frank,0.0,,,5.0,10
joe,0.0,7.0,9.0,5.0,10
robert,0.0,,,5.0,10
susan,0.0,1.0,3.0,5.0,10


Unnamed: 0,A,B,C,D,E
al,,7.0,9.0,,10
frank,,7.0,9.0,,10
joe,,7.0,9.0,,10
robert,,1.0,3.0,,10
susan,,1.0,3.0,,10


In [20]:
# Boolean indexing can be used to perform many filtering operations

df = pd.DataFrame(np.random.randn(1000,3))
display(df.describe())
print(df.head())
# select all rows where absolute value of any coordinate is greater
# than 2.5
dfmask = (np.abs(df) > 2.5).any(axis=1)
display(dfmask)
df[dfmask] = 0
display(df[dfmask])

df.describe()

Unnamed: 0,0,1,2
count,1000.0,1000.0,1000.0
mean,-0.002602,-0.005594,0.004198
std,1.010692,0.998364,0.982207
min,-3.196248,-3.22846,-2.939277
25%,-0.687071,-0.686133,-0.658583
50%,-0.026152,-0.032649,-0.03768
75%,0.680601,0.656847,0.644568
max,3.445887,3.013404,3.688991


          0         1         2
0  0.857496 -0.241225 -0.519681
1 -1.046730  0.518678  2.452448
2 -2.021943 -0.448106 -0.081444
3  0.124962 -0.689818  0.898168
4  0.969328  1.040416 -2.047172


0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

Unnamed: 0,0,1,2
27,0.0,0.0,0.0
56,0.0,0.0,0.0
75,0.0,0.0,0.0
164,0.0,0.0,0.0
177,0.0,0.0,0.0
199,0.0,0.0,0.0
202,0.0,0.0,0.0
206,0.0,0.0,0.0
210,0.0,0.0,0.0
213,0.0,0.0,0.0


Unnamed: 0,0,1,2
count,1000.0,1000.0,1000.0
mean,-0.008869,-0.009229,0.016479
std,0.923592,0.927838,0.922283
min,-2.428483,-2.357321,-2.267478
25%,-0.662053,-0.637539,-0.577714
50%,0.0,0.0,0.0
75%,0.610341,0.59079,0.587247
max,2.433722,2.456641,2.494613


In [21]:
# Vectorized string functions for Series

# pandas provides vectorized string functions for pattern matching
# on string series. These are more robust and don't fail on missing
# data (compared to applying a function via map()).

eseries = pd.Series({'rob': 'rob123@gmail.com', 'al' : 'al345@gmail.com', 'susan' : 'susan678@yahoo.com', 'bob' : np.nan})
display(eseries)

display(eseries.str.contains('gmail'))

# Use a regular expression to extract the username
pattern = '([a-zA-Z0-9]+)@'
display(eseries.str.findall(pattern))

rob        rob123@gmail.com
al          al345@gmail.com
susan    susan678@yahoo.com
bob                     NaN
dtype: object

rob       True
al        True
susan    False
bob        NaN
dtype: object

rob        [rob123]
al          [al345]
susan    [susan678]
bob             NaN
dtype: object

### Combining and Joining Datasets

- Simple concatenations of columns or rows can be done via `pandas.concat`.

- pandas supports database join operations via `pandas.merge`. A _join_ operation joins two tables based on one or more keys.


- Joins come in a number of flavours: _inner_, _outer_, _left-outer_, _right-outer_.


- We'll look at examples of simple concatenations and joins. Please consult the documentation ([`pandas.concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html), [`pandas.merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html)) for an exhaustive list of supported operations. 


In [22]:
# Concatenating along columns.
df1 = pd.DataFrame(np.arange(6).reshape((3,2)), index=['c','b','a'], columns=['one','two'])
df2 = pd.DataFrame(6+np.arange(6).reshape((3,2)), index=['a','b','c'], columns=['three', 'four'])
display(df1)
display(df2)

display(pd.concat([df1,df2], axis=1, sort=True))


Unnamed: 0,one,two
c,0,1
b,2,3
a,4,5


Unnamed: 0,three,four
a,6,7
b,8,9
c,10,11


Unnamed: 0,one,two,three,four
a,4,5,6,7
b,2,3,8,9
c,0,1,10,11


In [23]:
# When keys are shared, we can do database joins

# Let's create two dataframes to explore how joins work

df1 = pd.DataFrame({'key':list('bbacaabe'),'field1':np.random.randn(8)})
df2 = pd.DataFrame({'key':list('abd'),'field2':np.random.randn(3)})
display(df1.sort_values(by='key'))
display(df2.sort_values(by='key'))

# An inner join (intersection of keys, commutative)
print("\nInner")
display(pd.merge(df1,df2, on='key').sort_values(by='key'))

# An outer join (union of keys, commutative)
print("\nOuter")
display(pd.merge(df1,df2, on='key', how='outer').sort_values(by='key'))

Unnamed: 0,key,field1
2,a,0.30959
4,a,-0.18726
5,a,-1.311738
0,b,1.36257
1,b,0.102864
6,b,-0.583256
3,c,0.746899
7,e,0.64903


Unnamed: 0,key,field2
0,a,2.147297
1,b,-1.30774
2,d,0.013559



Inner


Unnamed: 0,key,field1,field2
3,a,0.30959,2.147297
4,a,-0.18726,2.147297
5,a,-1.311738,2.147297
0,b,1.36257,-1.30774
1,b,0.102864,-1.30774
2,b,-0.583256,-1.30774



Outer


Unnamed: 0,key,field1,field2
3,a,0.30959,2.147297
4,a,-0.18726,2.147297
5,a,-1.311738,2.147297
0,b,1.36257,-1.30774
1,b,0.102864,-1.30774
2,b,-0.583256,-1.30774
6,c,0.746899,
8,d,,0.013559
7,e,0.64903,


In [24]:
# Left-outer (use all keys from the left table, non-commutative)
# This way of calling may be more intuitive
print("\nLeft-outer")
display(df1.merge(df2, on='key', how='left').sort_values(by='key')) 

# Right-outer (use all keys from the right table, non-commutative)
print("\nRight-outer")
display(df1.merge(df2, on='key', how='right').sort_values(by='key')) 


Left-outer


Unnamed: 0,key,field1,field2
2,a,0.30959,2.147297
4,a,-0.18726,2.147297
5,a,-1.311738,2.147297
0,b,1.36257,-1.30774
1,b,0.102864,-1.30774
6,b,-0.583256,-1.30774
3,c,0.746899,
7,e,0.64903,



Right-outer


Unnamed: 0,key,field1,field2
0,a,0.30959,2.147297
1,a,-0.18726,2.147297
2,a,-1.311738,2.147297
3,b,1.36257,-1.30774
4,b,0.102864,-1.30774
5,b,-0.583256,-1.30774
6,d,,0.013559


In [25]:
# The above are examples of one-to-many joins. In many-to-many joins, a Cartesian product
# of the common keys determines the output.

df1 = pd.DataFrame({'key':list('bbacaabe'),'field1':np.random.randn(8)})
df2 = pd.DataFrame({'key':list('aabbcd'),'field2':np.random.randn(6)})
display(df1.sort_values(by='key'))
display(df2.sort_values(by='key'))

# Inner join
print("\nInner")
display(pd.merge(df1,df2, on='key').sort_values(by='key'))

# Outer join 
print("\nOuter")
display(pd.merge(df1,df2, on='key', how='outer').sort_values(by='key'))

Unnamed: 0,key,field1
2,a,-0.713127
4,a,3.498356
5,a,1.370341
0,b,1.395264
1,b,-1.580269
6,b,-0.113888
3,c,1.053545
7,e,-0.355607


Unnamed: 0,key,field2
0,a,-1.02679
1,a,0.272756
2,b,-0.102783
3,b,-2.447511
4,c,-1.544197
5,d,-0.390294



Inner


Unnamed: 0,key,field1,field2
6,a,-0.713127,-1.02679
7,a,-0.713127,0.272756
8,a,3.498356,-1.02679
9,a,3.498356,0.272756
10,a,1.370341,-1.02679
11,a,1.370341,0.272756
0,b,1.395264,-0.102783
1,b,1.395264,-2.447511
2,b,-1.580269,-0.102783
3,b,-1.580269,-2.447511



Outer


Unnamed: 0,key,field1,field2
6,a,-0.713127,-1.02679
7,a,-0.713127,0.272756
8,a,3.498356,-1.02679
9,a,3.498356,0.272756
10,a,1.370341,-1.02679
11,a,1.370341,0.272756
0,b,1.395264,-0.102783
1,b,1.395264,-2.447511
2,b,-1.580269,-0.102783
3,b,-1.580269,-2.447511


### Hierarchical Indexing, Reshaping and Pivoting

- An entry in a `DataFrame` can be another `DataFrame`. This is the way to represent multi-dimensional tables in pandas.


In [26]:
# Example: A series with a two-level index

data = pd.Series(np.arange(12), index=[list('aaabbccddeee'), [1,2,3,1,3,1,2,2,3,1,2,3]])
display(data)
display(data.index)

# Look at a reshaped version to see what's going on
display(data.unstack())

# The labels are integer indices into the levels lists for each data entry.

a  1     0
   2     1
   3     2
b  1     3
   3     4
c  1     5
   2     6
d  2     7
   3     8
e  1     9
   2    10
   3    11
dtype: int64

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3),
            ('e', 1),
            ('e', 2),
            ('e', 3)],
           )

Unnamed: 0,1,2,3
a,0.0,1.0,2.0
b,3.0,,4.0
c,5.0,6.0,
d,,7.0,8.0
e,9.0,10.0,11.0


In [27]:
# Let's explore how to use the two-level index.

display(data['b'])
print("\n")
display(data.loc[['a','e']])

# Can also index from the inner level
print("\n")
display(data.loc['a':'c',2])

print("\n")
display(data.loc[:,3])


1    3
3    4
dtype: int64





a  1     0
   2     1
   3     2
e  1     9
   2    10
   3    11
dtype: int64





a  2    1
c  2    6
dtype: int64





a     2
b     4
d     8
e    11
dtype: int64

In [28]:
# With a DataFrame, both the rows and columns can have a hierarchical index

mframe = pd.DataFrame(np.arange(12).reshape((4,3)), index=[list('aabb'), [1,2,1,2]], columns=[['YYC', 'YYC', 'YVR'], ['green', 'red', 'green']])
mframe.index.names = ['alpha', 'num']
mframe.columns.names = ['province', 'colour']

display(mframe)
display(mframe.index)
display(mframe.columns)

Unnamed: 0_level_0,province,YYC,YYC,YVR
Unnamed: 0_level_1,colour,green,red,green
alpha,num,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           names=['alpha', 'num'])

MultiIndex([('YYC', 'green'),
            ('YYC',   'red'),
            ('YVR', 'green')],
           names=['province', 'colour'])

In [29]:
# We can select groups of columns
display(mframe['YYC'])

# or groups of rows
display(mframe.loc['a'])

# or sub-tables
display(mframe.loc['a','YYC'])

# or an inner level
display(mframe.loc[:,('YYC','green')])
display(mframe.loc[('a',1),:])



Unnamed: 0_level_0,colour,green,red
alpha,num,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


province,YYC,YYC,YVR
colour,green,red,green
num,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0,1,2
2,3,4,5


colour,green,red
num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,1
2,3,4


alpha  num
a      1      0
       2      3
b      1      6
       2      9
Name: (YYC, green), dtype: int64

province  colour
YYC       green     0
          red       1
YVR       green     2
Name: (a, 1), dtype: int64

In [30]:
# The indices can be sorted
display(mframe.sort_index(level=0))
display(mframe.sort_index(level=1))

# summary stats can be computed according to a particular level
display(mframe.sum(level='alpha'))
display(mframe.sum(level='num'))
display(mframe.sum(axis=1, level='province'))
display(mframe.sum(axis=1, level='colour'))

Unnamed: 0_level_0,province,YYC,YYC,YVR
Unnamed: 0_level_1,colour,green,red,green
alpha,num,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Unnamed: 0_level_0,province,YYC,YYC,YVR
Unnamed: 0_level_1,colour,green,red,green
alpha,num,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


  display(mframe.sum(level='alpha'))


province,YYC,YYC,YVR
colour,green,red,green
alpha,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


  display(mframe.sum(level='num'))


province,YYC,YYC,YVR
colour,green,red,green
num,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


  display(mframe.sum(axis=1, level='province'))


Unnamed: 0_level_0,province,YYC,YVR
alpha,num,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,2
a,2,7,5
b,1,13,8
b,2,19,11


  display(mframe.sum(axis=1, level='colour'))


Unnamed: 0_level_0,colour,green,red
alpha,num,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [31]:
# stack() and unstack() reshape columns->rows and rows->columns 
# respectively in a flat (non-hierarchical) DataFrame.
# The result is a Series with a hierarchical index.

df1 = pd.DataFrame(np.arange(6).reshape((2,3)), index=['a','b'], columns=['one', 'two', 'three'])
display(df1)

display(df1.stack())
display(df1.unstack())

# The hierarchically indexed series can be reshaped back via unstack().
display(df1.stack().unstack())
display(df1.unstack().unstack())

# We can also transpose a table
display(df1.transpose())



Unnamed: 0,one,two,three
a,0,1,2
b,3,4,5


a  one      0
   two      1
   three    2
b  one      3
   two      4
   three    5
dtype: int64

one    a    0
       b    3
two    a    1
       b    4
three  a    2
       b    5
dtype: int64

Unnamed: 0,one,two,three
a,0,1,2
b,3,4,5


Unnamed: 0,a,b
one,0,3
two,1,4
three,2,5


Unnamed: 0,a,b
one,0,3
two,1,4
three,2,5


In [32]:
# We can reshape hierarchical DataFrames as well
display(mframe)

display(mframe.unstack('num'))
display(mframe.unstack('alpha'))

display(mframe.unstack('num').stack('colour'))


Unnamed: 0_level_0,province,YYC,YYC,YVR
Unnamed: 0_level_1,colour,green,red,green
alpha,num,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


province,YYC,YYC,YYC,YYC,YVR,YVR
colour,green,green,red,red,green,green
num,1,2,1,2,1,2
alpha,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
a,0,3,1,4,2,5
b,6,9,7,10,8,11


province,YYC,YYC,YYC,YYC,YVR,YVR
colour,green,green,red,red,green,green
alpha,a,b,a,b,a,b
num,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
1,0,6,1,7,2,8
2,3,9,4,10,5,11


Unnamed: 0_level_0,province,YVR,YVR,YYC,YYC
Unnamed: 0_level_1,num,1,2,1,2
alpha,colour,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,green,2.0,5.0,0,3
a,red,,,1,4
b,green,8.0,11.0,6,9
b,red,,,7,10


### Grouping and Aggregation

- One of the most important features of `pandas` is the ability to group tables according to different columns and produce informative summaries that can then be visualized.


- Grouping and aggregation implement the _split-apply-combine_ paradigm.
  - _Split_: split a table into groups based on one or more _keys_.
  - _Apply_: apply a function to each group.
  - _Combine_: combine the results yielding a new table.
  
  
- The relevant functions that implement this are: `groupby`, optimized aggregation functions (`count`, `sum`, `mean`, `std` etc.), and the custom aggregation function `agg` which can aggregate according to a specified function.

- Helper functions are also provided to produce pivot tables (`pivot_table`) and cross tabulations (`crosstab`).

In [33]:
# Usually we'll have a flat table from which we'd want to choose groups. 

x,y = np.meshgrid(np.arange(3),np.arange(3))
vals = pd.DataFrame({'x': x.reshape(9,1).squeeze(), 'y': y.reshape(9,1).squeeze(), 'f1': np.random.randn(9), 'f2': np.random.randn(9)})
display(vals)

# group f1 according to x
xgroup = vals['f1'].groupby(vals['x'])
print(type(xgroup))
print(xgroup.mean())

# group f1 and f2 according to x and y
xygroup = vals[['f1','f2']].groupby([vals['x'],vals['y']])
print(xygroup.sum())

Unnamed: 0,x,y,f1,f2
0,0,0,-1.099427,-0.003904
1,1,0,-0.297728,0.925884
2,2,0,-2.914717,0.537495
3,0,1,-1.573877,0.346516
4,1,1,-0.173257,0.577865
5,2,1,2.378632,-0.193861
6,0,2,-1.953944,0.679607
7,1,2,0.64723,-0.375465
8,2,2,0.144038,1.597389


<class 'pandas.core.groupby.generic.SeriesGroupBy'>
x
0   -1.542416
1    0.058749
2   -0.130682
Name: f1, dtype: float64
           f1        f2
x y                    
0 0 -1.099427 -0.003904
  1 -1.573877  0.346516
  2 -1.953944  0.679607
1 0 -0.297728  0.925884
  1 -0.173257  0.577865
  2  0.647230 -0.375465
2 0 -2.914717  0.537495
  1  2.378632 -0.193861
  2  0.144038  1.597389


In [34]:
# The following syntax may be more intuitive when the keys are found in the table.
print(vals.groupby(['x','y']).sum())

# To select a particular column or columns, we can index the 
# GroupBy object.
print(vals.groupby(['x'])['f1'].mean())


           f1        f2
x y                    
0 0 -1.099427 -0.003904
  1 -1.573877  0.346516
  2 -1.953944  0.679607
1 0 -0.297728  0.925884
  1 -0.173257  0.577865
  2  0.647230 -0.375465
2 0 -2.914717  0.537495
  1  2.378632 -0.193861
  2  0.144038  1.597389
x
0   -1.542416
1    0.058749
2   -0.130682
Name: f1, dtype: float64


In [35]:
# The GroupBy object is not a table. It is however iterable, you can 
# iterate over it to access the group tables.

for name, group in vals.groupby('x'):
    print(name)
    print(type(group))
    print(group)

print("\n")    
    
for (name1, name2), group in vals.groupby(['x','y']):
    print((name1, name2))
    print(group)
    

0
<class 'pandas.core.frame.DataFrame'>
   x  y        f1        f2
0  0  0 -1.099427 -0.003904
3  0  1 -1.573877  0.346516
6  0  2 -1.953944  0.679607
1
<class 'pandas.core.frame.DataFrame'>
   x  y        f1        f2
1  1  0 -0.297728  0.925884
4  1  1 -0.173257  0.577865
7  1  2  0.647230 -0.375465
2
<class 'pandas.core.frame.DataFrame'>
   x  y        f1        f2
2  2  0 -2.914717  0.537495
5  2  1  2.378632 -0.193861
8  2  2  0.144038  1.597389


(0, 0)
   x  y        f1        f2
0  0  0 -1.099427 -0.003904
(0, 1)
   x  y        f1        f2
3  0  1 -1.573877  0.346516
(0, 2)
   x  y        f1        f2
6  0  2 -1.953944  0.679607
(1, 0)
   x  y        f1        f2
1  1  0 -0.297728  0.925884
(1, 1)
   x  y        f1        f2
4  1  1 -0.173257  0.577865
(1, 2)
   x  y       f1        f2
7  1  2  0.64723 -0.375465
(2, 0)
   x  y        f1        f2
2  2  0 -2.914717  0.537495
(2, 1)
   x  y        f1        f2
5  2  1  2.378632 -0.193861
(2, 2)
   x  y        f1        f2
8  2 

In [36]:
# Grouping according to a function is also supported. 
# A specified function is called once per index and the results 
# grouped.

# Suppose we want to group x into evens and odds.

xpgroup = vals.set_index('x').groupby(lambda t: t%2)
for name, group in xpgroup:
    print(name)
    print(group)
    print("\n")
    
print(xpgroup.mean())

0
   y        f1        f2
x                       
0  0 -1.099427 -0.003904
2  0 -2.914717  0.537495
0  1 -1.573877  0.346516
2  1  2.378632 -0.193861
0  2 -1.953944  0.679607
2  2  0.144038  1.597389


1
   y        f1        f2
x                       
1  0 -0.297728  0.925884
1  1 -0.173257  0.577865
1  2  0.647230 -0.375465


     y        f1        f2
x                         
0  1.0 -0.836549  0.493874
1  1.0  0.058749  0.376095
