# Pandas Tips & Tricks

This notebook presents various tricks to manipulate your data, which are typically non-obvious to a novice in Pandas and data science.

In [1]:
import numpy as np
import pandas as pd

## Reading Data

### Combining Multiple CSV Files

This shows how to create a single dataframe from multiple files that share the same structure (columns).

In [2]:
import glob

files = '../data/commits*.tsv'
df = pd.concat([pd.read_csv(x, sep='\t', parse_dates=['Date']) for x in glob.glob(files)], 
               ignore_index=True)

!wc -l {files}
print('♯', len(df))

  11 ../data/commits1.tsv
  12 ../data/commits2.tsv
  23 total
♯ 21


In [3]:
len(df.Author.unique()), len(pd.unique(df.Author))

(2, 2)

In [None]:
df

## Inspecting Dataframes

Looking at the contents and metadata of your dataframes is quite important, to better understand the data they represent and then successfully transform it into the results you need.

In [4]:
# Data dimensions (rows, cols)
df.shape

(21, 4)

In [5]:
# Data types
df.dtypes

SHA                                      object
Author                                   object
Date       datetime64[ns, pytz.FixedOffset(60)]
Message                                  object
dtype: object

If you look at a sample, it is often useful to transpose the data, especially when you have many columns.

In [6]:
print(df.head(2).transpose())

                                           0                            1
SHA                                  8fe0ea9                      9de3457
Author                              jhermann                     jhermann
Date               2019-02-16 06:04:19+01:00    2019-02-16 05:57:50+01:00
Message  :link: Python Data Science Handbook  add requirements for Binder


And then there is `describe` with some core statistics about the dataframe…

In [7]:
print(df.describe().transpose())

        count unique                                              top freq  \
SHA        21     21                                          6454187    1   
Author     21      2                                         jhermann   20   
Date       21     21                        2019-02-15 21:23:05+01:00    1   
Message    21     21  link to Netflix 'Notebook Innovation' blog post    1   

                             first                       last  
SHA                            NaN                        NaN  
Author                         NaN                        NaN  
Date     2019-02-11 16:06:17+01:00  2019-02-16 06:04:19+01:00  
Message                        NaN                        NaN  


… and `info` with more technical information.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 4 columns):
SHA        21 non-null object
Author     21 non-null object
Date       21 non-null datetime64[ns, pytz.FixedOffset(60)]
Message    21 non-null object
dtypes: datetime64[ns, pytz.FixedOffset(60)](1), object(3)
memory usage: 752.0+ bytes
