# A very brief introduction to Pandas
This notebook will cover a few quick examples to show what Pandas dataframes are and how we can use them to work with data. Pandas is an indispensable library for working with data. It easily imports/exports data from a wide variety of sources (Excel, csv, SQL, JSON, HTML, etc), provides nice row/column indexing for viewing/selecting/manipulating the data, works well with time series data, and can do just about any sort of data processing faster than a function you could write yourself.

For more resources check out:
- [The official 10-min intro to Pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)
- A nice [DataCamp tutorial on Pandas](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
- The absolutely terrific [Modern Pandas](https://tomaugspurger.github.io/modern-1-intro.html)

By convention we import Pandas as `pd`

In [1]:
import pandas as pd

## Make a simple dataframe

We can create a DataFrame with a dictionary of list or array-like objects. Each item in the dictionary becomes a column, and all items must have the same length.

In [2]:
data = {
    'label': [x for x in 'ABCABC'], # Nice trick for iterating over characters in a string
    'data': range(6)
}

data

{'label': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(0, 6)}

Jupyter notebooks do a nice job of formatting DataFrames 

In [3]:
df = pd.DataFrame(data)
df

Unnamed: 0,label,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
label    6 non-null object
data     6 non-null int64
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes


## Select portions of the data

Select a single column, which is a Pandas Series, using brackets, `.<name>`, or `.loc[]`. Always use `.loc[]` when assigning a portion of the data from a DataFrame to another object.

In [5]:
df.loc[:, 'label']

0    A
1    B
2    C
3    A
4    B
5    C
Name: label, dtype: object

In [6]:
df['label']

0    A
1    B
2    C
3    A
4    B
5    C
Name: label, dtype: object

In [7]:
df.label

0    A
1    B
2    C
3    A
4    B
5    C
Name: label, dtype: object

In [8]:
type(df.label)

pandas.core.series.Series

### Slice the DataFrame

In [9]:
df.loc[1:4, 'data']

1    1
2    2
3    3
4    4
Name: data, dtype: int64

## Perform string operations on a column

In [10]:
df['label'].str.lower()

0    a
1    b
2    c
3    a
4    b
5    c
Name: label, dtype: object

## Perform math operations

In [11]:
df['data'] * 2

0     0
1     2
2     4
3     6
4     8
5    10
Name: data, dtype: int64

In [12]:
df['data'].sum()

15

In [13]:
df['data'].mean()

2.5

## Find unique values

In [14]:
df['label'].unique()

array(['A', 'B', 'C'], dtype=object)

## Group data
Groupby splits a single dataframe into a group of dataframes based on the unique values in one or more columns. You can then specify an operation to apply to each group. Pandas will apply the operation, recombine the data, and return the result.

Because `groupby` splits the data with very little code it can also be an efficient way to access filtered portions of a large dataframe.

In [15]:
df.groupby('label')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x11151a320>

In [16]:
list(df.groupby('label'))[0]

('A',   label  data
 0     A     0
 3     A     3)

In [17]:
list(df.groupby('label'))[0][0]

'A'

In [18]:
list(df.groupby('label'))[0][-1]

Unnamed: 0,label,data
0,A,0
3,A,3


In [19]:
df.groupby('label').sum()

Unnamed: 0_level_0,data
label,Unnamed: 1_level_1
A,3
B,5
C,7


## Why you should use `.loc[]`

See the [Setting with copy section](https://tomaugspurger.github.io/modern-1-intro.html) for a more detailed explanation.

In [20]:
df1 = pd.DataFrame(data)
df1

Unnamed: 0,label,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [21]:
df1[df1['label'] == 'A']['data'] = df1[df1['label'] == 'A']['data'] + 2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [22]:
df1

Unnamed: 0,label,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [23]:
df1[df1['label'] == 'A']['data']

0    0
3    3
Name: data, dtype: int64

In [24]:
df1.loc[df1['label'] == 'A', 'data'] = df1.loc[df1['label'] == 'A', 'data'] + 2

In [25]:
df1

Unnamed: 0,label,data
0,A,2
1,B,1
2,C,2
3,A,5
4,B,4
5,C,5
