In [2]:
import seaborn as sns
import pandas as pd
import numpy as np

# Pandas Misc Useful Functions

Pandas is massive. I mean really massive! There are hundreds of functions. So we are not going to go over all of them here, but I'll show you a couple of the most useful ones:


This time each function has a bit of documentation, so let's just jump right in. 

In [3]:
tips = sns.load_dataset('tips')
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


## Sample

Pretty useful. Let's you get samples from a dataframe in a pretty powerful diverse way.

http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selecting-random-samples

In [4]:
tips.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
137,14.15,2.0,Female,No,Thur,Lunch,2
159,16.49,2.0,Male,No,Sun,Dinner,4
203,16.4,2.5,Female,Yes,Thur,Lunch,2
216,28.15,3.0,Male,Yes,Sat,Dinner,5
32,15.06,3.0,Female,No,Sat,Dinner,2


In [5]:
tips.sample?

## isin

The next pretty useful function is called is in. It is applied to an entire column and is very useful in selecting specific rows

http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-isin

In [8]:
is_weekend = tips.day.isin(['Sat', 'Sun']).sample(5)
is_weekend

107     True
217     True
193    False
226    False
214     True
Name: day, dtype: bool

In [10]:
tips[tips.day.isin(['Sat', 'Sun'])].sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
44,30.4,5.6,Male,No,Sun,Dinner,4
61,13.81,2.0,Male,Yes,Sat,Dinner,2
240,27.18,2.0,Female,Yes,Sat,Dinner,2
168,10.59,1.61,Female,Yes,Sat,Dinner,2
103,22.42,3.48,Female,Yes,Sat,Dinner,2


## drop_duplicates

This one is a pretty useful function in a lot of respects, and it works on more than one column

http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#duplicate-data

In [11]:
tips[['time', 'day']].drop_duplicates(keep='first')

Unnamed: 0,time,day
0,Dinner,Sun
19,Dinner,Sat
77,Lunch,Thur
90,Dinner,Fri
220,Lunch,Fri
243,Dinner,Thur


## cut

This will cut your numeric data into equal buckets and then assign them labels depending on the bucket. Pretty useful and if you need something more granular you can use qcut.

http://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#tiling

In [12]:
pd.cut(tips['total_bill'], 3, labels=['low', 'mid', 'high']).head()

0    low
1    low
2    mid
3    mid
4    mid
Name: total_bill, dtype: category
Categories (3, object): [low < mid < high]

## str

The str functions are really really useful and there are a ton of them. If you ever need to compute a string operation on a column first look here.

http://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

In [13]:
tips.sex.str.lower().head()

0    female
1      male
2      male
3      male
4    female
Name: sex, dtype: object

## NaNs

There are three that are pretty useful:

* isna
* fillna
* dropna

They are all pretty self expanitory, but it is nice to know that they exist.

In [16]:
tips.isna().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [17]:
tips.tip.fillna(0, inplace=True)

In [18]:
tips.dropna(axis=1, how='any', inplace=True)

## corr

Calculate correlation. Pretty straightforward

http://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#correlation

In [19]:
tips[['tip', 'total_bill']].corr('pearson')

Unnamed: 0,tip,total_bill
tip,1.0,0.675734
total_bill,0.675734,1.0


## rank

This will calculate what rank each entry is in the column.

http://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#data-ranking

In [20]:
tips.tip.rank().head()

0      5.0
1     33.0
2    177.0
3    165.0
4    185.0
Name: tip, dtype: float64

## rename

Rename while not completely needed, is a nice convienience funtion. You can rename columns or indexes.

http://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#renaming-mapping-labels

In [21]:
tips.rename(columns={'total_bill': 'bill'}, inplace=True)

## itertuples

There are a couple of iteraters for dataframes. I would very much so caution you to not use these unless you are really sure that you know what you are doing. These are not very fast compared to many functions, but when working with a small dataframe this can be really useful.

http://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#itertuples

In [22]:
for tup in tips.itertuples():
    print(tup)
    break

Pandas(Index=0, bill=16.99, tip=1.01, sex='Female', smoker='No', day='Sun', time='Dinner', size=2)


## Conclusion

I hope this has been a bit interesting, but these are the functions that I use most (other than the funcitons I demonstrated in the other notebooks)

There are a couple of other things that I have not gone over, but if I get enough interest I'd be happy to make:


* timeseries 
* io
* performance

Please let me know if these interest you! And at this point you should be ready for all the exercises listed [here](https://github.com/guipsamora/pandas_exercises#merge)

In [23]:
ls

Combining DataFrames.ipynb             README.md
Group Operations.ipynb                 Row-Column Transformations.ipynb
Indexing and Selecting.ipynb           [34menv[m[m/
Misc Functions.ipynb                   requirements.txt
Pandas Intro to Data Structures.ipynb
