# Useful Pandas Snippets

- From [Pandas Snippets](https://github.com/siebenrock/pandas-snippets) by [Kai Siebenrock](https://github.com/siebenrock)

Updated by [John Fogarty](https://github.com/jfogarty) for Python 3.6 and [Base2 MLI](https://github.com/base2solutions/mli) and [colab](https://colab.research.google.com) standalone evaluation.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Importing Data

<figure>
    <img src="../images/titanic_ship.jpeg" />
    <figcaption></figcaption>
</figure>

In [76]:
#@title Nasty File Transfer Utility Tools
import numpy as np
import requests
import shutil
import gzip
import os
from bs4 import BeautifulSoup

ds = np.DataSource()
def copyHere(URL, toPath, quiet=False):
    toDir, toFile = os.path.split(toPath)
    toPath = os.path.join(toDir, toFile)
    if os.path.exists(toPath):
        if not quiet:
            print(f"- Skipped copy of existing file {toPath}.")
    else:
        if ds.exists(URL):
            if not toFile:
                urlPrefix, toFile = os.path.split(URL)
            response = requests.get(URL, stream=True)
            response.raw.decode_content=True  # reflate and ungzip automatically.
            response.encoding = 'utf-8'
            if toDir:
                if not os.path.exists(toDir): 
                  print(f"- Creating directory '{toDir}'.")
                  os.makedirs(toDir)
            with open(toPath, 'wb') as f: shutil.copyfileobj(response.raw, f)
            if not quiet: 
                print(f"- Downloaded {URL}.")
            gzipped = False
            # If the file sent is gzipped, unpack it anyway.
            with open(toPath, 'rb') as fin:
                prefix = fin.read(2) 
                gzipped = prefix == b'\x1f\x8b'
            if gzipped:
                gzPath = toPath + '.gz'               
                if os.path.exists(gzPath):
                    os.remove(gzPath)
                os.rename(toPath, gzPath)
                with gzip.open(gzPath) as gz:
                    with open(toPath, 'wb') as fout:
                         shutil.copyfileobj(gz, fout)
                if not quiet: 
                    print(f"- Unpacked gzipped file '{gzPath}' to '{toPath}'.")
            else:
                print(f"- Installed locally as '{toPath}'.")
        else:
            print(f"** Sorry, can't copy '{URL}' to '{toPath}'.")

In [0]:
import os
REPODATA='https://github.com/plotly/datasets/blob/master/titanic.csv'
RAWDATA='https://raw.githubusercontent.com/plotly/datasets/master/titanic.csv'
filename='titanic.csv'
TMPDATA='./tmpData'
if not os.path.exists(TMPDATA) : os.makedirs(TMPDATA)
datafile=os.path.join(TMPDATA, filename)
!curl $RAWDATA -o $datafile

In [0]:
copyHere(RAWDATA, datafile)

Read from CSV file

In [0]:
df = pd.read_csv(datafile)

df.describe()

In [0]:
df.head()

In [0]:
df

## Creating Data

Using dataframe

In [0]:
pd.DataFrame({'Name':['Marie', 'John', 'Max', 'Jane'],
              'Age':[32, 28, 27, 33]}, 
             index=['rank1','rank2','rank3','rank4'])

In [0]:
pd.DataFrame(np.random.randint(low=0, high=100, size=(5, 5)), 
             columns=['A', 'B', 'C', 'D', 'E'])

Using list comprehension

In [0]:
list = [x**2 for x in range(10)]

In [0]:
[x for x in list if x % 2 == 0]

## Cleaning

Drop NaN in fare

In [0]:
df.dropna(subset=["Fare"], inplace=True)

Return null values

In [0]:
df[df['Fare'].isnull()]

Upper case all column names

In [0]:
df.columns = map(str.upper, df.columns)
df.head()

Rename columns

In [0]:
df = df.rename(columns = {
    'Pclass':'Class',
    'Name':'Full Name',
})

Alternatively

In [0]:
df.columns = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
df.columns = ['Id', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard', 
              'Parents/Children Aboard', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Filter columns containing "Aboard"

In [0]:
df_aboard = df.loc[:, df.columns[df.columns.str.contains('Aboard')].tolist()]

In [0]:
df_aboard.head()

Replace strings in column

In [0]:
df['Sex'] = df['Sex'].str.replace('Mr.', 'Mister')

Remove if contains character

In [0]:
files = ['afile', 'bfile', 'not~mefile', 'cfile']
notfiles = [file for file in files if "~" not in file]
notfiles

Remove based on multiple values

In [0]:
df = df[~df['Name'].isin(['Invalid', 'Unknown'])]
df

Change type

In [0]:
df['Fare'] = df['Fare'].astype(float)

Reset index

In [0]:
df.reset_index(drop=True, inplace=True)

Convert to lower case

In [0]:
df['Sex'] = df['Sex'].str.lower()

In [0]:
df["Pclass"] * 1000

In [0]:
df.head()

Deleting columns

In [0]:
del df['Siblings/Spouses Aboard']
del df['Parents/Children Aboard']
df.head()

## Exploring

Number of rows

In [0]:
len(df.index)

Get info

In [0]:
df.info()

Describe data

In [0]:
df.describe()

Select two columns

In [0]:
df[['Name', 'Fare']].head()

Get titles

In [0]:
df["Title"] = df["Name"].str.split(" ").str[0]

Find records with no specified Age

In [0]:
df[df['Age'].isna()]

Looking only at males

In [0]:
df[df['Sex'] == 'male'].head()

Looking only at males who survived

In [0]:
df[(df['Sex'] == 'male') & (df['Survived'] == 1)].head()

Looking only at males who survived above the age of 50

In [0]:
df[(df['Sex'] == 'male') & (df['Survived'] == 1) & (df['Age'] > 50)].head()

Set column value based on other columns

In [0]:
df['Note'] = np.nan

In [0]:
df.loc[(df['Sex'] == 'male') & (df['Survived'] == 1) & (df['Age'] > 50), 
       ['Note']] = 'Male Above 50 Survived'

In [0]:
df['Note'].sort_values()[:3]

Number of men who survived

In [0]:
len(df[(df['Sex'] == 'male') & (df['Survived'] == 1)])

Average age of men who survived

In [0]:
df[(df['Sex'] == 'male') & (df['Survived'] == 1)]['Age'].mean()

Filter by multiple values

In [0]:
df[df["Name"].isin(["Mr. Charles Eugene Williams", "Mr. Lawrence Beesley"])]

Highest fare paid

In [0]:
df.loc[df['Fare'].idxmax()]

Sorting

In [0]:
df.sort_values(['Fare', 'Age'], ascending=[0,1]).head()

Sort by multiple columns

In [0]:
df.sort_values(['Fare', 'Age'], ascending=[0,1]).head()

Number of classes

In [0]:
df['Pclass'].unique()

Count of each class

In [0]:
df['Pclass'].value_counts()

Find duplicates : [pandas.DataFrame.duplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html)

In [0]:
df[df.duplicated(['Name'], keep=False)]

## Looping

In [0]:
for index, row in df.iterrows():
    print(index)
    print(row)

Returning tuples

In [0]:
for row in df.itertuples():
    print(row)

## Grouping

Group by class and aggregate fare by mean

In [0]:
df.groupby(['Pclass'])['Fare'].mean()

Pivot table

In [0]:
pd.pivot_table(df, values='Fare', index='Pclass', columns='Sex', 
               aggfunc=np.mean)

Sample weighted average aggregation function

In [0]:
agg_func = {'colA': ['sum'], 
            'colB': lambda x: np.average(x, weights=d.loc[x.index, 'colC'])}

## Miscellaneous

Functions in dictionary

In [0]:
func = {
    'times2': lambda x: print("The solution is: {}".format(x**2)),
    'times3': lambda x: print("The solution is: {}".format(x**3)),
    'times4': lambda x: print("The solution is: {}".format(x**4))
}

In [0]:
func['times2'](3)

## Recommended Cheat Sheets

* [Pandas DataFrame Object](http://www.webpages.uidaho.edu/~stevel/504/Pandas%20DataFrame%20Notes.pdf) from University of Idaho
* [Data Wrangling with Pandas](http://cs.umw.edu/~stephen/cpsc219/Pandas_Cheat_Sheet.pdf) from University of Mary Washington
* [Python for Data Science Pandas Basics](http://datacamp-community.s3.amazonaws.com/3857975e-e12f-406a-b3e8-7d627217e952) from DataCamp
* [Data Science Python Intermediate](https://www.dataquest.io/blog/large_files/python-cheat-sheet-intermediate.pdf) from Dataquest
* [Data Science Numpy](https://www.dataquest.io/blog/large_files/numpy-cheat-sheet.pdf) from Dataquest
* [Data Science Pandas](https://www.dataquest.io/blog/large_files/pandas-cheat-sheet.pdf) from Dataquest

### End of notebook.