# `pandas`

- wrangling data 
    - `pandas`
    - common manipulations


## Setup

In [None]:
# Import standard libraries
import pandas as pd
import numpy as np

## `pandas`: Common Manipulations

You'll want to be *very* familiar with a few common data manipulations when wrangling data, each of which is described below:

Manipulation | Description
-------|:------------
**select** | select which columns to include in dataset
**filter** | filter dataset to only include specified rows
**mutate** | add a new column based on values in other columns
**groupby** | group values to apply a function within the specified groups
**summarize** | calculate specified summary metric of a specified variable
**arrange** | sort rows ascending or descending order of a specified column
**merge** | join separate datasets into a single dataset based on a common column



## Loading Data

In [None]:
# Load a csv file of data
df = pd.read_csv('data/my_data.csv')

In [None]:
# Check out a few rows of the dataframe
df.head()

## Selecting & Dropping Columns

- include subset of columns of larger data frame

In [None]:
df.head()

In [None]:
# specify which columns to include
select_df = df[['id', 'age', 'score', 'value']]
select_df.head()

In [None]:
# Drop columns we don't want
df = df.drop(labels=['first_name', 'last_name'], axis=1)

In [None]:
# Check out the DataFrame after dropping some columns
df.head()

## Filtering Data (slicing)

- include a subset (slice) of rows from larger data frame

In [None]:
# Check if we have any data from people below the age of 18
sum(df['age'] < 18)

In [None]:
# Select only participants who are 18 or older
df = df[df['age'] >= 18]
df.shape

## Missing Data (NaNs)

In [None]:
# Check for missing values
df['value'].hasnans

In [None]:
# Check for null values
sum(df['value'].isnull())

In [None]:
# Have a look at the missing values
df[df['value'].isnull()]

## Dealing with Missing Data - NaNs

In [None]:
# Dealing with null values: Drop rows with missing data
df = df.dropna()
df.shape

## Finding Missing Data - Bad Values

In [None]:
# Check for the properties of specific columns
df['score'].describe()

In [None]:
# Check the plot of the data for score to see the distribution
df['score'].plot('hist', bins=25);

## Dealing with Missing Data - Bad Values

In [None]:
# Look for how many values have a -1 value in 'score'
sum(df['score'] == -1)

In [None]:
# Drop any row with -1 value in 'score'
df = df[df['score'] != -1]
df.shape

## Creating new columns (mutating)

- `assign` can be very helpful in adding a new column
- lambda functions can be used to carry out calculations

In [None]:
# convert age in years to age in (approximate) days
df = df.assign(age_days = df['age'] * 365)
df.head()

## Grouping & summarizing

- group by a particular variable
- calculate summary statistics/metrics within group

In [None]:
# caclculate average within each age
df.groupby('age').agg('mean')

## Sorting Rows (arrange)

- specify order in which to display rows

In [None]:
df.head()

In [None]:
# sort by values in age
df = df.sort_values(by = ['age'])
df.head()

## Combining datasets
![](img/join.png)

In [None]:
## Create two DataFrames
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

In [None]:
left

In [None]:
right

In [None]:
pd.merge?

In [None]:
# inner merge by default
pd.merge(left, right, on='key')

In [None]:
# same as above
pd.merge(left, right, on='key', how='inner')

In [None]:
# right merge
pd.merge(left, right, on='key', how='right')

In [None]:
# left merge
pd.merge(left, right, on='key', how='left')

In [None]:
# outer join
pd.merge(left, right, on='key', how='outer')

## Visualizing Data

- We'll have a whole lecture (or two) on visualization
- For now, we'll just look at one uniquely-pandas approach

In [None]:
# Plot all numerical columns, and their interactions
pd.plotting.scatter_matrix(df[['age', 'score', 'value']], figsize=[12, 12], marker=12);  

## Check for correlations between variables

In [None]:
# Take the correlations between all numerical columns
df.corr()