# Pandas for Exploratory Data Analysis

## Learning Objectives

- **Define** what Pandas is and how it relates to data science
- **Manipulate** Pandas DataFrames and Series
- **Filter and sort** data using Pandas
- **Manipulate** DataFrame columns
- **Know** how to handle null and missing values

## What is Pandas

- Pandas is a Python library that primarily adds two new datatypes to Python: `DataFrame` and `Series`.
    - A `Series` is a sequence of items, where each item has a unique label (called an `index`).
    - A `DataFrame` is a table of data. Each row has a unique label (the `row index`), and each column has a unique label (the `column index`)

### Using Pandas

Pandas is frequently used in data science because:
- it offers a large set of commonly used functions
- it is relatively fast
- and has a large community

### Class methods and attributes

- Pandas `DataFrame`s are Pandas class objects and therefore come with attributes and methods

- To access these, follow the variable name with a dot. For example, given a `DataFrame` called `users`:

```
- users.index       # accesses the `index` attribute -- note there are no parentheses
- users.head()      # calls the `head` method (since there are open/closed parentheses)
- users.head(10)    # calls the `head` method with parameter `10`, indicating the first 10 rows
```

### Viewing Documentation

There are a few ways to find more information about a method.

- **Method 1:** In Jupyter, by following the method name by a `?`, as follows:

```
users.head?
```

> ```
Signature: users.head(n=5)
Docstring: Returns first n rows
```

- **Method 2:** Google the phrase "`DataFrame head`"!!

## Import Libraries

In [1]:
# Load pandas into python
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

### Reading Files, Selecting Columns, and Summarizing

In [2]:
# Read data file into pandas dataframe
users = pd.read_table('../data/user.tbl', sep='|')

**Examine the users data**

In [None]:
# DataFrame
type(users)             

In [3]:
# print the first 5 rows
users.head()            

NameError: name 'users' is not defined

In [None]:
# print the last 5 rows
users.tail()

In [None]:
# the row index (aka "the row labels" -- in this case integers)
users.index

In [None]:
# column names (which is "an index")
users.columns

In [None]:
# data types of each column -- each column is stored as an ndarray which has a data type
users.dtypes

In [None]:
# number of rows and columns
users.shape

In [None]:
# all values as a numpy array
users.values

In [None]:
# concise summary (including memory usage) -- useful to quickly see if nulls exist
users.info()      

** Selecting or indexing data**
- Pandas `DataFrame`s have structural similarities with Python-style lists and dictionaries.  
- In the example below, we select a column of data using the name of the column in a similar manner to how we select a dictionary value with the dictionary key.

In [None]:
# select a column - returns a Pandas 'Series' (essentially an ndarray with an index)
users['gender']

In [None]:
# 'DataFrame' columns are Pandas 'Series'
type(users['gender'])

In [None]:
# select one column using the DataFrame attribute
users.gender

# while a useful shorthand, these attributes only exist
# if the column name has no punctuations or spaces

**summarize (describe) the data**
- Pandas has a bunch of built in methods to quickly summaraize your data

In [None]:
# describe all numeric columns
users.describe()          

In [None]:
# describe all object columns (can include multiple types)
users.describe(include=['object'])

In [None]:
# describe all columns, including non-numeric
users.describe(include='all')

In [None]:
# describe a single column -- recall that 'users.gender' refers to a Series
users.gender.describe()

In [None]:
# calculate the mean of the ages
users.age.mean()

In [None]:
# draw a histogram of a column (the distribution of ages)
users.age.hist();

**Count the number of occurrences of each value**

In [None]:
users.gender.value_counts()     # most useful for categorical variables

In [None]:
# can also be used with numeric variables
#   try .sort_index() to sort by indices or .sort_values() to sort by counts
users.age.value_counts()

In [None]:
users.gender.value_counts().plot(kind='bar')     # quick plot by category

In [None]:
users.age.value_counts().sort_index().plot(kind='bar', figsize=(12,12));     # bigger plot by increasing age
plt.xlabel('Age');
plt.ylabel('Number of users');
plt.title('Number of users per age');

### EXERCISE ONE

In [None]:
# read drinks.csv into a DataFrame called 'drinks'
import pandas as pd
drinks = pd.read_csv('data/drinks.csv')

In [None]:
1. print the head and the tail
2. examine the default index, data types, and shape
3. print the 'beer_servings' Series
4. calculate the average 'beer_servings' for the entire dataset
5. count the number of occurrences of each 'continent' value and see if it looks correct