# Data Analysis with Pandas — Day 1
## Exploratory Data Analysis

This is the Day 1 notebook for the June 2021 course "Data Analysis with Pandas," part of the [Text Analysis Pedagogy Institute](https://nkelber.github.io/tapi2021/book/intro.html).

In this lesson, we're going to introduce the basics of [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html), a powerful Python library for working with tabular data like CSV files.

We will cover:

* The Essential Structures of Pandas
* How to Load Data
* How to Explore and Filter Data
* How to Make Simple Plots

___

## Dataset
### Seattle Public Library Book Circulation Data

This week, we will be working with [circulation data](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6) made publicly avilable by the Seattle Public Library. The dataset includes items that were checked out 20+ times in a month between January 2015 and June 2021.

For more information about this dataset, see the Seattle Public Library's [data portal](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).
___

## Import Pandas

To use the Pandas library, we first need to `import` it.

In [None]:
import pandas as pd

The above `import` statement not only imports the Pandas library but also gives it an alias or nickname — `pd`.

By default, Pandas will display 60 rows and 20 columns. I often change [Pandas' default display settings](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to show more rows or columns.

In [None]:
pd.options.display.max_colwidth = 100

## How to Load Data

| File Type | Pandas Method  |             
|----------|---------| 
| CSV file    | `pd.read(filepath, delimiter=',')` |
| TSV file    | `pd.read(filepath, delimiter='\t')` |
|  Excel file     | `pd.read_excel(filepath)` |    
|  JSON file     | `pd.read_json(filepath)`, `pd.json_normalize(filepath)` |    
|  SQL table     | `pd.read_sql_table(table_name)` |    

To read in a CSV file, we will use the function `pd.read_csv()` and insert the name of our desired file path, along with a delimiter (the character that separates columns in our file) and a character encoding. 

In [None]:
pd.read_csv('Seattle-Library_2015-2021.csv', delimiter=",", encoding="utf-8")

This creates a Pandas [DataFrame object](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) — often abbreviated as *df*, e.g., *seattle_df*.

## How to Display Data

We can display a Pandas DataFrame in a Jupyter notebook simply by running a cell with the variable name of the DataFrame.

In [None]:
seattle_df = pd.read_csv('Seattle-Library_2015-2021.csv', delimiter=",", encoding="utf-8")

In [None]:
seattle_df

We can examine the first *n* number of rows by using `df.head(n)`.

In [None]:
seattle_df.head(5)

Sometimes we want to see data beyond the first few rows. To display a random number of rows, we can use `df.sample()`.

In [None]:
seattle_df.sample(5)

Go ahead and run it a few times to check out random rows!

## Exploratory Data Analysis — Overview

Ok so what's actually in this Seattle Public Library dataset? What categories are included? What time period(s) does it cover? Is there missing data? Is there messy data?

We can check to see how many rows vs columns are included in the DataFrame by getting the attribute `.shape`.

In [None]:
seattle_df.shape

To get important information about all the columns in a DataFrame, we can use `df.info()`.

In [None]:
seattle_df.info()

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title"><b/>Python Review 🐍 </b></p>


| **Python Data Type** |  **Example** |
|:-------------:|:---------------------------------------------------------------------------------------------------:|
| string  |   "Exhalation / Ted Chiang."; "2020"                                                                         |
| float       |  20.2 |                                      |
| integer |     20 |                                                |
| boolean |     True/False |                                                |
</div>

Python has different data types, which we can check with the built-in `type()` function.

In [None]:
type(2020)

In [None]:
type("2020")

Similarly, Pandas has different data types, too.

These data types are automatically assigned to columns when we read in a file.



| **Pandas Data Type** |  **Explanation**         | **Example** |
|:-------------:|:---------------------------------------------------------------------------------------------------:|---|
| `object`         | strings, mixture of strings and numeric values   |   "Exhalation / Ted Chiang."                                                                        |
| `float64`         | floats, `NaN`       |  20.2, `NaN` |                                      |
| `int64`       | integers |       20 |                                                |
| `datetime64`       |  datetimes |   `2021-02-01`       

We can check these Pandas data types explicitly with the [`.dtypes` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [None]:
seattle_df.dtypes

To calculate summary statistics for every column in our DataFrame, we can use the `.describe()` method.

By default, `describe()` will only work on numerical columns, but we can ask it to describe all columns with `include='all'`

In [None]:
seattle_df.describe(include='all')

`NaN` is the Pandas value for any missing data. We'll cover this in more detail later (you can read ["Working with missing data"](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html?highlight=nan) for more information if you're curious now). The `NaN` values in this table indicate that there is no applicable result for the column, e.g., there's no mean or standard deviation for the "title" of items.

- What is the maximum number of checkouts per month?
- What is the minimum number of checkouts per month?
- How many different material types are there?
- Which columns have missing data, and how much data is missing?

**Moment of Reflection** 🛑   

What questions might we explore with this data? What are some potential problems or issues with this data? How might we resolve them?

## Pandas Essentials — DataFrame vs Series

There are two main types of data structures in Pandas, *DataFrame* objects and *Series* objects.

| Pandas objects | Think of it like...  |   Dimensions | It looks like...                   |
|----------|---------| ----- | -----|
| `DataFrame`    | A spreadsheet | 2-dimensional |  A pretty, nicely formatted table |
| `Series`      | A single column | 1-dimensional | A more basic printed code output   |                

This is a Pandas [DataFrame object](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe), which looks and acts a lot like a spreadsheet.

In [None]:
seattle_df

We can confirm that this is a DataFrame by using the built-in Python function `type()`.

In [None]:
type(seattle_df)

This is a Series object, a single column from the DataFrame, which we can access with square brackets `[]` and the name of the column in quotation marks.

In [None]:
seattle_df['MaterialType']

A Series object displays differently than a DataFrame object. 

In [None]:
type(seattle_df['MaterialType'])

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title"><b/>Python Review 🐍 </b></p>

Python dictionaries are made up of key-value pairs, e.g., `{'key': 'value'}`.  
To access a value in a Python dictionary, we use square brackets`['key']`.
</div>

In [3]:
# Python dictionary
book_dict = {'title': 'My Brilliant Friend',
             'author': 'Elena Ferrante',
             'publication_year': 2011}

# Get value for the key author
book_dict['author']

'Elena Ferrante'

In [2]:
# Python dictionary
title_author_dict = {'My Brilliant Friend': 'Elena Ferrante',
                     'Thick': 'Tressie McMillan'}

# Get value for the key Thick
title_author_dict['Thick']

'Tressie McMillan'

There are actually two differents ways of accessing a column or Series object. You can also access a column with with dot `.` notation. However, to stay consistent, we will use square brackets throughout these lessons.

In [None]:
seattle_df.MaterialType

In [None]:
type(seattle_df.MaterialType)

If we use two square brackets, we will return a DataFrame rather than a Series object.

In [None]:
seattle_df[['MaterialType']]

In [None]:
type(seattle_df[['MaterialType']])

If we want to select multiple columns, we will need to use two square brackets.

In [None]:
seattle_df[['Publisher', 'MaterialType']]

## Basic Statistics and Value Counts

| Pandas method | Explanation                         |
|----------|-------------------------------------|
| `.sum()`      | Sum of values                       |
| `.mean()`     | Mean of values                      |
| `.median()`   | Median of values         |
| `.min()`      | Minimum                             |
| `.max()`      | Maximum                             |
| `.mode()`     | Mode                                |
| `.std()`      | Unbiased standard deviation         |
| `.count()`    | Total number of non-blank values    |
| `.value_counts()` | Frequency of unique values |

There are a number of conveient methods that we can with Series objects, such as `.max()` and `mean()`.

In [None]:
seattle_df['Checkouts'].max()

In [None]:
seattle_df['Checkouts'].mean()

We can also count the number of records in each category (excluding `NaN` values) in a column with `value_counts()`.

In [None]:
seattle_df['MaterialType'].value_counts()

We can get the proportion of different categories by setting `normalize=False`.

In [None]:
seattle_df['MaterialType'].value_counts(normalize=True)

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title"><b/>Python Review 🐍 </b></p>

Python `lists` consist of items separated by commas in square brackets.    

`books = ['My Brilliant Friend', 'Goosebumps', 'Man in the High Castle', 'Thick']`  

To slice a Python list and extract the first 2 values, we can use `[:2]`.e.g. `books[:2]`.  

`['My Brilliant Friend', 'Goosebumps']`
</div>

In [None]:
# Python list
books = ['My Brilliant Friend', 'Goosebumps', 'Man in the High Castle', 'Thick']
# Slice the list
books[:2]

In [None]:
# Python list
books = ['My Brilliant Friend', 'Goosebumps', 'Man in the High Castle', 'Thick']
# Index the list
books[2]

We can also include NaN values by setting `dropna=False`.

In [None]:
seattle_df['Publisher'].value_counts(dropna=False)[:10]

## Pandas Essentials — Index

The other essential structure in Pandas is the *index*, which is the bolded ascending numbers on the very left column of the DataFrame.

In [None]:
seattle_df.head(3)

In [None]:
seattle_df.index

In [None]:
type(seattle_df.index)

We can access rows by their index number with `.iloc`, or integer-location indexing.

In [None]:
seattle_df.iloc[0]

### Set Index

We can also change the index from ascending row numbers to one of our DataFrame columns. This can be useful for indexing based on values.

In [None]:
seattle_df.set_index('CheckoutYear')

In [None]:
seattle_df.set_index('CheckoutYear').loc[2018]

### Reset Index

We can also "reset" the index to the default integer index by using `.reset_index()`.

In [None]:
seattle_df['MaterialType'].value_counts()

Series objects like the result of this `.value_counts()` function also have an index.

In [None]:
seattle_df['MaterialType'].value_counts().index

In [None]:
seattle_df['MaterialType'].value_counts().reset_index()

## How to Select, Subset, and Filter Data

There are a number of different ways that you can select, subset and filter a DataFrame. 

This useful summary below is borrowed from the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#indexing-selection):

<table class="colwidths-given table">
<colgroup>
<col style="width: 50%">
<col style="width: 33%">
<col style="width: 17%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Selection Method</p></th>
<th class="head"><p>Pandas Syntax</p></th>
<th class="head"><p>Example</p></th>
<th class="head"><p>Output</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>Select column</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">df[col]</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">seattle_df['MaterialType']</span></code></p></td>
<td><p>Series</p></td>
</tr>
<tr class="row-odd"><td><p>Select row by label</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">df.loc[label]</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">df.loc[2018]</span></code></p></td>
<td><p>Series</p></td>
</tr>
<tr class="row-even"><td><p>Select row by integer location</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">df.iloc[loc]</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">seattle_df.iloc[103]</span></code></p></td>
<td><p>Series</p></td>
</tr>
<tr class="row-odd"><td><p>Select rows by filter or "boolean vector"</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">df[filter/boolean]</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">seattle_df[seattle_df['MaterialType'] == 'BOOK']</span></code></p></td>
<td><p>DataFrame</p></td>
</tr>
</tbody>
</table>

Select by column

In [None]:
seattle_df['MaterialType']

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title"><b/>Python Review 🐍 </b></p>


A conditional statement in Python will return a Boolean value, which is either `True` or `False`.  
A double equals sign `==` is the equals operator in Python.
</div>

In [None]:
# Assign a variable the value 'Book'
some_variable = 'Book'

# Evaluate whether the variable equals 'Book'
some_variable == 'Book'

In [None]:
# Evaluate whether the variable does NOT equal 'Book'
some_variable != 'Book'

We can construct a conditional statement with Pandas that returns a Series of True/Falses.

In [None]:
seattle_df['MaterialType'] == 'BOOK'

We can then subset the DataFrame by filtering with a conditional or Boolean vector.

In [None]:
# Boolean vector
book_filter = seattle_df['MaterialType'] == 'BOOK'

# Filter by Boolean vector
seattle_df[book_filter]

In [None]:
# Boolean vector
checkouts_filter = seattle_df['Checkouts'] > 750

# Filter by Boolean vector
seattle_df[checkouts_filter]

We can also chain different conditionals together, such that one condition AND (`&`) another are True/False, or one condition OR (`|`) another are True/False.

For example, we might be interested in filtering to see only books by the author Ted Chiang.

In [None]:
(seattle_df['MaterialType'] == 'BOOK') & (seattle_df['Creator'] == 'Chiang, Ted')

In [None]:
# Boolean vector
book_author_filter = (seattle_df['MaterialType'] == 'BOOK') \
                    & (seattle_df['Creator'] == 'Chiang, Ted')

# Filter by Boolean vector
seattle_df[book_author_filter]

## Missing Data

We can see if data is missing or NOT missing with `.isna()` or `.notna()`.

In [None]:
seattle_df[seattle_df['Publisher'].isna()]

In [None]:
seattle_df[seattle_df['Creator'].isna()]

In [None]:
seattle_df[seattle_df['Creator'].notna()]

## Making a DataFrame Copy

If we want to make a different DataFrame based on an original DataFrame, we can use `df.copy()`.

In [None]:
# Boolean vector
book_filter = seattle_df['MaterialType'] == 'BOOK'

book_df = seattle_df[book_filter].copy()

In [None]:
book_df.head(4)

In [None]:
book_df['MaterialType'].value_counts()

In [None]:
book_df['Checkouts'].max()

In [None]:
book_df['Checkouts'].min()

In [None]:
book_df['Checkouts'].mean()

## Make and Save Plots

Pandas makes it easy to create simple plots and data visualizations.

We can make a simple plot by adding `.plot()` to any DataFrame or Series object that has appropriate numeric data.

In [None]:
seattle_df['MaterialType'].value_counts().plot(kind='bar')

 We can specify the title with the `title=` parameter and the kind of plot by altering the `kind=` parameter:
* ‘bar’ or ‘barh’ for bar plots

* ‘hist’ for histogram

* ‘box’ for boxplot

* ‘kde’ or ‘density’ for density plots

* ‘area’ for area plots

* ‘scatter’ for scatter plots

* ‘hexbin’ for hexagonal bin plots

* ‘pie’ for pie plots   
 

For example, to make a horizontal bar chart, we can set `kind='barh'`

In [None]:
seattle_df['MaterialType'].value_counts()[:5].plot(title='SPL Material Types 2015-2021',
                                               kind='pie')

To save a plot as an image file or PDF file, we can assign the plot to a variable called `ax`, short for axes.

Then we can use `ax.figure.savefig('FILE-NAME.extension')`.

In [None]:
ax = seattle_df['MaterialType'].value_counts().plot(kind='pie')
ax.figure.savefig('SPL-MaterialTypes.pdf')

## Reflection and Next Steps

What next steps should we take to analyze this data or better prepare it for analysis?