
# Data Wrangling in Python
## Introduction to the pandas library, part 1
### [dataservices.library.jhu.edu](https://dataservices.library.jhu.edu/)
#### Reina Chano Murray, JHU Data Services
#### Date: February 27, 2023

## Table of Contents

#### Introduction
[Software and materials](#Software-and-materials)   
[Pre-requisites](#Pre-requisites)   
[Learning objectives](#Today,-you-will-learn:)   

#### Section 1: Temperatures dataset
[pandas Overview](#pandas:-a-Python-library-for-data-analysis)   
[Exercise 1: Why use pandas?](#Exercise-1:-Why-use-pandas?)   

[Data structures: Series and DataFrame](#Data-structures:-Series-and-DataFrame)     
[Exercise 2: Create a Series object](#Exercise-2:-Create-a-Series-object)   
[Exercise 3: Create a DataFrame](#Exercise-3:-Create-a-DataFrame)   
[Exercise 4: Exploring a DataFrame](#Exercise-4:-Exploring-a-DataFrame)   
[Exercise 5: Subsetting a DataFrame](#Exercise-5:-Subsetting-a-DataFrame)   
[Exercise 6: Adding and renaming columns](#Exercise-6:-Adding-and-renaming-columns)   

#### Section 2: Palmer Penguins dataset
[More data manipulation](#More-data-manipulation)   
[Exercise 7: Exploratory data analysis](#Exercise-7:-Exploratory-data-analysis)   
[Exercise 8: Dealing with missing values](#Exercise-8:-Dealing-with-missing-values)   
[Exercise 9: Sorting data](#Exercise-9:-Sorting-data)

#### Summary
[Summary](#Summary)  
[Questions?](#Questions?)   

## Software and materials     

- Jupyter Notebooks or JupyterLab ([Anaconda distribution](https://www.anaconda.com/products/individual) recommended)   
    - Please install the following libraries:
        - `pandas`
- Zip folder from the Data Service [github repo](https://github.com/jhu-data-services/data-wrangling-pandas) containing:
    - DataWranglingPandas_InClass.ipynb
    - Images folder
    - Data folder

## Pre-requisites:

- Knowledge of basic programming concepts
    - Data types
    - Variable assignment
    - Function calls
- Introductory experience in Python or R (e.g., Data Services Intro to Python or Intro to R workshops)

## About this Webinar

#### Recording
This workshop will be recorded. Recording will be stopped during Q&A. An edited version of this recording will be made available for JHU patrons to access via Panopto later in the semester. 

#### 2-part series
Today is part 1 of a 2 part series. Part 2 will take place on **March 6, 2023** from 1-3 pm.  

## Learning Objectives
<div class="alert alert-info">
    <p>Over the course of this 2-part webinar series, students will learn:
        <ul>
            <li>what the pandas library is</li>
            <li>the two primary data structures of the pandas library: Series and DataFrame</li>
            <li>How to implement functions from the pandas library to explore and manipulate a dataset, including:
                <ul>
                    <li>Exploratory data analysis</li>
                    <li>Subsetting or filtering data</li>
                    <li>Handling missing data</li>
                    <li>Sorting data</li>
                    <li>Calculating basic summary statistics</li>
                    <li>Grouping data</li>
                    <li>Joining data</li>
                </ul></li>
            <li>How to review documentation and reference information for pandas</li>
         </ul>
    </p>
</div>

***

<center><img src='./Images/DataServicesAbout.png'></center>

## Note: the copy of these materials you have downloaded is YOURS

Add notes, write additional code or comments, mark up the document in a way that is helpful to you!

***

<center><img src='./Images/pandas-logo.png'></center>

<div class="alert alert-block alert-info">
    <h3> Section 1: Temperatures dataset </h3>
    <h4>In this section:</h4> 
    
[pandas Overview](#pandas:-a-Python-library-for-data-analysis)  
- [Exercise 1: Why use pandas?](#Exercise-1:-Why-use-pandas?)   
    
[Data structures: Series and DataFrame](#Data-structures:-Series-and-DataFrame)  
- [pandas Series](#pandas-Series)   
- [Exercise 2: Create a Series object](#Exercise-2:-Create-a-Series-object)   
    
[pandas DataFrame](#pandas-DataFrame)   
- [Exercise 3: Create a DataFrame](#Exercise-3:-Create-a-DataFrame)   
- [Exercise 4: Exploring a DataFrame](#Exercise-4:-Exploring-a-DataFrame)   
- [Exercise 5: Subsetting a DataFrame](#Exercise-5:-Subsetting-a-DataFrame)   
- [Exercise 6: Adding and renaming columns](#Exercise-6:-Adding-and-renaming-columns)   
</div>

## pandas: a Python library for data analysis
The `pandas` library is an open-source Python library that helps you work with data. 

- Supports a full data analysis workflow:
    - data cleaning
    - data exploration
    - data transformation (merging, joining, reshaping, pivoting)
    - data analysis
    - data visualization

- Works with a range of data formats (CSV, Excel, JSON, XML, SQL, etc)

- Similar structure to R programming language (DataFrames)

- Especially good for time series data, statistics, machine learning

- Documentation: [https://pandas.pydata.org/docs/index.html](https://pandas.pydata.org/docs/index.html)

### Exercise 1: Why use pandas?

<div class="alert alert-block alert-warning">
Below is a list of temperatures in Fahrenheit. In an empty code cell, write some Python code to convert the temperatures from Fahrenheit to Celsius. Assign the new temperatures to a new list called <code>temps_c</code> </div>

The formula to convert Fahrenheit to Celsius is
<br>
$${\frac {F-32}{1.8}}$$

In [None]:
temps_f = [66, 70, 66, 64, 64, 59, 52]

In [None]:
# code to convert temps_f to Celsius


## Data structures: Series and DataFrame

### pandas Series

- A one-dimensional array
    - Similar to a spreadsheet with 1 column

- Can hold any data type (integer, string, float, python objects, etc)

- Row (axis) labels are called the **index**

#### Exercise 2: Create a Series object

<div class="alert alert-block alert-warning">
    Create a pandas Series using our list of temperatures in Fahrenheit, <code>temps_f</code>. Then use the pandas library to convert the temperatures to Celsius. </div>

In [None]:
# import pandas library


In [None]:
# transform list temps_f into a pandas series named temps_series_f


In [None]:
temps_series_f

In `temps_series_f`, the left column (0, 1, 2, 3,...) is the index. The right column (66, 70, 66...) is our data.

In [None]:
# convert the Fahrenheit values in temp_series_f to Celsius, saved in a variable named temp_series_c
# reminder: celsius = (temp_f - 32) / 1.8

In [None]:
temps_series_c

#### pandas Series - some useful attributes and methods
For more information on series, view its [documentation](https://pandas.pydata.org/docs/reference/series.html)

In [None]:
# how big is our series?
temps_series_c.size

In [None]:
# are all the values unique?
temps_series_c.is_unique

In [None]:
# see number of occurrences of each value in a series
temps_series_c.value_counts()

In [None]:
# look at just the top or last n rows
temps_series_c.head()   # returns top n rows

In [None]:
temps_series_c.tail()   # returns last n rows

In [None]:
# aggregations

# what is the average temperature recorded?
temps_series_c.mean()

In [None]:
# what is the median temperature recorded?
temps_series_c.median()

In [None]:
# what is the highest temperature recorded?
temps_series_c.max()

In [None]:
# what is the lowest temperature recorded?
temps_series_c.min()

In [None]:
# sort Series in order of smallest to largest temperature
temps_series_c.sort_index(ascending=False)

### pandas DataFrame

- A two-dimensional array
    - Similar to a spreadsheet with multiple columns, or many Series combined
    - will also have a header row

- Can hold any data type
    - different columns can hold different data types

- Row (axis 0) labels are called the index
- Column (axis 1) labels are called columns

### Exercise 3: Create a DataFrame

<div class="alert alert-block alert-warning">
    Create a dataframe, called <code>df</code>, using our list of temperatures <code>temps_f</code> and the below list of days of the week <code>days_list</code>.<br>The temperatures listed in <code>temps_f</code> represent estimated high temperatures for a recent week in Baltimore.</div>

In [None]:
days_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

There are multiple ways we can create a dataframe from Python lists.

<div class="alert alert-block alert-success">
    Option 1: Create an empty DataFrame, then add our lists as new columns</div>

In [None]:
# create empty dataframe named df


The syntax to add a new column is `dataframe[col_name] = data_for_column`

In [None]:
# add column days


In [None]:
# add column temps_f


In [None]:
# view the dataframe


<div class="alert alert-block alert-success">
Option 2: Combine our two lists into a Python dictionary, then create a DataFrame from the dictionary</div>  

In [None]:
# combine days and temps_f into a Python dictionary


In [None]:
# create DataFrame from dictionary object


In [None]:
# view the dataframe


# 5 minute break
When we come back: Exploratory data analysis

### Exercise 4: Exploring a DataFrame
In this section, we will find basic information about our dataframe and start to manipulate our data

`pandas` has many methods and attributes to help us explore a new dataset. Here are a few of them:  

| Syntax | Description |
| :----------- | :----------- |
| **.head()** | returns first 5 rows [default, put desired number of rows in the parentheses] |
| **.tail()** | returns last 5 rows [default, put desired number of rows in the parentheses] |
| **.sample()** | returns random sample of the dataframe |
| **.dtypes** | returns data type of each column |
| **.shape** | returns tuple representing the dimensionality of the dataframe (rows, columns) |
| **.axes** | returns list representing axes of the dataframe |
| **.info** | prints a summary of the dataframe |
| **.columns** | returns column names |
| **.unique()** | returns unique values for a given column or Series. *Note*: Must use this function on an individual column in a DataFrame, or on a singular Series |
| **.describe()** | returns summary statistics for numeric columns |

Why do some of these have parentheses and some do not?

There are two programming paradigms in Python: **procedural** (or sequential) programming, and **object-oriented programming** (OOP).  

Python libraries, including pandas, use object-oriented programming. In object-oriented programming, you create **classes**.  

Pandas DataFrames and Series are both classes; when we create a DataFrame in our code, we are creating an *instance* of this class (what's known as instantiating a class). 

Classes have **methods** and **attributes**. 

**Methods** are functions belonging to the class.   

- these functions perform actions on the dataframe or series
- parentheses () can hold additional arguments

Pandas example: `.sample()` - random sample of the dataframe

**Attributes** are properties of the class.  

- these attributes are intrinsic to the dataframe or series
- used for description

Pandas example: `.columns` - column names of the dataframe

For more information, view the [full list of DataFrame attributes and methods](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

<div class="alert alert-block alert-warning">
    Try 4 or 5 of the attributes and methods listed to explore our temperatures dataset <code>df</code> </div>

| Syntax | Description |
| :----------- | :----------- |
| **.head()** | returns first 5 rows [default, put desired number of rows in the parentheses] |
| **.tail()** | returns last 5 rows [default, put desired number of rows in the parentheses] |
| **.sample()** | returns random sample of the dataframe |
| **.dtypes** | returns data type of each column |
| **.shape** | returns tuple representing the dimensionality of the dataframe (rows, columns) |
| **.axes** | returns list representing axes of the dataframe |
| **.info** | prints a summary of the dataframe |
| **.columns** | returns column names |
| **.unique()** | returns unique values for a given column or Series. *Note*: Must use this function on an individual column in a DataFrame, or on a singular Series |
| **.describe()** | returns summary statistics for numeric columns |

In [None]:
# Try 4 or 5 of the above functions to explore our temperatures dataset df


### Exercise 5: Subsetting a DataFrame
What if we want to search our dataframe for the temperature on a specific day? Or find all days with a specific temperature value?

<div class="alert alert-block alert-success">
Option 1: Extract specific rows by index</div>   

- **.iloc[ ]** - integer location; returns row at given integer
- **.loc[ ]** - location; returns all rows with given index value; does not need to be an integer     

We use the square bracket [ ] notation to select an index, just as we would when indexing strings or lists in other Python libraries.

In [None]:
# Look for the information at index 0


In [None]:
# Look for information at index 1


In [None]:
# get multiple locations by indicating a range for our index


`.iloc[]` and `.loc[]` are great for subsetting our data, but we may not always know the indices we want to subset. More often, we'll be looking to subset our data based on certain *conditions*. 

<div class="alert alert-block alert-success">
    Option 2: Filter by known element in `days` column</div>   

1. Select a column using `df.colName` or `df[column name]`
2. Filter that column using [comparison and logical operators](https://www.w3schools.com/python/python_operators.asp) (examples: >, <, ==, |, &)

<div class="alert alert-block alert-warning">
    Filter <code>df</code> to show all rows where the <code>days</code> column is Tuesday </div>

In [None]:
# show all rows where days == Tuesday


In [None]:
# show all rows where days == Tuesday


Both of these syntax approaches return a Series of boolean values of `True` or `False`.  
To return a subset, similar to what we did earlier with `.loc[]` and `.iloc[]`, we need to put our comparison statement in `[]`.

In [None]:
# return a subset of the dataframe where days == Tuesday


<div class="alert alert-block alert-warning">
    Filter <code>df</code> to show where the <code>days</code> column is Tuesday or Wednesday </div>

<div class="alert alert-block alert-warning">
    Filter <code>df</code> to return a <b>subset</b> of rows where the <code>days</code> column is Tuesday or Wednesday </div>

<div class="alert alert-block alert-warning">
    Filter <code>df</code> to return a <b>subset</b> of rows where the <code>temps_f</code> column is greater than 60 </div>

<div class="alert alert-block alert-warning">
    Filter <code>df</code> to return a <b>subset</b> of rows where <code>days</code> is Saturday or Sunday, and <code>temps_f</code> is greater than 55 </div>

**Resource:** [Learn more about the indexing operator and how to select subsets of dataframes](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c)

<div class="alert alert-block alert-success">
    Option 3: Set `days` as the index, extract rows by index</div> 
    
We can set one of the dataframe columns as the index using the **.set_index()** function   
We can then use **.loc[ ]** to index the dataframe. 

In [None]:
# take a look at the original dataframe


In [None]:
# Create a new dataframe
# use .set_index() to set the index to the 'days' column


In [None]:
# Notice that now, our dataframe has the days as the index


In [None]:
# explore the new dataframe


Why do this? Remember `.loc[]`?

**.loc[ ]** - returns all rows with given index value; does not need to be an integer     


In [None]:
# return rows where the index value is 'Tuesday'

In [None]:
# notice how now you'll get an error if you search for index 0 using .loc[] but you CAN use .iloc[]


### Exercise 6: Adding and renaming columns
The temperature data we have are the high temps for the week. Let's add a new column with the week's low temperatures:

In [None]:
low_temp_list = [41, 43, 45, 43, 54, 43, 37]

Remember the syntax to add a new column: `dataframe[col_name] = data_for_column`

In [None]:
# add new column from low_temp_list


In [None]:
# view the dataframe


Now let's change the name of column `temps_f` to the more descriptive `high_temps`   

We have 2 options: we can use the **.rename()** method or the **columns** attribute

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) for .rename()

<div class="alert alert-block alert-warning">
    To use the .rename() method, we provide a dictionary with <code>'old_column_name' : 'new_column_name'</code><br>Remember that Python dictionaries use <code>{}</code> </div>

In [1]:
# create a python dictionary where the key = current column name, value = new column name

To use the columns attribute, simply provide a list of all of the column names you want for the dataframe. This is a great option if you are renaming multiple columns. But, you must provide a name for **all** of the columns in the dataframe, even if you do not want to change all of the column names.   

<div class="alert alert-block alert-warning">
    Use the <code>columns</code> attribute to change all of the column names to uppercase: </div>

In [None]:
# start by listing all columns


# 5 minute break
When we come back: Penguins!

![Gentoo penguin with chick](Images/Gentoo_Penguin_with_chick_at_Jougla_Point,_Antarctica_(6063647060).jpg)

<div class="alert alert-block alert-info">
    <h3>Section 2: Palmer Penguins dataset</h3>
    <h4>In this section:</h4>
    
[More data manipulation](#More-data-manipulation)   
[Exercise 7: Exploratory data analysis](#Exercise-7:-Exploratory-data-analysis)   
[Exercise 8: Dealing with missing values](#Exercise-8:-Dealing-with-missing-values)   
[Exercise 9: Sorting data](#Exercise-9:-Sorting-data) 
    </div>

### More data manipulation
In this section, we'll use the Palmer Penguins dataset. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

This dataset was compiled by developer Allison Horst as an R package [(see R documentation here)](https://allisonhorst.github.io/palmerpenguins/).   

The dataset is also available as a [Python library](https://pypi.org/project/palmerpenguins/), which I have converted to a CSV file and provided for this workshop.

In this section, we will:
- Import data from a CSV file
- Perform exploratory data analysis
- Clean and manipulate the dataset
    - Handle missing values
    - Sort the dataset  
    - Calculate basic summary statistics

<div class="alert alert-block alert-warning">
    Use the <code>.read_csv()</code> method to import our dataset.
    </div>

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for .read_csv()

In [None]:
# import the dataset from file palmerpenguins.csv


### Exercise 7: Exploratory data analysis
<div class="alert alert-block alert-warning">
    Spend 2 minutes getting to know the <code>penguins</code> dataset. Try methods and attributes like .shape, .dtypes, .describe(), or .unique()
    </div>

### Exercise 8: Dealing with missing values
<div class="alert alert-block alert-warning">
    We'll use the <code>.isna()</code> method to check if we have any missing values (NaN) in our dataset. Then we will drop all rows that have any missing value using <code>.dropna()</code>
    </div>

- [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) for .isna()   
- [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) for .dropna()

In [None]:
# check for missing values


`.isna()` will also work with a specific column: `df[column].isna()`   
You can add the `.unique()` function to quickly see if any data is the column is missing

In [None]:
# use .isna() and .unique() together


**.dropna()** will drop rows or columns that have missing values.  
The `axis` argument is EXTREMELY important!
- `axis=0` -- drop rows with missing values
- `axis=1` -- drop columns with missing values

Read more [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html). 

In [None]:
# remove all rows that have at least one missing value


Our dataset started with 344 rows. After dropping rows with missing values, how many rows are left?

In [None]:
# check size of the dataset


We now have 333 rows. We still have all 8 columns.

You can check for missing values again in, for example, the `bill_length_mm` column:

### Exercise 9: Sorting data 
<div class="alert alert-block alert-warning">
    Use <code>.sort_values()</code> to order <code>penguins</code> by bill length, from smallest to largest. Then order <code>penguins</code> by bill length from largest to smallest.  
    </div>

By default, the **.sort_values()** method sorts data in ascending order (smallest to largest).  
- Use the `by` argument to specify which column(s) to sort
- Use the `ascending=False` argument to sort in descending order

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) for .sort_values()

In [None]:
# sort penguins by bill length, smallest to largest


In [None]:
# sort penguins by bill length, largest to smallest


We can sort a dataframe by multiple variables by passing a list of column names into the `by` argument.   

<div class="alert alert-block alert-warning">
    Sort <code>penguins</code> first by year, then by bill length in ascending order (smallest to largest).
    </div>

In [None]:
# sort penguins by year, then bill length in ascending order


We can also specify ascending vs descending order for each column when sorting by multiple columns by passing a second list for our `ascending` argument. 

<div class="alert alert-block alert-warning">
    Sort <code>penguins</code> first by year in descending order, then by bill length in ascending order.
    </div>

In [None]:
# sort penguins by year (descending order), then bill length (ascending order)


**Resource**: for more examples of sorting mechanisms in pandas, see [this article](https://www.geeksforgeeks.org/how-to-sort-a-pandas-dataframe-by-multiple-columns-in-python/)

# Summary
Today we covered:
- what is `pandas`?
- Data Structures: pandas Series and DataFrames
- Creating and exploring a Series object
- Creating and exploring a DataFrame object
    - subsetting 
    - adding and renaming columns
- Data manipulation:
    - exploratory data analysis
    - removing missing values (`NaN`)
    - sorting data

## Stay tuned for Part 2:
We'll cover:
- Data manipulation continued:
    - grouping and aggregating data
    - joining data
- Further data exploration:
    - basic calculations
- Exporting a DataFrame

March 6, 2023 from 1-3 pm.  
Same Zoom link

## Questions?   

## Contact us at dataservices@jhu.edu

### About this Presentation  
This presentation was created using Jupyter Notebooks version 6.5.2 and the RISE notebook extension version 5.7.1.    

### Terms of Use 
The presentation materials are licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/), attributable to Data Services, Johns Hopkins University.   

Please cite this material as:

> Johns Hopkins University Data Services. (2023, February 27). Data Wrangling in Python: Introduction to the pandas library [workshop presentation].