# Data Wrangling in Python: Introduction to the pandas library
## [dataservices.library.jhu.edu](https://dataservices.library.jhu.edu/)
### Marley Kalt and Harshil Desai, JHU Data Services
### Date: November 9, 2021

## Table of Contents

#### Introduction
[Software and materials](#Software-and-materials)   
[Pre-requisites](#Pre-requisites)   
[Learning objectives](#Today,-you-will-learn:)   

#### Section 1: Temperatures dataset
[pandas Overview](#pandas:-a-Python-library-for-data-analysis)   
[Exercise 1: Why use pandas?](#Exercise-1:-Why-use-pandas?)   
[Data structures: Series and DataFrame](#Data-structures:-Series-and-DataFrame)   
[pandas Series](#pandas-Series)   
[Exercise 2: Create a Series object](#Exercise-2:-Create-a-Series-object)   
[pandas DataFrame](#pandas-DataFrame)   
[Exercise 3: Create a DataFrame](#Exercise-3:-Create-a-DataFrame)   
[Exercise 4: Exploring a DataFrame](#Exercise-4:-Exploring-a-DataFrame)   
[Exercise 5: Subsetting a DataFrame](#Exercise-5:-Subsetting-a-DataFrame)   
[Exercise 6: Adding and renaming columns](#Exercise-6:-Adding-and-renaming-columns)   

#### Section 2: Palmer Penguins dataset
[More data manipulation](#More-data-manipulation)   
[Exercise 7: Exploratory data analysis](#Exercise-7:-Exploratory-data-analysis)   
[Exercise 8: Dealing with missing values](#Exercise-8:-Dealing-with-missing-values)   
[Exercise 9: Sorting data](#Exercise-9:-Sorting-data)   
[Exercise 10: Basic calculations](#Exercise-10:-Basic-calculations)   
[Exercise 11: Grouping and aggregating data](#Exercise-11:-Grouping-and-aggregating-data)   

#### Resources section
[Additional Practice](#Additional-Practice)   
[Resources](#Resources)   
[Questions?](#Questions?)   

## Software and materials     

- Jupyter Notebooks or JupyterLab ([Anaconda distribution](https://www.anaconda.com/products/individual) recommended)   
- pandas library installed
- Zip folder containing:
    - DataWranglingPandas_Workshop.ipynb
    - Images folder
    - Data folder

## Pre-requisites:

- Knowledge of basic programming concepts
    - Data types
    - Variable assignment
    - Function calls
- Introductory experience in Python or R (like Data Services' Intro to Python or Intro to R workshops)

## This workshop will not be recorded  

You will receive all workshop materials by tomorrow afternoon.

<center><img src='./Images/DataServicesAbout.png'></center>

***

## Today, you will learn:

- The two primary data structures of the pandas library: Series and DataFrame

- How to implement functions from the pandas library to explore and manipulate a dataset, including:  
    - Exploratory data analysis
    - Subsetting or filtering data
    - Handling missing data  
    - Sorting data  
    - Grouping data  
    - Calculating basic summary statistics  

***

<center><img src='./Images/pandas-logo.png'></center>

### Section 1: Temperatures dataset
#### In this section:
[pandas Overview](#pandas:-a-Python-library-for-data-analysis)   
[Exercise 1: Why use pandas?](#Exercise-1:-Why-use-pandas?)   
[Data structures: Series and DataFrame](#Data-structures:-Series-and-DataFrame)   
[pandas Series](#pandas-Series)   
[Exercise 2: Create a Series object](#Exercise-2:-Create-a-Series-object)   
[pandas DataFrame](#pandas-DataFrame)   
[Exercise 3: Create a DataFrame](#Exercise-3:-Create-a-DataFrame)   
[Exercise 4: Exploring a DataFrame](#Exercise-4:-Exploring-a-DataFrame)   
[Exercise 5: Subsetting a DataFrame](#Exercise-5:-Subsetting-a-DataFrame)   
[Exercise 6: Adding and renaming columns](#Exercise-6:-Adding-and-renaming-columns)   

## pandas: a Python library for data analysis

- Supports data manipulation and analysis

- Works with tabular data (spreadsheets, databases)

- Similar structure to R programming language (DataFrames)

- Especially good for time series data, statistics, machine learning

- Documentation: [https://pandas.pydata.org/docs/index.html](https://pandas.pydata.org/docs/index.html)

### Exercise 1: Why use pandas?

Below is a list of temperatures in Fahrenheit. In an empty code cell, write some Python code to convert the temperatures from Fahrenheit to Celsius. Assign the new temperatures to a new list called `temps_c`   

The formula to convert Fahrenheit to Celsius is
$${\frac {F-32}{1.8}}$$

In [None]:
temps_f = [66, 70, 66, 64, 64, 59, 52]

In [None]:
# code to convert temps_f to Celsius
temps_c = []
for temp in temps_f:
    celsius = (temp - 32) / 1.8
    temps_c.append(celsius)

In [None]:
temps_c

## Data structures: Series and DataFrame

### pandas Series

- A one-dimensional array
    - Similar to a spreadsheet with 1 column

- Can hold any data type

- Row (axis) labels are called the **index**

### Exercise 2: Create a Series object

Create a pandas Series using our list of temperatures in Fahrenheit, `temps_f`, and use the pandas library to convert the temperatures to Celsius

In [None]:
# import pandas library

In [None]:
# transform list temps_f into a pandas series named temps_series_f

In [None]:
temps_series_f

In `temps_series_f`, the left column (0, 1, 2, 3,...) is the index. The right column (66, 70, 66...) is our data.

In [None]:
# convert the Fahrenheit values in temp_series_f to Celsius, saved in a variable named temp_series_c

In [None]:
temps_series_c

### pandas DataFrame

- A two-dimensional array
    - Similar to a spreadsheet with multiple columns, or many Series combined

- Can hold any data type

- Row (axis 0) labels are called the index
- Column (axis 1) labels are called columns

### Exercise 3: Create a DataFrame

Create a dataframe, called `df`, using our list of temperatures, `temps_f`, and the below list of days of the week, `days`

In [None]:
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

There are multiple ways we can create a dataframe from scratch. Below are two possibilities:

**Option 1: Create an empty DataFrame, then add our lists as new columns**

In [None]:
# create empty dataframe named df

The syntax to add a new column is `dataframe[col_name] = data_for_column`

In [None]:
# add column days

In [None]:
# add column temps_f

In [None]:
# view the dataframe
df

**Option 2: Combine our two lists into a Python dictionary, then create a DataFrame from the dictionary**   

In [None]:
# combine days and temps_f into a Python dictionary
temp_dict = {'Days': days, 'Temps_F': temps_f}
temp_dict

In [None]:
# create DataFrame from dictionary object
df_fromDict = pd.DataFrame(temp_dict)

In [None]:
# view the dataframe
df_fromDict

# 5 minute break
When we come back: Exploratory data analysis

### Exercise 4: Exploring a DataFrame
In this section, we will find basic information about our dataframe and start to manipulate our data

pandas has several functions to help us explore a new dataset:
- **.head()** - first 5 rows [default, put desired number of rows in the parentheses]
- **.tail()** - last 5 rows [default, put desired number of rows in the parentheses]
- **.sample()** - random sample of the dataframe
- **.dtypes** -  data type of each column
- **.shape** - tuple of (rows, columns)
- **len()** - base Python function, length of object
    - Note: In the pandas functions, we put our data object first, `df.function_name`, as required by pandas syntax. In contrast, `len()` is a Python function and uses different syntax.
- **.columns** - column names
- **.unique()** - unique values for a given column or Series
    - Must use this function on an individual column in a DataFrame, or on a singular Series
- **.describe()** - summary statistics for numeric columns

Why do some of these functions have parentheses and some do not?

Functions with () are **methods**   

Functions without () are **attributes**

Methods:
- Actions performed on a dataframe
- Parentheses () can hold additional arguments

Attributes:
- Things intrinsic to the dataframe
- Used for description

**Resource:** [Full list of DataFrame attributes and methods](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

Try 4 or 5 of the above functions to explore our temperatures dataset, `df`

### Exercise 5: Subsetting a DataFrame
What if we want to search our dataframe for the temperature on a specific day? Or find all days with a specific temperature value?

**Option 1: Extract specific rows by index**
- **.iloc[ ]** - integer location; returns row at given integer
- **.loc[ ]** - location; returns all rows with given index value; does not need to be an integer     

We use the square bracket [ ] notation to select an index, just as we would when indexing strings or lists in other Python programs.

**Option 2: Filter by known element in `Days` column**

1. Select a column using `df.colName` or `df[column name]`
2. Filter that column using [comparison and logical operators](https://www.w3schools.com/python/python_operators.asp) (examples: >, <, ==, |, &)

Filter `df` to show all rows where the Days column is Tuesday

Filter `df` to show all rows where the Days column is Tuesday or Wednesday

Filter `df` to show all rows where the Temps_F column is greater than 60

Filter `df` to show all rows where Days is Saturday or Sunday, and Temps_F is greater than 55

**Resource:** [Learn more about the indexing operator and how to select subests of dataframes, here](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c)

**Option 3: Set `Days` as the index, extract rows by index**   

We can set one of the dataframe columns as the index using the **.set_index()** function   
We can then use **.loc[ ]** to index the dataframe  

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) for .set_index()

### Exercise 6: Adding and renaming columns
The temperature data we have are the high temps for this week. Let's add a new column with this week's low temperatures

In [None]:
low_temp_list = [41, 43, 45, 43, 54, 43, 37]

Remember the syntax to add a new column: `dataframe[col_name] = data_for_column`

In [None]:
# add new column from low_temp_list

In [None]:
# view the dataframe
df

Now let's change the name of column `Temps_F` to the more descriptive `High_Temps`   

We can use the **.rename()** function or the **columns** attribute

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) for .rename()

To use the .rename() function, we provide a dictionary with 'old_column_name' : 'new_column_name'

To use the columns function, simply provide a list of all of the column names you want for the dataframe. This is a great option if you are renaming multiple columns. But, you must provide a name for **all** of the columns in the dataframe, even if you do not want to change all of the column names.

# 5 minute break
When we come back: Penguins!

![Gentoo penguin with chick](Images/Gentoo_Penguin_with_chick_at_Jougla_Point,_Antarctica_(6063647060).jpg)

### Section 2: Palmer Penguins dataset
#### In this section:
[More data manipulation](#More-data-manipulation)   
[Exercise 7: Exploratory data analysis](#Exercise-7:-Exploratory-data-analysis)   
[Exercise 8: Dealing with missing values](#Exercise-8:-Dealing-with-missing-values)   
[Exercise 9: Sorting data](#Exercise-9:-Sorting-data)   
[Exercise 10: Basic calculations](#Exercise-10:-Basic-calculations)   
[Exercise 11: Grouping and aggregating data](#Exercise-11:-Grouping-and-aggregating-data)   

### More data manipulation
In this section, we'll use the Palmer Penguins dataset. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

This dataset was compiled by developer Allison Horst as an R package [(see R documentation here)](https://allisonhorst.github.io/palmerpenguins/).   

The dataset is also available as a [Python library](https://pypi.org/project/palmerpenguins/), which I have converted to a CSV file and provided for this workshop.

In this section, we will:
- Import data from a CSV file
- Perform exploratory data analysis
- Clean and manipulate the dataset
    - Handle missing values
    - Sort the dataset  
    - Group the dataset in different ways  
    - Calculate basic summary statistics

Use the **.read_csv()** function to import our dataset.

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for .read_csv()

In [None]:
# import the dataset from file palmerpenguins.csv

### Exercise 7: Exploratory data analysis
Spend 3 minutes getting to know the `penguins` dataset

Try functions like .shape, .dtypes, .describe(), or .unique()

### Exercise 8: Dealing with missing values
We'll use the **.isna()** function to check if we have any missing value (NaN) in our dataset. Then we will drop all rows that have any missing value using **.dropna()**    

- [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) for .isna()   
- [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) for .dropna()

In [None]:
# check for missing values

`.isna()` will also work with a specific column: `df[column].isna()`   
You can add the `.unique()` function to quickly see if any data is the column is missing

In [None]:
# use .isna() and .unique() together

**.dropna()** will drop rows or columns that have missing values.    
Specify that we want to drop rows using the "axis=0" argument. If we wanted to drop columns with missing values, we would use "axis=1"

In [None]:
# remove all rows that have at least one missing value

In [None]:
penguins.shape

Our dataset started with 344 rows. After dropping rows with missing values, we have 333 rows. We still have all 8 columns.

### Exercise 9: Sorting data
The **.sort_values()** function sorts data in ascending order. Use sort_values() to order `penguins` by bill length, from smallest to largest. Then order `penguins` by bill length from largest to smallest.  

- Use the `by` argument to specify which column(s) to sort
- Use the `ascending=False` argument to sort in descending order

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) for .sort_values()

In [None]:
# sort penguins by bill length, smallest to largest

In [None]:
# sort penguins by bill length, largest to smallest

We can sort a dataframe by multiple variables by passing a list of column names into the `by` argument.   
Sort `penguins` first by year, then by bill length in ascending order.

In [None]:
# sort penguins by year, then bill length 

### Exercise 10: Basic calculations  

The pandas library includes computational tools to analyze a dataframe. These can give us summary statistics like **.mean()** or **.median()**, or more advanced statistics like correlation (**.corr()**)

[More on computation](https://pandas.pydata.org/docs/user_guide/computation.html) in pandas

Write code to find the mean value for each of the numeric variables and the correlation between numeric variables.

In [None]:
# find mean of each numeric variable

In [None]:
# calculate correlation between variables

### Exercise 11: Grouping and aggregating data

The **.groupby()** function separates a dataframe into groups based on the dataframe's columns. This function uses a split-apply-combine process:   
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure   

The .groupby() function keeps track of which rows of the dataframe belong to each group. The function returns a GroupBy object that is not very informative on its own. We can see what is inside of a GroupBy object by adding additional methods like **.get_group()**, **.groups**, or **.size()**.  We can also aggregate data within groups by adding functions like **.sum()** or **.mean()**.   

[Documentation](https://pandas.pydata.org/docs/reference/groupby.html) for .groupby()   

[More on the split-apply-combine process](https://pandas.pydata.org/docs/user_guide/groupby.html)

**Task 1: Group `penguins` by species. Show the dataframe for Gentoo penguins.**

**Task 2: Group `penguins` by species. Calculate the mean for each of the numeric variables.**

**Task 3: Group `penguins` by species and island. Show the dataframe for Adelie penguins on Biscoe island.**

**Task 4: Group `penguins` by species and island. Get a count of how many penguins of each species were on each island.**

### Resources section links:
[Additional Practice](#Additional-Practice)   
[Resources](#Resources)   
[Questions?](#Questions?)

### Additional Practice
Below are a few additional questions to continue practicing pandas functions. Answers provided in the answer key are just suggestions; there are many ways to solve these problems!

**Practice 1: Extract a dataset of just Gentoo penguins**   

**Practice 2: Which species had the highest number of male penguins in 2007? In 2009?**   

**Practice 3: How many total penguins were on each island in each year? How many penguins of each species were on each island, in each year?**

**Practice 4: For Adelie penguins in 2009, what percentage of the total population was female?**   

## Resources

__pandas Resources__   
[pandas Official Documentation](https://pandas.pydata.org/pandas-docs/stable/)   
[pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)   
[Comparing pandas to R programming](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html)   
Comparing pandas to [Excel](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_spreadsheets.html), [SQL](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html), [SAS](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sas.html), and [Stata](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_stata.html)   


[Towards Data Science](https://towardsdatascience.com/) - contains articles on Python and other programming languages, from beginner to expert levels

__A few more things pandas can do:__   
Pivot tables and reshaping datasets - [blog post with images](https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/), [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)   

Merging, joining and comparing datasets - [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html), [.join() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html), [tutorial with images](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html)

__Jupyter Notebooks Resources__   
[Project Jupyter](https://jupyter.org/) - organization behind Jupyter Notebooks   
[Anaconda](https://www.anaconda.com/) - environment manager and GUI for launching Jupyter Notebooks  
[RISE slideshow extension for Jupyter Notebooks](https://rise.readthedocs.io/en/stable/)   
[Guide to interactive notebooks](https://morphocode.com/interactive-notebooks-data-analysis-visualization/)   
[Basic Markdown syntax](https://www.markdownguide.org/basic-syntax) for formatting text elements   

__Conferences__   
[Pycon 2022](https://us.pycon.org/2022/) - annual Python users conference, past talks [available on Youtube](https://www.youtube.com/channel/UCMjMBMGt0WJQLeluw6qNJuA)     
[PyData conferences and meetups](https://pydata.org/)   
[SciPy conference](http://conference.scipy.org/)   
[More Python community events](https://www.python.org/community/workshops/)

### Want to visualize your data? Take our [Data Visualization in Python](https://jhu.libcal.com/event/8219524) workshop on Tuesday November 16
### [Register here](https://jhu.libcal.com/event/8219524)   

<img src="./Images/matplotlib_logo.svg" width='400'/>

### Take our survey to help us improve this workshop:   
### https://www.surveymonkey.com/r/IntroPandas


## Questions?   

## Contact us at dataservices@jhu.edu

### About this Presentation  
This presentation was created using Jupyter Notebooks version 6.1.4 and the RISE notebook extension version 5.6.1.    

### Terms of Use 
The presentation materials are licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/), attributable to Data Services, Johns Hopkins University.   

Please cite this material as:

> Johns Hopkins University Data Services. (2021, November 9). Data Wrangling in Python: Introduction to the pandas library [workshop presentation].