# UE4: Pandas

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2560px-Pandas_logo.svg.png height=200>

### **Introduction to Pandas for Data Analysis**

Welcome to the Introduction to **Pandas** course, designed to equip you with the skills necessary for data manipulation and analysis in Python using the Pandas library. 

This two-hour course is structured to offer a comprehensive overview of Pandas' capabilities, ensuring you're prepared to tackle real-world data analysis challenges.

---

## **Course Outline**

#### **Chapter 1. [Introduction to Pandas](#1-introduction-to-pandas)**
- **Chapter 1.1 [Overview of Pandas](#1.1-overview-of-pandas):** What is Pandas, and why is it essential for data science?
- **Chapter 1.2 [Installation and Setup](#1.2-installation-and-setup):** Getting Pandas up and running.
- **Chapter 1.3 [Pandas Data Structures](#1.3-pandas-data-structures):** Understanding Series and DataFrames.

#### **Chapter 2. [Data Manipulation with Pandas](#2-data-manipulation-with-pandas)**
- **Chapter 2.1 [Data Importing](#2.1-data-importing):** Reading data from various sources.
- **Chapter 2.2 [Data Exploration](#2.2-data-exploration):** Basic data exploration techniques.
- **Chapter 2.3 [Data Indexing](#2.3-data-indexing):** Indexing your data.
- **Chapter 2.4 [Data Cleaning](#2.4-data-cleaning):** Preparing your data for analysis.
- **Chapter 2.5 [Coding Challenge](#2.5-coding-challenge):** Practical exercise on data manipulation.

#### **Chapter 3. [Data Analysis and Aggregation](#3-data-analysis-and-aggregation)**
- **Chapter 3.1 [Filtering and Sorting](#3.1-filtering-and-sorting):** Slicing and dicing data.
- **Chapter 3.2 [Grouping and Aggregating](#3.2-grouping-and-aggregating):** Summarizing data effectively.
- **Chapter 3.3 [Pivot Tables and Cross-Tabulation](#3.3-pivot-tables-and-cross-tabulation):** Advanced data summarization.
- **Chapter 3.4 [Coding Challenge](#3.4-coding-challenge):** Deep dive into data analysis.

#### **Chapter 4. [Advanced Pandas Techniques](#4-advanced-pandas-techniques)**
- **Chapter 4.1 [Time Series Analysis](#4.1-time-series-analysis):** Handling time-stamped data.
- **Chapter 4.2 [Text Data Handling](#4.2-text-data-handling):** Managing textual data.
- **Chapter 4.3 [Combining and Merging Data Sets](#4.3-combining-and-merging-data-sets):** Building complex data relationships.
- **Chapter 4.4 [Coding Challenge](#4.4-coding-challenge):** Applying advanced techniques in practical scenarios.

#### **Chapter 5. [Best Practices and Resources](#5-best-practices-and-resources)**
- **Chapter 5.1 [Writing Efficient Pandas Code](#5.1-writing-efficient-pandas-code):** Optimization tips.
- **Chapter 5.2 [Learning Resources](#5.2-learning-resources):** Furthering your Pandas knowledge.
- **Chapter 5.3 [Q&A Session](#5.3-qa-session):** Addressing common queries.

---

## **[1. Introduction to Pandas](#1-introduction-to-pandas)**

**Pandas** is a cornerstone library for data analysis and manipulation in Python, offering rich data structures and functions designed to make data exploration and manipulation straightforward and efficient. 

<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2560px-Pandas_logo.svg.png height=100>

At the heart of Pandas are its two primary data structures: Series and DataFrame, which provide the functionality to handle and analyze data in a way that is both fast and intuitive. 

Conceived by Wes McKinney in 2008, Pandas has become indispensable for data scientists for tasks ranging from simple data filtering and aggregation to more complex data transformations and analysis. Its seamless integration with other libraries like NumPy, Scikit-learn, and Matplotlib makes it a versatile tool in the data science toolkit, enabling analysts and researchers to draw insights from data with ease. 

### [1.1 Overview of Pandas](#11-overview-of-pandas)

Pandas users range from beginners in data science to seasoned analysts and researchers, all benefiting from its extensive functionality for handling and analyzing input data, regardless of its origin or format. 

The library's central feature is its powerful and flexible `DataFrame` object, which allows for sophisticated data manipulation and analysis.


#### Core Feature: DataFrame

The DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It's akin to a spreadsheet or SQL table and is the most commonly used pandas object.

```python
pandas.DataFrame()
```

#### Why Pandas?

Pandas simplify the process of data manipulation and analysis through its powerful data structures. 

It provides:
- Fast and efficient DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of datasets.
- Label-based slicing, indexing, and subsetting of large datasets.
- Data structure column insertion and deletion.
- Group by functionality for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

Pandas is built on top of NumPy and is designed to integrate well within a scientific computing environment with many other 3rd party libraries.


### [1.2 Installation and Setup](#1.2-installation-and-setup)
<a name="1.2-installation-and-setup"></a>
Pandas can be installed using Python's package manager `pip`. 

If you have Python and pip on your system, you can install Pandas by running:

    pip install pandas

To verify the installation, you can import pandas and check its version with:

In [None]:
import pandas as pd
print(pd.__version__)

### [1.3 Pandas Data Structures](#1.3-pandas-data-structures)
<a name="1.3-why-pandas"></a>
The two primary data structures of pandas, `Series` (1-dimensional) and `DataFrame` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

For [R](https://www.r-project.org/) users, DataFrame provides everything that [R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame)'s `data.frame` provides and much more.

<img src=../images/ue4_series-dataframe.png height=200>

#### Series

A Series is a `one-dimensional array-like object` containing a sequence of values and an associated array of data labels, called its index. 

A simple Series is formed from only an array of data:

In [None]:
series = pd.Series([4, 7, -5, 3])

print(series)

#### DataFrame

The DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

Creating a DataFrame is as simple as passing a dict of objects:

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2000, 2002, 2001, 2003, 2002],
        'pop': [1.5, 1.5, 3.6, 2.4, 2.9, 3.9]}
dataframe = pd.DataFrame(data)  

dataframe

<img src= https://cdn-images-1.medium.com/v2/1*5zJ9tsVIRvxY83GsO8eyOw.png height=500>

**Source:** https://medium.com/dunder-data/the-pandas-dataframe-and-series-a7e7a5987492

Pandas DataFrames are versatile and powerful, capable of solving a wide array of data manipulation tasks.

---

## **[2. Data Manipulation with Pandas](#2-data-manipulation-with-pandas)**

For the next section we use a dataset from an injection molding experiment. The data is only used to demonstrate other ways for getting a dataframe. A research paper is available that details the experiments undertaken to obtain the data: 

> Polenta, A.; Tomassini, S.; Falcionelli, N.; Contardo, P.; Dragoni, A.F.; Sernani, P. A Comparison of Machine Learning Techniques for the Quality Classification of Molded Products. Information 2022, 13, 272. https://doi.org/10.3390/info13060272

Please download the `csv` data [here](https://github.com/airtlab/machine-learning-for-quality-prediction-in-plastic-injection-molding/blob/master/dataset/data.csv) and save it in your `/data` folder.



### **[2.1 Data Importing](#2.1-data-importing)**
<a name="2.1-data-importing"></a>
Pandas supports a wide range of data formats for reading and writing data, making it incredibly versatile for data analysis tasks. 

You can easily import data from `CSV`, `Excel`, `JSON`, `SQL` databases, and many other formats into Pandas DataFrames. 

Here's how to import data from a CSV file:

In [None]:
# Import data with pandas read CSV function
df = pd.read_csv('../data/data.csv')

# In the next chapters (2.2, 2.3 and 2.4), we are going to use more simple example data.
# For the coding challenge (2.5), we will utilize the injection modling dataset. 

The data import is similar for every format pd.read_dataformat, e.g.:
- pd.read_csv
- pd.read_json
- pd.read_excel
- pd.read_html
- pd.read_xml
- pd.read_sql

For more information about data import you can use the official pandas documentaion ([click here](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)).

### **[2.2 Data Exploration](#2.2-data-exploration)**
<a name="2.2-data-exploration"></a>
Once you have your data loaded into a DataFrame, Pandas offers several functions to perform initial data exploration. 

These functions help to understand the structure and content of your data:
- name_of_the_dataframe **.head()**
- name_of_the_dataframe **.describe()**
- name_of_the_dataframe **.dtypes()**
- name_of_the_dataframe **.index()**
- name_of_the_dataframe **.columns()**

In [None]:
# Creating a new DataFrame to explore 
import numpy as np

# Create some dummy data using NumPy
dummy_data = np.random.randn(20,4)

# Create a dataframe with dummy data 
df = pd.DataFrame(data=dummy_data)

df 

> Why do we see something, even though we did not print the dataframe?

When you simply place the name of a DataFrame (or any object for that matter) at the end of a Jupyter notebook cell, Jupyter uses the _repr_html_() method of the DataFrame object to render it as an `HTML table`. This method provides a visually appealing, styled HTML representation of the DataFrame, which includes features like a boxed layout, background shading for the header, and potentially interactive elements like scrollbars for wide or long DataFrames.

On the other hand, when you use the `print(df)` statement, the DataFrame is converted to a string representation using the `__str__()` method. This results in a text-based output that looks more like traditional, console-based output. This output is less visually appealing than the HTML representation and lacks the formatting enhancements like background shading and borders.

The HTML representation is generally more readable, especially for large DataFrames, as it makes better use of the notebook's capabilities to present data in an interactive environment. However, if you need to share your DataFrame in a text format or prefer the simplicity of the text output for any reason, you might use `print(df)` instead.

In [None]:
# To make the data more interesting, we are going to add three series to df

# Create a new Series with categorical values
dummy_series_int = pd.Series(np.random.choice([0,1,2], size=20))
dummy_series_cat = pd.Series(np.random.choice(["a","b","c"], size=20))
dummy_series_bool = pd.Series(np.random.choice([True, False], size=20))

# Add the Series to the DataFrame
df = df.assign(int_col=dummy_series_int, cat_col=dummy_series_cat, bool_col=dummy_series_bool)

df

The `.assign` method in pandas is a powerful tool for adding new columns to a `DataFrame` in a functional style, meaning it returns a *new* DataFrame without modifying the original one. This method is particularly useful for creating new columns that are derived from existing ones or for adding entirely new information to a DataFrame. You will see more ways to add columns to DataFrames in the next chapters.

**.head():** Display the first 5 rows of the DataFrame. You can change the number of rows from 5 to *n* with `.head(n)`. With `.head(-n)` every row is displayed except the last *n* rows.

In [None]:
# Display the first 5 rows of the DataFrame
df.head()  # Try to change the number of rows

**.describe()**: Display the summary statistics of the DataFrame's numeric columns. With the parameter `include='all'` all columns of the input will be included in the output, also non-numeric columns.

In [None]:
# Display the summary statistics of the DataFrame's numeric columns
df.describe()  # Try with parameter include='all'

**.dtypes():** Display the data types of each column. Columns with mixed types are stored with the 'object' dtype.

In [None]:
# Display the data types of each column
df.dtypes

In pandas, the data types (dtypes) of columns specify the kind of data each column contains. Understanding these dtypes is crucial for efficient data manipulation and analysis.
* The `float64` dtype represents floating-point numbers in pandas, stored using 64 bits of memory. A floating-point number is one that can contain fractions (decimal points), used for more precise calculations or when dealing with real numbers. The `64` in `float64` indicates the precision level of the floating-point data, with 64 bits offering double precision. This is the standard floating-point dtype in pandas and is capable of representing a wide range of decimal values. It's particularly useful for numerical analysis, scientific computations, and any situation where fractional numbers are common.

* The `int32` dtype in pandas represents integer numbers, using 32 bits of memory. Unlike floating-point numbers (`float64`), integers are whole numbers without a decimal point. The 32 in int32 indicates that each integer value is stored using 32 bits. This allows for a range of values from `-2,147,483,648` to `2,147,483,647`. The choice of `int32` over, say, `int64`, may be motivated by memory considerations, as `int32` uses half the memory of `int64`. However, it's important to ensure that the data you're working with fits within the `int32` range to avoid overflow errors. In pandas, integer columns are very common for indexing, counts, and any other numerical data that doesn't require a fractional component.

* The `object` dtype in pandas is used for columns that contain data of mixed types or non-numeric data. Most commonly, object dtype columns store text data (strings), but they can also hold complex objects like lists, dictionaries, or even other DataFrames if you have a particularly complex dataset. An `object` dtype is essentially a catch-all for columns that do not fit into other specific dtypes like `int64`, `float64`, or `bool`. Because object dtypes can store various types of data, operations on these columns might be less efficient, and performing numerical operations directly on an object column without first converting it to a more specific type can lead to errors or unexpected results.

* The `bool` dtype in pandas is used for columns that contain Boolean values: `True` or `False`. It is the pandas equivalent of Python's built-in bool type and represents a binary state. Boolean columns are extremely useful in data filtering operations, where they can be used directly as masks to select subsets of data that meet certain conditions. For example, a bool column might be used to flag certain rows based on whether they meet a specific criterion, such as whether a value in another column exceeds a certain threshold. In memory, `bool` types are very efficient because they only need to represent two possible states.

When working with pandas, managing your DataFrame's dtypes is essential for both performance and accuracy. Converting columns to the most appropriate dtypes can significantly speed up data processing and ensure that operations like aggregations, sorting, and mathematical computations are performed correctly.

**.index()**: Display the index of the DataFrame.

In [None]:
#  Display the index of the DataFrame
df.index

* RangeIndex: This is a type of index that pandas uses. It's an optimized form of index for cases where the index is equivalent to range(start, stop, step). It's particularly memory-efficient because, instead of storing all index values, it only needs to store the start, stop, and step values.
    * `start=0`: This is the starting value of the index. It indicates that the index starts at 0.
    * `stop=50`: This is the stopping value of the index, but it's exclusive. That means the actual last index value is one less than this number. So, in this case, the last index would be 49 (since indexing is zero-based).
    * `step=1`: This indicates the increment between each index value. A step of 1 means the index values are sequential integers.

In [None]:
# Pandas allows you to create dataframe with any index you like
df_idx = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], index=["A", "B", "B"], columns=["X", "Y", "Z"])
df_idx

**.columns()**: Display the name of the columns of the DataFrame.

In [None]:
# Display the index of the DataFrame
df.columns

### **[2.3 Data Indexing](#2.3-data-indexing)**
<a name="2.3-data-indexing"></a>
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as *Subset Selection*.

Data Indexing is important to:
- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
- Enables automatic and explicit data alignment.
- Allows intuitive getting and setting of subsets of the data set. 

In [None]:
# First, we create a new example dataframe from a dictionary to work with 

data = {
    "city": ["Dortmund", "Kassel", "Kassel", "Dortmund", "Kassel", "Dortmund"],
    "population": [587010, 201585, 191854, 539050, 204202, 591672],
    "year": [2019, 2019, 2023, 2021, 2022, 2023]
}

df = pd.DataFrame(data)

df

#### Simple indexing

* **Select columns:** With the name of one selected column or more selected columns as a list.

In [None]:
# Select a column via its name

df["city"]

In [None]:
# Select multiple columns by providing a list 
df[["city", "population"]]

#### Simple slicing

* **Select rows:** With slicing operation similar to NumPy. 
    * Use `[:n]` to get the first n rows. 
    * Use `[n:]` without the first n rows. 
    * Use `[::n]` every n row.

In [None]:
# Get the first three rows
df[:3]

In [None]:
# Get the last three rows
df[-3:]

In [None]:
# Get every third row
df[::2]

#### Label-based selection with `loc`

The `loc` method is used for selecting rows and columns based on their labels. It allows for both, selecting specific rows and columns by their labels and slicing ranges of rows or columns.

In [None]:
# Select a single row by index label
df.loc[0]

In [None]:
# Select multiple rows by index labels
df.loc[[3, 4]]

In [None]:
# Select rows by index label range
df.loc[2:5]

In [None]:
# Select specific rows and columns
df.loc[[2, 3], ['city', 'population']]


#### Positional Indexing with `iloc`

The `iloc` method is used for selecting rows and columns based on their integer positions. The positions are taken to be integer indices, so they start at `0`. It’s similar to Python *list slicing*, and as such, it’s exclusive of the endpoint.

In [None]:
# Select a single row by position
df.iloc[0]

In [None]:
# Select multiple rows by position
df.iloc[1:5]  # Selects rows 0 to 4

In [None]:
# Select specific rows and columns by position
df.iloc[[0, 2, 3], [1, 2]]  # Rows 0, 2, 3 and columns 1, 3

In [None]:
# Using a boolean array
df.iloc[[True, False, True, False, True, True]]

**Summary:** In pandas, both `.loc[]` and `.iloc[]` are used for indexing and selecting data from DataFrames, but they work in fundamentally different ways due to their indexing criteria: labels for `.loc[]` and integer positions for `.iloc[]`. 

Understanding the difference between `.loc[]` and `.iloc[]` is crucial for correctly accessing data within pandas objects.

* Label-based Indexing: `.loc[]` is used to select rows and columns by their labels.
* Positional Indexing: `.iloc[]` is used to select rows and columns by their integer position (i.e., their physical location in the DataFrame).

### **[2.4 Data Cleaning](#2.4-data-cleaning)**
<a name="2.4-data-cleaning"></a>
Data cleaning is a crucial step before the actual analysis. 

Pandas provides several methods to deal with duplicate data, missing data and data transformation:

In [None]:
# First, we create a new example dataframe from a dictionary to work with 

data = {
    "city": ["Dortmund", "Kassel", "Dortmund", "Kassel", "Dortmund", "Kassel", "Kassel", "Dortmund", "Kassel", "Dortmund"],
    "population": [587010, 201585, None, None, 591672, 191854, 230011, 539050, 204202, 591672],
    "year": [2019, 2019,2020, 2020, 2023, 2023, 2023, 2021, 2022, 2023]
}

df = pd.DataFrame(data)

df

#### Remove duplicate values

Removing duplicates from a DataFrame is an essential step in data cleaning to ensure the accuracy of your data analysis. Pandas provides two main functions to handle duplicate data: `duplicated()` and `drop_duplicates()`. Understanding how to use these functions allows for effective identification and removal of duplicate rows in your dataset.

* The `duplicated()` function is used to identify duplicate rows in a DataFrame. By default, it returns a *boolean* Series indicating whether each row is a duplicate of a row encountered earlier in the DataFrame. You can specify how to consider duplicates with the `keep` parameter.


In [None]:
# Find duplicates
df_copy = df.copy()
df_duplicates = df.duplicated()

df_duplicates

* The `drop_duplicates()` function removes duplicate rows from a DataFrame. Like `duplicated()`, it offers flexibility in defining what constitutes a duplicate.

In [None]:
# Removing duplicate entries
df_raw = df.copy()
df_copy = df.drop_duplicates()

df_copy

#### Handle missing values

Handling missing values is a critical aspect of data cleaning that directly impacts the *quality of data analysis*. Missing data can arise for various reasons, such as errors in data collection, processing, or transmission. 

**Pandas** provides several methods to deal with missing values effectively: `isnull()`, `notnull()`, `dropna()`, and `fillna()`. Each of these methods plays a specific role in identifying, removing, or replacing missing values in a `DataFrame` or `Series`.

* Using `isnull()` helps to too identify missing values in a DataFrame or Series. It returns a boolean same-sized object indicating if the elements are `NaN` or `None`.

In [None]:
# Example use of isnull()
df.isnull()

* `notnull()` can be considered the oposite of `isnull()`, since it identifies non-missing values. It also returns a boolean object of the same size, but indicates if the elements are **not** `NaN` or `None`.

In [None]:
# Example use of notnull()
df.notnull()

* You can use `dropna()` to remove missing values from a `DataFrame` or a `Series`. 
    * The parameter `axis` allows you to specifies whether *rows* (`0` or `'index'`) or *columns* (`1` or `'columns'`) should be dropped.


In [None]:
# Dropping the rows
df_copy = df_copy.dropna(axis=0)

In [None]:
# Dropping the columns
df_copy = df.copy()
df_copy.dropna(axis=1)

* You can use `fillna()`to replaces missing values with a specified value, or by using an imputation method such as `mean`, `median`, or `mode`.

In [None]:
# Filling NaN with a string
df_copy = df.copy()
df_copy.fillna(value="Value to fill")

In [None]:
# Filling NaNs using a imputation method
df_copy = df.copy()
df_copy.fillna(value={"population": df_copy["population"].mean()})

#### Working with Data Types

Converting data types in a pandas DataFrame is a common task during the data cleaning process, ensuring that each column has the most appropriate type for data analysis or machine learning models. This process involves first inspecting the current data types of each column and then using the `astype()` method to explicitly convert columns to different types.

* Before converting data types, it's crucial to understand the current data types of the columns in your DataFrame. Pandas provides the `dtypes` attribute for this purpose.

In [None]:
# Inspecting Data Types with dtypes
df.dtypes

* Once you've identified the columns whose data types need to be converted, you can use the `astype()` method to perform the conversion. This method is versatile, allowing for the conversion to most pandas data types and even custom types.

In [None]:
# Converting Data Types with astype()
df_copy = df.copy()
df_copy = df_copy.fillna(42)
df_copy = df_copy.astype({'population': 'int32', 'year': 'category'})

df_copy.dtypes

In this example, the population column is converted to a `32-bit` integer to potentially reduce memory usage, and the `year` column is converted to a category type, which is often useful for columns with a relatively small number of unique values and can lead to performance improvements in certain operations.

### **[2.5 Coding Challenge](#2.5-coding-challenge)**

For this coding challenge, you'll be working with a dataset of your choice. Your task is to import the data into a Pandas DataFrame, explore the data to understand its structure, clean the data as necessary, and perform some basic analysis. This exercise will test your ability to manipulate and analyze data using Pandas.

Challenge objectives:

1. Import a dataset into a Pandas DataFrame.
2. Use DataFrame methods to explore the data (e.g., `head()`, `describe()`, `dtypes`).
3. Clean the dataset by handling missing values and duplicates.

To give an idea of how the task can be solved you can find an example with analysing a dataset about "Machine Learning for Quality Prediction in Plastic Injection Molding". The real-world dataset was collected during the production process and is used for a "systematic comparison of ML techniques to predict the quality classes of plastic molded products" 

> Polenta, A.; Tomassini, S.; Falcionelli, N.; Contardo, P.; Dragoni, A.F.; Sernani, P. A Comparison of Machine Learning Techniques for the Quality Classification of Molded Products. Information 2022, 13, 272. https://doi.org/10.3390/info13060272


Again, you can download the dataset and get more information [here](https://github.com/airtlab/machine-learning-for-quality-prediction-in-plastic-injection-molding).


**Step 1: Import a dataset into a Pandas DataFrame.**
- Load the data using pandas import to beginn the analysis.


In [1]:
# Step 1: Import a dataset into a Pandas DataFrame

# Sample code template
import pandas as pd
# Import the dataset
df = pd.read_csv('../data/data.csv', delimiter=";")

df.head()

Unnamed: 0,Melt temperature,Mold temperature,time_to_fill,ZDx - Plasticizing time,ZUx - Cycle time,SKx - Closing force,SKs - Clamping force peak value,Ms - Torque peak value current cycle,Mm - Torque mean value current cycle,APSs - Specific back pressure peak value,APVs - Specific injection pressure peak value,CPn - Screw position at the end of hold pressure,SVo - Shot volume,quality
0,106.476184,80.617,7.124,3.16,74.83,886.9,904.0,116.9,104.3,145.6,922.3,8.82,18.73,1.0
1,105.505,81.362,6.968,3.16,74.81,919.409791,935.9,113.9,104.9,145.6,930.5,8.59,18.73,1.0
2,105.505,80.411,6.864,4.08,74.81,908.6,902.344823,120.5,106.503496,147.0,933.1,8.8,18.98,1.0
3,106.474827,81.162,6.864,3.16,74.82,879.410871,902.033653,127.3,104.9,145.6,922.3,8.85,18.73,1.0
4,106.46614,81.471,6.864,3.22,74.83,885.64426,902.821269,120.5,106.7,145.6,917.5,8.8,18.75,1.0


**Step 2: Use DataFrame methods to explore the data**
- Apply head(), describe() and dtypes()

In [2]:
# Step 2: Use DataFrame methods to explore the data

df.shape

(1451, 14)

In [None]:
df.head()

In [3]:
df.describe()

Unnamed: 0,Melt temperature,Mold temperature,time_to_fill,ZDx - Plasticizing time,ZUx - Cycle time,SKx - Closing force,SKs - Clamping force peak value,Ms - Torque peak value current cycle,Mm - Torque mean value current cycle,APSs - Specific back pressure peak value,APVs - Specific injection pressure peak value,CPn - Screw position at the end of hold pressure,SVo - Shot volume,quality
count,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0,1451.0
mean,106.89204,81.326023,7.459043,3.234173,75.218794,901.974834,919.351778,116.716747,104.163904,146.230048,900.972846,8.808863,18.756285,2.461751
std,5.615773,0.428813,1.688106,0.34323,0.432761,11.098192,10.780023,5.029085,4.802195,0.804894,25.519215,0.097238,0.095528,1.123611
min,81.747,78.409,6.084,2.78,74.78,876.7,894.8,94.2,76.5,144.8,780.5,8.33,18.51,1.0
25%,105.9145,81.1235,6.292,3.0,74.82,893.6,914.4,114.2,103.55,145.6,886.65,8.77,18.71,1.0
50%,106.089,81.327,6.968,3.1926,74.83,902.4,918.8,116.9,105.2,146.1,906.8,8.82,18.75,2.0
75%,106.263,81.441,7.124,3.29,75.65,909.4,926.3,120.2,106.531415,146.7,918.9,8.85,18.79,4.0
max,155.032,82.159,11.232,6.61,75.79,930.6,946.5,130.3,114.9,150.5,943.0,9.06,19.23,4.0


In [4]:
df.dtypes

Melt temperature                                    float64
Mold temperature                                    float64
time_to_fill                                        float64
ZDx - Plasticizing time                             float64
ZUx - Cycle time                                    float64
SKx - Closing force                                 float64
SKs - Clamping force peak value                     float64
Ms - Torque peak value current cycle                float64
Mm - Torque mean value current cycle                float64
APSs - Specific back pressure peak value            float64
APVs - Specific injection pressure peak value       float64
CPn - Screw position at the end of hold pressure    float64
SVo - Shot volume                                   float64
quality                                             float64
dtype: object

**Step 3: Clean the dataset**
- by handling missing values 
- by handling duplicate values
- by changing the datatype

In [None]:
# Step 3: Clean the dataset

# Find NaN
df.isna().any()


In [None]:
# Find duplicates
df.duplicated().any()

In [None]:
# Change the data type to 32-bit floats
df = df.astype("Float32")

df.dtypes

[--> Back to Outline](#course-outline)

---

## **[3. Data Analysis and Aggregation](#3-data-analysis-and-aggregation)**


In [5]:
# Again, we create our example dataframe from a dictionary to work with 

data = {
    "city": ["Dortmund", "Kassel", "Dortmund", "Kassel", "Dortmund", "Kassel", "Kassel", "Dortmund", "Kassel", "Dortmund"],
    "population": [587010, 201585, None, None, 591672, 191854, 230011, 539050, 204202, 591672],
    "year": [2019, 2019,2020, 2020, 2023, 2023, 2023, 2021, 2022, 2023]
}

df = pd.DataFrame(data)

df

Unnamed: 0,city,population,year
0,Dortmund,587010.0,2019
1,Kassel,201585.0,2019
2,Dortmund,,2020
3,Kassel,,2020
4,Dortmund,591672.0,2023
5,Kassel,191854.0,2023
6,Kassel,230011.0,2023
7,Dortmund,539050.0,2021
8,Kassel,204202.0,2022
9,Dortmund,591672.0,2023


### **[3.1 Filtering and Sorting](#3.1-filtering-and-sorting)**

Pandas provides numerous options for filtering and sorting data, allowing you to slice and dice your dataset to uncover insights or prepare data for visualization or further analysis.

#### Filtering

Filtering in pandas is a crucial technique for selecting specific rows from a DataFrame or Series based on one or more conditions. This process allows you to extract subsets of data that meet certain criteria, making it easier to perform targeted analysis. There are several ways to filter data in pandas, including using boolean indexing, the `query()` method, and conditional expressions.

* `Boolean indexing` involves creating a boolean condition (or conditions) that is applied to the DataFrame or Series to filter rows. The condition evaluates to `True` or `False` for each row, and only rows where the condition is `True` are retained.

In [6]:
# Filter rows where the 'population' is greater than 200000
df[df['population'] > 500000]

Unnamed: 0,city,population,year
0,Dortmund,587010.0,2019
4,Dortmund,591672.0,2023
7,Dortmund,539050.0,2021
9,Dortmund,591672.0,2023


In this example, `df['population'] > 500000` creates a boolean Series that is `True` for rows where the population is greater than `500000`. When this boolean Series is passed to `df[]`, only the rows with True are selected.

* The `query()` method allows for filtering using an expression string, making it a more concise syntax for complex filtering operations.

In [8]:
# Filter rows where the 'population' is greater than 200000 and the 'year' is 2023
filtered_df = df.query("population > 200000 and year == 2023")
filtered_df

Unnamed: 0,city,population,year
4,Dortmund,591672.0,2023
6,Kassel,230011.0,2023
9,Dortmund,591672.0,2023


This example filters the DataFrame to include only rows where the population is greater than 200000 and the year is 2023. The `query()` method is particularly useful for filtering with multiple conditions and offers a syntax that is easier to read and write for complex expressions.

* Pandas also supports the use of `conditional expressions` for filtering. This can be useful when you need to apply more complex logic that isn't easily represented as a single condition or when working with multiple `DataFrames`.

In [9]:
# Use np.where to create a new column based on a condition
import numpy as np
df['large_city'] = np.where(df['population'] > 500000, 'Yes', 'No')
df

Unnamed: 0,city,population,year,large_city
0,Dortmund,587010.0,2019,Yes
1,Kassel,201585.0,2019,No
2,Dortmund,,2020,No
3,Kassel,,2020,No
4,Dortmund,591672.0,2023,Yes
5,Kassel,191854.0,2023,No
6,Kassel,230011.0,2023,No
7,Dortmund,539050.0,2021,Yes
8,Kassel,204202.0,2022,No
9,Dortmund,591672.0,2023,Yes


This example uses np.where to create a new column, `'large_city'`, that marks cities with a population greater than `500000` as `'Yes'`, and all others as `'No'`. 

While this is more about generating new data based on conditions, it illustrates the flexibility of conditional expressions in pandas.

#### Sorting

Sorting data in pandas is an essential operation for organizing your dataset in a meaningful order, whether it be ascending or descending, based on one or more columns. This process can greatly enhance your ability to analyze trends, patterns, and anomalies within your data. Pandas provides several methods for sorting, including `sort_values()`, `sort_index()`, and the `rank()` method for ranking data.

* The `sort_values()` method sorts a DataFrame based on the values of one or more columns. It's the primary method used for value-based sorting.

In [None]:
# Sort the DataFrame based on the 'population' column in ascending order
sorted_df = df.sort_values(by='population')
sorted_df

In [None]:
# Sort by multiple columns, first by 'city' in ascending order, then by 'population' in descending order
sorted_df = df.sort_values(by=['city', 'population'], ascending=[True, False])
sorted_df

In the first example, `sort_values(by='population')` sorts the DataFrame based on the population column in ascending order. The second example demonstrates sorting by multiple columns: it sorts by city in ascending order and then within each city, sorts by population in descending order.

* The `sort_index()` method sorts a DataFrame or Series based on its index. This method is particularly useful when you want to revert a DataFrame to its original order after performing operations that alter its row order.

In [10]:
# Sort the DataFrame by its index in ascending order
sorted_df = df.sort_index()
sorted_df

Unnamed: 0,city,population,year,large_city
0,Dortmund,587010.0,2019,Yes
1,Kassel,201585.0,2019,No
2,Dortmund,,2020,No
3,Kassel,,2020,No
4,Dortmund,591672.0,2023,Yes
5,Kassel,191854.0,2023,No
6,Kassel,230011.0,2023,No
7,Dortmund,539050.0,2021,Yes
8,Kassel,204202.0,2022,No
9,Dortmund,591672.0,2023,Yes


In [11]:
# Sort the DataFrame by its index in descending order
sorted_df = df.sort_index(ascending=False)
sorted_df

Unnamed: 0,city,population,year,large_city
9,Dortmund,591672.0,2023,Yes
8,Kassel,204202.0,2022,No
7,Dortmund,539050.0,2021,Yes
6,Kassel,230011.0,2023,No
5,Kassel,191854.0,2023,No
4,Dortmund,591672.0,2023,Yes
3,Kassel,,2020,No
2,Dortmund,,2020,No
1,Kassel,201585.0,2019,No
0,Dortmund,587010.0,2019,Yes


* The `rank()` method assigns ranks to data, treating ties in a specific manner determined by its parameters. Ranking is different from sorting in that it assigns a numerical rank to each entry based on its value, rather than rearranging the entries.

In [12]:
# Rank the 'population' column, with higher populations receiving a higher rank
df['population_rank'] = df['population'].rank(ascending=False)
df

Unnamed: 0,city,population,year,large_city,population_rank
0,Dortmund,587010.0,2019,Yes,3.0
1,Kassel,201585.0,2019,No,7.0
2,Dortmund,,2020,No,
3,Kassel,,2020,No,
4,Dortmund,591672.0,2023,Yes,1.5
5,Kassel,191854.0,2023,No,8.0
6,Kassel,230011.0,2023,No,5.0
7,Dortmund,539050.0,2021,Yes,4.0
8,Kassel,204202.0,2022,No,6.0
9,Dortmund,591672.0,2023,Yes,1.5


This assigns ranks to the population column, with higher populations receiving a lower numerical rank (i.e., rank `1` is the highest population). The ranks are stored in a new column called `'population_rank'`.

Sorting and ranking are powerful tools in data analysis, enabling you to organize your data in meaningful ways and derive insights based on the order of data points. Whether you're preparing your dataset for analysis, looking for the top or bottom entries, or assigning ranks for comparison, pandas provides robust and flexible methods to accomplish these tasks efficiently.

### **[3.2 Grouping and Aggregating](#3.2-grouping-and-aggregating)**

Grouping and aggregation are powerful concepts in pandas that allow you to organize your data into groups and then perform calculations over those groups, summarizing or transforming the original detailed data into a new form. This can be highly useful for statistical analysis, data summarization, and even preparing data for further analysis or visualization.

#### Grouping

* The `groupby()` method is the cornerstone of data aggregation in pandas. It involves splitting the data into groups based on some criteria, applying a function to each group independently, and then combining the results into a data structure.


In [15]:
# Group the DataFrame by the 'city' column and calculate the average 'population' for each city
grouped_df = df.groupby('city')['population'].mean()
grouped_df

city
Dortmund    577351.0
Kassel      206913.0
Name: population, dtype: float64

In this example, `df.groupby('city')` creates a group for each unique city, and then for each group, the average population is calculated using `.mean()`. The result is a Series where the index consists of the unique cities, and the values are the average populations.

* You can also `group by multiple columns` by passing a list of column names. This is particularly useful for more complex data analysis tasks.

In [16]:
# Group by both 'city' and 'year', then calculate the total 'population' for each group
grouped_df = df.groupby(['city', 'year'])['population'].sum()
grouped_df

city      year
Dortmund  2019     587010.0
          2020          0.0
          2021     539050.0
          2023    1183344.0
Kassel    2019     201585.0
          2020          0.0
          2022     204202.0
          2023     421865.0
Name: population, dtype: float64

This groups the data first by `city`, and within each `city`, further groups by year. Then, for each `city-year` combination, it calculates the total population.

#### Aggregate functions

After grouping, you can compute aggregate statistics such as `sum()`, `mean()`, `median()`, `min()`, `max()`, `count()`, and `nunique()` on each group. 

These functions can be applied directly to the grouped object.

In [17]:
# Count the number of entries for each 'city'
entry_count = df.groupby('city')['population'].count()
entry_count

city
Dortmund    4
Kassel      4
Name: population, dtype: int64

In [18]:
# Find the maximum 'population' for each 'city'
max_population = df.groupby('city')['population'].max()
max_population

city
Dortmund    591672.0
Kassel      230011.0
Name: population, dtype: float64

* The `agg()` method allows for more flexibility, letting you apply multiple aggregation operations to your groups in a single step or even define your custom aggregation functions.

In [19]:
# Apply multiple aggregate functions to 'population' for each 'city'
agg_df = df.groupby('city')['population'].agg(['mean', 'min', 'max'])
agg_df

Unnamed: 0_level_0,mean,min,max
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dortmund,577351.0,539050.0,591672.0
Kassel,206913.0,191854.0,230011.0


In [20]:
# Custom aggregation
def range_func(x):
    return x.max() - x.min()

agg_df = df.groupby('city')['population'].agg(['mean', 'min', 'max', range_func])
agg_df

Unnamed: 0_level_0,mean,min,max,range_func
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dortmund,577351.0,539050.0,591672.0,52622.0
Kassel,206913.0,191854.0,230011.0,38157.0


In the first example, `['mean', 'min', 'max']` applies multiple predefined aggregation functions to each group. The second example includes a custom aggregation function range_func, which calculates the range of populations within each city group.

### **[3.3 Pivot Tables and Cross-Tabulation](#3.3-pivot-tables-and-cross-tabulation)**

Pivot tables and cross-tabulation are versatile tools in pandas for summarizing, analyzing, and presenting data. They allow you to reorganize and aggregate your data across multiple dimensions, making it easier to extract useful information and insights.

#### Pivot Tables with `pivot_table()`

* The `pivot_table()` function in pandas creates spreadsheet-style pivot tables as DataFrame objects. It's a highly versatile function that allows you to specify row and column indices, data values to fill the table, and the aggregation function to be applied.

In [21]:
# Apply pivot table
pivot = df.pivot_table(values='population', index='city', columns='year', aggfunc='mean')
pivot

year,2019,2021,2022,2023
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Dortmund,587010.0,539050.0,,591672.0
Kassel,201585.0,,204202.0,210932.5


This example creates a pivot table that shows the `mean population` for each `city` by `year`, with cities as row indices and years as column labels.

* The `crosstab()` function computes a cross-tabulation of two (or more) factors. It's useful for summarizing categorical data and creating contingency tables, which show the frequency distribution of variables.

In [22]:
# Apply cross tabulation
ct = pd.crosstab(index=df['city'], columns=df['year'])
ct

year,2019,2020,2021,2022,2023
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dortmund,1,1,1,0,2
Kassel,1,1,0,1,2


This creates a simple cross-tabulation that counts the occurrences of each `city-year` combination. If df includes other columns you wish to aggregate, you can specify them with the values parameter and an appropriate aggfunc.

### **[3.4 Coding Challenge](#3.4-coding-challenge)**

We are going to use the dataset from the previous coding challenge. It contains metrics from an injection molding process, including melt temperature, mold temperature, cycle times, forces, pressures, volumes, and a quality indicator. The goal is to analyze the dataset to uncover insights into the manufacturing process and quality outcomes.

**Challenge objectives:**

1. Step: Filter the dataset to include only cycles where the `'Melt temperature'` is greater than the average melt temperature.

2. Step: Slice the dataset to include only the columns related to temperatures (`'Melt temperature'`, `'Mold temperature'`) and `quality`.

3. Step: Group the dataset by `'quality'` and calculate the average `'Mold temperature'` and `'Melt temperature'` for each quality group.

4. Step: For each `'quality'` group, find the maximum `'Specific injection pressure peak value'` (APVs).

5. Step: Create a pivot table showing the average `'Cycle time' (ZUx - Cycle time) for each combination of 'quality'` and `'Mold temperature'` (rounded to the nearest integer).

6. Step: Use cross-tabulation to summarize the count of cycles by `'quality'` and the `'Screw position at the end of hold pressure'` (CPn) binned into categories (e.g., <9, 9-10, >10).

In [23]:
# Step 0: Import a dataset into a Pandas DataFrame

# Sample code template
import pandas as pd
# Import the dataset
df = pd.read_csv('../data/data.csv', delimiter=";")

df.head()

Unnamed: 0,Melt temperature,Mold temperature,time_to_fill,ZDx - Plasticizing time,ZUx - Cycle time,SKx - Closing force,SKs - Clamping force peak value,Ms - Torque peak value current cycle,Mm - Torque mean value current cycle,APSs - Specific back pressure peak value,APVs - Specific injection pressure peak value,CPn - Screw position at the end of hold pressure,SVo - Shot volume,quality
0,106.476184,80.617,7.124,3.16,74.83,886.9,904.0,116.9,104.3,145.6,922.3,8.82,18.73,1.0
1,105.505,81.362,6.968,3.16,74.81,919.409791,935.9,113.9,104.9,145.6,930.5,8.59,18.73,1.0
2,105.505,80.411,6.864,4.08,74.81,908.6,902.344823,120.5,106.503496,147.0,933.1,8.8,18.98,1.0
3,106.474827,81.162,6.864,3.16,74.82,879.410871,902.033653,127.3,104.9,145.6,922.3,8.85,18.73,1.0
4,106.46614,81.471,6.864,3.22,74.83,885.64426,902.821269,120.5,106.7,145.6,917.5,8.8,18.75,1.0


In [27]:
# Step 1:  Filter the dataset to include only cycles where the `'Melt temperature'` is greater than the average melt temperature.

avg_melt_temp = df['Melt temperature'].mean()
filtered_df = df[df['Melt temperature'] > avg_melt_temp]

filtered_df.head()

Unnamed: 0,Melt temperature,Mold temperature,time_to_fill,ZDx - Plasticizing time,ZUx - Cycle time,SKx - Closing force,SKs - Clamping force peak value,Ms - Torque peak value current cycle,Mm - Torque mean value current cycle,APSs - Specific back pressure peak value,APVs - Specific injection pressure peak value,CPn - Screw position at the end of hold pressure,SVo - Shot volume,quality
391,124.391,80.844,6.968,3.17,74.81,916.7,933.0,124.7,110.0,146.1,914.9,8.88,18.69,2.0
454,141.136,80.947,6.968,3.13,74.8,921.0,936.7,123.8,110.0,145.6,918.0,8.86,18.71,2.0
470,110.223,81.081,7.748,3.1,74.84,921.2,936.7,126.4,113.0,147.5,923.1,8.9,18.67,2.0
511,140.641,81.183,7.852,3.12,74.82,918.2,937.0,118.7,109.9,145.7,930.8,8.95,18.62,2.0
547,132.74,80.934,6.968,3.14,74.81,918.2,936.5,126.4,109.7,146.4,913.0,8.88,18.69,2.0


This step filters rows based on whether their `'Melt temperature'` is above the column's average. This step focuses on cycles with higher-than-average melt temperatures.

In [28]:
# 2. Step: Slice the dataset to include only the columns related to temperatures (`'Melt temperature'`, `'Mold temperature'`) and `quality`.

sliced_df = filtered_df[['Melt temperature', 'Mold temperature', 'quality']]
sliced_df.head()

Unnamed: 0,Melt temperature,Mold temperature,quality
391,124.391,80.844,2.0
454,141.136,80.947,2.0
470,110.223,81.081,2.0
511,140.641,81.183,2.0
547,132.74,80.934,2.0


This step narrows down the dataset to only include the columns that are immediately relevant to the subsequent analysis, enhancing clarity and performance.

In [24]:
# 3. Step: Group the dataset by `'quality'` and calculate the average `'Mold temperature'` and `'Melt temperature'` for each quality group.

avg_temp_by_quality = df.groupby('quality')[['Mold temperature', 'Melt temperature']].mean()

avg_temp_by_quality

Unnamed: 0_level_0,Mold temperature,Melt temperature
quality,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,81.106881,106.062991
2.0,81.227101,106.640461
3.0,81.151197,106.020055
4.0,81.806685,108.752877


In [25]:
# 4. Step: For each `'quality'` group, find the maximum `'Specific injection pressure peak value'` (APVs).

max_apvs_by_quality = df.groupby('quality')['APVs - Specific injection pressure peak value'].max()

max_apvs_by_quality

quality
1.0    937.7
2.0    943.0
3.0    927.9
4.0    935.6
Name: APVs - Specific injection pressure peak value, dtype: float64

Step 3 and Step 4 leverage the `groupby()` method to segment the data by `'quality'`, enabling detailed analysis of temperature averages and maximum injection pressures within each quality category.

In [29]:
# 5. Step: Create a pivot table showing the average `'Cycle time' (ZUx - Cycle time) for each combination of 'quality'` and `'Mold temperature'` (rounded to the nearest integer).

df['Mold temperature rounded'] = df['Mold temperature'].round()
pivot_table = df.pivot_table(values='ZUx - Cycle time', index='quality', columns='Mold temperature rounded', aggfunc='mean')

pivot_table

Mold temperature rounded,78.0,79.0,80.0,81.0,82.0
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1.0,,,74.820286,74.819305,74.81
2.0,,74.835,,74.827239,74.825
3.0,74.88,74.88,75.2875,75.736831,75.628889
4.0,,,,75.64693,75.635299


This step creates a pivot table that averages `'Cycle time'` across combinations of `'quality'` and `'Mold temperature'`, offering insights into how these variables interact.

In [30]:
# 6. Step: Use cross-tabulation to summarize the count of cycles by `'quality'` and the `'Screw position at the end of hold pressure'` (CPn) binned into categories (e.g., <9, 9-10, >10).

# Binning the CPn values
df['CPn_category'] = pd.cut(df['CPn - Screw position at the end of hold pressure'], bins=[-np.inf, 9, 10, np.inf], labels=['<9', '9-10', '>10'])

# Cross-tabulation
cross_tab = pd.crosstab(index=df['quality'], columns=df['CPn_category'])

cross_tab


CPn_category,<9,9-10
quality,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,370,0
2.0,406,0
3.0,308,2
4.0,343,22


This step utilizes `pd.cut()` to categorize `'Screw position at the end of hold pressure'` and `pd.crosstab()` to summarize the distribution of these categories by `'quality'`, providing a clear overview of cycle counts across different operational settings and outcomes.

[--> Back to Outline](#course-outline)

---
## **[5. Best Practices and Resources](#5-best-practices-and-resources)**


### **[5.1 Writing Efficient Pandas Code](#5.1-writing-efficient-pandas-code)**
<a name="5.1-writing-efficient-pandas-code"></a>
Writing efficient Pandas code is crucial for dealing with large datasets and complex data transformations. Here are some tips:

- **Use vectorized operations**: Avoid loops; instead, use Pandas' built-in methods that are optimized for performance.
- **Avoid chaining operations**: Chaining methods can lead to intermediate copies of data, so try to use methods that allow for direct assignment.
- **Utilize categorical data**: When dealing with object types, consider converting them to categorical data, which can save both memory and time.
- **Minimize memory usage**: Pay attention to data types. Downcasting numeric data to smaller types can significantly reduce memory usage.


### **[5.2 Learning Resources](#5.2-learning-resources)**
<a name="5.2-learning-resources"></a>
To further enhance your Pandas skills, consider exploring the following resources:

- **[Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)**: The official documentation is an excellent resource for learning and reference.
- **[Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)**: Offers an in-depth look at using Pandas for data science.
- **[Kaggle Pandas Micro-Course](https://www.kaggle.com/learn/pandas)**: Provides practical exercises and examples.
- **Online forums and communities**: Websites like Stack Overflow, Reddit's r/learnpython, and the Pandas tag on Stack Overflow are great for getting help and advice.

### **[5.3 Q&A Session](#5.3-qa-session)**
<a name="5.3-qa-session"></a>
Q: How can I improve my Pandas code's performance?
A: Focus on vectorized operations, use appropriate data types, and leverage the power of indexing for selecting and filtering data efficiently.

Q: Can Pandas handle very large datasets?
A: Pandas can handle large datasets, but its performance depends on your system's memory. For extremely large datasets, consider using Dask or chunking your data with Pandas.

Q: How can I stay updated with new Pandas features?
A: Follow the official Pandas blog and GitHub repository for release notes and updates. Participating in communities can also keep you informed about best practices and new features.

[--> Back to Outline](#course-outline)

---