<center>

# **Fall 2025 &mdash; CIS 3803<br>Introduction to Data Science**
### Week 3: Pandas Tutorial

</center>

**Date:** 15 September 2025  
**Time:** 6:00–9:00 PM  
**Instructor:** Dr. Patrick T. Marsh  
**Course Verse:** “It is the glory of God to conceal a matter; to search out a matter is the glory of kings.” &mdash; *Proverbs 25:2 (NIV)*

This notebook provides a crash course in Pandas and begins to cover the skills needed to be successful in data science.

****

## **Opening Devotional and Reflection**

*"The Spirit you received does not make you slaves, so that you live in fear again; rather, the Spirit you received brought about your adoption to sonship. And by him we cry, 'Abba, Father.'"*

**&mdash; Romans 8:15 (NIV)**

#### **Faith Reflection:** 
As we continue our work with data, we often face a spirit of fear—fear of messy data, fear of making mistakes, or the fear that our skills aren't good enough. This can make us feel like slaves to the process. But today's verse reminds us that we are adopted as God’s children. He has not given us a spirit of fear, but a spirit of love and a sound mind. As we face the tedious and often frustrating task of data cleaning and preparation, we can trust that we have the wisdom and patience to bring order out of chaos. How can you approach this week's data wrangling challenge with a spirit of confidence, knowing that God has equipped you for the task?

****

## **Pandas**

Pandas is an open-source software library for Python, widely used for data analysis and manipulation. Pandas has two powerful, and easy-to-use data structures:

- **Series:** A one-dimensional labeled array capable of holding any data type. Think of it as a single column in a spreadsheet.
- **DataFrame:** A two-dimensional labeled data structure with columns of potentially different types. It's like a whole spreadsheet or SQL table.

The common way of importing Pandas into a Python script is:

```python
import pandas as pd
```

This is the import convention I will use throughout the semester.

#### **Series**

Pandas Series are:

- **One dimensional:** It has only one axis.
- **Labeled:** Each element in a Series has an associated *index*. This index can be a sequence of integers (default), or it can be custom labels (like dates, strings, etc.). This labeling allows for flexible data aceess and alignment.
- **Homogeneous Data Type:** While Pandas is flexible, a single Series generally holds elements of the same data type for optimal performance.

You can create a Series in several ways:

- From a list or array &mdash; and the default integer index is automatically generated
- From a list or array with a custom index
- From a dictionary &mdash; and the dictionary keys become the Series index

Examples:

In [None]:

import pandas as pd

# Create a Pandas Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(f"Pandas Series from list:")
print(s)

# Create a Pandas Series with custom index
s_custom_index = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print(f"\nPandas Series with custom index:")
print(s_custom_index)

# Create a Pandas Series from a dictionary
data_dict = {'apple': 1, 'banana': 2, 'cherry': 3}
s_dict = pd.Series(data_dict)
print(f"\nPandas Series from dictionary:")
print(s_dict)

#### **DataFrame**

Pandas DataFrames are the most commonly used Pandas object. They are:

- **Two-dimensional:** It has both rows and columns
- **Labeled Axes:** Both rows and columns have an index. Rows have row labels (index) and columns have column labels (headers).
- **Heterogeneous Data Types:** Columns can contain different data types (e.g., one column can be integers, another strings, another booleans).
- **Size Mutable:** You can add or delete rows and columns

You can create a DataFrame in multiple ways:

- From a dictionary of lists or arrays
- From a list of dictionaries

Examples:

In [None]:

# Dictionary of lists or arrays
data_df = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 35],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data_df)
print(f"\nPandas Series from dictionary:")
print(df)

# List of Dictionaries
data_list_dict = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 22, 'City': 'Chicago'},
    {'Name': 'David', 'Age': 35, 'City': 'Houston'}
]
df_list = pd.DataFrame(data_list_dict)
print(f"\nDataFrame from list of dictionaries:")
print(df_list)

### **Reading and Writing Data**

Reading and writing files is a fundamental part of data science. Pandas makes it easy to handle various file formats. Here, we'll focus on CSV, Excel, and JSON.

#### Reading Data

Pandas provides a suite of `read_****()` functions, which are listed below, that can be used to read data into Pandas. The ones in bold are the functions we are most likely to encounter this semester.

- Reading from Text and Web
    - `pd.read_clipboard()`: Reads the contents of the clipboard into a DataFrame. This is useful for quickly getting data from a table copied from a website or a spreadsheet.
    - **`pd.read_csv()`: The most common function. It reads any delimited text file into a DataFrame (default is comma). It can also handle headers and different encoding formats.**
    - `pd.read_fwf()`: Reads a fixed-width formatted file. This is for data where coluns are aligned by character position rather than by a delimter. 
    - `pd.read_html()`: Reads HTML tables from a URL, a string, or local file path and returns a list of DataFrames
    - `pd.read_json()`: Reads a JSON (JavaScript Object Notation) file or JSON string into a DataFrame. It can handle various JSON formats, specified by the `orient` paramter.
    - `pd.read_table()`: Similar to `pd.read_csv()`, but the default delimiter is a tab (`\t`). It's commonly used for reading tab-separated value (TSV) files. 
    - `pd.read_xml()`: Reads an XML file into a DataFrame. It can parse data from XML tags and attributes.  
- Reading from Databases
    - `pd.read_gbq()`: Reads data from Google BigQuery into a DataFrame. This function requires the `pandas-gbq` library and proper authentication.
    - **`pd.read_sql()`: A general-purpose function for reading an SQL query or a database table into a DataFrame. It requires a database connection string.**
    - `pd.read_sql_query()`: Reads the results of an SQL query from a database directly into a DataFrame.
    - `pd.read_sql_table()`: Reads an entire table from a database into a DataFrame.
- Reading from Binary and Compressed Files
    - **`pd.read_excel()`: Reads data from an Excel file (`.xls`, `.xlsx`). It can read from a specific sheet by name or index into a DataFrame. It requires the `openpyxl` library to be installed (via conda).**
    - `pd.read_feather()`: Reads as Feather file, a fast, lightweight, and language-agnostic format for storing data frames.
    - `pd.read_hdf()`: Reads an HDF5 (Hierarchical Data Format) file, which is designed for storing large amounts of data.
    - `pd.read_orc()`: Reads an ORC (Optimized Row Columnar) file, a free and open-source file format for columnar storage.
    - `pd.read_parquet()`: Reads a Parquet file, a columnar storage format that's highly efficient for large-scale data processing.
    - `pd.read_pickle()`: Reads a pickled (serialized) Python object from a file. This is a very fast way to save and load Pandas objects. 
- Reading from Statistical Software
    - `pd.read_sas()`: Reads a SAS (Statistical Analysis System) file. It can handle both `.sas7bdat` and XPORT files.
    - `pd.read_spss()`: Reads an SPSS (Statistical Package for the Social Sciences) file.
    - `pd.read_stata()`: Reads a Stata file. 

#### Writing Data

Pandas also provides a suite of `to_****()` functions, which are listed below, that can be used to convert or expert you DataFrame. The ones in bold are the functions we are most likely to encounter this semester. **Note: Not all of these methods create an output file! Some simply return the object in a new format in memory!**

- Text and Web Formats
    - `to_clipboard()`: Writes the object to the user's clipboard. This is useful for quickly pasting data into other applications like spreadsheets.
    - **`to_csv()`: Writes the DataFrame to a comma-separated values (CSV) file. You can specify the delimiter, whether to include the index, and other file properties.**
    - `to_dict()`: Converts the DataFrame to a Python dictionary. This is great for integrating with other Python code or for creating structured data objects.
    - `to_html()`: Renders the DataFrame as an HTML table string. This is useful for displaying data in web reports.
    - `to_json()`: Converts the DataFrame to a JSON (JavaScript Object Notation) string or file. The `orient` parameter controls the JSON format, such as `records` or `split`.
    - `to_latex()`: Renders the DataFrame as a LaTeX table. This is perfect for including tables in academic papers or documents written in LaTeX.
    - `to_markdown()`: Converts the DataFrame to a Markdown table string, ideal for documentation or README files.
    - `to_string()`: Renders the DataFrame as a string representation, suitable for printing to the console or writing to a text file.
    - `to_xml()`: Writes the DataFrame to an XML file. This method provides options to define the XML structure and tags.
- Binary and Database Formats
    - **`to_excel()`: Writes the DataFrame to an Excel file. You can write to a specific sheet and use the Pandas built-it `ExcelWriter` for multiple `DataFrames` in one file. (Note: You still must have the `openpyxl` library installed to write Excel files!**
    - `to_feather()`: Writes the DataFrame to the Apache Feather format, a fast, language-agnostic binary format for storing data frames, which is highly efficient for data transfer between Python and other languages like R.
    - `to_hdf()`: Writes the DataFrame to an HDF5 file. This is ideal for storing very large, hierarchical datasets.
    - `to_numpy()`: Converts the DataFrame to a NumPy `ndarray`. This is a crucial step for many machine learning tasks that require NumPy arrays as input. 
    - `to_orc()`: Writes the DataFrame to an ORC file, a columnar storage format optimized for large-scale data processing in big data ecosystems.
    - `to_parquet()`: Writes the DataFrame to a Parquet file, another highly efficient columnar storage format widely used in distributed computing frameworks like Apache Spark.
    - `to_pickle()`: Serializes the DataFrame to a Python pickle file. This is the most efficient way to save and load a pandas object, as it preserves all data types and metadata.
    - `to_sql()`: Writes the records in the DataFrame to a SQL database table. This requires a database connection and table name.
- Specialty Formats and Transformations
    - `to_gbq()`: Writes the DataFrame to a Google BigQuery table. This is part of the `pandas-gbq` library and is useful for cloud-based data warehousing.
    - `to_records()`: Converts the DataFrame to a NumPy `recarray` (record array), where each row is a record that can be accessed by field name.
    - `to_stata()`: Writes the DataFrame to a Stata file (`.dta`), a format used by the statistical software Stata.
    - `to_xarray()`: Converts the DataFrame to an `xarray` object, which is useful for working with multi-dimensional labeled arrays, often used in geosciences and climatology.

In [None]:
import pathlib
cwd = pathlib.Path().resolve()
datadir = cwd.joinpath('example-files')

# Read a CSV file
df_csv = pd.read_csv(datadir.joinpath('data.csv'), header=None)
print(f"\nDataFrame from CSV file:")
print(df_csv)

# Read a CSV with a different delimiter (e.g., a semicolon)
df_semicolon = pd.read_csv(datadir.joinpath('data-semicolon.csv'), sep=';')
print(f"\nDataFrame from CSV file (semicolon delimited):")
print(df_semicolon)

# Read a CSV and specify a column as the index
df_indexed = pd.read_csv(datadir.joinpath('data.csv'), index_col='id')
print(f"\nDataFrame from CSV file (indexed):")
print(df_indexed)

# Read a CSV and skip the first few rows
df_skipped = pd.read_csv(datadir.joinpath('data.csv'), skiprows=5, header=None)
print(f"\nDataFrame from CSV file (skipped rows):")
print(df_skipped)

In [None]:
# Read the first sheet of an Excel file
df_excel = pd.read_excel(datadir.joinpath('data.xlsx'))
print(f"DataFrame from Excel file:")
print(df_excel)

# Read a specific sheet by name
df_sheet2 = pd.read_excel(datadir.joinpath('data.xlsx'), sheet_name='Sheet2')
print(f"\nDataFrame from Excel file (Sheet2):")
print(df_sheet2)

# Read a specific sheet by index (0-based)
df_sheet1 = pd.read_excel(datadir.joinpath('data.xlsx'), sheet_name=0)
print(f"\nDataFrame from Excel file (Sheet1):")
print(df_sheet1)


In [None]:
# Read a JSON file
df_json = pd.read_json(datadir.joinpath('data.json'))
print(f"DataFrame from JSON file:")
print(df_json)

# Reading a JSON file with a different orientation
# This is useful when the JSON is formatted as a list of records
df_json_records = pd.read_json(datadir.joinpath('data-records.json'), orient='records')
print(f"\nDataFrame from JSON file (records):")
print(df_json_records)

### Inspecting Data

Once you've loaded data into a Pandas DataFrame, the first step is to get a general overview of the dataset. 

- **`.head()` and `.tail()`**: These are the most common functions for quickly viewing the top or bottom rows of your data. By default, they show the first or last 5 rows, but you can specify a different number.
- **`.info()`**: This provides a concise summary of the DataFrame. It shows the number of entries, column names, the number of non-null values in each column, and the data types (`Dtype`). This is essential for identifying missing data and incorrect data types.
- **`.shape`**: This attribute returns a tuple representing the dimensions of the DataFrame, in the format `(rows, columns)`. It's a quick way to see how much data you're working with.
 - **`.columns`**: This returns a list of the column names in the DataFrame.

#### Statistical Summary

To understand the distribution of your data, use these functions.

- **`.describe()`**: This is a powerful function that generates descriptive statistics for all numerical columns. It includes `count`, `mean`, `std` (standard deviation), `min`, `max`, and the quartile values. For non-numerical data (like strings or objects), you can add the `include='all'` parameter to see the count, unique values, and frequency of the most common values.
- **`.value_counts()`**: This is perfect for categorical data. It returns a `Series` containing the counts of unique values in a specified column, sorted in descending order.
- **`.unique()`**: This function returns an array of the unique values in a column, without their counts.

#### Handling Missing Values (more later)

Dealing with missing data is a critical part of data inspection.

- **`.isnull()` / `.isna()`**: These methods return a DataFrame of booleans, where `True` indicates a missing value (`NaN`). You can combine them with `sum()` to get a count of missing values per column.
- **`.notnull()` / `.notna()`**: The inverse of the above, returning `True` for non-missing values.

A typical workflow for initial data inspection is to use a combination of these methods:

1.  **`df.info()`** to check data types and find columns with a different number of non-null values.
2.  **`df.isnull().sum()`** to get a precise count of missing values per column.
3.  **`df.describe()`** to understand the distribution of numerical data.
4.  **`df.value_counts()`** on categorical columns to check their frequency.


In [None]:
# View the first 5 rows
print("First 5 rows of the DataFrame using .head():")
print(df_csv.head())

# View the last 7 rows
print("\nLast 7 rows of the DataFrame using .tail(7):")
print(df_csv.tail(7))

# View the summary of the DataFrame
print("\nSummary of the DataFrame:")
df_csv.info()

# Get the number of rows and columns
print("\nNumber of rows and columns:")
print(df_csv.shape)

# Get the column names
print("\nColumn names:")
print(df_csv.columns)

# Get descriptive statistics for numerical columns
print("\nDescriptive statistics for numerical columns:")
print(df_csv.describe())

# Get descriptive statistics for all columns
print("\nDescriptive statistics for all columns:")
print(df_csv.describe(include='all'))

# Get the frequency of each unique value in the 'City' column
print("\nFrequency of each unique value in the 'city' column:")
print(df_csv['city'].value_counts())

# Find all unique cities
print("\nUnique cities:")
print(df_csv['city'].unique())

# Check for missing values in the DataFrame
print("\nMissing values in the DataFrame:")
print(df_csv.isnull().sum())


### Data Selection and Indexing

As a budding data scientist you need to understand how to access and manipulate subsets of your data. This is a fundamental skill for any data science task, from cleaning to analysis. Pandas offers several methods for this, primarily `[]`, `.loc[]`, and `.iloc[]`. Within these methods, you can select data, index data, and slice data.

#### Basic Selection with `[]`

The most straightforward method for selecting data is using the `[]` operator.

- **Selecting a single column:** Use the column name inside the brackets. This returns a **Series**.
- **Selecting multiple columns:** Use a list of column names inside the brackets. This returns a **DataFrame**.
- **Slicing rows:** You can slice rows using integer-based indexing, just like with a Python list. The slice is based on the row's position.

#### Label-based Selection with `.loc[]`

`.loc[]` is the primary method for **label-based** indexing. It selects data based on row and column labels. The syntax is `df.loc[row_label, column_label]`. The start and end labels are **inclusive**.

- **Selecting by label:** Use a single label or a list of labels.
- **Slicing with labels:** `.loc[]` can slice both rows and columns.
- **Boolean indexing:** `.loc[]` is excellent for filtering data based on a condition.

#### Position-based Selection with `.iloc[]`

`.iloc[]` is for **integer-location based** indexing. It selects data based on the integer position of rows and columns, similar to NumPy arrays. The syntax is `df.iloc[row_position, column_position]`. The end position is **exclusive**, just like Python slicing.

- **Selecting by position:** Use an integer or a list of integers.
- **Slicing with positions:** Slicing works exactly like Python lists.

#### Summary of Differences

| Method | Syntax | Selection Type | End Point |
| :--- | :--- | :--- | :--- |
| `[]` | `df[label]` | Column (sometimes rows) | Exclusive (for slicing) |
| `.loc[]` | `df.loc[row_labels, col_labels]` | Label-based | Inclusive |
| `.iloc[]` | `df.iloc[row_positions, col_positions]` | Position-based | Exclusive |

In [None]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)
print("DataFrame for selection examples:")
print(df)

# Select column 'A'
series_A = df['A']
print("\nColumn 'A':")
print(series_A)

# Select columns 'A' and 'C'
df_AC = df[['A', 'C']]
print("\nColumns 'A' and 'C':")
print(df_AC)

# Select rows from index 0 up to (but not including) 2
rows_0_to_1 = df[0:2]
print("\nRows from index 0 to 1:")
print(rows_0_to_1)

# Select row with label 1 and column with label 'B'
value = df.loc[1, 'B']
print(f"\nValue at row label 1 and column 'B': {value}")

# Select all columns for row with label 0
row_0 = df.loc[0, :]
print("\nAll columns for row with label 0:")
print(row_0)

# Select all rows from label 0 to 2, and columns from 'A' to 'C'
subset = df.loc[0:2, 'A':'C']
print("\nSubset of rows 0 to 2 and columns 'A' to 'C':")
print(subset)

# Select all rows where column 'B' is greater than 5
filtered_df = df.loc[df['B'] > 5]
print("\nRows where column 'B' is greater than 5:")
print(filtered_df)

# Select the value in the row at position 1 and column at position 1
print(f"\nValue at row position 1 and column position 1: {value}")

# Select all rows at positions 0 and 2, and all columns
rows_0_2 = df.iloc[[0, 2], :]
print("\nRows at positions 0 and 2:")
print(rows_0_2)

# Select rows from position 0 up to (but not including) 2, and columns from 1 up to (but not including) 3
subset = df.iloc[0:2, 1:3]
print("\nSubset of rows 0 to 2 and columns at positions 1 to 3:")
print(subset)


### Handling Missing Data

When handling missing data in Pandas, you need to first **detect** where the missing values are, then decide whether to **drop** the rows or columns with missing data, or **fill** them with new values.

#### Detecting Missing Data

The first step is to identify where missing values exist. Pandas represents missing data with `NaN` (Not a Number) for numerical data and `None` or `NaN` for other data types.

- **`.isnull()` or `.isna()`**: These methods return a DataFrame of boolean values where `True` indicates a missing value.
* **`.isnull().sum()`**: This is the most common way to get a count of missing values per column.

#### Dropping Missing Data

If the number of missing values is small, or if dropping them won't significantly impact your analysis, you can simply remove the rows or columns.

- **`.dropna()`**: This method removes rows or columns that contain missing values.
- **Dropping rows**: The default behavior of `.dropna()` is to drop any row that has **at least one** missing value.
- **Dropping columns**: To drop columns with missing values, set the `axis` parameter to `1` (or `'columns'`).
- **Controlling the drop**: You can use the `how` parameter to control the dropping behavior.
    - `how='any'` (default): Drop if **any** value is `NaN`.
    - `how='all'`: Drop only if **all** values in a row/column are `NaN`.

#### Filling Missing Data

Often, you don't want to lose data by dropping rows or columns. **Filling** or **imputing** missing values is a common solution.

- **`.fillna()`**: This is the main function for filling missing values.
- **Filling with a single value**: You can fill all `NaN` values with a specific value.
- **Filling with a statistical measure**: A common approach is to fill missing values with the mean, median, or mode of that column.
- **Forward-fill (`.ffill()`) or Backward-fill (`.bfill()`)**: These methods fill missing values using the value from the previous (`.ffill()`) or next (`.bfill()`) valid observation. This is useful for time-series data.

In [None]:
import numpy as np

data = {'A': [1, 2, np.nan], 'B': [4, np.nan, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)
print("Original DataFrame with missing values:")
print(df)

# Check for missing values
print("\nDataFrame missing values:")
print(df.isnull())

# Get the count of missing values for each column
print(f"\nThe count of missing values for each column:")
print(df.isnull().sum())

# Drop any row with a missing value
df_dropped_rows = df.dropna()
print("\nDataFrame after dropping rows with any missing values:")
print(df_dropped_rows)

# Drop any column with a missing value
df_dropped_cols = df.dropna(axis=1)
print("\nDataFrame after dropping columns with any missing values:")
print(df_dropped_cols)

# Drop rows where all values are NaN
df_dropped_all = df.dropna(how='all')
print("\nDataFrame after dropping rows where all values are missing:")
print(df_dropped_all)

# Fill all missing values with 0
df_filled_zero = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_filled_zero)

# Fill missing values in column 'B' with the mean of that column
mean_b = df['B'].mean()
df_filled_mean = df['B'].fillna(mean_b)
print("\nDataFrame after filling missing values in column 'B' with the mean of that column:")
print(df_filled_mean)

# This will only fill the missing value in column B, not A
df_filled_with_means = df.fillna(df.mean(numeric_only=True))
print('\nDataFrame after filling missing values with column means:')
print(df_filled_with_means)

# Forward-fill missing values
df_ffill = df.ffill()
print('\nDataFrame after forward-filling missing values:')
print(df_ffill)

# Backward-fill missing values
df_bfill = df.bfill()
print('\nDataFrame after backward-filling missing values:')
print(df_bfill)

### Data Manipulation

#### Adding New Columns

You can add a new column to a `DataFrame` by assigning a `Series` or a list to a new column name.

- **From a single value**: Assigning a single value to a new column will populate every row with that value.
- **From an existing column**: A common practice is to create a new column based on calculations from one or more existing columns.
- **Using `.assign()`**: The `.assign()` method is an alternative way to add new columns. It's especially useful for method chaining as it returns a new `DataFrame`.

#### Modifying Existing Columns

You can modify an existing column by selecting it and then assigning a new value to it, just as you would when adding a new column.

- **Direct assignment**: Simply use the column's name to select it and then assign the new values.
- **Conditional modification**: Use boolean indexing with `.loc[]` to modify values based on a condition.

#### Dropping Columns

You can remove one or more columns using the `.drop()` method. The `axis=1` parameter is crucial as it specifies that you are dropping a column rather than a row.

- **Dropping a single column**: Pass the column name as a string.
- **Dropping multiple columns**: Pass a list of column names.

To drop columns from the original `DataFrame` without creating a new one, use the `inplace=True` parameter.

#### Renaming Columns

The `.rename()` method is used to change the names of one or more columns. It takes a dictionary where keys are the old column names and values are the new names.

- **Renaming one or more columns**:

Just like with `.drop()`, you can use `inplace=True` to modify the original `DataFrame`.

- **Using a function**: You can also pass a function to `rename` to perform a uniform operation on all column names.

In [None]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
print("Initial DataFrame:")
print(df)

# Add a new column 'C' with a constant value
df['C'] = 10
print("\nDataFrame after adding column 'C':")
print(df)

# Add a new column 'D' which is the sum of columns 'A' and 'B'
df['D'] = df['A'] + df['B']
print("\nDataFrame after adding column 'D':")
print(df)

# Add a new column 'E' using the assign method
df_new = df.assign(E=lambda x: x['A'] * 2)
print("\nDataFrame after adding column 'E':")
print(df_new)

# Rename columns to lowercase
df_new.columns = ['a', 'b', 'c', 'd', 'e']
print("\nDataFrame after renaming columns to lowercase:")
print(df_new)

# Modify column 'A' by multiplying its values by 10
df['A'] = df['A'] * 10
print("\nDataFrame after modifying column 'A':")
print(df)

# Change values in column 'B' to 99 where 'A' is greater than 15
df.loc[df['A'] > 15, 'B'] = 99
print("\nDataFrame after modifying column 'B':")
print(df)

# Drop column 'B'
df_no_B = df.drop('B', axis=1)
print("\nDataFrame after dropping column 'B':")
print(df_no_B)

# Drop columns 'A' and 'C'
df_no_AC = df.drop(['A', 'C'], axis=1)
print("\nDataFrame after dropping columns 'A' and 'C':")
print(df_no_AC)

# Drop 'A' and 'B' permanently from the original DataFrame
df.drop(['A', 'B'], axis=1, inplace=True)
print("\nDataFrame after dropping columns 'A' and 'B' permanently:")
print(df)

# Rename 'A' to 'First' and 'B' to 'Second'
df_renamed = df.rename(columns={'A': 'First', 'B': 'Second'})
print("\nDataFrame after renaming columns 'A' to 'First' and 'B' to 'Second':")
print(df_renamed)

# Convert all column names to lowercase
df.rename(columns=str.lower, inplace=True)
print("\nDataFrame after converting column names to lowercase using function:")
print(df)


### Grouping and Aggregation

You can perform grouping and aggregation in pandas using the **`.groupby()`** method, which is a powerful way to summarize data. The process typically involves three steps: 

1. **split**
2. **apply**
3. **combine**

#### The `groupby()` Method

The `groupby()` method is at the core of this process. You use it to split your DataFrame into groups based on one or more columns. The result isn't a DataFrame itself, but a special `GroupBy` object that's ready for an aggregation function.

- **Syntax**: `df.groupby('column_to_group_by')` or `df.groupby(['col1', 'col2'])`
    - **Split**: This is the grouping step. For example, if you group by the 'City' column, pandas internally creates separate groups for 'New York', 'Chicago', etc.
    - **Apply**: Next, you apply an aggregation function to each group. This function calculates a single value for each group, such as the mean, sum, or count.
    - **Combine**: Finally, pandas combines the results from each group into a new DataFrame or Series.

#### Common Aggregation Functions

After grouping, you apply an aggregation function to the `GroupBy` object. Some of the most common functions are:

- **`.sum()`**: Computes the sum of values within each group.
- **`.mean()`**: Calculates the average of values in each group.
- **`.count()`**: Counts the number of non-null values in each group.
- **`.min()` and `.max()`**: Finds the minimum and maximum values in each group.
- **`.size()`**: Counts the total number of rows (including nulls) in each group.
- **`.agg()`**: Allows you to apply multiple aggregation functions at once.

#### Step-by-Step Examples

Let's use a simple DataFrame to demonstrate.



In [None]:
data = {'City': ['New York', 'New York', 'Chicago', 'Chicago', 'New York'],
        'Product': ['A', 'B', 'A', 'A', 'C'],
        'Sales': [100, 150, 200, 50, 300]}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

**1. Grouping by a single column and calculating the sum:**

To find the total sales for each city, you group by 'City' and then apply the `.sum()` function to the 'Sales' column.

In [None]:
# Group by 'City' and sum the 'Sales' for each group
city_sales = df.groupby('City')['Sales'].sum()
print("\nTotal sales per city:")
print(city_sales)

**2. Grouping by multiple columns:**

You can group by more than one column to get more granular results. For example, to find the total sales for each product in each city, you group by both 'City' and 'Product'.

In [None]:
# Group by 'City' and 'Product' and get the mean of 'Sales'
avg_sales = df.groupby(['City', 'Product'])['Sales'].mean()
print("\nAverage sales per city and product:")
print(avg_sales)

**3. Applying multiple aggregations with `.agg()`:**

The `.agg()` method is highly flexible and lets you apply multiple aggregation functions at once, even to different columns.

In [None]:
# Group by 'City' and get the sum and mean of 'Sales'
city_stats = df.groupby('City').agg(
    total_sales=('Sales', 'sum'),
    average_sales=('Sales', 'mean'),
    number_of_transactions=('Sales', 'count')
)
print("\nMultiple aggregations per city:")
print(city_stats)


The syntax `total_sales=('Sales', 'sum')` creates a new column called 'total\_sales' with the sum of the 'Sales' column.

### Merging and Joining DataFrames

Merging and joining DataFrames is a fundamental operation in pandas for combining data from different sources. The primary methods for this are **`pd.merge()`** and **`DataFrame.join()`**.

#### Understanding Merges (`pd.merge()`)

The `pd.merge()` function is a versatile tool for combining two DataFrames based on a common column or index. It's similar to SQL `JOIN` operations.

The key parameters of `pd.merge()` are:

- **`left` and `right`**: The two DataFrames you want to merge.
- **`on`**: The column name(s) to join on. If the column names are different in the two DataFrames, you can use `left_on` and `right_on`.
- **`how`**: The type of merge to perform. This is the most crucial parameter. The options are:
    - **`'inner'`** (default): Returns only the rows where the key is present in **both** DataFrames. This is the most common type of merge.
    - **`'outer'`**: Returns all rows from both DataFrames. Where keys don't match, `NaN` is filled in.
    - **`'left'`**: Returns all rows from the **left** DataFrame, and matching rows from the right. `NaN` is filled for rows in the left DataFrame that have no match in the right.
    - **`'right'`**: Returns all rows from the **right** DataFrame, and matching rows from the left. `NaN` is filled for rows in the right DataFrame that have no match in the left.

#### Example: Inner and Outer Merge

In [None]:
# Create two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'value1': [1, 2, 3, 4]})

df2 = pd.DataFrame({'key': ['A', 'B', 'E', 'F'],
                    'value2': [5, 6, 7, 8]})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Perform an inner merge
merged_df = pd.merge(df1, df2, on='key', how='inner')

print("\nInner Merge:")
print(merged_df)

# Perform an outer merge
merged2_df = pd.merge(df1, df2, on='key', how='outer')
print("\nOuter Merge:")
print(merged2_df)


**Explanation**: The result of the inner merge only includes keys 'A' and 'B' because they are the only ones present in both DataFrames. The result of the outer merge includes all the keys of both datasets, and assigns a missing value (NaN) to the features (columns) that do not have a value in the original DataFrames

#### Understanding Joins (`DataFrame.join()`)

The `DataFrame.join()` method is a convenient way to combine two DataFrames on their **indexes**. It's primarily used when the keys you want to join on are the DataFrame indexes.

- **Syntax**: `left_df.join(right_df)`

**Example: Join on Index**

In [None]:
# Create two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'value1': [1, 2, 3, 4]})

df2 = pd.DataFrame({'key': ['A', 'B', 'E', 'F'],
                    'value2': [5, 6, 7, 8]})
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Create two DataFrames with different indexes
df1_indexed = df1.set_index('key')
df2_indexed = df2.set_index('key')

# Perform a left join on the index
joined_df = df1_indexed.join(df2_indexed, how='left')

print("\nLeft Join:")
print(joined_df)

# Perform a right join on the index
joined_df_right = df1_indexed.join(df2_indexed, how='right')
print("\nRight Join:")
print(joined_df_right)

**Explanation**: The `join()` method defaults to a left join. Since `df1_indexed` has keys 'A', 'B', 'C', and 'D', the result will include all of these. The `value2` for keys 'C' and 'D' will be `NaN` because there is no match in `df2_indexed`. In the case of the right join, the keys are 'A', 'B', 'E', and 'F', thus 'C' and 'D' get dropped.

#### Merge vs. Join: Which to Use?

- **`pd.merge()`**: Use `merge()` when you need to join on specific columns, especially when those columns are not the index. It is the more flexible and general-purpose function.
- **`DataFrame.join()`**: Use `join()` for joining on the DataFrame's index. It's often cleaner and more concise for these specific use cases. You can also join on a column by setting the `on` parameter, but `merge()` is generally preferred for this task.


### Applying Functions

The `.apply()` method in pandas is a powerful and flexible tool for applying a function along an axis of a DataFrame or to the elements of a Series. It's often used when a standard vectorized operation isn't available or is too complex.

#### `.apply()` on a Series

When you use `.apply()` on a **Series**, the function is applied to each individual element in that Series.

**Example**: Let's create a Series and apply a simple function to it. We'll use a lambda function to convert temperatures from Celsius to Fahrenheit.


In [None]:
# Create a Series of temperatures in Celsius
temps_celsius = pd.Series([10, 20, 30, 40])
print("Temperatures in Celsius:")
print(temps_celsius)

# Define a function to convert Celsius to Fahrenheit
def c_to_f(celsius):
    return (celsius * 9/5) + 32

# Apply the function to each element of the Series
temps_fahrenheit = temps_celsius.apply(c_to_f)

print("\nTemperatures in Fahrenheit:")
print(temps_fahrenheit)


**Explanation**: The `.apply()` method takes the `c_to_f` function and applies it to each value in `temps_celsius`, creating a new Series with the converted temperatures.

You can also use a **lambda function** for more concise operations:



In [None]:
# Use a lambda function to perform the same conversion
temps_fahrenheit_lambda = temps_celsius.apply(lambda x: (x * 9/5) + 32)
print("\nTemperatures in Fahrenheit using lambda:")
print(temps_fahrenheit_lambda)

This is a very common pattern in pandas for simple element-wise transformations.

#### `.apply()` on a DataFrame

When you use `.apply()` on a **DataFrame**, the function is applied to each row or each column. You specify the axis along which the function should be applied.

- **`axis=0`**: Applies the function to each **column**. The function will receive a Series (the column) as input.
- **`axis=1`**: Applies the function to each **row**. The function will receive a Series (the row) as input.

Example: Applying to each row (`axis=1`)

Let's use a DataFrame of student scores and calculate their average grade for each row.


In [None]:
data = {'Math': [80, 90, 75],
        'Science': [85, 95, 80],
        'English': [90, 85, 70]}
df = pd.DataFrame(data)
print("Initial DataFrame:")
print(df)

# Define a function to calculate the average of a row
def calculate_average(row):
    return (row['Math'] + row['Science'] + row['English']) / 3

# Apply the function to each row
df['Average'] = df.apply(calculate_average, axis=1)
print("\nDataFrame with average grades:")
print(df)



**Explanation**: By specifying `axis=1`, we tell `.apply()` to iterate through each row. The `calculate_average` function receives each row as a Series and can access column values using their labels (e.g., `row['Math']`).

#### Example: Applying to each column (`axis=0`)

You can use `axis=0` to perform column-wise operations, like finding the minimum score for each subject.


In [None]:
# Apply a function to each column to find the minimum value
min_scores = df[['Math', 'Science', 'English']].apply(min, axis=0)

print("Minimum score for each subject:")
print(min_scores)


**Explanation**: In this case, `.apply()` passes each column as a Series to the built-in `min()` function, returning the minimum value for 'Math', 'Science', and 'English' respectively.

While `.apply()` is very versatile, it can be slower than built-in vectorized pandas operations.  For simple arithmetic (like `df['A'] + df['B']`), it is always better to use the direct vectorized operation. Use `.apply()` when your function requires custom logic or access to multiple columns on a row-by-row basis.


### Pivoting and Melting

Pivoting and melting are two common data restructuring operations in pandas. They are inverse operations, used to transform data between "long" and "wide" formats.

#### Pivoting (Wide Format)

**Pivoting** is the process of reshaping data from a "long" or normalized format to a "wide" format. It takes a unique value from one column and makes it a new column header. Think of it like creating a pivot table in Excel. The main function for this is **`DataFrame.pivot()`**.

The `pivot()` function takes three arguments:

- **`index`**: The column to use as the new DataFrame index.
- **`columns`**: The column whose unique values will become the new column headers.
- **`values`**: The column(s) whose values will populate the new DataFrame.

Example

Imagine you have sales data for different products over a few months. This data is in a long format.

In [None]:
import pandas as pd

# Long format DataFrame
df_long = pd.DataFrame({
    'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
    'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 150, 120, 160, 130, 170]
})

print("Original (Long) DataFrame:")
print(df_long)

To analyze the sales of each product side-by-side, you can pivot the data.

In [None]:
# Pivot to wide format
df_wide = df_long.pivot(index='Month', columns='Product', values='Sales')

print("\nPivoted (Wide) DataFrame:")
print(df_wide)

The `Month` column becomes the new index, and the unique values from the `Product` column (`A` and `B`) become the new column names. The corresponding `Sales` values fill the table.

#### Melting (Long Format)

**Melting** is the inverse operation of pivoting. It reshapes a DataFrame from a "wide" format to a "long" format. This is useful when you have data where columns represent variables, and you want to convert these columns into rows. The main function for this is **`pd.melt()`**.

The `melt()` function takes two main arguments:

- **`id_vars`**: The column(s) to remain as identifier variables. These will not be melted.
- **`value_vars`**: The column(s) to melt. These columns' names will become a new variable column, and their values will become a new value column.

Example

Using the `df_wide` DataFrame from the previous example, let's melt it back to the original long format.

In [None]:
# Melt the wide DataFrame back to long format
df_melted = pd.melt(df_wide.reset_index(),
                    id_vars=['Month'],
                    value_vars=['A', 'B'],
                    var_name='Product',
                    value_name='Sales')

print("\nMelted (Long) DataFrame:")
print(df_melted)


We use `df_wide.reset_index()` to turn the `Month` index back into a column. Then, the `Product` column is created from the melted column headers (`A`, `B`), and the `Sales` column is created from their corresponding values.

### Time Series Functionality

Pandas offers robust functionality for working with time-series data, from parsing dates to resampling data. The primary objects for this are the `DatetimeIndex`, `PeriodIndex`, and `TimedeltaIndex`.

#### Creating a Time-Series Index

The first step is to ensure your DataFrame has a proper time-series index.

- **Using `pd.to_datetime()`**: This function converts a column with string or integer dates into a `datetime` object, which is essential for time-series operations.
- **Setting the index**: To unlock time-series functionality, set the datetime column as the DataFrame's index.
- **Slicing by date**: You can use strings to slice by year, month, or day.
    

In [None]:
df = pd.DataFrame({'date': ['2023-01-01', '2023-01-02', '2023-01-03',
                            '2024-01-01', '2024-01-02', '2024-01-03',],
                'value': [10, 15, 20, 25, 30, 35]})

# Convert the 'date' column to a datetime object
df['date'] = pd.to_datetime(df['date'])
print("\nDataFrame with 'date' column as datetime:")
print(df)

print("\nDataFrame info:")
print(df.info())

# Set the 'date' column as the DataFrame's index
df.set_index('date', inplace=True)
print("\nDataFrame with 'date' as index:")
print(df)

# Select all data from a specific year
df_2023 = df.loc['2023']
print("\nData for the year 2023:")
print(df_2023)

# Select all data from a specific month
df_jan = df.loc['2023-01']
print("\nData for January 2023:")
print(df_jan)

# Select data for a specific date range
df_slice = df.loc['2023-01-01':'2023-01-03']
print("\nData from 2023-01-01 to 2023-01-03:")
print(df_slice)

### Resampling Time-Series Data

**Resampling** is the process of changing the frequency of your time-series data. It is a powerful tool for aggregation.

- **Downsampling**: Converting high-frequency data to low-frequency data (e.g., from daily to monthly). When downsampling, you must provide an aggregation function like `mean()`, `sum()`, or `first()`.
- **Upsampling**: Converting low-frequency data to high-frequency data (e.g., from monthly to daily). When upsampling, you must decide how to fill the new data points.

In [None]:
# Create a daily time series
daily_data = pd.DataFrame({'value': np.random.rand(365)},
                          index=pd.date_range(start='2023-01-01', periods=365, freq='D'))
print(daily_data)

# Resample from daily to monthly, taking the mean of each month
monthly_mean = daily_data.resample('ME').mean()
print("Downsampled to monthly means:")
print(monthly_mean.head())

# Upsample from monthly to weekly
weekly_data = monthly_mean.resample('W').asfreq()
print("\nUpsampled to weekly frequency (with NaNs):")
print(weekly_data)

# Upsample and fill missing values with forward fill or backward fill
weekly_filled_f = monthly_mean.resample('W').ffill()
weekly_filled_b = monthly_mean.resample('W').bfill()
print("\nUpsampled to weekly frequency (forward fill):")
print(weekly_filled_f)
print("\nUpsampled to weekly frequency (backward fill):")
print(weekly_filled_b)

#### `Timedelta` Objects

A `Timedelta` represents a duration, the difference between two dates or times.

- **Creating a `Timedelta`**: You can create a `Timedelta` by subtracting two `datetime` objects.
- **Adding/Subtracting**: You can add or subtract `Timedelta` objects to/from `datetime` objects.

In [None]:
start = pd.to_datetime('2023-01-01 10:00:00')
end = pd.to_datetime('2023-01-01 11:30:00')

duration = end - start
print("Timedelta object:")
print(duration)

# Add a Timedelta to a datetime
new_time = start + pd.to_timedelta('2 hours')
print("\nNew time after adding 2 hours:")
print(new_time)


### Advanced Operations

Pandas can do even more than what I have shown so far. Here is a quick high-level highlight of handling categorical data, working with MultiIndex objects, and plotting directly from DataFrames.

#### Categorical Data

**Categorical data** is a pandas data type that represents a variable with a fixed, limited number of possible values (categories). It's more memory-efficient than object or string data, and can improve performance for certain operations.

You can convert a column to the categorical type using `astype('category')`.

Benefits of Categorical Data include

  * **Memory Efficiency**: Storing categories as integer codes rather than repeating strings can save a significant amount of memory, especially with large datasets.
  * **Performance**: Operations like `groupby()` can be much faster with categorical data.
  * **Ordering**: You can define a specific order for your categories, which is useful for sorting and plotting.




In [None]:
df = pd.DataFrame({'product_id': [1, 2, 3, 4],
                   'category': ['shoes', 'clothing', 'shoes', 'accessories']})
print("Initial DataFrame:")
print(df.info())

# Convert the 'category' column to the 'category' data type
df['category'] = df['category'].astype('category')
print("\nDataFrame after converting 'category' to categorical type:")
print(df.info())

# Define a specific order for the categories
df['category'] = pd.Categorical(df['category'], categories=['clothing', 'shoes', 'accessories'], ordered=True)
print("\nDataFrame after defining category order:")
print(df)

print("\nCategory codes:")
print(df['category'].cat.codes)

#### MultiIndex

A **MultiIndex** is a hierarchical index that allows you to have multiple levels of labels on an axis (rows or columns). This is powerful for handling complex, multi-dimensional data, often created after using `groupby()` or `pivot_table()`.

- You can create a MultiIndex by setting multiple columns as the index.
- Slicing and selecting data with a MultiIndex can be done using a tuple.



In [None]:
data = {'city': ['NY', 'NY', 'SF', 'SF'],
        'year': [2020, 2021, 2020, 2021],
        'sales': [100, 150, 200, 250]}

df = pd.DataFrame(data).set_index(['city', 'year'])
print("MultiIndex DataFrame:")
print(df)

# Select data for 'SF' in 2020
sales_sf_2020 = df.loc[('SF', 2020)]
print("\nSales in SF for 2020:")
print(sales_sf_2020)

# Select all years for 'NY'
all_years_ny = df.loc['NY']
print("\nAll years for NY:")
print(all_years_ny)

# Use a slice to select a range of years for a specific city
range_years = df.loc[('NY', 2020):('NY', 2021)]
print("\nRange of years for NY from 2020 to 2021:")
print(range_years)


#### Plotting

Pandas has a built-in plotting functionality that is a wrapper around the **Matplotlib** library. This allows for quick and convenient visualization of your data directly from a DataFrame or Series.

To use the plotting functions, you must have Matplotlib installed. The syntax to plot from a data frame is: 

- **Syntax**: `df.plot(kind='plot_type')`

Common Plot Types

You can specify the plot type using the `kind` parameter or by calling a specific plot method.

- `kind='line'` or `df.plot.line()`: Line plot
- `kind='bar'` or `df.plot.bar()`: Bar plot
- `kind='hist'` or `df.plot.hist()`: Histogram
- `kind='scatter'` or `df.plot.scatter(x='A', y='B')`: Scatter plot
- `kind='box'` or `df.plot.box()`: Box plot


Pandas' plotting functions are great for quick exploratory data analysis, but for more customization and complex plots, it's recommended to use Matplotlib or Seaborn directly.


In [None]:
# Create a DataFrame for plotting
df = pd.DataFrame(np.random.randn(10, 4),
                  index=pd.date_range('2023-01-01', periods=10),
                  columns=['A', 'B', 'C', 'D'])

# Plot a line chart
df.plot(kind='line', title='Line Plot')


In [None]:
# Create a bar plot
df.plot(kind='bar', title='Bar Plot')

In [None]:
# Create a histogram of column 'A'
df['A'].plot(kind='hist', title='Histogram of A')