# 🐼 Pandas Guide: Essential Data Handling and Manipulation

## 📚 Notebook Outline

1. Introduction
2. Data Structures Overview
3. Loading Data
4. Basic Data Inspection & Selection
5. Data Exploration
6. Statistical Summaries and Correlation
7. Grouping, Aggregation & Sorting
8. Data Cleaning and Transformation
9. Data Preprocessing and Feature Engineering
10. Combining Datasets
11. Bonus: Tips for Handling Large Datasets

## 1. Introduction  

This notebook is a practical, example-driven guide to using `pandas`, one of the most important libraries in the Python data science ecosystem. It focuses on the essential operations you'll need for exploring, cleaning, transforming, and analyzing data in real-world projects.

### 📌 Purpose of the Guide

The goal is to:
- Build a strong foundation in pandas for data manipulation tasks.
- Provide clear, reusable examples for future reference.
- Help transition from academic data handling (e.g., NumPy, raw CSVs) to industry-standard practices.

### ⚙️ Why Pandas Is Crucial for Data Science Workflows

Pandas is a core tool for:
- Efficiently reading, cleaning, and transforming data.
- Performing exploratory data analysis (EDA).
- Preparing data for machine learning pipelines.
- Managing data pipelines in reproducible, readable ways.

This guide is part of a larger portfolio aimed at showcasing end-to-end data science workflows using industry-standard tools.

## Imports and Setup

We’ll use `pandas` for data manipulation and `numpy` for some supporting operations.

In [36]:
import pandas as pd
import numpy as np

## 2. Data Structures Overview

At the core of `pandas` are two main data structures:

- **Series**: A one-dimensional labeled array, similar to a column in a spreadsheet or a 1D NumPy array with labels.
- **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types - like a table or spreadsheet in Python.

These structures are optimized for performance and integrate seamlessly with NumPy, making them essential tools in the Python data science ecosystem.

---

### 🧾 What is a DataFrame?

A **pandas DataFrame** is a two-dimensional, labeled data structure that behaves like a table, with rows and columns. Each column can hold a different data type (e.g., int, float, string), and each entry is identified by a row index and column label.

In [39]:
pd.DataFrame({'Apple': [50, 21], 'Banana': [131, 2]}, index=['Monday', 'Tuesday'])

Unnamed: 0,Apple,Banana
Monday,50,131
Tuesday,21,2


### 🔍 What is a Series?

A **pandas Series** is a one-dimensional labeled array that can hold any data type. It's like a single column of a DataFrame, with each value paired to an index label.

Unlike a DataFrame, a Series does not have multiple column labels, it has just one overall `name`.

In [37]:
pd.Series([12, 7, 35], index=['Day 1', 'Day 2', 'Day 3'], name='# of customers')

Day 1    12
Day 2     7
Day 3    35
Name: # of customers, dtype: int64

### ✅ Summary

- Use **Series** for 1D labeled data (e.g., time series, single columns).
- Use **DataFrame** for 2D structured data (e.g., datasets, tables).
- Both structures support rich operations like filtering, aggregation, and transformation.

## 3. Loading Data

A common first step in data analysis is loading your dataset into a pandas DataFrame. The most common format is CSV (comma-separated values), which pandas can read easily with `pd.read_csv()`.

In this example, we will download a dataset directly from Kaggle (https://www.kaggle.com/datasets/zynicide/wine-reviews) and load it into a DataFrame.

---

Notice that the first column in the dataset actually contains the row indices, not regular data. Using the parameter index_col=0 tells pandas to use the first column as the DataFrame index, fixing the indexing for cleaner data handling.

In [43]:
# Import necessary libraries
import kagglehub  # Library to download datasets directly from Kaggle

# Download the dataset
path = kagglehub.dataset_download("zynicide/wine-reviews")

print("Path to dataset files:", path)

# Load the CSV file into a DataFrame without fixing the index (default)
wine_data = pd.read_csv(path + "\\winemag-data-130k-v2.csv")
print("DataFrame loaded without index_col:")
display(wine_data.head())

# Reload the CSV, this time fixing the index column
wine_data = pd.read_csv(path + "\\winemag-data-130k-v2.csv", index_col=0)
print("DataFrame loaded with index_col=0:")
display(wine_data.head())

Path to dataset files: C:\Users\lucad\.cache\kagglehub\datasets\zynicide\wine-reviews\versions\4
DataFrame loaded without index_col:


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


DataFrame loaded with index_col=0:


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


Sometimes you might want to store your downloaded datasets in a custom folder for better project organization. The following snippet shows how to move the downloaded files to a directory of your choice.

Be sure to update the target_folder variable with your preferred folder path.

In [None]:
import kagglehub
import shutil
import os

# Uncomment and run this line if you haven't downloaded the dataset yet
# path = kagglehub.dataset_download("zynicide/wine-reviews")

# Specify your desired target folder here
target_folder = "/your/custom/path/wine-reviews"

# Create the target folder if it doesn't exist
os.makedirs(target_folder, exist_ok=True)

# Move the downloaded dataset files to the target folder
shutil.move(path, target_folder)

print(f"Dataset moved to {target_folder}")

### 📋 Summary

- We loaded a dataset from a CSV file using `pd.read_csv()`.  
- You learned how to fix the DataFrame index with the `index_col` parameter.  
- We showed how to move dataset files to a custom folder to keep your workspace organized.

---

### 🗂️ Other Data File Formats Supported by pandas

Besides CSV, pandas can load data from many other file formats, including:

- **Excel files** (`.xls`, `.xlsx`) with [`pd.read_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)  
- **JSON files** with [`pd.read_json()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)  
- **SQL databases** via [`pd.read_sql()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html)  
- **Parquet files** (`.parquet`) for efficient columnar storage with [`pd.read_parquet()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html)  
- **HTML tables** using [`pd.read_html()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html)  

This flexibility makes pandas a powerful tool for various real-world data workflows.

## 4. Basic Data Inspection & Selection

Before diving deeper into analysis, it's important to understand the structure and shape of your dataset, and how to select the data you want to work with.

### Checking Dataset Shape

You can quickly check the number of rows and columns using the `.shape` attribute:

In [44]:
print(f"Dataset shape: {wine_data.shape}")

Dataset shape: (129971, 13)


### Accessing Columns: Two Common Ways

You can select a column either by:

- `wine_data['province']`  
  - More general and reliable  
  - Works with any column name, including those with spaces or special characters  
  - Preferred if column names might conflict with DataFrame attributes  
  - Always returns a Series

- `wine_data.province`  
  - More concise and convenient for simple column names  
  - Only works if the column name is a valid Python identifier (no spaces, special chars, or starting with numbers)  
  - Can conflict with DataFrame methods or attributes  
  - Not recommended for production code


In [50]:
wine_data['province'] # or wine_data.province (see explanation above)

0         Sicily & Sardinia
1                     Douro
2                    Oregon
3                  Michigan
4                    Oregon
                ...        
129966                Mosel
129967               Oregon
129968               Alsace
129969               Alsace
129970               Alsace
Name: province, Length: 129971, dtype: object

To select multiple columns and get a DataFrame:

In [48]:
wine_data[['country', 'province', 'points']]

Unnamed: 0,country,province,points
0,Italy,Sicily & Sardinia,87
1,Portugal,Douro,87
2,US,Oregon,87
3,US,Michigan,87
4,US,Oregon,87
...,...,...,...
129966,Germany,Mosel,90
129967,US,Oregon,90
129968,France,Alsace,90
129969,France,Alsace,90


### Selecting Rows

- **By position with `.iloc[]` (index-based selection):**  
  Select rows and columns based on their integer position (starting from 0).

- **By label with `.loc[]` (label-based selection):**  
  Select rows and columns based on their index labels and column names.
  
---
  
- `.iloc[]` uses integer-based positions for both rows and columns.

- `.loc[]` uses label-based indexing - for rows this means the DataFrame's index labels, and for columns the column names.

In [54]:
# Select the second row (remember Python uses zero-based indexing)
wine_data.iloc[1]

# Select the first three rows and all columns
wine_data.iloc[:3, :]

# Select all rows, but only specific columns by name
wine_data.loc[:, ['designation', 'taster_name', 'variety']]

Unnamed: 0,designation,taster_name,variety
0,Vulkà Bianco,Kerin O’Keefe,White Blend
1,Avidagos,Roger Voss,Portuguese Red
2,,Paul Gregutt,Pinot Gris
3,Reserve Late Harvest,Alexander Peartree,Riesling
4,Vintner's Reserve Wild Child Block,Paul Gregutt,Pinot Noir
...,...,...,...
129966,Brauneberger Juffer-Sonnenuhr Spätlese,Anna Lee C. Iijima,Riesling
129967,,Paul Gregutt,Pinot Noir
129968,Kritt,Roger Voss,Gewürztraminer
129969,,Roger Voss,Pinot Gris


## 5. Data Exploration

In this section, we'll cover essential techniques to understand and prepare your data, including basic inspection, conditional filtering, handling missing values, and managing data types.

### 5.1 Inspecting Data

Before diving into analysis or modeling, it's crucial to understand the structure and contents of your dataset. Pandas provides several handy methods to quickly get an overview:

- `head(n)`: Displays the first *n* rows (default 5) to get a quick look at the data.

- `info()`: Provides a concise summary of the DataFrame, including data types, non-null counts, and memory usage.

- `columns`: Lists all column names (features) in the dataset, helping you understand what variables are available.

- `dtypes`: Shows the data type of each column, which is essential for understanding how data will be handled and processed.

- `describe()`: Generates summary statistics for numeric columns (mean, std, min, max, percentiles).

- `unique()`: Returns the unique values in a column, which is useful for understanding categorical data or spotting anomalies.

You can also use `describe()` on non-numeric (categorical) columns to get statistics like count, number of unique values, top frequent value, and frequency.

In [57]:
# Display the first 5 rows
wine_data.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [56]:
# Summary info about the dataset
wine_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 13.9+ MB


Before diving deeper into the data, it’s useful to get a quick list of all the columns (features) available in the dataset:

In [61]:
# List all columns in the dataset
wine_data.columns

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

In [62]:
# Display data types of each column
wine_data.dtypes

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

In [55]:
# Summary statistics for numeric columns
wine_data.describe()

Unnamed: 0,points,price
count,129971.0,120975.0
mean,88.447138,35.363389
std,3.03973,41.022218
min,80.0,4.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,3300.0


In [58]:
# Summary statistics for a string/categorical column
wine_data.country.describe()

count     129908
unique        43
top           US
freq       54504
Name: country, dtype: object

In [75]:
wine_data.taster_twitter_handle.unique() # explicitly list all unique values in a column

array(['@kerinokeefe', '@vossroger', '@paulgwine', nan, '@wineschach',
       '@vboone', '@mattkettmann', '@wawinereport', '@gordone_cellars',
       '@JoeCz', '@AnneInVino', '@laurbuzz', '@worldwineguys',
       '@suskostrzewa', '@bkfiona', '@winewchristina'], dtype=object)

### 5.2 Conditional Selection and Filtering

Pandas allows you to filter datasets using boolean conditions, helping you focus on relevant subsets of your data.

- Use logical operators (`==`, `&`, `|`) to build complex conditions.
- `.isin()` helps filter rows where a column’s value belongs to a list.
- `.isnull()` and `.notnull()` identify missing or present values, which is useful for filtering before handling missing data.

In [63]:
wine_data.country == 'Argentina'  # Boolean series based on country

0         False
1         False
2         False
3         False
4         False
          ...  
129966    False
129967    False
129968    False
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

In [64]:
wine_data[wine_data.country == 'Argentina']  # Filter rows where country is Argentina

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
16,Argentina,"Baked plum, molasses, balsamic vinegar and che...",Felix,87,30.0,Other,Cafayate,,Michael Schachner,@wineschach,Felix Lavaque 2010 Felix Malbec (Cafayate),Malbec,Felix Lavaque
17,Argentina,Raw black-cherry aromas are direct and simple ...,Winemaker Selection,87,13.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Gaucho Andino 2011 Winemaker Selection Malbec ...,Malbec,Gaucho Andino
183,Argentina,With attractive melon and other tropical aroma...,,88,12.0,Other,Salta,,Michael Schachner,@wineschach,Alamos 2007 Torrontés (Salta),Torrontés,Alamos
224,Argentina,Blackberry and road-tar aromas are dark and st...,Lunta,90,22.0,Mendoza Province,Luján de Cuyo,,Michael Schachner,@wineschach,Mendel 2014 Lunta Malbec (Luján de Cuyo),Malbec,Mendel
231,Argentina,"Meaty and rubbery, but that's young Bonarda. T...",,85,10.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Andean Sky 2007 Bonarda (Mendoza),Bonarda,Andean Sky
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129921,Argentina,There is a select group of under-$20 Malbecs f...,La Madras Vineyard,91,18.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Ricardo Santos 2006 La Madras Vineyard Malbec ...,Malbec,Ricardo Santos
129925,Argentina,"A lively, well-made blend of Tempranillo, Malb...",B Crux,91,24.0,Mendoza Province,Uco Valley,,Michael Schachner,@wineschach,O. Fournier 2005 B Crux Red (Uco Valley),Red Blend,O. Fournier
129932,Argentina,Andeluna's top wines tend to be ripe and plump...,Pasionado,91,55.0,Mendoza Province,Uco Valley,,Michael Schachner,@wineschach,Andeluna 2004 Pasionado Red (Uco Valley),Red Blend,Andeluna
129938,Argentina,Compared to the regular 2006 Malbec from Chaka...,Reserve,91,25.0,Mendoza Province,Luján de Cuyo,,Michael Schachner,@wineschach,Chakana 2006 Reserve Malbec (Luján de Cuyo),Malbec,Chakana


In [66]:
wine_data[(wine_data.country == 'Argentina') & (wine_data.price >= 20)]  # Multiple conditions (AND)
wine_data[(wine_data.country == 'Argentina') | (wine_data.price >= 20)]  # Multiple conditions (OR)
wine_data[wine_data.country.isin(['Italy', 'Argentina'])];  # Filter for countries in a list

In [68]:
wine_data[wine_data.price.notnull()]  # Filter non-missing price values
wine_data[wine_data.price.isnull()];   # Filter missing price values

### 5.3 Handling Missing Values

Missing data is a common challenge in real-world datasets. It's essential to identify and understand the extent of missing values before deciding how to handle them.

#### Handling Missing Values - Overview

Pandas offers straightforward methods to count and summarize missing data:

- `.isnull()` and `.notnull()` identify missing or present data, returning boolean masks.
- `.sum()` applied on these masks counts missing entries per column or row.
- Calculating total and percentage of missing data helps assess dataset quality.

In [72]:
# Count missing values per column
missing_per_column = wine_data.isnull().sum()
print(missing_per_column)
print()
# Calculate total and percentage of missing values in the dataset
total = np.product(wine_data.shape)
total_missing = missing_per_column.sum()
percent_missing = (total_missing / total) * 100
print(f"Total missing values: {total_missing}")
print(f"Percentage missing: {percent_missing:.2f}%")

country                     63
description                  0
designation              37465
points                       0
price                     8996
province                    63
region_1                 21247
region_2                 79460
taster_name              26244
taster_twitter_handle    31213
title                        0
variety                      1
winery                       0
dtype: int64

Total missing values: 204752
Percentage missing: 12.12%


In [70]:
# Count missing values per column
wine_data.isnull().sum()

country                     63
description                  0
designation              37465
points                       0
price                     8996
province                    63
region_1                 21247
region_2                 79460
taster_name              26244
taster_twitter_handle    31213
title                        0
variety                      1
winery                       0
dtype: int64

#### Handling Missing Values - Common Methods

- `.fillna()` replaces missing values with specified defaults or strategies.
- `.replace()` fixes specific incorrect or unwanted values in your dataset.
- Other strategies include dropping missing data (`.dropna()`) or imputing values.

In [73]:
# Replace missing values in a specific column with a default value
wine_data['region_1_filled'] = wine_data['region_1'].fillna("Unknown")

# Replace specific incorrect/unwanted values in a column
wine_data['taster_twitter_handle'] = wine_data['taster_twitter_handle'].replace("@paulgwine\xa0", "@paulgwine")

# (Optional) Drop rows with missing values in certain columns
wine_data_dropped = wine_data.dropna(subset=['price', 'points'])

### 5.4 Data Types and Conversions

Understanding and managing data types is crucial for effective analysis and model building. 

Pandas provides useful methods to inspect and convert data types within a DataFrame.

- Use `.dtypes` to view the data type of each column.

- Sometimes you may need to convert columns to appropriate types for performance or compatibility reasons. Common conversions include:

  - `.astype()` to convert a column to a specific type (e.g., integer, float, category).

  - `.to_datetime()` to convert columns containing dates or times into pandas datetime objects for easier manipulation.

In [92]:
# Check data types
wine_data.dtypes # or wine_data.points.dtypes for checking a single specific column 

country                   object
description               object
designation               object
points                   float64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
region_1_filled           object
dtype: object

In [90]:
# Convert 'points' to float (if needed)
wine_data['points'] = wine_data['points'].astype(float)

You can create or modify columns by simple assignment, and pandas will infer the data type automatically.

In [93]:
# Assign a new column with a constant value
wine_data['checked'] = 'yes'
wine_data['checked'].head()

0    yes
1    yes
2    yes
3    yes
4    yes
Name: checked, dtype: object

#### Parsing Dates

Dates can come in many formats, for example:

- `7/13/07` with format `"%m/%d/%y"`

- `13-7-2007` with format `"%d-%m-%Y"`

You can convert date columns to datetime using:

`pd.to_datetime(data_column, format="%m/%d/%y")`

Specifying the format speeds up conversion. Alternatively, letting pandas infer the format can handle mixed date formats but is generally slower (but might be useful if we have multiple formats):

`pd.to_datetime(data_column, infer_datetime_format=True)`

Once converted to datetime, you can easily extract components like day, month, or year using the `.dt` accessor.

In [None]:
# Example: converting a string column to datetime
sample_dates = pd.Series(['2022-01-01', '2022-02-15', '2022-03-30'])
sample_dates_dt = pd.to_datetime(sample_dates)
print(sample_dates_dt)
print()
# Now you can easily extract the year, month, or day
print(sample_dates_dt.dt.year)
print(sample_dates_dt.dt.month)

## 📊 6. Statistical Summaries and Correlation

Before performing deeper analysis or building models, it's helpful to understand the statistical properties and relationships within your data.

Pandas offers a range of built-in functions to quickly compute descriptive statistics and examine correlations between numerical features. These tools allow you to identify trends, spot outliers, and better understand how variables relate to one another.

In this section, we’ll explore:

- Basic statistical methods (`mean()`, `min()`, `max()`, etc.)
- Pairwise correlations between numeric columns using `.corr()`
- Sorting correlation values to find the strongest relationships
- Frequency counts for categorical data (`.value_counts()`)

In [98]:
# statistic functions examples

wine_data.points.mean()
wine_data.points.min()
wine_data.points.max();

In [101]:
# The .corr() function in pandas computes the pairwise correlation coefficients between numeric columns in a DataFrame.
# Notes:
# 1) Only numeric columns are included.
# 2) Uses Pearson correlation by default.
# 3) Doesn’t detect nonlinear relationships.

wine_data.corr()

price     1.000000
points    0.416167
Name: price, dtype: float64

In [102]:
# If using pandas >= 1.5.0, you can use numeric_only=True
# corr_matrix = wine_data.corr(numeric_only=True)

# Alternative method for older versions: select only numeric columns first
corr_matrix = wine_data.select_dtypes(include='number').corr()

# Sort correlation values with respect to 'price'
corr_matrix['price'].sort_values(ascending=False)

price     1.000000
points    0.416167
Name: price, dtype: float64

### 🧮 Value Counts for Categorical Columns

You can use `.value_counts()` to see how often each unique value appears in a column; especially useful for categorical features.

📌 **Note:** `value_counts()` automatically ignores missing values by default.  
Use `dropna=False` if you want to include them: `wine_data.taster_name.value_counts(dropna=False)`

In [105]:
# Count how many times each unique wine taster appears in the dataset
wine_data.taster_name.value_counts() 

Roger Voss            25514
Michael Schachner     15134
Kerin O’Keefe         10776
Virginie Boone         9537
Paul Gregutt           9532
Matt Kettmann          6332
Joe Czerwinski         5147
Sean P. Sullivan       4966
Anna Lee C. Iijima     4415
Jim Gordon             4177
Anne Krebiehl MW       3685
Lauren Buzzeo          1835
Susan Kostrzewa        1085
Mike DeSimone           514
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: taster_name, dtype: int64

## 7. Grouping, Aggregation & Sorting

### Grouping and Aggregation

Grouping allows you to split your DataFrame into subsets based on column values and then apply aggregation functions (like count, sum, mean) to each group. This is useful for summarizing and analyzing data by categories.

For example, you can replicate what `value_counts()` does with:

In [107]:
wine_data.groupby('points').points.count() 
# This groups the data by points and counts the number of entries in each group.

points
80.0       397
81.0       692
82.0      1836
83.0      3025
84.0      6480
85.0      9530
86.0     12600
87.0     16933
88.0     17207
89.0     12226
90.0     15410
91.0     11359
92.0      9613
93.0      6489
94.0      3758
95.0      1535
96.0       523
97.0       229
98.0        77
99.0        33
100.0       19
Name: points, dtype: int64

You can also calculate summary statistics within groups. For example:

In [108]:
# Average price for each points rating
wine_data.groupby('points').price.mean()

points
80.0      16.372152
81.0      17.182353
82.0      18.870767
83.0      18.237353
84.0      19.310215
85.0      19.949562
86.0      22.133759
87.0      24.901884
88.0      28.687523
89.0      32.169640
90.0      36.906622
91.0      43.224252
92.0      51.037763
93.0      63.112216
94.0      81.436938
95.0     109.235420
96.0     159.292531
97.0     207.173913
98.0     245.492754
99.0     284.214286
100.0    485.947368
Name: price, dtype: float64

Or aggregate multiple statistics at once using `.agg()`:

In [109]:
# Count, minimum, and maximum price per country
wine_data.groupby('country').price.agg([len, min, max])

Unnamed: 0_level_0,len,min,max
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,3800.0,4.0,230.0
Armenia,2.0,14.0,15.0
Australia,2329.0,5.0,850.0
Austria,3345.0,7.0,1100.0
Bosnia and Herzegovina,2.0,12.0,13.0
Brazil,52.0,10.0,60.0
Bulgaria,141.0,8.0,100.0
Canada,257.0,12.0,120.0
Chile,4472.0,5.0,400.0
China,1.0,18.0,18.0


These methods provide flexible ways to explore your data grouped by categorical variables.

### Sorting Data

You can sort your DataFrame or grouped results by column values using `sort_values()`:

In [114]:
# Sort the dataset by price in ascending order
wine_data.sort_values(by='price')

# Sort in descending order
wine_data.sort_values(by='price', ascending=False)

# When working with lists, use the built-in Python `sort()` method
my_list = [3, 1, 2]
my_list.sort()
my_list

[1, 2, 3]

## 🧹 8. Data Cleaning and Transformation 

Cleaning and transforming data are essential steps before analysis or modeling. This section covers key tasks to prepare your dataset:

- Handling missing values and imputation

- Checking for and removing duplicates

- Using mapping methods (`map`, `replace`, `apply`) for data transformations

- Renaming columns and index labels for clarity and consistency

### 8.1 Handling Missing Values and Imputation

Handling missing values is a crucial part of cleaning data before analysis or modeling.

You can begin by inspecting how many values are missing per column:

In [116]:
wine_data.isnull().sum()

country                     63
description                  0
designation              37465
points                       0
price                     8996
province                    63
region_1                 21247
region_2                 79460
taster_name              26244
taster_twitter_handle    31213
title                        0
variety                      1
winery                       0
region_1_filled              0
checked                      0
dtype: int64

**Dropping Missing Data**

Use `.dropna()` to remove rows or columns with missing values:

In [118]:
wine_data_copy = wine_data.copy()

# Drop rows with any missing values
wine_data_copy.dropna()

# Drop columns with any missing values
wine_data_copy.dropna(axis=1);

This approach is only recommended when the amount of missing data is small, or when the column/row isn't essential.

**Imputing Missing Data** with `fillna()`

You can fill missing values with a constant or use forward/backward filling:

In [121]:
wine_data_copy.fillna(0)           # Replace all NaNs with 0
wine_data_copy.fillna(method='ffill')  # Forward fill
wine_data_copy.fillna(method='bfill');  # Backward fill

Statistical Imputation with **SimpleImputer** (scikit-learn)

For numeric columns, a more robust strategy is to impute missing values with the column mean or median:

In [124]:
from sklearn.impute import SimpleImputer

# Select only numeric columns for imputation 
numeric_data = wine_data.select_dtypes(include=['number'])

# Create imputer with default strategy='mean'
imputer = SimpleImputer()

# Fit and transform the numeric data
imputed_numeric_data = pd.DataFrame(imputer.fit_transform(numeric_data), columns=numeric_data.columns)

📌 Notes:

- `SimpleImputer()` only works on numeric data if you're using the 'mean' strategy.

- If you want to impute categorical columns, you can use a different strategy (e.g., 'most_frequent' or 'constant') and handle those columns separately.

💡 Tip:
- Always copy your DataFrame before using destructive operations like `dropna()` to avoid losing original data.

### 8.2 Checking for and Removing Duplicates

Duplicate rows can distort analysis and lead to misleading results. It's important to detect and handle them early in your data cleaning process.

#### 🔍 Check for duplicates

Use `.duplicated()` to identify duplicate rows:

In [125]:
wine_data.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
129966    False
129967    False
129968    False
129969    False
129970    False
Length: 129971, dtype: bool

In [126]:
# To see how many duplicates exist:

wine_data.duplicated().sum()

9983

✅ Check if a column has only unique values

Useful for validating columns that should ideally have unique entries (e.g., IDs, user handles).

You can test whether a specific column contains duplicates with:

In [130]:
sum(wine_data.duplicated(subset='taster_name')) == 0

False

This returns ✅ `True` if all entries in the `taster_name` column are unique, and ❌ `False` otherwise.

🧹 **Remove duplicates**

To remove duplicated rows, keeping only the first occurrence:

In [None]:
wine_data_no_duplicates = wine_data.drop_duplicates()

To remove duplicates based on specific columns:

In [128]:
wine_data_no_duplicates_subset = wine_data.drop_duplicates(subset=['taster_name', 'country'])

📌 Note:

- By default, `.duplicated()` and `.drop_duplicates()` consider all columns.
- Use keep='last' or keep=False in `.drop_duplicates()` to change which duplicates are retained.

In [129]:
# Keep the last occurrence instead of the first
wine_data_last = wine_data.drop_duplicates(keep='last')

# Drop all duplicates (keep none)
wine_data_drop_all = wine_data.drop_duplicates(keep=False)

💡 Tip:
- Always check how many rows you're dropping by comparing the shape before and after deduplication.

### 8.3 Using Mapping Methods (map, replace, apply) for Data Transformations

Pandas provides several powerful methods for transforming and cleaning data at the column level:

🔁 `map()`:

- Used for element-wise transformation.
- Works only on Series (not entire DataFrames).
- Useful for mapping values using a dictionary or a function.

In [131]:
# Example: map values in a Series (non-destructive)
country_mapped = wine_data['country'].map({'Italy': 'IT', 'France': 'FR'})

In [138]:
# Example: subtracting the mean price from each price value
wine_data_price_mean = wine_data.price.mean()
demeaned_price = wine_data.price.map(lambda p: p - wine_data_price_mean)

📌 `map()` returns a new Series with transformed values, leaving the original data untouched.

🔄 `replace()`:

- Similar to `map()` but more flexible.
- Can operate on DataFrames or Series.
- Supports replacing values with others (single or multiple).

Used to replace specific values in a Series or DataFrame.

In [134]:
# Example: replace values in a Series (non-destructive)
country_replaced = wine_data['country'].replace({'US': 'United States', 'UK': 'United Kingdom'})

🛠️ `apply()`:

- General-purpose transformation method.
- Can be used on entire DataFrames (with axis=1) or on Series.
- Allows you to apply a function to rows or columns.

In [136]:
# Example: apply a function to a column (non-destructive)
points_plus_five = wine_data['points'].apply(lambda x: x + 5)

In [139]:
# Apply a custom function row-wise to re-center prices

wine_data_price_mean = wine_data.price.mean()

def remean_price(row):
    row.price = row.price - wine_data_price_mean
    return row

wine_data_remeaned = wine_data.apply(remean_price, axis='columns')

⚠️ Unlike `map()`, `apply()` gives access to the full row or column, not just one value; making it great for more complex transformations.

⚡ Bonus: Pandas supports simple vectorized math too

In [140]:
# Re-centering points using direct subtraction
wine_data_points_mean = wine_data.points.mean()
demeaned_points = wine_data.points - wine_data_points_mean

✅ This is often the fastest and cleanest way to transform numeric columns in pandas.

📌 These examples do not modify the original dataset. You can assign the results to a new variable or column if needed.

### 8.4 Renaming Columns and Index Labels and Cleaning Inconsistent Entries

Renaming columns or index labels improves code readability and ensures consistent naming throughout your workflow. Additionally, cleaning inconsistent text entries within columns helps maintain data quality and accuracy.

🏷️ Renaming Columns

Use `.rename()` with the `columns` parameter and a dictionary mapping old names to new ones:

In [142]:
# Rename the 'points' column to 'score'
wine_data_renamed = wine_data.rename(columns={'points': 'score'})

🔹 This returns a new DataFrame by default; the original remains unchanged unless you set `inplace=True`.

You can rename multiple columns at once:

In [144]:
# Rename multiple columns at once
wine_data_renamed = wine_data.rename(columns={
    'points': 'score',
    'price': 'cost'
})

🧾 **Renaming Index Labels**

To rename row index labels (not commonly needed unless working with custom indexes):

In [145]:
wine_data_renamed_index = wine_data.rename(index={0: 'first_row', 1: 'second_row'})

✏️ **Rename All Columns with a Function**

This is handy for cleaning up messy column names after loading data.

If you want to rename all columns systematically (e.g. lowercase, remove spaces):

In [146]:
wine_data.columns = wine_data.columns.str.lower().str.replace(' ', '_')

🧹 **Cleaning Inconsistent Text Entries**

Text data often contains inconsistencies like different capitalizations or trailing spaces. Standardizing these entries improves data quality:

In [147]:
wine_data_copy = wine_data.copy()

# Convert all entries in 'country' column to lowercase
wine_data_copy['country'] = wine_data_copy['country'].str.lower()

# Remove trailing and leading whitespace
wine_data_copy['country'] = wine_data_copy['country'].str.strip()

To detect and fix near-duplicate or misspelled entries, you can use fuzzy string matching with the `fuzzywuzzy` library:

In [149]:
import fuzzywuzzy
from fuzzywuzzy import process

# Example: Find top 5 closest matches to "Italy"
matches = fuzzywuzzy.process.extract("Italy", wine_data['country'].unique(), limit=5, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
print(matches);

[('Italy', 100), ('Israel', 55), ('Portugal', 46), ('Australia', 43), ('Chile', 40)]


You can automate corrections by defining a function that replaces similar entries with a consistent target string:

In [150]:
def replace_matches_in_column(df, column, target_string, min_ratio=80):
    strings = df[column].unique()
    matches = fuzzywuzzy.process.extract(target_string, strings, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    close_matches = [match[0] for match in matches if match[1] >= min_ratio]
    df.loc[df[column].isin(close_matches), column] = target_string
    print(f"Replaced {len(close_matches)} entries with '{target_string}'")

# Usage example
replace_matches_in_column(wine_data_copy, 'country', 'italy')

Replaced 1 entries with 'italy'


## 🔧 9. Data Preprocessing and Feature Engineering

Before feeding your dataset into a machine learning model or performing deeper analysis, it’s important to transform your raw data into a cleaner and more informative structure.

This section covers common **preprocessing** and **feature engineering** techniques, including:

- Extracting components from datetime columns  
- Handling categorical variables  
- Scaling and normalizing numerical features 
- Creating categorical bins from continuous data (feature binning)

These techniques help improve model performance and make your data easier to work with.

### 9.1 Date Feature Extraction

If you’ve already converted your date columns to datetime format (see Section 5.4), you can now extract valuable components from them using the `.dt` accessor:

In [None]:
df['date'].dt.year       # Extract the year
df['date'].dt.month      # Extract the month
df['date'].dt.dayofweek  # Extract day of the week (0 = Monday)

These features are useful when building models that may depend on seasonal, monthly, or weekday-based patterns.

### ✅ 9.2 Handling Categorical Variables

Categorical variables represent values that fall into distinct categories, like `"country"`, `"taster_name"`, or `"wine_style"`. Most machine learning models require these to be encoded numerically.

In [156]:
# Get list of categorical variables contained in our dateset
wine_data_copy = wine_data.copy()
category = (wine_data_copy.dtypes == 'object')
cat_cols = list(category[category].index)
cat_cols

['country',
 'description',
 'designation',
 'province',
 'region_1',
 'region_2',
 'taster_name',
 'taster_twitter_handle',
 'title',
 'variety',
 'winery',
 'region_1_filled',
 'checked']

There are three main approaches to deal with Categorical Variables:

1️⃣ **Drop Categorical Variables**

Only do this if you're sure the variable has no predictive value (columns do not contain useful information).

In [152]:
# Drop all object (i.e. categorical) columns
wine_data_numeric = wine_data.select_dtypes(exclude=['object'])

2️⃣ **Ordinal Encoding**

Use when the categories have a natural order (e.g., "Low", "Medium", "High") - clear ranking to the categories. 

Ordinal variables = categorical variables that have a clear ordering in the values. 

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Copy dataset and get list of categorical columns
wine_data_copy = wine_data.copy()
cat_cols = wine_data_copy.select_dtypes(include=['object']).columns

# Apply ordinal encoding
ordinal_encoder = OrdinalEncoder()
wine_data_copy[cat_cols] = ordinal_encoder.fit_transform(wine_data_copy[cat_cols])

3️⃣ **One-Hot Encoding**

Use when the categories are nominal (no order). Each category becomes a separate column with 0/1 values.

One-hot encoding does not assume an ordering of the categories.
Nominal variables = categorical variables without an intrinsic ranking

In [None]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encode categorical features
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(wine_data_copy[cat_cols]))

# Restore index
OH_cols.index = wine_data_copy.index

# Drop original categorical columns and join encoded ones
num_data = wine_data_copy.drop(cat_cols, axis=1)
OH_data = pd.concat([num_data, OH_cols], axis=1)

# Make sure all column names are strings
OH_data.columns = OH_data.columns.astype(str)

💡 Tip: You can also one-hot encode a single column quickly with `pd.get_dummies()`:

In [155]:
pd.get_dummies(wine_data_copy['country'], drop_first=True).head()

Unnamed: 0,Armenia,Australia,Austria,Bosnia and Herzegovina,Brazil,Bulgaria,Canada,Chile,China,Croatia,...,Serbia,Slovakia,Slovenia,South Africa,Spain,Switzerland,Turkey,US,Ukraine,Uruguay
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### ⚖️ 9.3 Scaling and Normalization

Scaling and normalization are essential data preprocessing steps, especially before feeding data into many machine learning models. They ensure that features are on comparable scales, preventing models from being biased toward variables with larger values.

🤔 What's the Difference?

- Scaling: changes the range of your data (e.g., shrinking everything between 0 and 1).
- Normalization: changes the shape of the distribution of your data (e.g., making it look more like a normal distribution).

🧠 When Should You Use Each?

- Use scaling when your model depends on distance calculations (like K-Nearest Neighbors, SVM, or models using gradient descent). By scaling variables, you compare features on equal footing.

- Use normalization when your algorithm assumes normally distributed data (e.g., Linear Discriminant Analysis, Gaussian Naive Bayes).
    It’s a more radical transformation and may significantly reshape the feature distribution.

🔢 Why Scale or Normalize?

- Many algorithms are sensitive to feature magnitudes, such as:
    - K-Nearest Neighbors
    - Logistic Regression
    - Support Vector Machines
    - Gradient Descent–based models
- Helps speed up convergence in optimization.

⚙️ Common Scalers in `scikit-learn`

📏 **MinMaxScaler**

Scales features to a specific range (default [0, 1]).

✅ Preserves shape of original distribution ⚠️ Sensitive to outliers

In [161]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(wine_data[['price', 'points']])
scaled_df = pd.DataFrame(scaled_data, columns=['price', 'points'])

📊 **StandardScaler**

Centers feature values around mean = 0 and standard deviation = 1.

✅ Works well for features with Gaussian-like distributions

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(wine_data[['price', 'points']])

📌 Note: fit_transform() returns a NumPy array. If you want a DataFrame back:

In [158]:
scaled_df = pd.DataFrame(scaled_data, columns=['price', 'points'])

🧮 MaxAbsScaler

Scales by dividing by the maximum absolute value of each feature, resulting in values between -1 and 1.

✅ Ideal for sparse data
✅ Keeps 0s intact

In [None]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
scaled_data = scaler.fit_transform(wine_data[['price', 'points']])

🔁 Normalizer (L2 Norm)

Applies normalization across each row (not column).

Useful when the direction of the data matters more than magnitude (e.g., text classification).

In [None]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
normalized_data = normalizer.fit_transform(wine_data[['price', 'points']])

📦 Box-Cox Transformation

Transforms non-normal dependent variables into a normal shape. 

Useful for normalization before linear models.

In [None]:
from scipy import stats

# Apply Box-Cox to a single column (must be strictly positive)
boxcox_transformed, fitted_lambda = stats.boxcox(wine_data['price'].dropna())

### 🔍 Summary Table of Scalers

| Scaler            | Output Range        | Preserves Distribution | Notes                                |
|-------------------|---------------------|-------------------------|--------------------------------------|
| `MinMaxScaler`    | [0, 1]              | ✅ Yes                  | Sensitive to outliers                |
| `StandardScaler`  | Mean = 0, Std = 1   | ✅ Gaussian-like only   | Most commonly used                   |
| `MaxAbsScaler`    | [-1, 1]             | ✅ Yes                  | Good for sparse data, keeps zeros    |
| `Normalizer`      | Unit norm (per row) | ❌ No                   | Use when direction matters (e.g. text) |
| `stats.boxcox()`  | ~Gaussian           | ❌ No                   | Normalization for positive values only |

### 9.4 Feature Binning with `pd.cut()`

Sometimes, it’s useful to convert a continuous numeric feature into categorical bins or ranges. This can help simplify models or create grouped summaries. 

This technique is great for creating meaningful categories from continuous data.

Pandas provides the `cut()` function to split a numeric column into discrete intervals with optional labels.

Example:

In [169]:
wine_data_copy = wine_data.copy()

# Define bins and labels for the 'points' column
bins = [75., 85., 95., 100.]
labels = ['tier 3', 'tier 2', 'tier 1']

# Create a new categorical column based on the bins
wine_data_copy['points_range'] = pd.cut(wine_data_copy.points, bins=bins, labels=labels)

# Count how many entries fall into each bin
wine_data_copy['points_range'].value_counts()

tier 2    107130
tier 3     21960
tier 1       881
Name: points_range, dtype: int64

## 📚 10. Combining Datasets

Combining datasets is a fundamental task in data analysis. Whether you're merging additional features, appending new records, or joining datasets from different sources, pandas provides powerful tools to handle this efficiently.

This section will cover:
- Concatenation (pd.concat)
- Joining based on index or columns (.join)
- Merging on keys (pd.merge)
- Dealing with overlapping columns or mismatched indexes

| Method         | Purpose                          | Axis / Direction          | Notes / Usage Highlights                                   |
|----------------|---------------------------------|--------------------------|------------------------------------------------------------|
| `pd.concat()`  | Concatenate DataFrames           | axis=0 (rows) or axis=1 (columns) | Stacks DataFrames vertically or horizontally; indices preserved unless `ignore_index=True` |
| `df.join()`    | Join DataFrames by index or key  | N/A                      | Convenient for joining on index or key; similar to merge but simpler syntax for index joins |
| `pd.merge()`   | Merge DataFrames based on keys   | N/A                      | Database-style join; supports inner, outer, left, right joins; merges on columns or indices |
| `pd.append()`  | Append rows to DataFrame (deprecated) | axis=0 (rows)            | Used to add rows; deprecated in favor of `pd.concat()`       |


Use `pd.concat()` to stack DataFrames one after another (adding rows):

In [166]:
import pandas as pd

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})

result = pd.concat([df1, df2], ignore_index=True)
# ignore_index=True resets the index in the resulting DataFrame after concatenation

print(result)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40


**Joining DataFrames by Index**

Use `.join()` to join DataFrames side-by-side based on their index.

In [167]:
df1 = pd.DataFrame({'Name': ['Alice', 'Bob']}, index=[0, 1])
df2 = pd.DataFrame({'Age': [25, 30]}, index=[0, 1])

result = df1.join(df2)

print(result)

    Name  Age
0  Alice   25
1    Bob   30


**Merging DataFrames on a Key Column**

Use `pd.merge()` to combine DataFrames based on common columns (keys).

In [168]:
df1 = pd.DataFrame({
    'ID': [1, 2],
    'Name': ['Alice', 'Bob']
})

df2 = pd.DataFrame({
    'ID': [1, 2],
    'Age': [25, 30]
})

result = pd.merge(df1, df2, on='ID')

print(result)

   ID   Name  Age
0   1  Alice   25
1   2    Bob   30


## 🧠 11. Bonus: Tips for Handling Large Datasets

Working with large datasets in pandas can be slow or memory-intensive. Here are some practical strategies:

- Load only needed columns: use `usecols` when reading files like CSVs.

In [None]:
pd.read_csv("data.csv", usecols=["column1", "column2"])

- Read in chunks for massive files:

In [None]:
chunk_iter = pd.read_csv("large_file.csv", chunksize=10000)
for chunk in chunk_iter:
    process(chunk)

- Use `category` dtype for columns with repeated string values to save memory.

- Filter rows early: Don’t load/process more data than you need.

- Use `.loc[]` and `.query()` efficiently to avoid copying unnecessary data.

- Use libraries optimized for big data, like:
    - polars
    - dask
    - vaex