# Pandas Comprehensive Training

## 1. Introduction to Pandas

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of NumPy. It is commonly used in data science for cleaning, transforming, and analyzing data.
Key components of Pandas:
Series: 1-dimensional labeled array.
DataFrame: 2-dimensional labeled data structure (most commonly used).


Key Pandas Concepts for ML:
1. Data cleaning & preprocessing (handling missing values, duplicates, etc.)
2. Feature engineering (creating new features, encoding, normalization)
3. Aggregation and grouping (for data summarization)
4. Time series handling (important for forecasting)
5. Performance optimization (working with large datasets)
6. Data visualization integration (matplotlib, seaborn)
    
## 2. Basic Data Structures

### 2.1 Series
A Series is like a column in a table or a one-dimensional array with labels. You can create a Series using Python lists, NumPy arrays, or dictionaries.

### 2.2 DataFrame
A DataFrame is a two-dimensional table with labeled axes (rows and columns).

## 3. Data Selection and Indexing

### 3.1 Selecting Columns
You can select columns by passing column names as a key.

### 3.2 Selecting Rows by Label
You can use .loc[] to select rows based on their index labels.

### 3.3 Selecting Rows by Position
Use .iloc[] to select rows by position (integer index).

### 3.4 Conditional Selection
You can filter the data based on conditions.

### Multi-Indexing (Hierarchical Indexing)
When dealing with large datasets, hierarchical indexing allows you to represent multi-level indices.

### Indexing with .query() for Faster Selection
The .query() method can be faster for selecting rows when compared to traditional boolean indexing.

In [None]:
import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)


# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
print(df)


# Select a single column
age_column = df['Age']
print(age_column)

# Select multiple columns
subset = df[['Name', 'Age']]
print(subset)


# Select row with label 1 (Bob)
row_bob = df.loc[1]
print(row_bob)

# Select the first row
first_row = df.iloc[0]
print(first_row)

# Select rows where age is greater than 30
age_filter = df[df['Age'] > 30]
print(age_filter)

.............
# Creating MultiIndex from arrays
arrays = [['A', 'A', 'B', 'B'], ['one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=('letter', 'number'))
df_multi = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], index=index, columns=['x', 'y'])
print(df_multi)

# Accessing data with multi-index
print(df_multi.loc['A'])
.............
# Using query to filter rows (faster for large datasets)
df_filtered = df.query('Age > 30 and City == "New York"')
print(df_filtered)
............

## 4. Data Cleaning

### 4.1 Handling Missing Data
Pandas provides multiple functions to deal with missing data (NaN).

#### Imputing Missing Values
- For numerical features, imputation methods like mean, median, or using model-based imputation are common.
- For categorical features, you might fill missing values with the mode (most frequent value).

#### Advanced Imputation with sklearn's SimpleImputer
sklearn.impute.SimpleImputer offers various strategies like mean, median, or a constant value, and works well in a pipeline.

#### Removing Missing Data
If the proportion of missing data in a column is too high, it may be better to drop that column entirely.

### 4.2 Renaming Columns
You can rename columns using .rename().

### 4.3 Replacing Values
You can replace values in a DataFrame.

In [None]:
# Create DataFrame with NaN values
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, None, 35],
    'City': ['New York', None, 'Los Angeles']
}
df = pd.DataFrame(data)

# Fill missing values with a constant value
df_filled = df.fillna('Unknown')
print(df_filled)

# Drop rows with missing values
df_dropped = df.dropna()
print(df_dropped)

......

# Rename columns
df_renamed = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'})
print(df_renamed)

.........

# Replace a specific value
df_replaced = df.replace({'New York': 'NYC'})
print(df_replaced)

.........

# Impute missing values with median (useful for numerical features)
df['Age'].fillna(df['Age'].median(), inplace=True)
..........

# Impute missing values in categorical column with mode
df['City'].fillna(df['City'].mode()[0], inplace=True)

.........

from sklearn.impute import SimpleImputer

# Imputation for numeric features
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])

# Imputation for categorical features
imputer_cat = SimpleImputer(strategy='most_frequent')
df['City'] = imputer_cat.fit_transform(df[['City']])
.................
# Drop columns with more than 30% missing data
threshold = 0.3
df_cleaned = df.dropna(thresh=len(df)*(1-threshold), axis=1)
..............

## 5. Data Transformation

### 5.1 Apply Functions
You can apply functions to columns or rows using .apply().

### 5.2 Grouping Data
You can group the data and apply functions using .groupby(). groupby() allows you to split data into groups and apply aggregate functions such as mean, sum, or count.

### 5.3 Pivot Tables
Pivot tables help you aggregate data and summarize it in a more readable format.

### 5.4 Sorting Data
You can sort the data using .sort_values().

### 5.5 Merging DataFrames
Pandas allows merging DataFrames using .merge().

In [None]:
# Apply a function to a column
df['Age in months'] = df['Age'].apply(lambda x: x * 12 if pd.notna(x) else None)
print(df)

...........
# Group by a column and calculate mean
grouped = df.groupby('City').mean()
print(grouped)
...........
# Group by 'City' and calculate mean age
df_grouped = df.groupby('City')['Age'].mean()
print(df_grouped)

# Multiple aggregations on multiple columns
df_grouped_agg = df.groupby('City').agg({
    'Age': 'mean',
    'Sales': 'sum'
})
print(df_grouped_agg)
...........
# Creating a pivot table with multiple levels of aggregation
pivot_table = df.pivot_table(values='Sales', index='City', columns='Age', aggfunc='mean', fill_value=0)
print(pivot_table)
...........
# Sort by Age
df_sorted = df.sort_values(by='Age')
print(df_sorted)
..........
# Create another DataFrame
data2 = {
    'City': ['New York', 'San Francisco', 'Los Angeles'],
    'Population': [8000000, 870000, 4000000]
}
df2 = pd.DataFrame(data2)

# Merge dataframes on 'City'
merged_df = pd.merge(df, df2, on='City')
print(merged_df)

## 6. Time Series Analysis

Pandas is great for time series data. You can work with datetime objects for indexing and analysis.

6.1 DateTime Conversion
Ensure that date columns are in proper datetime format for time series analysis.

6.2 Resampling Time Series Data
For time series forecasting, resampling helps to aggregate data into a specific frequency.

6.3 Time Series Lag Features
Lag features help when creating predictive models for time series.

In [None]:
# Create a time series DataFrame
dates = pd.date_range('2024-01-01', periods=6)
data = {'Sales': [250, 300, 350, 400, 450, 500]}
df_time_series = pd.DataFrame(data, index=dates)
print(df_time_series)

# Resampling (e.g., monthly average)
df_resampled = df_time_series.resample('M').mean()
print(df_resampled)

.............

df['Date'] = pd.to_datetime(df['Date'])
print(df['Date'].dt.year)
...............
# Resample data to monthly frequency and take mean
df_resampled = df.resample('M', on='Date').mean()
.............
# Lag feature (e.g., previous month's sales)
df['Sales_lag'] = df['Sales'].shift(1)

## 7. Feature Engineering for Machine Learning

### 7.2 Creating New Features
Sometimes new features are derived from existing ones. For example, creating interaction terms or aggregating features.

### Feature Scaling (Normalization & Standardization)
For most ML models, scaling is crucial. Use MinMaxScaler for normalization or StandardScaler for standardization.

### One-Hot Encoding
For categorical variables, one-hot encoding transforms them into binary columns.

### Label Encoding for Machine Learning
Label encoding can be used when there’s an ordinal relationship between categories.

In [None]:
# Creating interaction feature
df['Age_X_City'] = df['Age'] * df['City'].astype('category').cat.codes

............

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization (zero mean, unit variance)
scaler = StandardScaler()
df['Age_scaled'] = scaler.fit_transform(df[['Age']])

# Normalization (scaling to range [0, 1])
scaler = MinMaxScaler()
df['Age_normalized'] = scaler.fit_transform(df[['Age']])

..........

# One-hot encode a categorical column (City)
df_encoded = pd.get_dummies(df, columns=['City'], drop_first=True)
print(df_encoded)

..............

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['City_encoded'] = encoder.fit_transform(df['City'])
print(df[['City', 'City_encoded']])
..............

## 8. Pandas Integration with Machine Learning Workflows

### 8.1 Data Pipeline Using sklearn.pipeline
Creating a data pipeline helps streamline preprocessing steps before fitting the model.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)