## Advanced Data Manipulation with Pandas

**DataFrames and Series (Advanced Indexing)**

*   **`.loc` and `.iloc`:**
    *   `.loc`:  Label-based indexing (select rows and columns by their labels).
    *   `.iloc`:  Integer-based indexing (select rows and columns by their integer positions).

    ```python
    import pandas as pd
    import numpy as np

    # Create a DataFrame
    data = {'col1': [1, 2, 3, 4],
            'col2': [5, 6, 7, 8],
            'col3': [9, 10, 11, 12]}
    df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])

    # .loc examples
    print(df.loc['A'])        # Row with label 'A'
    print(df.loc[['A', 'C']]) # Rows with labels 'A' and 'C'
    print(df.loc['B', 'col2']) # Value at row 'B', column 'col2'
    print(df.loc['A':'C', 'col1':'col2']) # Slicing with labels

    # .iloc examples
    print(df.iloc[0])        # First row (index 0)
    print(df.iloc[[0, 2]])   # First and third rows
    print(df.iloc[1, 1])     # Value at row 1, column 1
    print(df.iloc[0:3, 0:2]) # Slicing with integer positions
    ```

*   **Boolean Indexing:**  Select rows based on a condition.

    ```python
    print(df[df['col1'] > 2])  # Rows where 'col1' is greater than 2
    print(df[(df['col2'] >= 6) & (df['col3'] < 12)]) # Multiple conditions
    ```

*   **Multi-Indexing (Hierarchical Indexing):**  Create DataFrames with multiple levels of indices.

    ```python
    index = pd.MultiIndex.from_tuples([('Group1', 'A'), ('Group1', 'B'),
                                     ('Group2', 'A'), ('Group2', 'B')],
                                    names=['Group', 'Letter'])
    data = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
    df_multi = pd.DataFrame(data, index=index)
    print(df_multi)

    # Accessing data with multi-index
    print(df_multi.loc['Group1'])
    print(df_multi.loc[('Group1', 'B')]) # Access a specific row
    print(df_multi.loc['Group1', 'col1']) #Select a column for a specific group.
    print(df_multi.xs('A', level='Letter')) # Cross-section: all rows with 'Letter' = 'A'
    ```

**Data Cleaning and Transformation**

*   **Handling Missing Data (NaN values):**

    ```python
    df = pd.DataFrame({'col1': [1, np.nan, 3, 4],
                      'col2': [5, 6, np.nan, 8]})

    # Check for missing values
    print(df.isnull())
    print(df.isna()) # Same as isnull()
    print(df.isnull().sum())  # Count missing values per column

    # Drop rows or columns with missing values
    print(df.dropna())  # Drop rows with any NaN
    print(df.dropna(axis=1))  # Drop columns with any NaN
    print(df.dropna(thresh=2))  # Drop rows with fewer than 2 non-NaN values

    # Fill missing values
    print(df.fillna(0))  # Fill with 0
    print(df.fillna(method='ffill'))  # Forward fill (propagate last valid observation)
    print(df.fillna(method='bfill'))  # Backward fill
    print(df.fillna(df.mean()))  # Fill with column means
    ```

*   **Data Type Conversions:**

    ```python
    df = pd.DataFrame({'col1': ['1', '2', '3'], 'col2': [1.1, 2.2, 3.3]})

    # Convert to integer
    df['col1'] = df['col1'].astype(int)

    # Convert to float
    # df['col1'] = df['col1'].astype(float) # This will raise an error if the column cannot be converted to float

    # Convert to string
    df['col2'] = df['col2'].astype(str)

    # Convert to datetime
    df['date'] = pd.to_datetime(['2023-10-26', '2023-10-27', '2023-10-28'])
    print(df.dtypes)
    ```

*   **String Manipulation:**

    ```python
    df = pd.DataFrame({'text': ['hello world', 'Python Pandas', 'data science']})

    # String methods (vectorized)
    print(df['text'].str.upper())
    print(df['text'].str.lower())
    print(df['text'].str.contains('Python'))
    print(df['text'].str.split())
    print(df['text'].str.replace(' ', '_'))
    ```

*   **Applying Custom Functions:**

    ```python
    # Using apply (row-wise or column-wise)
    def my_function(row):
        return row['col1'] * 2 + row['col2']

    df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
    df['new_col'] = df.apply(my_function, axis=1)  # Apply row-wise
    print(df)

    # Using applymap (element-wise)
    df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
    df_squared = df.applymap(lambda x: x**2)
    print(df_squared)
    ```

**Grouping and Aggregation**

*   **`groupby` Operations:**

    ```python
    df = pd.DataFrame({'group': ['A', 'B', 'A', 'B', 'C'],
                       'value': [1, 2, 3, 4, 5]})

    # Group by a single column
    grouped = df.groupby('group')

    # Calculate the mean of each group
    print(grouped.mean())

    # Calculate multiple aggregations
    print(grouped.agg({'value': ['sum', 'mean', 'max']}))

    # Iterate over groups
    for name, group_df in grouped:
        print(f"Group: {name}")
        print(group_df)
    ```

*   **Transformations:**  Apply a function to each group and return a DataFrame with the same index as the original.

    ```python
    # Calculate the z-score within each group
    df = pd.DataFrame({'group': ['A', 'A', 'B', 'B'], 'value': [1, 2, 3, 4]})
    zscore = lambda x: (x - x.mean()) / x.std()
    df['zscore'] = df.groupby('group')['value'].transform(zscore)
    print(df)
    ```

*   **Filtering:**  Select groups based on a condition.

    ```python
    # Keep only groups with a mean greater than 2
    filtered_df = df.groupby('group').filter(lambda x: x['value'].mean() > 2)
    print(filtered_df)
    ```

**Merging, Joining, and Concatenating**

*   **`pd.concat`:**  Concatenate DataFrames along rows or columns.

    ```python
    df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
    df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

    # Concatenate along rows (axis=0, default)
    print(pd.concat([df1, df2]))

    # Concatenate along columns (axis=1)
    print(pd.concat([df1, df2], axis=1))

    # Handle different indices
    df3 = pd.DataFrame({'A': [9, 10]}, index=[2,3])
    print(pd.concat([df1, df3])) # Will fill missing values with NaN
    print(pd.concat([df1, df3], join='inner')) #Keep only shared columns

    ```

*   **`pd.merge`:**  Join DataFrames based on common columns (like SQL joins).

    ```python
    df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
    df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})

    # Inner join (default)
    print(pd.merge(df1, df2, on='key'))

    # Left join
    print(pd.merge(df1, df2, on='key', how='left'))

    # Right join
    print(pd.merge(df1, df2, on='key', how='right'))

    # Outer join
    print(pd.merge(df1, df2, on='key', how='outer'))

    # Joining on different column names
    df3 = pd.DataFrame({'key1': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
    df4 = pd.DataFrame({'key2': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
    print(pd.merge(df3, df4, left_on='key1', right_on='key2'))

    # Joining on index
    print(pd.merge(df1, df2, left_index=True, right_index=True, suffixes=('_left', '_right'))) # Example with index and different columns
    ```

* **`.join`**
    ```python
    # Simplified version of merge, primarily for joining on index.
    left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                      'B': ['B0', 'B1', 'B2']},
                     index=['K0', 'K1', 'K2'])
    right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                        'D': ['D0', 'D2', 'D3']},
                       index=['K0', 'K2', 'K3'])
    print(left.join(right)) # left join by default
    print(left.join(right, how='outer')) # outer join
    ```

**Time Series Data**

*   **Working with Dates and Times:**

    ```python
    # Create a DatetimeIndex
    dates = pd.to_datetime(['2023-10-26', '2023-10-27', '2023-10-28'])
    df = pd.DataFrame({'value': [1, 2, 3]}, index=dates)

    # Access components of dates
    print(df.index.year)
    print(df.index.month)
    print(df.index.day)
    print(df.index.dayofweek)

    # Create a date range
    date_range = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D') # Daily frequency
    print(date_range)

    # Time deltas
    print(pd.Timedelta(days=1))
    print(df.index + pd.Timedelta(days=2))

    ```

*   **Resampling:**  Change the frequency of a time series.

    ```python
    # Create a time series with hourly data
    rng = pd.date_range('2023-01-01', periods=24, freq='H')
    ts = pd.Series(np.random.randn(len(rng)), index=rng)

    # Resample to daily frequency, taking the mean
    print(ts.resample('D').mean())

    # Resample to 3-hour frequency, forward filling missing values
    print(ts.resample('3H').ffill())
    ```

*   **Shifting:**  Move data forward or backward in time.

    ```python
    # Shift the data by 1 period forward
    print(ts.shift(1))

    # Shift the data by 2 periods backward
    print(ts.shift(-2))
    ```

*   **Window Functions:**  Perform calculations over a sliding window of data.

    ```python
    # Calculate a 3-period rolling mean
    print(ts.rolling(window=3).mean())

    # Calculate a 3-period rolling sum
    print(ts.rolling(window=3).sum())

    # Expanding window (cumulative)
    print(ts.expanding().mean())
    ```

**Pivot Tables and Cross-Tabulations**

*   **`pivot_table`:**  Reshape data from "long" to "wide" format.

    ```python
    df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                       'B': ['A', 'B', 'C'] * 4,
                       'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D': np.random.randn(12),
                       'E': np.random.randn(12)})

    # Create a pivot table
    pivot = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)
    print(pivot)
    ```

*   **`crosstab`:**  Compute a simple cross-tabulation of two (or more) factors.

    ```python
    # Create a cross-tabulation
    cross_tab = pd.crosstab(df['A'], df['C'])
    print(cross_tab)
    ```

**Performance Optimization**

*   **Vectorized Operations:**  Use Pandas' built-in vectorized operations (string methods, arithmetic operations) instead of loops whenever possible.  This is *much* faster.

*   **`apply` Efficiently:**
    *   Use `apply` with built-in NumPy functions when possible.
    *   For custom functions, ensure they are optimized (e.g., use NumPy operations inside the function).
    *   Consider using `numba` to compile your custom functions for even greater speed.

*   **Avoid `.iterrows()` and `.itertuples()` (Generally):**  These methods iterate over rows, which is slow.  Use vectorized operations or `apply` instead.  If you *must* iterate, `.itertuples()` is generally faster than `.iterrows()`.

* **Data types**: ensure to use appropriate data types.

*   **Example:**

    ```python
    import time
    df = pd.DataFrame({'col1': np.random.rand(100000),
                      'col2': np.random.rand(100000)})

    # Slow: Iterating
    def slow_func(df):
        total = 0
        for index, row in df.iterrows():
            total += row['col1'] * row['col2']
        return total

    start_time = time.time()
    slow_func(df)
    end_time = time.time()
    print(f"Iterrows time: {end_time - start_time:.4f} seconds")

    # Fast: Vectorized
    def fast_func(df):
        return (df['col1'] * df['col2']).sum()

    start_time = time.time()
    fast_func(df)
    end_time = time.time()
    print(f"Vectorized time: {end_time - start_time:.4f} seconds")

    # Apply (better than iterrows, but still slower than vectorization)
    def apply_func(row):
      return row['col1'] * row['col2']

    start_time = time.time()
    df.apply(apply_func, axis=1).sum()
    end_time = time.time()
    print(f"Apply time: {end_time - start_time:.4f} seconds")
    ```

**Practice Exercises:**

1.  **Real-World Dataset:**
    *   Find a dataset on Kaggle or another data repository (e.g., UCI Machine Learning Repository).
    *   Load the data into a Pandas DataFrame.
    *   Clean the data:
        *   Handle missing values.
        *   Convert data types as needed.
        *   Perform any necessary string manipulation.
    *   Transform the data:
        *   Create new features.
        *   Apply custom functions.
    *   Analyze the data:
        *   Use `groupby` to aggregate data.
        *   Filter data based on conditions.
        *   Merge or join with other DataFrames if applicable.

2.  **Time Series Analysis:**
    *   Find a dataset with time-based information (e.g., stock prices, weather data).
    *   Load the data into a Pandas DataFrame and set the time column as the index.
    *   Resample the data to different frequencies (e.g., daily, weekly, monthly).
    *   Calculate rolling statistics (e.g., moving averages).
    *   Shift the data to create lagged features.

3.  **Pivot Tables and Cross-Tabulations:**
    *   Using the dataset from Exercise 1 or 2, create pivot tables to summarize the data in different ways.
    *   Create cross-tabulations to analyze the relationships between categorical variables.

This course provides a strong foundation in advanced Pandas techniques. Remember to practice consistently, explore the Pandas documentation, and adapt these techniques to your specific data analysis tasks. The best way to learn is by doing!
