In [7]:
import numpy as np
import pandas as pd

# Generate a DataFrame with normally distributed data
data = pd.DataFrame(np.random.standard_normal((1000, 4)))

# Display descriptive statistics of the data
print("Descriptive Statistics of the Data:")
print(data)
print(data.describe())


Descriptive Statistics of the Data:
            0         1         2         3
0    0.345157 -0.673573 -0.096872 -1.509783
1    0.611159 -2.119816  1.780718  0.347664
2   -0.897091  0.425012  1.850351  0.132433
3    1.321895 -0.213506  0.580945 -2.146815
4   -0.924085 -2.184550 -1.925482 -0.368269
..        ...       ...       ...       ...
995  0.639372 -2.454351 -0.077494 -1.500924
996  0.805329 -0.156093 -1.273927  2.548325
997  0.680172  0.868181 -0.575667 -0.274601
998 -0.365088  1.990230 -2.167744 -0.046019
999  0.975137  0.810957 -0.467959  0.031007

[1000 rows x 4 columns]
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      0.014259    -0.031829    -0.017740    -0.039970
std       1.016262     0.948918     1.014655     1.013174
min      -2.956010    -2.685481    -3.092381    -3.253687
25%      -0.684655    -0.669440    -0.682163    -0.756287
50%       0.046209    -0.017571    -0.053176    -0.053265
75%  

np.random.standard_normal((1000, 4)) generates a 1000x4 matrix of random numbers drawn from a standard normal distribution.
pd.DataFrame(...) converts the matrix into a DataFrame.
data.describe() provides a summary of statistics like mean, standard deviation, and percentiles for each column.
Step 2: Detect Outliers in a Specific Column
To find values in column 2 that exceed 3 in absolute value:

In [5]:
# Select the column to check for outliers
col = data[2]

# Find values exceeding 3 in absolute value
outliers_col = col[col.abs() > 3]
print("\nOutliers in Column 2:")
print(outliers_col)



Outliers in Column 2:
459   -3.617837
782    3.080968
Name: 2, dtype: float64


data.abs() > 3 creates a Boolean DataFrame where values greater than 3 in absolute terms are marked as True.
.any(axis='columns') checks if any column in each row has a True value.
data[(data.abs() > 3).any(axis='columns')] filters out the rows where any column has an outlier.
Step 4: Cap Outliers Using np.sign and inplace Operations
To cap the values outside the range of -3 to 3:

In [6]:
# Detect all rows with any value exceeding 3 or less than -3
outliers_rows = data[(data.abs() > 3).any(axis='columns')]
print("\nRows with Outliers in Any Column:")
print(outliers_rows)



Rows with Outliers in Any Column:
            0         1         2         3
116 -3.224694 -0.582590 -1.459096 -0.708833
278 -1.295733 -0.320384  1.799589  3.636438
417 -0.441509 -1.904090  0.727183  4.135169
459 -0.697698 -0.121747 -3.617837 -0.795135
566 -0.075419  0.086795 -1.093684  3.534973
640  3.014346  0.387546  2.714997 -0.437871
688 -0.642721  0.729394 -0.336234 -3.083636
716  1.386277 -3.270058  0.732501 -0.973724
782  0.551717 -2.043327  3.080968 -0.899704
936 -0.390314 -3.162148  0.024568 -1.505562


In [8]:
import pandas as pd
import numpy as np

# Creating a Series with hierarchical index
data = pd.Series(
    np.random.uniform(size=10),  # Random data for the example
    index=[
        ["x", "x", "x", "y", "y", "z", "z", "w", "w", "w"],
        [1, 2, 3, 1, 2, 1, 2, 1, 2, 3]
    ]
)

print("Hierarchical Index Series:")
print(data)


Hierarchical Index Series:
x  1    0.895085
   2    0.125591
   3    0.514164
y  1    0.334689
   2    0.121735
z  1    0.506089
   2    0.817436
w  1    0.667600
   2    0.618742
   3    0.528833
dtype: float64


Outer Level (Level 0):

The first list ["x", "x", "x", "y", "y", "z", "z", "w", "w", "w"] represents the outer level of the index. This level groups the data by the first index.
The unique labels in this level are: x, y, z, w.
Inner Level (Level 1):

The second list [1, 2, 3, 1, 2, 1, 2, 1, 2, 3] represents the inner level of the index. This level represents a sub-index under each label in the outer level.
The unique labels in this level are: 1, 2, 3.
MultiIndex:

These two lists together form a MultiIndex, where each combination of the outer and inner levels forms a unique identifier for each data point in the Series.
Visual Representation of the Hierarchical Index
To better understand this, let's visualize how the hierarchical index is structured:

Outer Level (Level 0) | Inner Level (Level 1) | Value
-----------------------------------------------------
x                     | 1                     | 0.682356
x                     | 2                     | 0.479302
x                     | 3                     | 0.301799
y                     | 1                     | 0.804570
y                     | 2                     | 0.911242
z                     | 1                     | 0.158301
z                     | 2                     | 0.798315
w                     | 1                     | 0.884762
w                     | 2                     | 0.194512
w                     | 3                     | 0.249384


Selecting Data Using the Hierarchical Index
With hierarchical indexing, you can easily select data at different levels:

Selecting by the Outer Level Only:

You can select all data for a specific outer level, such as all entries for y:

In [12]:
print(data["y"])


1    0.334689
2    0.121735
dtype: float64


Selecting a Range of Outer Levels:

You can select a range of outer levels, like from y to z:

In [15]:
print(data.loc[["y", "w"]])

y  1    0.334689
   2    0.121735
w  1    0.667600
   2    0.618742
   3    0.528833
dtype: float64


Selecting by the Inner Level Across All Outer Levels:

To select all data for a specific inner level (e.g., level 2), you can use partial indexing:

In [16]:
print(data.loc[:, 2])

x    0.125591
y    0.121735
z    0.817436
w    0.618742
dtype: float64


Reshaping with unstack and stack
Hierarchical indexing is particularly useful for reshaping data.

Unstacking the Data:

The unstack() method pivots the inner level of the index to form columns in a DataFrame: