# Pandas DataFrame: A Comprehensive Tutorial

## Introduction to DataFrames
A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure in pandas. We can think of a DataFrame as a collection of Series objects that share the same index. DataFrames provide a flexible and powerful way to organize, manipulate, and analyze structured data.

## Installation
```python
# Install pandas using pip
pip install pandas

# Install pandas using conda
conda install pandas
```

# Importing Libraries


In [2]:
import numpy as np
import pandas as pd

# Creating DataFrames

## Method 1: Creating a DataFrame from a Dictionary
Dictionaries provide a natural way to define column-oriented data:

# Creating a dictionary with state and year data
dict1 = {'state': ['Andra Pradesh', 'Telangana'], 'year': [2020, 2021]}

# Converting the dictionary to a DataFrame
df = pd.DataFrame(dict1)
print(df)

By default, pandas assigns numeric indices (0, 1, ...) to rows. We can specify custom row indices:

In [5]:
# Creating a DataFrame with custom row indices
df = pd.DataFrame(dict1, index=['row1', 'row2'])
print(df)

              state  year
row1  Andra Pradesh  2020
row2      Telangana  2021


## Method 2: Creating a DataFrame from NumPy Arrays
We can create DataFrames from numerical data using NumPy arrays:

In [7]:
# Setting a random seed for reproducibility
np.random.seed(10)

# Creating a DataFrame from a random 5x5 array
df2 = pd.DataFrame(np.random.randn(5, 5))
print(df2)

          0         1         2         3         4
0  1.331587  0.715279 -1.545400 -0.008384  0.621336
1 -0.720086  0.265512  0.108549  0.004291 -0.174600
2  0.433026  1.203037 -0.965066  1.028274  0.228630
3  0.445138 -1.136602  0.135137  1.484537 -1.079805
4 -1.977728 -1.743372  0.266070  2.384967  1.123691


We can customize both row indices and column names:

In [9]:
# Creating a DataFrame with custom row indices and column names
np.random.seed(10)
df2 = pd.DataFrame(
    np.random.randn(5, 5),
    index=['row1', 'row2', 'row3', 'row4', 'row5'],
    columns=['col1', 'col2', 'col3', 'col4', 'col5']
)
print(df2)

          col1      col2      col3      col4      col5
row1  1.331587  0.715279 -1.545400 -0.008384  0.621336
row2 -0.720086  0.265512  0.108549  0.004291 -0.174600
row3  0.433026  1.203037 -0.965066  1.028274  0.228630
row4  0.445138 -1.136602  0.135137  1.484537 -1.079805
row5 -1.977728 -1.743372  0.266070  2.384967  1.123691


## Accessing DataFrame Elements
Selecting Columns

Single Column Selection

When we select a single column, pandas returns it as a Series:

In [10]:
# Selecting a single column
col1_series = df2['col1']
print(col1_series)

row1    1.331587
row2   -0.720086
row3    0.433026
row4    0.445138
row5   -1.977728
Name: col1, dtype: float64


In [11]:
# Selecting another column
col4_series = df2['col4']
print(col4_series)

row1   -0.008384
row2    0.004291
row3    1.028274
row4    1.484537
row5    2.384967
Name: col4, dtype: float64


## Multiple Column Selection
To select multiple columns, we use a list of column names:

In [12]:
# Selecting multiple columns
subset = df2[['col1', 'col2']]
print(subset)

          col1      col2
row1  1.331587  0.715279
row2 -0.720086  0.265512
row3  0.433026  1.203037
row4  0.445138 -1.136602
row5 -1.977728 -1.743372


In [14]:
# Selecting multiple columns
subset = df2[['col1', 'col2']]
print(subset)

          col1      col2
row1  1.331587  0.715279
row2 -0.720086  0.265512
row3  0.433026  1.203037
row4  0.445138 -1.136602
row5 -1.977728 -1.743372


## Creating New Columns
We can create new columns by performing operations on existing columns:

In [16]:
# Creating a new column as the sum of two existing columns
df2['new'] = df2['col1'] + df2['col2']
print(df2)

          col1      col2      col3      col4      col5       new
row1  1.331587  0.715279 -1.545400 -0.008384  0.621336  2.046865
row2 -0.720086  0.265512  0.108549  0.004291 -0.174600 -0.454574
row3  0.433026  1.203037 -0.965066  1.028274  0.228630  1.636064
row4  0.445138 -1.136602  0.135137  1.484537 -1.079805 -0.691465
row5 -1.977728 -1.743372  0.266070  2.384967  1.123691 -3.721101


# Modifying DataFrames
## Removing Columns
We can remove columns using the drop() method with axis=1:

In [17]:
# Removing a column (not inplace)
df_without_new = df2.drop('new', axis=1)
print(df_without_new)

          col1      col2      col3      col4      col5
row1  1.331587  0.715279 -1.545400 -0.008384  0.621336
row2 -0.720086  0.265512  0.108549  0.004291 -0.174600
row3  0.433026  1.203037 -0.965066  1.028274  0.228630
row4  0.445138 -1.136602  0.135137  1.484537 -1.079805
row5 -1.977728 -1.743372  0.266070  2.384967  1.123691


Note that the original DataFrame is not modified. To modify the original DataFrame, use inplace=True:

In [18]:
# Removing a column permanently
df2.drop('col1', axis=1, inplace=True)
print(df2)

          col2      col3      col4      col5       new
row1  0.715279 -1.545400 -0.008384  0.621336  2.046865
row2  0.265512  0.108549  0.004291 -0.174600 -0.454574
row3  1.203037 -0.965066  1.028274  0.228630  1.636064
row4 -1.136602  0.135137  1.484537 -1.079805 -0.691465
row5 -1.743372  0.266070  2.384967  1.123691 -3.721101


## Removing Rows
We can remove rows using the drop() method with axis=0 (which is the default):

In [20]:
# Removing a row permanently
df2.drop('row1', axis=0, inplace=True)
print(df2)

          col2      col3      col4      col5       new
row2  0.265512  0.108549  0.004291 -0.174600 -0.454574
row3  1.203037 -0.965066  1.028274  0.228630  1.636064
row4 -1.136602  0.135137  1.484537 -1.079805 -0.691465
row5 -1.743372  0.266070  2.384967  1.123691 -3.721101


# Advanced Selection Methods
## Using .loc[] for Label-based Indexing
The .loc[] accessor is used for label-based indexing:

In [21]:
# Selecting a row by label
row2_series = df2.loc['row2']
print(row2_series)

col2    0.265512
col3    0.108549
col4    0.004291
col5   -0.174600
new    -0.454574
Name: row2, dtype: float64


In [22]:
# Selecting a column using .loc
col2_series = df2.loc[:, 'col2']
print(col2_series)

row2    0.265512
row3    1.203037
row4   -1.136602
row5   -1.743372
Name: col2, dtype: float64


## Using .iloc[] for Position-based Indexing
The .iloc[] accessor is used for integer-location based indexing:

In [23]:
# Selecting a row by position (second row, which is 'row3')
second_row = df2.iloc[1]
print(second_row)

col2    1.203037
col3   -0.965066
col4    1.028274
col5    0.228630
new     1.636064
Name: row3, dtype: float64


In [24]:
# Selecting a column by position (second column, which is 'col3')
second_col = df2.iloc[:, 1]
print(second_col)

row2    0.108549
row3   -0.965066
row4    0.135137
row5    0.266070
Name: col3, dtype: float64


## Slicing with .iloc[] and .loc[]
We can select subsets of the DataFrame using slicing:

In [25]:
# Selecting all rows and columns from position 1 onwards
subset = df2.iloc[:, 1:]
print(subset)

          col3      col4      col5       new
row2  0.108549  0.004291 -0.174600 -0.454574
row3 -0.965066  1.028274  0.228630  1.636064
row4  0.135137  1.484537 -1.079805 -0.691465
row5  0.266070  2.384967  1.123691 -3.721101


In [26]:
# Selecting specific columns by name
subset = df2.loc[:, ['col3', 'col4']]
print(subset)

          col3      col4
row2  0.108549  0.004291
row3 -0.965066  1.028274
row4  0.135137  1.484537
row5  0.266070  2.384967


## Conclusion

Pandas DataFrames are powerful data structures for data analysis that offer:

1. **Flexible creation methods** - from dictionaries, arrays, and other data sources
2. **Intuitive indexing** - both by label (`.loc[]`) and position (`.iloc[]`)
3. **Easy manipulation** - adding, removing, and transforming columns and rows
4. **Powerful selection capabilities** - for extracting specific subsets of data

This tutorial covered the basics of creating, accessing, and modifying DataFrames. In practice, DataFrames offer many more advanced capabilities for data cleaning, transformation, grouping, and analysis.

## Next Steps

To deepen your understanding of pandas DataFrames, explore these additional topics:
- Data cleaning and preprocessing
- Grouping and aggregating data
- Handling missing values
- Merging and joining DataFrames
- Time series analysis
- Advanced indexing and selection